`R`

is a very fluid language amenable to meta-programming, or alterations of the language itself. This has allowed the late user-driven introduction of a number of powerful features such as magrittr pipes, the foreach system, futures, data.table, and dplyr. Please read on for some small meta-programming effects we have been experimenting with.

## Encoding categorical variables: one-hot and beyond

## (or: how to correctly use `xgboost`

from `R`

)

`R`

has "one-hot" encoding hidden in most of its modeling paths. Asking an `R`

user where one-hot encoding is used is like asking a fish where there is water; they can’t point to it as it is everywhere.

For example we can see evidence of one-hot encoding in the variable names chosen by a linear regression:

```
dTrain <- data.frame(x= c('a','b','b', 'c'),
y= c(1, 2, 1, 2))
summary(lm(y~x, data= dTrain))
```

```
##
## Call:
## lm(formula = y ~ x, data = dTrain)
##
## Residuals:
## 1 2 3 4
## -2.914e-16 5.000e-01 -5.000e-01 2.637e-16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.0000 0.7071 1.414 0.392
## xb 0.5000 0.8660 0.577 0.667
## xc 1.0000 1.0000 1.000 0.500
##
## Residual standard error: 0.7071 on 1 degrees of freedom
## Multiple R-squared: 0.5, Adjusted R-squared: -0.5
## F-statistic: 0.5 on 2 and 1 DF, p-value: 0.7071
```

Continue reading Encoding categorical variables: one-hot and beyond

## Teaching pivot / un-pivot

## Authors: John Mount and Nina Zumel

## Introduction

In teaching thinking in terms of coordinatized data we find the hardest operations to teach are joins and pivot.

One thing we commented on is that moving data values into columns, or into a “thin” or entity/attribute/value form (often called “un-pivoting”, “stacking”, “melting” or “gathering“) is easy to explain, as the operation is a function that takes a single row and builds groups of new rows in an obvious manner. We commented that the inverse operation of moving data into rows, or the “widening” operation (often called “pivoting”, “unstacking”, “casting”, or “spreading”) is harder to explain as it takes a specific group of columns and maps them back to a single row. However, if we take extra care and factor the pivot operation into its essential operations we find pivoting can be usefully conceptualized as a simple single row to single row mapping followed by a grouped aggregation.

Please read on for our thoughts on teaching pivoting data. Continue reading Teaching pivot / un-pivot

## You can’t do that in statistics

There are a number of statistical principles that are perhaps more honored in the breach than in the observance. For fun I am going to name a few, and show why they are not always the “precision surgical knives of thought” one would hope for (working more like large hammers).

## Visualizing relational joins

I want to discuss a nice series of figures used to teach relational join semantics in *R for Data Science* by Garrett Grolemund and Hadley Wickham, O’Reilly 2016. Below is an example from their book illustrating an inner join:

Please read on for my discussion of this diagram and teaching joins. Continue reading Visualizing relational joins

## Coordinatized Data: A Fluid Data Specification

## Authors: John Mount and Nina Zumel.

## Introduction

It has been our experience when teaching the data wrangling part of data science that students often have difficulty understanding the conversion to and from row-oriented and column-oriented data formats (what is commonly called pivoting and un-pivoting).

Real trust and understanding of this concept doesn’t fully form until one realizes that rows and columns are *inessential* implementation details when *reasoning* about your data. Many *algorithms* are sensitive to how data is arranged in rows and columns, so there is a need to convert between representations. However, confusing representation with semantics slows down understanding.

In this article we will try to separate representation from semantics. We will advocate for thinking in terms of *coordinatized data*, and demonstrate advanced data wrangling in `R`

.

Continue reading Coordinatized Data: A Fluid Data Specification

## Debugging Pipelines in R with Bizarro Pipe and Eager Assignment

This is a note on debugging `magrittr`

pipelines in `R`

using Bizarro Pipe and eager assignment.

## Datashader is a big deal

I recently got back from Strata West 2017 (where I ran a very well received workshop on `R`

and `Spark`

). One thing that really stood out for me at the exhibition hall was `Bokeh`

plus `datashader`

from Continuum Analytics.

I had the privilege of having Peter Wang himself demonstrate `datashader`

for me and answer a few of my questions.

I am so excited about `datashader`

capabilities I literally *will not wait* for the functionality to be exposed in `R`

through `rbokeh`

. I am going to leave my usual `knitr`

/`rmarkdown`

world and dust off `Jupyter Notebook`

just to use `datashader`

plotting. This is worth trying, even for diehard `R`

users. Continue reading Datashader is a big deal

## Practical Data Science with R: ACM SIGACT News Book Review and Discount!

Our book Practical Data Science with R has just been reviewed in Association for Computing Machinery Special Interest Group on Algorithms and Computation Theory (ACM SIGACT) News by Dr. Allan M. Miller (U.C. Berkeley)!

The book is half off at Manning March 21st 2017 using the following code (please share/Tweet):

Deal of the Day March 21: Half off my book Practical Data Science with R. Use code

`dotd032117au`

at https://www.manning.com/dotd

Please read on for links and excerpts from the review. Continue reading Practical Data Science with R: ACM SIGACT News Book Review and Discount!

## Another R [Non-]Standard Evaluation Idea

Jonathan Carroll had a an interesting `R`

language idea: to use `@`

-notation to request value substitution in a non-standard evaluation environment (inspired by msyql User-Defined Variables).

He even picked the right image: