Cross-validation is a way to safely reuse training data in nested model situations. This includes both the case of setting hyperparameters before fitting a model, and the case of fitting models (let’s call them *base learners*) that are then used as variables in downstream models, as shown in Figure 1. In either situation, using the same data twice can lead to models that are overtuned to idiosyncracies in the training data, and more likely to overfit.

In general, if any stage of your modeling pipeline involves looking at the outcome (we’ll call that a *y-aware* stage), you cannot directly use the same data in the following stage of the pipeline. If you have enough data, you can use separate data in each stage of the modeling process (for example, one set of data to learn hyperparameters, another set of data to train the model that uses those hyperparameters). Otherwise, you should use cross-validation to reduce the nested model bias.

Cross-validation is relatively computationally expensive; regularization is relatively cheap. Can you mitigate nested model bias by using regularization techniques instead of cross-validation?

The short answer: no, you shouldn’t. But as, we’ve written before, demonstrating this is more memorable than simply saying “Don’t do that.”

Suppose you have a system with two categorical variables. The variable `x_s`

has 10 levels, and the variable `x_n`

has 100 levels. The outcome `y`

is a function of `x_s`

, but not of `x_n`

(but you, the analyst building the model, don’t know this). Here’s the head of the data.

```
## x_s x_n y
## 2 s_10 n_72 0.34228110
## 3 s_01 n_09 -0.03805102
## 4 s_03 n_18 -0.92145960
## 9 s_08 n_43 1.77069352
## 10 s_08 n_17 0.51992928
## 11 s_01 n_78 1.04714355
```

With most modeling techniques, a categorical variable with K levels is equivalent to K or K-1 numerical (indicator or dummy) variables, so this system actually has around 110 variables. In real life situations where a data scientist is working with high-cardinality categorical variables, or with a lot of categorical variables, the number of actual variables can begin to swamp the size of training data, and/or bog down the machine learning algorithm.

One way to deal with these issues is to represent each categorical variable by a single variable model (or base learner), and then use the predictions of those base learners as the inputs to a bigger model. So instead of fitting a model with 110 indicator variables, you can fit a model with two numerical variables. This is a simple example of nested models.

We refer to this procedure as “impact coding,” and it is one of the data treatments available in the `vtreat`

package, specifically for dealing with high-cardinality categorical variables. But for now, let’s go back to the original problem.

For this simple example, you might try representing each variable as the expected value of `y - mean(y)`

in the training data, conditioned on the variable’s level. So the ith “coefficient” of the one-variable model would be given by:

*v*_{i} = *E*[*y*|*x* = *s*_{i}] − *E*[*y*]

Where *s*_{i} is the *i*th level. Let’s show this with the variable `x_s`

(the code for all the examples in this article is here):

```
## x_s meany coeff
## 1 s_01 0.7998263 0.8503282
## 2 s_02 -1.3815640 -1.3310621
## 3 s_03 -0.7928449 -0.7423430
## 4 s_04 -0.8245088 -0.7740069
## 5 s_05 0.7547054 0.8052073
## 6 s_06 0.1564710 0.2069728
## 7 s_07 -1.1747557 -1.1242539
## 8 s_08 1.3520153 1.4025171
## 9 s_09 1.5789785 1.6294804
## 10 s_10 -0.7313895 -0.6808876
```

In other words, whenever the value of `x_s`

is `s_01`

, the one variable model `vs`

returns the value 0.8503282, and so on. If you do this for both variables, you get a training set that looks like this:

```
## x_s x_n y vs vn
## 2 s_10 n_72 0.34228110 -0.6808876 0.64754957
## 3 s_01 n_09 -0.03805102 0.8503282 0.54991135
## 4 s_03 n_18 -0.92145960 -0.7423430 0.01923877
## 9 s_08 n_43 1.77069352 1.4025171 1.90394159
## 10 s_08 n_17 0.51992928 1.4025171 0.26448341
## 11 s_01 n_78 1.04714355 0.8503282 0.70342961
```

Now fit a linear model for `y`

as a function of `vs`

and `vn`

.

```
model_raw = lm(y ~ vs + vn,
data=dtrain_treated)
summary(model_raw)
```

```
##
## Call:
## lm(formula = y ~ vs + vn, data = dtrain_treated)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.33068 -0.57106 0.00342 0.52488 2.25472
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.05050 0.05597 -0.902 0.368
## vs 0.77259 0.05940 13.006 <2e-16 ***
## vn 0.61201 0.06906 8.862 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8761 on 242 degrees of freedom
## Multiple R-squared: 0.6382, Adjusted R-squared: 0.6352
## F-statistic: 213.5 on 2 and 242 DF, p-value: < 2.2e-16
```

Note that this model gives significant coefficients to both `vs`

and `vn`

, even though `y`

is not a function of `x_n`

(or `vn`

). Because you used the same data to fit the one variable base learners and to fit the larger model, you have overfit.

The correct way to impact code (or to nest models in general) is to use cross-validation techniques. Impact coding with cross-validation is already implemented in `vtreat`

; note the similarity between this diagram and Figure 1 above.

The training data is used both to fit the base learners (as we did above) and to also to create a data frame of cross-validated base learner predictions (called a *cross-frame* in `vtreat`

). This cross-frame is used to train the overall model. Let’s fit the correct nested model, using `vtreat`

.

```
library(vtreat)
library(wrapr)
xframeResults = mkCrossFrameNExperiment(dtrain,
qc(x_s, x_n), "y",
codeRestriction = qc(catN),
verbose = FALSE)
# the plan uses the one-variable models to treat data
treatmentPlan = xframeResults$treatments
# the cross-frame
dtrain_treated = xframeResults$crossFrame
head(dtrain_treated)
```

```
## x_s_catN x_n_catN y
## 1 -0.6337889 0.91241547 0.34228110
## 2 0.8342227 0.82874089 -0.03805102
## 3 -0.7020597 0.18198634 -0.92145960
## 4 1.3983175 1.99197404 1.77069352
## 5 1.3983175 0.11679580 0.51992928
## 6 0.8342227 0.06421659 1.04714355
```

```
variables = setdiff(colnames(dtrain_treated), "y")
model_X = lm(mk_formula("y", variables),
data=dtrain_treated)
summary(model_X)
```

```
##
## Call:
## lm(formula = mk_formula("y", variables), data = dtrain_treated)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2157 -0.7343 0.0225 0.7483 2.9639
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.04169 0.06745 -0.618 0.537
## x_s_catN 0.92968 0.06344 14.656 <2e-16 ***
## x_n_catN 0.10204 0.06654 1.533 0.126
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.055 on 242 degrees of freedom
## Multiple R-squared: 0.4753, Adjusted R-squared: 0.471
## F-statistic: 109.6 on 2 and 242 DF, p-value: < 2.2e-16
```

This model correctly determines that `x_n`

(and its one-variable model `x_n_catN`

) do not affect the outcome. We can compare the performance of this model to the naive model on holdout data.

rmse | rsquared | |
---|---|---|

ypred_naive | 1.303778 | 0.2311538 |

ypred_crossval | 1.093955 | 0.4587089 |

The correct model has a much smaller root-mean-squared error and a much larger R-squared than the naive model when applied to new data.

But cross-validation is so complicated. Can’t we just regularize? As we’ll show in the appendix of this article, for a one-variable model, L2-regularization is simply Laplace smoothing. Again, we’ll represent each “coefficient” of the one-variable model as the Laplace smoothed value minus the grand mean.

*v*_{i} = ∑_{xj = si} *y*_{i}/(count_{i} + *λ*) − *E*[*y*_{i}]

Where count_{i} is the frequency of *s*_{i} in the training data, and *λ* is the smoothing parameter (usually 1). If *λ* = 1 then the first term on the right is just adding one to the frequency of the level and then taking the “adjusted conditional mean” of `y`

.

Again, let’s show this for the variable `x_s`

.

```
## x_s sum_y count_y grandmean vs
## 1 s_01 20.795484 26 -0.05050187 0.8207050
## 2 s_02 -37.302227 27 -0.05050187 -1.2817205
## 3 s_03 -22.199656 28 -0.05050187 -0.7150035
## 4 s_04 -14.016649 17 -0.05050187 -0.7282009
## 5 s_05 19.622340 26 -0.05050187 0.7772552
## 6 s_06 3.129419 20 -0.05050187 0.1995218
## 7 s_07 -35.242672 30 -0.05050187 -1.0863585
## 8 s_08 36.504412 27 -0.05050187 1.3542309
## 9 s_09 33.158549 21 -0.05050187 1.5577086
## 10 s_10 -16.821957 23 -0.05050187 -0.6504130
```

After applying the one variable models for `x_s`

and `x_n`

to the data, the head of the resulting treated data looks like this:

```
## x_s x_n y vs vn
## 2 s_10 n_72 0.34228110 -0.6504130 0.44853367
## 3 s_01 n_09 -0.03805102 0.8207050 0.42505898
## 4 s_03 n_18 -0.92145960 -0.7150035 0.02370493
## 9 s_08 n_43 1.77069352 1.3542309 1.28612835
## 10 s_08 n_17 0.51992928 1.3542309 0.21098803
## 11 s_01 n_78 1.04714355 0.8207050 0.61015422
```

Now fit the overall model:

```
##
## Call:
## lm(formula = y ~ vs + vn, data = dtrain_treated)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.30354 -0.57688 -0.02224 0.56799 2.25723
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.06665 0.05637 -1.182 0.238
## vs 0.81142 0.06203 13.082 < 2e-16 ***
## vn 0.85393 0.09905 8.621 8.8e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8819 on 242 degrees of freedom
## Multiple R-squared: 0.6334, Adjusted R-squared: 0.6304
## F-statistic: 209.1 on 2 and 242 DF, p-value: < 2.2e-16
```

Again, both variables look significant. Even with regularization, the model is still overfit. Comparing the performance of the models on holdout data, you see that the regularized model does a little better than the naive model, but not as well as the correctly cross-validated model.

rmse | rsquared | |
---|---|---|

ypred_naive | 1.303778 | 0.2311538 |

ypred_crossval | 1.093955 | 0.4587089 |

ypred_reg | 1.267648 | 0.2731756 |

Unfortunately, regularization is not enough to overcome nested model bias. Whenever you apply a y-aware process to your data, you have to use cross-validation methods (or a separate data set) at the next stage of your modeling pipeline.

Without regularization, the optimal one-variable model for `y`

in terms of a categorical variable with K levels {*s*_{j}} is a set of K coefficients **v** such that

is minimized (N is the number of data points). L2-regularization adds a penalty to the magnitude of **v**, so that the goal is to minimize

where *λ* is a known smoothing hyperparameter, usually set (in this case) to 1.

To minimize the above expression for a single coefficient *v*_{j}, take the deriviative with respect to *v*_{j} and set it to zero:

Where count_{j} is the number of times the level *s*_{j} appears in the training data. Now solve for *v*_{j}:

This is Laplace smoothing. Note that it is also the one-variable equivalent of ridge regression.

]]>`rquery`

]]>`rquery`

is a data wrangling system designed to express complex data manipulation as a series of simple data transforms. This is in the spirit of `R`

’s `base::transform()`

, or `dplyr`

’s `dplyr::mutate()`

and uses a pipe in the style popularized in `R`

with `magrittr`

. The operators themselves follow the selections in Codd’s relational algebra, with the addition of the traditional `SQL`

“window functions.” More on the background and context of `rquery`

can be found here.

The `R`

/`rquery`

version of this introduction is here, and the `Python`

/`data_algebra`

version of this introduction is here.

In transform formulations data manipulation is written as transformations that produce new `data.frame`

s, instead of as alterations of a primary data structure (as is the case with `data.table`

). Transform system *can* use more space and time than in-place methods. However, in our opinion, transform systems have a number of pedagogical advantages.

In `rquery`

’s case the primary set of data operators is as follows:

`drop_columns`

`select_columns`

`rename_columns`

`select_rows`

`order_rows`

`extend`

`project`

`natural_join`

`convert_records`

(supplied by the`cdata`

package).

These operations break into a small number of themes:

- Simple column operations (selecting and re-naming columns).
- Simple row operations (selecting and re-ordering rows).
- Creating new columns or replacing columns with new calculated values.
- Aggregating or summarizing data.
- Combining results between two
`data.frame`

s. - General conversion of record layouts (supplied by the
`cdata`

package).

The point is: Codd worked out that a great number of data transformations can be decomposed into a small number of the above steps. `rquery`

supplies a high performance implementation of these methods that scales from in-memory scale up through big data scale (to just about anything that supplies a sufficiently powerful `SQL`

interface, such as PostgreSQL, Apache Spark, or Google BigQuery).

We will work through simple examples/demonstrations of the `rquery`

data manipulation operators.

`rquery`

operatorsThe simple column operations are as follows.

`drop_columns`

`select_columns`

`rename_columns`

These operations are easy to demonstrate.

We set up some simple data.

```
d <- data.frame(
x = c(1, 1, 2),
y = c(5, 4, 3),
z = c(6, 7, 8)
)
knitr::kable(d)
```

x | y | z |
---|---|---|

1 | 5 | 6 |

1 | 4 | 7 |

2 | 3 | 8 |

For example: `drop_columns`

works as follows. `drop_columns`

creates a new `data.frame`

without certain columns.

```
library(rquery)
drop_columns(d, c('y', 'z'))
```

```
## x
## 1: 1
## 2: 1
## 3: 2
```

In all cases the first argument of a `rquery`

operator is either the data to be processed, or an earlier `rquery`

pipeline to be extended. We will take about composing `rquery`

operations after we work through examples of all of the basic operations.

We can write the above in piped notation (using the `wrapr`

pipe in this case):

```
d %.>%
drop_columns(., c('y', 'z')) %.>%
knitr::kable(.)
```

x |
---|

1 |

1 |

2 |

Notice the first argument is an explicit “dot” in `wrapr`

pipe notation.

`select_columns`

’s action is also obvious from example.

```
d %.>%
select_columns(., c('x', 'y')) %.>%
knitr::kable(.)
```

x | y |
---|---|

1 | 5 |

1 | 4 |

2 | 3 |

The simple row operations are:

`select_rows`

`order_rows`

`select_rows`

keeps the set of rows that meet a given predicate expression.

```
d %.>%
select_rows(., x == 1) %.>%
knitr::kable(.)
```

x | y | z |
---|---|---|

1 | 5 | 6 |

1 | 4 | 7 |

`order_rows`

re-orders rows by a selection of column names (and allows reverse ordering by naming which columns to reverse in the optional `reverse`

argument). Multiple columns can be selected in the order, each column breaking ties in the earlier comparisons.

```
d %.>%
order_rows(.,
c('x', 'y'),
reverse = 'x') %.>%
knitr::kable(.)
```

x | y | z |
---|---|---|

2 | 3 | 8 |

1 | 4 | 7 |

1 | 5 | 6 |

General `rquery`

operations do not depend on row-order and are not guaranteed to preserve row-order, so if you do want to order rows you should make it the last step of your pipeline.

The important create or replace column operation is:

`extend`

`extend`

accepts arbitrary expressions to create new columns (or replace existing ones). For example:

```
d %.>%
extend(., zzz := y / x) %.>%
knitr::kable(.)
```

x | y | z | zzz |
---|---|---|---|

1 | 5 | 6 | 5.0 |

1 | 4 | 7 | 4.0 |

2 | 3 | 8 | 1.5 |

We can use `=`

or `:=`

for column assignment. In these examples we will use `:=`

to keep column assignment clearly distinguishable from argument binding.

`extend`

allows for very powerful per-group operations akin to what `SQL`

calls “window functions”. When the optional `partitionby`

argument is set to a vector of column names then aggregate calculations can be performed per-group. For example.

```
shift <- data.table::shift
d %.>%
extend(.,
max_y := max(y),
shift_z := shift(z),
row_number := row_number(),
cumsum_z := cumsum(z),
partitionby = 'x',
orderby = c('y', 'z')) %.>%
knitr::kable(.)
```

x | y | z | max_y | shift_z | row_number | cumsum_z |
---|---|---|---|---|---|---|

1 | 4 | 7 | 5 | NA | 1 | 7 |

1 | 5 | 6 | 5 | 7 | 2 | 13 |

2 | 3 | 8 | 3 | NA | 1 | 8 |

Notice the aggregates were performed per-partition (a set of rows with matching partition key values, specified by `partitionby`

) and in the order determined by the `orderby`

argument (without the `orderby`

argument order is not guaranteed, so always set `orderby`

for windowed operations that depend on row order!).

More on the window functions can be found here.

The main aggregation method for `rquery`

is:

`project`

`project`

performs per-group calculations, and returns only the grouping columns (specified by `groupby`

) and derived aggregates. For example:

```
d %.>%
project(.,
max_y := max(y),
count := n(),
groupby = 'x') %.>%
knitr::kable(.)
```

x | max_y | count |
---|---|---|

1 | 5 | 2 |

2 | 3 | 1 |

Notice we only get one row for each unique combination of the grouping variables. We can also aggregate into a single row by not specifying any `groupby`

columns.

```
d %.>%
project(.,
max_y := max(y),
count := n()) %.>%
knitr::kable(.)
```

max_y | count |
---|---|

5 | 3 |

`data.frame`

sTo combine multiple tables in `rquery`

one uses what we call the `natural_join`

operator. In the `rquery`

`natural_join`

, rows are matched by column keys and any two columns with the same name are *coalesced* (meaning the first table with a non-missing values supplies the answer). This is easiest to demonstrate with an example.

Let’s set up new example tables.

```
d_left <- data.frame(
k = c('a', 'a', 'b'),
x = c(1, NA, 3),
y = c(1, NA, NA),
stringsAsFactors = FALSE
)
knitr::kable(d_left)
```

k | x | y |
---|---|---|

a | 1 | 1 |

a | NA | NA |

b | 3 | NA |

```
d_right <- data.frame(
k = c('a', 'b', 'q'),
y = c(10, 20, 30),
stringsAsFactors = FALSE
)
knitr::kable(d_right)
```

k | y |
---|---|

a | 10 |

b | 20 |

q | 30 |

To perform a join we specify which set of columns our our row-matching conditions (using the `by`

argument) and what type of join we want (using the `jointype`

argument). For example we can use `jointype = 'LEFT'`

to augment our `d_left`

table with additional values from `d_right`

.

```
natural_join(d_left, d_right,
by = 'k',
jointype = 'LEFT') %.>%
knitr::kable(.)
```

k | x | y |
---|---|---|

a | 1 | 1 |

a | NA | 10 |

b | 3 | 20 |

In a left-join (as above) if the right-table has unique keys then we get a table with the same structure as the left-table- but with more information per row. This is a very useful type of join in data science projects. Notice columns with matching names are coalesced into each other, which we interpret as “take the value from the left table, unless it is missing.”

Record transformation is “simple once you get it”. However, we suggest reading up on that as a separate topic here.

We could, of course, perform complicated data manipulation by sequencing `rquery`

operations. For example to select one row with minimal `y`

per-`x`

group we could work in steps as follows.

```
. <- d
. <- extend(.,
row_number := row_number(),
partitionby = 'x',
orderby = c('y', 'z'))
. <- select_rows(.,
row_number == 1)
. <- drop_columns(.,
"row_number")
knitr::kable(.)
```

x | y | z |
---|---|---|

1 | 4 | 7 |

2 | 3 | 8 |

The above discipline has the advantage that it is easy to debug, as we can run line by line and inspect intermediate values. We can even use the Bizarro pipe to make this look like a pipeline of operations.

```
d ->.;
extend(.,
row_number := row_number(),
partitionby = 'x',
orderby = c('y', 'z')) ->.;
select_rows(.,
row_number == 1) ->.;
drop_columns(.,
"row_number") ->.;
knitr::kable(.)
```

x | y | z |
---|---|---|

1 | 4 | 7 |

2 | 3 | 8 |

Or we can use the `wrapr`

pipe on the data, which we call “immediate mode” (for more on modes please see here).

```
d %.>%
extend(.,
row_number := row_number(),
partitionby = 'x',
orderby = c('y', 'z')) %.>%
select_rows(.,
row_number == 1) %.>%
drop_columns(.,
"row_number") %.>%
knitr::kable(.)
```

x | y | z |
---|---|---|

1 | 4 | 7 |

2 | 3 | 8 |

`rquery`

operators can also act on `rquery`

pipelines instead of acting on data. We can write our operations as follows:

```
ops <- local_td(d) %.>%
extend(.,
row_number := row_number(),
partitionby = 'x',
orderby = c('y', 'z')) %.>%
select_rows(.,
row_number == 1) %.>%
drop_columns(.,
"row_number")
cat(format(ops))
```

```
## mk_td("d", c(
## "x",
## "y",
## "z")) %.>%
## extend(.,
## row_number := row_number(),
## partitionby = c('x'),
## orderby = c('y', 'z'),
## reverse = c()) %.>%
## select_rows(.,
## row_number == 1) %.>%
## drop_columns(.,
## c('row_number'))
```

And we can re-use this pipeline, both on local data and to generate `SQL`

to be run in remote databases. Applying this operator pipeline to our `data.frame`

`d`

is performed as follows.

```
d %.>%
ops %.>%
knitr::kable(.)
```

x | y | z |
---|---|---|

1 | 4 | 7 |

2 | 3 | 8 |

What we are trying to illustrate above: there is a continuum of notations possible between:

- Working over values with explicit intermediate variables.
- Working over values with a pipeline.
- Working over operators with a pipeline.

Being able to see these as all related gives some flexibility in decomposing problems into solutions. We have some more advanced notes on the differences in working modalities here and here.

`rquery`

supplies a very teachable grammar of data manipulation based on Codd’s relational algebra and experience with pipelined data transforms (such as `base::transform()`

, `dplyr`

, and `data.table`

).

For in-memory situations `rquery`

uses `data.table`

as the implementation provider (through the small adapter package `rqdatatable`

) and is routinely faster than any other `R`

data manipulation system *except* `data.table`

itself.

For bigger than memory situations `rquery`

can translate to any sufficiently powerful `SQL`

dialect, allowing `rquery`

pipelines to be executed on PostgreSQL, Apache Spark, or Google BigQuery.

In addition the `data_algebra`

Python package supplies a nearly identical system for working with data in Python. The two systems can even share data manipulation code between each other (allowing very powerful R/Python inter-operation or helping port projects from one to the other).

If you ever wanted to see what Nina Zumel and John Mount are like when we have the help of editors, this book is your chance!

One thing I noticed in working through the galleys: it becomes easy to see why Dr. Nina Zumel is first author.

]]>2/3rds of the book is her work.

This section reflects an important design decision in the book: teach model evaluation first, and as a step separate from model construction.

It is funny, but it takes some effort to teach in this way. New data scientists want to dive into the details of model construction first, and statisticians are used to getting model diagnostics as a side-effect of model fitting. However, to compare different modeling approaches one really needs good model evaluation that is independent of the model construction techniques.

This teaching style has worked very well for us both in R and in Python (it is considered one of the merits of our LinkedIn AI Academy course design):

(Note: Nina Zumel, leads on the course design, which is the heavy lifting, John Mount just got tasked to be the one delivering it.)

Zumel, Mount, *Practical Data Science with R, 2nd Edition* is coming out in print *very* soon. Here is a discount code to help you get a good deal on the book:

]]>Take 37% off Practical Data Science with R, Second Edition by entering

fcczumel3into the discount code box at checkout at manning.com.

John Mount at the LinkedIn campus

Nina Zumel designed most of the material, and John Mount has been delivering it and bringing her feedback. We’ve just started our 9th cohort. We adjust the course each time. Our students teach us a lot about how one thinks about data science. We bring that forward to each round of the course.

Roughly the goal is the following.

If every engineer, product manager, and project manager had some hands-on experience with data science and AI (deep neural nets), then they are both more likely to think of using these techniques in their work

andof introducing the instrumentation required to have useful data in the first place.

This will have huge downstream benefits for LinkedIn. Our group is thrilled to be a part of this.

We are looking for more companies that want an on-site data science intensive for their teams (either in Python or in R).

]]>`vtreat`

‘s cross validation works, which I want to share here.
`vtreat`

is a system that makes data preparation for machine learning a “one-liner” (available in `R`

or available in `Python`

). We have a set of starting off points here. These documents describe what `vtreat`

does for you, you just find the one that matches your task and you should have a good start for solving data science problems in `R`

or in `Python`

.

The latest documentation is a bit about how `vtreat`

works, and how to control some of the details of the work it is doing for you.

The new documentation is:

Please give one of the examples a try, and consider adding `vtreat`

to your data science workflow.

To understand computations in R, two slogans are helpful:

- Everything that exists is an object.
- Everything that happens is a function call.

John Chambers

In R, the “`[`

” array access operator is a function call. And it is one a user can re-bind to the new effect of their own choosing.

Let’s see what sort of mischief we can get into using this capability.

Yeah, yeah, but your scientists were so preoccupied with whether or not they could that they didn’t stop to think if they should.

Jurassic Park (1993) – Jeff Goldblum as Dr. Ian Malcolm

How about defining a new `[`

-based function call notation? The ideas is: we could write `sin[5]`

in place of `sin(5)`

, thus unifying the notations for function call and array access. Some languages do in fact have unified function call and array access (though often using “`(`

” for both). Examples languages include Fortran and Matlab.

Let’s add R to the list of such languages. We could define the `[`

to have either R-traditional lazy argument semantics.

```
```# lazy argument version
`[` <- function(x, ...) {
args <- as.list(substitute(alist(...)))
args <- do.call(base::`[`, args = list(args, -1))
if(is.function(x)) {
return(do.call(x, args = args))
}
return(do.call(base::`[`, args = c(list(x), args)))
}

Or we could define the `[`

to have eager argument semantics.

```
```# eager argument version
`[` <- function(x, ...) {
args <- list(...)
if(is.function(x)) {
return(do.call(x, args = args))
}
return(do.call(base::`[`, args = c(list(x), args)))
}

Let’s try the eager version.

```
```sin[5]
#> [1] -0.9589243
c(10,20)[2]
#> [1] 20
c(1,2)[-2]
#> [1] 1
d = data.frame(x= 1:5, y= 2)
d[2, 'y', drop = FALSE]
#> y
#> 2 2
paste0['1', 'c']
#> [1] "1c"

One of the advantages of eager evaluation is: if you know a function is in fact going to use all if its arguments, it often makes sense to compute them all ahead of time. For example: we don’t want a function that runs an expensive step on its first argument to then error-out due to issues that could have been addressed in its second argument.

Notice below how with lazy evaluation it takes 100 seconds to notice the second argument to `f(,)`

is bad. With eager evaluation we detect this instantly.

```
```f <- function(v1, v2) {
Sys.sleep(v1) # simulate expensive step
v2 # oops, inexpensive next step fails
}
date()
#> [1] "Wed Oct 2 11:14:06 2019"
f(100, stop())
#> Error in f(100, stop()):
date()
#> [1] "Wed Oct 2 11:15:46 2019"

With eager evaluation we detect the issue much quicker.

```
```date()
#> [1] "Wed Oct 2 11:15:46 2019"
f[100, stop()]
#> Error in f[100, stop()]:
date()
#> [1] "Wed Oct 2 11:15:46 2019"

Eager languages are more common. Examples include Python, C, C++, Java, and many more. So students are more likely to be already familiar with eager evaluation. Eager languages are also typically considered easier to debug, as it is much easier to infer evaluation order from the source code.

Lazy languages, such as Haskell and R, can save the time wasted in computing values of unused arguments. They also allow users to introduce their own new evaluation control structures, and therefore tend to be very user extensible.

]]>`Python`

`vtreat`

to prepare data for multinomial classification mode. And I have finally finished porting the documentation to `R`

`vtreat`

. So we now have good introductions on how to use `vtreat`

to prepare data for the common tasks of:
**Regression**:`R`

regression example,`Python`

regression example.**Classification**:`R`

classification example,`Python`

classification example.**Unsupervised data preparation**:`R`

unsupervised example,`Python`

unsupervised example.**Multinomial classification**:`R`

multinomial classification example,`Python`

multinomial classification example.

That is now 8 introductions to start with. To use `vtreat`

you only have to work through *one* introduction (the one helping with the task you have at hand in the language you are using).

As I have said before:

`vtreat`

helps with project blocking issues commonly seen in real world data: missing values, re-coding categorical variables, and dealing high cardinality categorical variables.- If you aren’t using a tool like
`vtreat`

in your data science projects: you are really missing out (and making more work for yourself).

```
# build an "ideal" linear process.
set.seed(34524)
N = 100
x1 = runif(N)
x2 = runif(N)
noise = 0.25*rnorm(N)
y = x1 + x2 + noise
df = data.frame(x1=x1, x2=x2, y=y)
# Fit a linear regression model
model = lm(y~x1+x2, data=df)
summary(model)
```

```
##
## Call:
## lm(formula = y ~ x1 + x2, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.73508 -0.16632 0.02228 0.19501 0.55190
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.16706 0.07111 2.349 0.0208 *
## x1 0.90047 0.09435 9.544 1.30e-15 ***
## x2 0.81444 0.09288 8.769 6.07e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2662 on 97 degrees of freedom
## Multiple R-squared: 0.6248, Adjusted R-squared: 0.6171
## F-statistic: 80.78 on 2 and 97 DF, p-value: < 2.2e-16
```

```
# plot it
library(ggplot2)
df$pred = predict(model, newdata=df)
df$residual = with(df, y-pred)
# standard residual plot
ggplot(df, aes(x=pred, y=residual)) +
geom_point(alpha=0.5) + geom_hline(yintercept=0, color="red") +
geom_smooth(se=FALSE) +
ggtitle("Standard residual plot",
subtitle = "linear model and process")
```

In the above plot, we’re plotting the residuals as a function of model prediction, and comparing them to the line `y = 0`

, using a smoothing curve through the residuals. The idea is that for a well-fit model, the smoothing curve should approximately lie on the line `y = 0`

. This is true not only for linear models, but for any model that captures most of the explainable variance, and for which the unexplainable variance (the noise) is IID and zero mean.

If the residuals aren’t zero mean independently of the model’s predictions, then either you are missing some explanatory variables, or your model does not have the correct structure, or an appropriate inductive bias. A simple example of the second case is trying to fit a linear model to a process where the outcome is quadratically (or otherwise non-linearly) related to the outcome. To see this, let’s make an example quadratic system while deliberately failing to supply that structure to the model.

```
# a simple quadratic example
x3 = runif(N)
qf = data.frame(x1=x1, x2=x2, x3=x3)
qf$y = x1 + x2 + 2*x3^2 + 0.25*noise
# Fit a linear regression model
model2 = lm(y~x1+x2+x3, data=qf)
# summary(model2)
qf$pred = predict(model2, newdata=qf)
qf$residual = with(qf, y-pred)
ggplot(qf, aes(x=pred, y=residual)) +
geom_point(alpha=0.5) + geom_hline(yintercept=0, color="red") +
geom_smooth(se=FALSE) +
ggtitle("Standard residual plot",
subtitle = "linear model, quadratic process")
```

In this case, the smoothing line on the residuals doesn’t approximate the line `y = 0`

; when the model predicts a value in the range 1 to about 2.3, it tends to be overpredicting; otherwise, it tends to underpredict. This is an instance of a pathology called “structure in the residuals.”

What happens if you erroneously plot the residuals versus the true outcome, instead of the predictions? Let’s try this with the model for the linear process (which we know is a well-fit model):

```
# the wrong residual graph
ggplot(df, aes(x=y, y=residual)) +
geom_point(alpha=0.5) + geom_hline(yintercept=0, color="red") +
geom_smooth(se=FALSE) +
ggtitle("Incorrect residual plot",
subtitle = "linear model and process")
```

If you make this plot when you meant to make the other, you will give yourself a nasty shock. Plotting residuals versus the outcome will *always* look more or less like the above graph. You might think that for a good model, the outcome and the prediction are close to each other, so the residual graphs should look about the same no matter which quantity you plot on the x-axis, right? Why do they look so different?

One reason that the proper residual graph (for a well fit model) should smooth out to the line `y=0`

is known as *reversion to mediocrity*, or *regression to the mean*.

Imagine that you have an ideal process that always produces a single value *y*. You don’t actually observe this “true value”; instead, what you observe is *y* plus (IID, zero mean) noise. You can build a “model” for this process that predicts the mean of the observations, in this case the value 0.1033149. Then you can calculate the residuals of your “model” in the usual way.

When you plot the residuals as a function of the prediction, all the datums fall at the same horizontal coordinate of the graph, centered around zero, and approximately equally distributed between positive and negative. The “smoothing line” through this graph is simply the point (0.1033149, 0) – that is, the graph is centered at zero.

On the other hand, if you plot the residuals as a function of the *observed outcome*, all the observations will be sorted so that the observations with positive noise are to the right of the observations with negative noise, and the smoothing line through the graph no longer looks like the line `y = 0`

.

For a process that varies as a function of the input, you can think of the prediction corresponding to an input `X`

as the mean of all the observations corresponding to `X`

, and the idea is the same.

Incidentally, this regression to the mean is also why model predictions tend to have less range than the original training data.

Sometimes instead of plotting residuals versus the predictions, I plot observations versus predictions. In this case, you want to check that the predictions lie approximately on the line `y = x`

. This isn’t a standard diagnostic plot, but it does give a better sense of the magnitude of the errors relative to the magnitudes of the outcomes. Again, the important thing to remember is that the *predictions go on the x-axis*.

Here’s the correct plot:

```
# standard prediction plot
ggplot(df, aes(x=pred, y=y)) +
geom_point(alpha=0.5) + geom_abline(color="red") +
geom_smooth(se=FALSE) +
ggtitle("Standard prediction plot")
```

And here’s the wrong plot:

```
# the "wrong" way
ggplot(df, aes(x=y, y=pred)) +
geom_point(alpha=0.5) + geom_abline(color="red") +
geom_smooth(se=FALSE) +
ggtitle("Incorrect prediction plot")
```

Notice how the wrong plot again seems to show pathological structure where none exists.

The above examples show why you should always take care to plot your model diagnostics as functions of the *predictions* and not of the *observations*. Most students have heard this already, but we feel that demonstrating why will be more memorable that simply saying “make it so.”

In this note we describe some great tools for working with such data.

For an example: consider the KDD 2009 contest data. Though this data structured, it is not immediately compatible with a number of high-quality machine learning packages (such as xgboost). As we see in the following Python extract, xgboost raises an exception on this data due to the issues we raised above (non-numeric column types, and also missing values):

fitter = xgboost.XGBClassifier(n_estimators=10, max_depth=3, objective='binary:logistic') try: fitter.fit(d_train, churn_train) except Exception as ex: print(ex) # DataFrame.dtypes for data must be int, float or bool. # Did not expect the data types in fields Var191, Var192, Var193, ...

vtreat is a family of packages (in R and in Python) to prepare structured data for machine learning or data science projects in a statistically sound manner. The goal of vtreat is to transform arbitrary structured data into “clean” pure numeric data. This “clean” data has no missing values, and retains most of the information relating explanatory variables to the dependent variable to be predicted.

The vtreat principles include:

- do a very good job
- work fast, and at production scale
- minimize interference: leave as many opportunities open to the user and downstream modeling software as possible.

The last point (minimizing interference / maximizing opportunity) is a subtle, but important one. vtreat does not choose any one language (it is currently available both in R or Python, leaving the choice of working in R or Python to the user) and tries to be low to moderate dependency (for instance not bringing in a deep learning system, thus leaving the choice of such systems for later steps again open).

The overall intent is that by automating the domain independent steps in data preparation we leave the analyst with much more time to work on even more critical domain dependent steps.

In all cases, designing a vtreat transform should be a one-liner. Later application of the transform should also be a one-liner (the “one line” is `prepare()`

in R, and `.transform()`

in Python).

To trust “one liners” one needs a good discussion of the theory behind them (both for learning and to cite), and worked examples.

The vtreat theory can be found here: <arXiv:1611.09477>. This helps you both learn how the vtreat transforms work, and also is itself a quick way to document them when used in your own work (such as in a “methods” section).

And we have a growing organized family of documentation and simple examples organized by task here:

**Regression**:`R`

regression example,`Python`

regression example.**Classification**:`R`

classification example,`Python`

classification example.**Unsupervised tasks**:`R`

unsupervised example,`Python`

unsupervised example.**Multinomial classification**:`R`

multinomial classification example,`Python`

multinomial classification example.

Overall vtreat R documentation (including how to install) can be found here, and vtreat Python documentation (including how to install) here.

As we have said before: if you aren’t using something like vtreat in your data science projects: you are really missing out (and making more work for yourself).

We really hope you try vtreat for one of your projects. We think you will have a great experience.

]]>