Slides from my PyData2019 data_algebra lightning talk are here.
Practical Data Science with R, 2nd Edition: Introduction Video
Nina and I have prepared a quick introduction video for Practical Data Science with R, 2nd Edition.
We are really proud of both editions of the book. This book can help an R user directly experience the data science style of working with data and machine learning techniques.
The book is available now at:
 Directly from the publisher Manning, now (often with significant discounts!).
 Via preorder from Amazon.com.

Get a signed copy off us! We will be giving away some ecopies and a few signed physical copies at various conferences and meetups
(for example at PyData LA 2019).
Please check it out!
Nina Zumel and John Mount speaking on vtreat at PyData LA 2019
As we have announced before, we have ported the R version of vtreat to a new Python version of vtreat.
Our latest news is: we are speaking about the Python version at PyData LA 2019 (Thursday 10:50 AM–11:35 AM in Track 2 Room).
Continue reading Nina Zumel and John Mount speaking on vtreat at PyData LA 2019
Practical Data Science with R, 2nd Edition, IS OUT!!!!!!!
Practical Data Science with R, 2nd Edition author Dr. Nina Zumel, with a fresh author’s copy of her book!
When CrossValidation is More Powerful than Regularization
Regularization is a way of avoiding overfit by restricting the magnitude of model coefficients (or in deep learning, node weights). A simple example of regularization is the use of ridge or lasso regression to fit linear models in the presence of collinear variables or (quasi)separation. The intuition is that smaller coefficients are less sensitive to idiosyncracies in the training data, and hence, less likely to overfit.
Crossvalidation is a way to safely reuse training data in nested model situations. This includes both the case of setting hyperparameters before fitting a model, and the case of fitting models (let’s call them base learners) that are then used as variables in downstream models, as shown in Figure 1. In either situation, using the same data twice can lead to models that are overtuned to idiosyncracies in the training data, and more likely to overfit.
In general, if any stage of your modeling pipeline involves looking at the outcome (we’ll call that a yaware stage), you cannot directly use the same data in the following stage of the pipeline. If you have enough data, you can use separate data in each stage of the modeling process (for example, one set of data to learn hyperparameters, another set of data to train the model that uses those hyperparameters). Otherwise, you should use crossvalidation to reduce the nested model bias.
Crossvalidation is relatively computationally expensive; regularization is relatively cheap. Can you mitigate nested model bias by using regularization techniques instead of crossvalidation?
The short answer: no, you shouldn’t. But as, we’ve written before, demonstrating this is more memorable than simply saying “Don’t do that.”
A simple example
Suppose you have a system with two categorical variables. The variable x_s
has 10 levels, and the variable x_n
has 100 levels. The outcome y
is a function of x_s
, but not of x_n
(but you, the analyst building the model, don’t know this). Here’s the head of the data.
## x_s x_n y
## 2 s_10 n_72 0.34228110
## 3 s_01 n_09 0.03805102
## 4 s_03 n_18 0.92145960
## 9 s_08 n_43 1.77069352
## 10 s_08 n_17 0.51992928
## 11 s_01 n_78 1.04714355
With most modeling techniques, a categorical variable with K levels is equivalent to K or K1 numerical (indicator or dummy) variables, so this system actually has around 110 variables. In real life situations where a data scientist is working with highcardinality categorical variables, or with a lot of categorical variables, the number of actual variables can begin to swamp the size of training data, and/or bog down the machine learning algorithm.
One way to deal with these issues is to represent each categorical variable by a single variable model (or base learner), and then use the predictions of those base learners as the inputs to a bigger model. So instead of fitting a model with 110 indicator variables, you can fit a model with two numerical variables. This is a simple example of nested models.
We refer to this procedure as “impact coding,” and it is one of the data treatments available in the vtreat
package, specifically for dealing with highcardinality categorical variables. But for now, let’s go back to the original problem.
The naive way
For this simple example, you might try representing each variable as the expected value of y  mean(y)
in the training data, conditioned on the variable’s level. So the ith “coefficient” of the onevariable model would be given by:
v_{i} = E[yx = s_{i}] − E[y]
Where s_{i} is the ith level. Let’s show this with the variable x_s
(the code for all the examples in this article is here):
## x_s meany coeff
## 1 s_01 0.7998263 0.8503282
## 2 s_02 1.3815640 1.3310621
## 3 s_03 0.7928449 0.7423430
## 4 s_04 0.8245088 0.7740069
## 5 s_05 0.7547054 0.8052073
## 6 s_06 0.1564710 0.2069728
## 7 s_07 1.1747557 1.1242539
## 8 s_08 1.3520153 1.4025171
## 9 s_09 1.5789785 1.6294804
## 10 s_10 0.7313895 0.6808876
In other words, whenever the value of x_s
is s_01
, the one variable model vs
returns the value 0.8503282, and so on. If you do this for both variables, you get a training set that looks like this:
## x_s x_n y vs vn
## 2 s_10 n_72 0.34228110 0.6808876 0.64754957
## 3 s_01 n_09 0.03805102 0.8503282 0.54991135
## 4 s_03 n_18 0.92145960 0.7423430 0.01923877
## 9 s_08 n_43 1.77069352 1.4025171 1.90394159
## 10 s_08 n_17 0.51992928 1.4025171 0.26448341
## 11 s_01 n_78 1.04714355 0.8503282 0.70342961
Now fit a linear model for y
as a function of vs
and vn
.
model_raw = lm(y ~ vs + vn,
data=dtrain_treated)
summary(model_raw)
##
## Call:
## lm(formula = y ~ vs + vn, data = dtrain_treated)
##
## Residuals:
## Min 1Q Median 3Q Max
## 2.33068 0.57106 0.00342 0.52488 2.25472
##
## Coefficients:
## Estimate Std. Error t value Pr(>t)
## (Intercept) 0.05050 0.05597 0.902 0.368
## vs 0.77259 0.05940 13.006 <2e16 ***
## vn 0.61201 0.06906 8.862 <2e16 ***
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8761 on 242 degrees of freedom
## Multiple Rsquared: 0.6382, Adjusted Rsquared: 0.6352
## Fstatistic: 213.5 on 2 and 242 DF, pvalue: < 2.2e16
Note that this model gives significant coefficients to both vs
and vn
, even though y
is not a function of x_n
(or vn
). Because you used the same data to fit the one variable base learners and to fit the larger model, you have overfit.
The right way: crossvalidation
The correct way to impact code (or to nest models in general) is to use crossvalidation techniques. Impact coding with crossvalidation is already implemented in vtreat
; note the similarity between this diagram and Figure 1 above.
The training data is used both to fit the base learners (as we did above) and to also to create a data frame of crossvalidated base learner predictions (called a crossframe in vtreat
). This crossframe is used to train the overall model. Let’s fit the correct nested model, using vtreat
.
library(vtreat)
library(wrapr)
xframeResults = mkCrossFrameNExperiment(dtrain,
qc(x_s, x_n), "y",
codeRestriction = qc(catN),
verbose = FALSE)
# the plan uses the onevariable models to treat data
treatmentPlan = xframeResults$treatments
# the crossframe
dtrain_treated = xframeResults$crossFrame
head(dtrain_treated)
## x_s_catN x_n_catN y
## 1 0.6337889 0.91241547 0.34228110
## 2 0.8342227 0.82874089 0.03805102
## 3 0.7020597 0.18198634 0.92145960
## 4 1.3983175 1.99197404 1.77069352
## 5 1.3983175 0.11679580 0.51992928
## 6 0.8342227 0.06421659 1.04714355
variables = setdiff(colnames(dtrain_treated), "y")
model_X = lm(mk_formula("y", variables),
data=dtrain_treated)
summary(model_X)
##
## Call:
## lm(formula = mk_formula("y", variables), data = dtrain_treated)
##
## Residuals:
## Min 1Q Median 3Q Max
## 3.2157 0.7343 0.0225 0.7483 2.9639
##
## Coefficients:
## Estimate Std. Error t value Pr(>t)
## (Intercept) 0.04169 0.06745 0.618 0.537
## x_s_catN 0.92968 0.06344 14.656 <2e16 ***
## x_n_catN 0.10204 0.06654 1.533 0.126
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.055 on 242 degrees of freedom
## Multiple Rsquared: 0.4753, Adjusted Rsquared: 0.471
## Fstatistic: 109.6 on 2 and 242 DF, pvalue: < 2.2e16
This model correctly determines that x_n
(and its onevariable model x_n_catN
) do not affect the outcome. We can compare the performance of this model to the naive model on holdout data.
rmse  rsquared  

ypred_naive  1.303778  0.2311538 
ypred_crossval  1.093955  0.4587089 
The correct model has a much smaller rootmeansquared error and a much larger Rsquared than the naive model when applied to new data.
An attempted alternative: regularized models.
But crossvalidation is so complicated. Can’t we just regularize? As we’ll show in the appendix of this article, for a onevariable model, L2regularization is simply Laplace smoothing. Again, we’ll represent each “coefficient” of the onevariable model as the Laplace smoothed value minus the grand mean.
v_{i} = ∑_{xj = si} y_{i}/(count_{i} + λ) − E[y_{i}]
Where count_{i} is the frequency of s_{i} in the training data, and λ is the smoothing parameter (usually 1). If λ = 1 then the first term on the right is just adding one to the frequency of the level and then taking the “adjusted conditional mean” of y
.
Again, let’s show this for the variable x_s
.
## x_s sum_y count_y grandmean vs
## 1 s_01 20.795484 26 0.05050187 0.8207050
## 2 s_02 37.302227 27 0.05050187 1.2817205
## 3 s_03 22.199656 28 0.05050187 0.7150035
## 4 s_04 14.016649 17 0.05050187 0.7282009
## 5 s_05 19.622340 26 0.05050187 0.7772552
## 6 s_06 3.129419 20 0.05050187 0.1995218
## 7 s_07 35.242672 30 0.05050187 1.0863585
## 8 s_08 36.504412 27 0.05050187 1.3542309
## 9 s_09 33.158549 21 0.05050187 1.5577086
## 10 s_10 16.821957 23 0.05050187 0.6504130
After applying the one variable models for x_s
and x_n
to the data, the head of the resulting treated data looks like this:
## x_s x_n y vs vn
## 2 s_10 n_72 0.34228110 0.6504130 0.44853367
## 3 s_01 n_09 0.03805102 0.8207050 0.42505898
## 4 s_03 n_18 0.92145960 0.7150035 0.02370493
## 9 s_08 n_43 1.77069352 1.3542309 1.28612835
## 10 s_08 n_17 0.51992928 1.3542309 0.21098803
## 11 s_01 n_78 1.04714355 0.8207050 0.61015422
Now fit the overall model:
##
## Call:
## lm(formula = y ~ vs + vn, data = dtrain_treated)
##
## Residuals:
## Min 1Q Median 3Q Max
## 2.30354 0.57688 0.02224 0.56799 2.25723
##
## Coefficients:
## Estimate Std. Error t value Pr(>t)
## (Intercept) 0.06665 0.05637 1.182 0.238
## vs 0.81142 0.06203 13.082 < 2e16 ***
## vn 0.85393 0.09905 8.621 8.8e16 ***
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8819 on 242 degrees of freedom
## Multiple Rsquared: 0.6334, Adjusted Rsquared: 0.6304
## Fstatistic: 209.1 on 2 and 242 DF, pvalue: < 2.2e16
Again, both variables look significant. Even with regularization, the model is still overfit. Comparing the performance of the models on holdout data, you see that the regularized model does a little better than the naive model, but not as well as the correctly crossvalidated model.
rmse  rsquared  

ypred_naive  1.303778  0.2311538 
ypred_crossval  1.093955  0.4587089 
ypred_reg  1.267648  0.2731756 
The Moral of the Story
Unfortunately, regularization is not enough to overcome nested model bias. Whenever you apply a yaware process to your data, you have to use crossvalidation methods (or a separate data set) at the next stage of your modeling pipeline.
Appendix: Derivation of Laplace Smoothing as L2Regularization
Without regularization, the optimal onevariable model for y
in terms of a categorical variable with K levels {s_{j}} is a set of K coefficients v such that
is minimized (N is the number of data points). L2regularization adds a penalty to the magnitude of v, so that the goal is to minimize
where λ is a known smoothing hyperparameter, usually set (in this case) to 1.
To minimize the above expression for a single coefficient v_{j}, take the deriviative with respect to v_{j} and set it to zero:
Where count_{j} is the number of times the level s_{j} appears in the training data. Now solve for v_{j}:
This is Laplace smoothing. Note that it is also the onevariable equivalent of ridge regression.
rquery
New Introduction to rquery
Introduction
rquery
is a data wrangling system designed to express complex data manipulation as a series of simple data transforms. This is in the spirit of R
’s base::transform()
, or dplyr
’s dplyr::mutate()
and uses a pipe in the style popularized in R
with magrittr
. The operators themselves follow the selections in Codd’s relational algebra, with the addition of the traditional SQL
“window functions.” More on the background and context of rquery
can be found here.
The R
/rquery
version of this introduction is here, and the Python
/data_algebra
version of this introduction is here.
In transform formulations data manipulation is written as transformations that produce new data.frame
s, instead of as alterations of a primary data structure (as is the case with data.table
). Transform system can use more space and time than inplace methods. However, in our opinion, transform systems have a number of pedagogical advantages.
In rquery
’s case the primary set of data operators is as follows:
drop_columns
select_columns
rename_columns
select_rows
order_rows
extend
project
natural_join
convert_records
(supplied by thecdata
package).
These operations break into a small number of themes:
 Simple column operations (selecting and renaming columns).
 Simple row operations (selecting and reordering rows).
 Creating new columns or replacing columns with new calculated values.
 Aggregating or summarizing data.
 Combining results between two
data.frame
s.  General conversion of record layouts (supplied by the
cdata
package).
The point is: Codd worked out that a great number of data transformations can be decomposed into a small number of the above steps. rquery
supplies a high performance implementation of these methods that scales from inmemory scale up through big data scale (to just about anything that supplies a sufficiently powerful SQL
interface, such as PostgreSQL, Apache Spark, or Google BigQuery).
We will work through simple examples/demonstrations of the rquery
data manipulation operators.
Practical Data Science with R 2nd Edition update
We are in the last stages of proofing the galleys/typesetting of Zumel, Mount, Practical Data Science with R, 2nd Edition, Manning 2019. So this edition will definitely be out soon!
If you ever wanted to see what Nina Zumel and John Mount are like when we have the help of editors, this book is your chance!
One thing I noticed in working through the galleys: it becomes easy to see why Dr. Nina Zumel is first author.
2/3rds of the book is her work.
Free R/datascience Extract: Evaluating a Classification Model with a Spam Filter
We are excited to share a free extract of Zumel, Mount, Practical Data Science with R, 2nd Edition, Manning 2019: Evaluating a Classification Model with a Spam Filter.
This section reflects an important design decision in the book: teach model evaluation first, and as a step separate from model construction.
It is funny, but it takes some effort to teach in this way. New data scientists want to dive into the details of model construction first, and statisticians are used to getting model diagnostics as a sideeffect of model fitting. However, to compare different modeling approaches one really needs good model evaluation that is independent of the model construction techniques.
This teaching style has worked very well for us both in R and in Python (it is considered one of the merits of our LinkedIn AI Academy course design):
(Note: Nina Zumel, leads on the course design, which is the heavy lifting, John Mount just got tasked to be the one delivering it.)
Zumel, Mount, Practical Data Science with R, 2nd Edition is coming out in print very soon. Here is a discount code to help you get a good deal on the book:
Take 37% off Practical Data Science with R, Second Edition by entering fcczumel3 into the discount code box at checkout at manning.com.
AI for Engineers
For the last year we (Nina Zumel, and myself: John Mount) have had the honor of teaching the AI200 portion of LinkedIn’s AI Academy.
John Mount at the LinkedIn campus
Nina Zumel designed most of the material, and John Mount has been delivering it and bringing her feedback. We’ve just started our 9th cohort. We adjust the course each time. Our students teach us a lot about how one thinks about data science. We bring that forward to each round of the course.
Roughly the goal is the following.
If every engineer, product manager, and project manager had some handson experience with data science and AI (deep neural nets), then they are both more likely to think of using these techniques in their work and of introducing the instrumentation required to have useful data in the first place.
This will have huge downstream benefits for LinkedIn. Our group is thrilled to be a part of this.
We are looking for more companies that want an onsite data science intensive for their teams (either in Python or in R).
vtreat Cross Validation
Nina Zumel finished new documentation on how vtreat
‘s cross validation works, which I want to share here.
vtreat
is a system that makes data preparation for machine learning a “oneliner” (available in R
or available in Python
). We have a set of starting off points here. These documents describe what vtreat
does for you, you just find the one that matches your task and you should have a good start for solving data science problems in R
or in Python
.
The latest documentation is a bit about how vtreat
works, and how to control some of the details of the work it is doing for you.
The new documentation is:
Please give one of the examples a try, and consider adding vtreat
to your data science workflow.