## sklearn Pipe Step Interface for vtreat

We’ve been experimenting with this for a while, and the next R vtreat package will have a back-port of the Python vtreat package sklearn pipe step interface (in addition to the standard R interface).

Posted on Tags , , , Leave a comment on New vtreat Feature: Nested Model Bias Warning

## New vtreat Feature: Nested Model Bias Warning

For quite a while we have been teaching estimating variable re-encodings on the exact same data they are later naively using to train a model on, leads to an undesirable nested model bias. The vtreat package (both the R version and Python version) both incorporate a cross-frame method that allows one to use all the training data both to build learn variable re-encodings and to correctly train a subsequent model (for an example please see our recent PyData LA talk).

The next version of vtreat will warn the user if they have improperly used the same data for both vtreat impact code inference and downstream modeling. So in addition to us warning you not to do this, the package now also checks and warns against this situation. vtreat has had methods for avoiding nested model bias for vary long time, we are now adding new warnings to confirm users are using them.

## Set up the Example

This example is excerpted from some of our classification documentation.

Posted on Leave a comment on New Year’s Resolution 2020: Work on more R Data Science Projects

We had such a positive reception to our last Introduction to Data Science promotion, that we are going to try and make the course available to more people by lowering the base-price to $29.99. We are also creating a 1 month promotional price of$20.99. To get a permanent subscription to the course for less than $21 just visit this link https://www.udemy.com/course/introduction-to-data-science/ and use the discount code ITDS21 any time in January of 2020. Combine this with the new second edition of Practical Data Science with R, and you have a great study set to succeed at substantial statistical modeling and analytics tasks using the R programming language. (Note: Lego mini-fig not included!) Posted on Categories data science, Opinion, Pragmatic Data Science, Tutorials1 Comment on New Timings for a Grouped In-Place Aggregation Task ## New Timings for a Grouped In-Place Aggregation Task I’d like to share some new timings on a grouped in-place aggregation task. A client of mine was seeing some slow performance, so I decided to time a very simple abstraction of one of the steps of their workflow. Posted on 1 Comment on PyData Los Angeles 2019 talk: Preparing Messy Real World Data for Supervised Machine Learning ## PyData Los Angeles 2019 talk: Preparing Messy Real World Data for Supervised Machine Learning Video of our PyData Los Angeles 2019 talk Preparing Messy Real World Data for Supervised Machine Learning is now available. In this talk describe how to use vtreat, a package available in R and in Python, to correctly re-code real world data for supervised machine learning tasks. Please check it out. (Slides are also here.) Posted on Leave a comment on Why to try Practical Data Science with R, 2nd Edition ## Why to try Practical Data Science with R, 2nd Edition I thought we would try to express why somebody interested in using the R language (and package ecosystem) for supervised machine learning, data wrangling, analytics projects, and other data science topics should give Practical Data Science with R, 2nd Edition a try. Nina Zumel and I shared the book with two incredible data scientists (Jeremy Howard and Rachel Thomas), and they helped answer the question with the following as the Practical Data Science with R, 2nd Edition forward: Practical Data Science with R, Second Edition, is a hands-on guide to data science, with a focus on techniques for working with structured or tabular data, using the R language and statistical packages. The book emphasizes machine learning, but is unique in the number of chapters it devotes to topics such as the role of the data scientist in projects, managing results, and even designing presentations. In addition to working out how to code up models, the book shares how to collaborate with diverse teams, how to translate business goals into metrics, and how to organize work and reports. If you want to learn how to use R to work as a data scientist, get this book. We have known Nina Zumel and John Mount for a number of years. We have invited them to teach with us at Singularity University. They are two of the best data scientists we know. We regularly recommend their original research on cross-validation and impact coding (also called target encoding). In fact, chapter 8 of Practical Data Science with R teaches the theory of impact coding and uses it through the authors own R package: vtreat. Practical Data Science with R takes the time to describe what data science is, and how a data scientist solves problems and explains their work. It includes careful descriptions of classic supervised learning methods, such as linear and logistic regression. We liked the survey style of the book and extensively worked examples using contest-winning methodologies and packages such as random forests and xgboost. The book is full of useful, shared experience and practical advice. We notice they even include our own trick of using random forest variable importance for initial variable screening. Overall, this is a great book, and we highly recommend it. Jeremy Howard and Rachel Thomas About the forward authors. Jeremy Howard is an entrepreneur, business strategist, developer, and educator. Jeremy is a founding researcher at fast.ai, a research institute dedicated to making deep learning more accessible. He is also a faculty member at the University of San Francisco, and is chief scientist at doc.ai and platform.ai. Previously, Jeremy was the founding CEO of Enlitic, which was the first company to apply deep learning to medicine, and was selected as one of the worlds top 50 smartest companies by MIT Tech Review two years running. He was the president and chief scientist of the data science platform Kaggle, where he was the top-ranked participant in international machine learning competitions two years running. Rachel Thomas is director of the USF Center for Applied Data Ethics and cofounder of fast.ai, which has been featured in The Economist, MIT Tech Review, and Forbes. She was selected by Forbes as one of 20 Incredible Women in AI, earned her math PhD at Duke, and was an early engineer at Uber. Rachel is a popular writer and keynote speaker. In her TEDx talk, she shares what scares her about AI and why we need people from all backgrounds involved with AI. Zumel, Mount, Practical Data Science with R, 2nd Edition, Manning, 2019 is available from: Posted on Categories data science, Pragmatic Data Science, TutorialsLeave a comment on A Richer Category for Data Wrangling ## A Richer Category for Data Wrangling I’ve been writing a lot about a category theory interpretations of data-processing pipelines and some of the improvements we feel it is driving in both the data_algebra and in rquery/rqdatatable. I think I’ve found an even better category theory re-formulation of the package, which I will describe here. Posted on Leave a comment on Better SQL Generation via the data_algebra ## Better SQL Generation via the data_algebra In our recent note What is new for rquery December 2019 we mentioned an ugly processing pipeline that translates into SQL of varying size/quality depending on the query generator we use. In this note we try a near-relative of that query in the data_algebra. Posted on 1 Comment on When Cross-Validation is More Powerful than Regularization ## When Cross-Validation is More Powerful than Regularization Regularization is a way of avoiding overfit by restricting the magnitude of model coefficients (or in deep learning, node weights). A simple example of regularization is the use of ridge or lasso regression to fit linear models in the presence of collinear variables or (quasi-)separation. The intuition is that smaller coefficients are less sensitive to idiosyncracies in the training data, and hence, less likely to overfit. Cross-validation is a way to safely reuse training data in nested model situations. This includes both the case of setting hyperparameters before fitting a model, and the case of fitting models (let’s call them base learners) that are then used as variables in downstream models, as shown in Figure 1. In either situation, using the same data twice can lead to models that are overtuned to idiosyncracies in the training data, and more likely to overfit. In general, if any stage of your modeling pipeline involves looking at the outcome (we’ll call that a y-aware stage), you cannot directly use the same data in the following stage of the pipeline. If you have enough data, you can use separate data in each stage of the modeling process (for example, one set of data to learn hyperparameters, another set of data to train the model that uses those hyperparameters). Otherwise, you should use cross-validation to reduce the nested model bias. Cross-validation is relatively computationally expensive; regularization is relatively cheap. Can you mitigate nested model bias by using regularization techniques instead of cross-validation? The short answer: no, you shouldn’t. But as, we’ve written before, demonstrating this is more memorable than simply saying “Don’t do that.” ## A simple example Suppose you have a system with two categorical variables. The variable x_s has 10 levels, and the variable x_n has 100 levels. The outcome y is a function of x_s, but not of x_n (but you, the analyst building the model, don’t know this). Here’s the head of the data. ## x_s x_n y ## 2 s_10 n_72 0.34228110 ## 3 s_01 n_09 -0.03805102 ## 4 s_03 n_18 -0.92145960 ## 9 s_08 n_43 1.77069352 ## 10 s_08 n_17 0.51992928 ## 11 s_01 n_78 1.04714355 With most modeling techniques, a categorical variable with K levels is equivalent to K or K-1 numerical (indicator or dummy) variables, so this system actually has around 110 variables. In real life situations where a data scientist is working with high-cardinality categorical variables, or with a lot of categorical variables, the number of actual variables can begin to swamp the size of training data, and/or bog down the machine learning algorithm. One way to deal with these issues is to represent each categorical variable by a single variable model (or base learner), and then use the predictions of those base learners as the inputs to a bigger model. So instead of fitting a model with 110 indicator variables, you can fit a model with two numerical variables. This is a simple example of nested models. We refer to this procedure as “impact coding,” and it is one of the data treatments available in the vtreat package, specifically for dealing with high-cardinality categorical variables. But for now, let’s go back to the original problem. ## The naive way For this simple example, you might try representing each variable as the expected value of y - mean(y) in the training data, conditioned on the variable’s level. So the ith “coefficient” of the one-variable model would be given by: vi = E[y|x = si] − E[y] Where si is the ith level. Let’s show this with the variable x_s (the code for all the examples in this article is here): ## x_s meany coeff ## 1 s_01 0.7998263 0.8503282 ## 2 s_02 -1.3815640 -1.3310621 ## 3 s_03 -0.7928449 -0.7423430 ## 4 s_04 -0.8245088 -0.7740069 ## 5 s_05 0.7547054 0.8052073 ## 6 s_06 0.1564710 0.2069728 ## 7 s_07 -1.1747557 -1.1242539 ## 8 s_08 1.3520153 1.4025171 ## 9 s_09 1.5789785 1.6294804 ## 10 s_10 -0.7313895 -0.6808876 In other words, whenever the value of x_s is s_01, the one variable model vs returns the value 0.8503282, and so on. If you do this for both variables, you get a training set that looks like this: ## x_s x_n y vs vn ## 2 s_10 n_72 0.34228110 -0.6808876 0.64754957 ## 3 s_01 n_09 -0.03805102 0.8503282 0.54991135 ## 4 s_03 n_18 -0.92145960 -0.7423430 0.01923877 ## 9 s_08 n_43 1.77069352 1.4025171 1.90394159 ## 10 s_08 n_17 0.51992928 1.4025171 0.26448341 ## 11 s_01 n_78 1.04714355 0.8503282 0.70342961 Now fit a linear model for y as a function of vs and vn. model_raw = lm(y ~ vs + vn, data=dtrain_treated) summary(model_raw) ## ## Call: ## lm(formula = y ~ vs + vn, data = dtrain_treated) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2.33068 -0.57106 0.00342 0.52488 2.25472 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.05050 0.05597 -0.902 0.368 ## vs 0.77259 0.05940 13.006 <2e-16 *** ## vn 0.61201 0.06906 8.862 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.8761 on 242 degrees of freedom ## Multiple R-squared: 0.6382, Adjusted R-squared: 0.6352 ## F-statistic: 213.5 on 2 and 242 DF, p-value: < 2.2e-16 Note that this model gives significant coefficients to both vs and vn, even though y is not a function of x_n (or vn). Because you used the same data to fit the one variable base learners and to fit the larger model, you have overfit. ## The right way: cross-validation The correct way to impact code (or to nest models in general) is to use cross-validation techniques. Impact coding with cross-validation is already implemented in vtreat; note the similarity between this diagram and Figure 1 above. The training data is used both to fit the base learners (as we did above) and to also to create a data frame of cross-validated base learner predictions (called a cross-frame in vtreat). This cross-frame is used to train the overall model. Let’s fit the correct nested model, using vtreat. library(vtreat) library(wrapr) xframeResults = mkCrossFrameNExperiment(dtrain, qc(x_s, x_n), "y", codeRestriction = qc(catN), verbose = FALSE) # the plan uses the one-variable models to treat data treatmentPlan = xframeResults$treatments
# the cross-frame
dtrain_treated = xframeResults\$crossFrame

head(dtrain_treated)
##     x_s_catN   x_n_catN           y
## 1 -0.6337889 0.91241547  0.34228110
## 2  0.8342227 0.82874089 -0.03805102
## 3 -0.7020597 0.18198634 -0.92145960
## 4  1.3983175 1.99197404  1.77069352
## 5  1.3983175 0.11679580  0.51992928
## 6  0.8342227 0.06421659  1.04714355
variables = setdiff(colnames(dtrain_treated), "y")

model_X = lm(mk_formula("y", variables),
data=dtrain_treated)
summary(model_X)
##
## Call:
## lm(formula = mk_formula("y", variables), data = dtrain_treated)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -3.2157 -0.7343  0.0225  0.7483  2.9639
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.04169    0.06745  -0.618    0.537
## x_s_catN     0.92968    0.06344  14.656   <2e-16 ***
## x_n_catN     0.10204    0.06654   1.533    0.126
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.055 on 242 degrees of freedom
## Multiple R-squared:  0.4753, Adjusted R-squared:  0.471
## F-statistic: 109.6 on 2 and 242 DF,  p-value: < 2.2e-16

This model correctly determines that x_n (and its one-variable model x_n_catN) do not affect the outcome. We can compare the performance of this model to the naive model on holdout data.

rmse rsquared
ypred_naive 1.303778 0.2311538
ypred_crossval 1.093955 0.4587089

The correct model has a much smaller root-mean-squared error and a much larger R-squared than the naive model when applied to new data.

## An attempted alternative: regularized models.

But cross-validation is so complicated. Can’t we just regularize? As we’ll show in the appendix of this article, for a one-variable model, L2-regularization is simply Laplace smoothing. Again, we’ll represent each “coefficient” of the one-variable model as the Laplace smoothed value minus the grand mean.

vi = ∑xj = si yi/(counti + λ) − E[yi]

Where counti is the frequency of si in the training data, and λ is the smoothing parameter (usually 1). If λ = 1 then the first term on the right is just adding one to the frequency of the level and then taking the “adjusted conditional mean” of y.

Again, let’s show this for the variable x_s.

##     x_s      sum_y count_y   grandmean         vs
## 1  s_01  20.795484      26 -0.05050187  0.8207050
## 2  s_02 -37.302227      27 -0.05050187 -1.2817205
## 3  s_03 -22.199656      28 -0.05050187 -0.7150035
## 4  s_04 -14.016649      17 -0.05050187 -0.7282009
## 5  s_05  19.622340      26 -0.05050187  0.7772552
## 6  s_06   3.129419      20 -0.05050187  0.1995218
## 7  s_07 -35.242672      30 -0.05050187 -1.0863585
## 8  s_08  36.504412      27 -0.05050187  1.3542309
## 9  s_09  33.158549      21 -0.05050187  1.5577086
## 10 s_10 -16.821957      23 -0.05050187 -0.6504130

After applying the one variable models for x_s and x_n to the data, the head of the resulting treated data looks like this:

##     x_s  x_n           y         vs         vn
## 2  s_10 n_72  0.34228110 -0.6504130 0.44853367
## 3  s_01 n_09 -0.03805102  0.8207050 0.42505898
## 4  s_03 n_18 -0.92145960 -0.7150035 0.02370493
## 9  s_08 n_43  1.77069352  1.3542309 1.28612835
## 10 s_08 n_17  0.51992928  1.3542309 0.21098803
## 11 s_01 n_78  1.04714355  0.8207050 0.61015422

Now fit the overall model:

##
## Call:
## lm(formula = y ~ vs + vn, data = dtrain_treated)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -2.30354 -0.57688 -0.02224  0.56799  2.25723
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.06665    0.05637  -1.182    0.238
## vs           0.81142    0.06203  13.082  < 2e-16 ***
## vn           0.85393    0.09905   8.621  8.8e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8819 on 242 degrees of freedom
## Multiple R-squared:  0.6334, Adjusted R-squared:  0.6304
## F-statistic: 209.1 on 2 and 242 DF,  p-value: < 2.2e-16

Again, both variables look significant. Even with regularization, the model is still overfit. Comparing the performance of the models on holdout data, you see that the regularized model does a little better than the naive model, but not as well as the correctly cross-validated model.

rmse rsquared
ypred_naive 1.303778 0.2311538
ypred_crossval 1.093955 0.4587089
ypred_reg 1.267648 0.2731756

## The Moral of the Story

Unfortunately, regularization is not enough to overcome nested model bias. Whenever you apply a y-aware process to your data, you have to use cross-validation methods (or a separate data set) at the next stage of your modeling pipeline.

### Appendix: Derivation of Laplace Smoothing as L2-Regularization

Without regularization, the optimal one-variable model for y in terms of a categorical variable with K levels {sj} is a set of K coefficients v such that

$f(\mathbf{v}) := \sum\limits_{i=1}^N (y_i - v_i)^2$

is minimized (N is the number of data points). L2-regularization adds a penalty to the magnitude of v, so that the goal is to minimize

$f(\mathbf{v}) := \sum\limits_{i=1}^N (y_i - v_i)^2 + \lambda \sum\limits_{j=1}^K {v_j}^2$

where λ is a known smoothing hyperparameter, usually set (in this case) to 1.

To minimize the above expression for a single coefficient vj, take the deriviative with respect to vj and set it to zero:

$\sum\nolimits_{x_i = s_j} -2 (y_i - v_j) + 2 \lambda v_j = 0\\ \sum\nolimits_{x_i = s_j }-y_i + \sum\nolimits_{x_i = s_j} v_j + \lambda v_j = 0\\ \sum\nolimits_{x_i = s_j }-y_i + \text{count}_j v_j + \lambda v_j = 0$

Where countj is the number of times the level sj appears in the training data. Now solve for vj:

$v_j (\text{count}_j + \lambda) = \sum\nolimits_{x_i = s_j} y_i\\ v_j = \sum\nolimits_{x_i = s_i} y_i / (\text{count}_j + \lambda)$

This is Laplace smoothing. Note that it is also the one-variable equivalent of ridge regression.