Posted on 3 Comments on A Theory of Nested Cross Simulation

## A Theory of Nested Cross Simulation

[Reader’s Note. Some of our articles are applied and some of our articles are more theoretical. The following article is more theoretical, and requires fairly formal notation to even work through. However, it should be of interest as it touches on some of the fine points of cross-validation that are quite hard to perceive or discuss without the notational framework. We thought about including some “simplifying explanatory diagrams” but so many entities are being introduced and manipulated by the processes we are describing we found equation notation to be in fact cleaner than the diagrams we attempted and rejected.]

• Picking hyper-parameters, fitting a model, and then evaluating the model.
• Variable preparation/pruning, fitting a model, and then evaluating the model.

In each case you are building a pipeline where “y-aware” (or outcome aware) choices and transformations made at each stage affect later stages. This can introduce undesirable nested model bias and over-fitting.

Our current standard advice to avoid nested model bias is either:

• Split your data into 3 or more disjoint pieces, such as separate variable preparation/pruning, model fitting, and model evaluation.
• Reserve a test-set for evaluation and use “simulated out of sample data” or “cross-frame”/“cross simulation” techniques to simulate dividing data among the first two model construction stages.

The first practice is simple and computationally efficient, but statistically inefficient. This may not matter if you have a lot of data, as in “big data”. The second procedure is more statistically efficient, but is also more complicated and has some computational cost. For convenience the cross simulation method is supplied as a ready to go procedure in our `R` data cleaning and preparation package `vtreat`.

What would it look like if we insisted on using cross simulation or simulated out of sample techniques for all three (or more) stages? Please read on to find out.

Hyperbole and a Half copyright Allie Brosh (use allowed in some situations with attribution)

Edit: we are going to be writing on a situation of some biases that do leak into the cross-frame “new data simulation.” So think of cross-frames as bias (some small amount is introduced) / variance (reduced be appearing to have a full sized data set at all stages) trade-off.

Posted on Categories Statistics, Tutorials5 Comments on Variables can synergize, even in a linear model

## Introduction

Suppose we have the task of predicting an outcome `y` given a number of variables `v1,..,vk`. We often want to “prune variables” or build models with fewer than all the variables. This can be to speed up modeling, decrease the cost of producing future data, improve robustness, improve explain-ability, even reduce over-fit, and improve the quality of the resulting model.

For some informative discussion on such issues please see the following:

In this article we are going to deliberately (and artificially) find and test one of the limits of the technique. We recommend simple variable pruning, but also think it is important to be aware of its limits.

Posted on Categories Computer Science, math programming, Statistics, Tutorials2 Comments on Variable pruning is NP hard

## Variable pruning is NP hard

I am working on some practical articles on variable selection, especially in the context of step-wise linear regression and logistic regression. One thing I noticed while preparing some examples is that summaries such as model quality (especially out of sample quality) and variable significances are not quite as simple as one would hope (they in fact lack a lot of the monotone structure or submodular structure that would make things easy).

That being said we have a lot of powerful and effective heuristics to discuss in upcoming articles. I am going to leave such positive results for my later articles and here concentrate on an instructive technical negative result: picking a good subset of variables is theoretically quite hard. Continue reading Variable pruning is NP hard

Posted on 1 Comment on vtreat version 0.5.26 released on CRAN

## vtreat version 0.5.26 released on CRAN

Win-Vector LLC, Nina Zumel and I are pleased to announce that ‘vtreat’ version 0.5.26 has been released on CRAN.

‘vtreat’ is a data.frame processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner.

‘vtreat’ is an R package that incorporates a number of transforms and simulated out of sample (cross-frame simulation) procedures that can:

• Decrease the amount of hand-work needed to prepare data for predictive modeling.
• Improve actual model performance on new out of sample or application data.
• Lower your procedure documentation burden (through ready vtreat documentation and tutorials).
• Increase model reliability (by re-coding unexpected situations).
• Increase model expressiveness (by allowing use of more variable types, especially large cardinality categorical variables).

‘vtreat’ can be used to prepare data for either regression or classification.

Please read on for what ‘vtreat’ does and what is new. Continue reading vtreat version 0.5.26 released on CRAN

Posted on 8 Comments on A demonstration of vtreat data preparation

## A demonstration of vtreat data preparation

In previous writings we have gone to great lengths to document, explain and motivate `vtreat`. That necessarily gets long and unnecessarily feels complicated.

In this example we are going to show what building a predictive model using `vtreat` best practices looks like assuming you were somehow already in the habit of using vtreat for your data preparation step. We are deliberately not going to explain any steps, but just show the small number of steps we advise routinely using. This is a simple schematic, but not a guide. Of course we do not advise use without understanding (and we work hard to teach the concepts in our writing), but want what small effort is required to add `vtreat` to your predictive modeling practice.

Posted on Categories Administrativia, Statistics, Tutorials

## Improved vtreat documentation

Nina Zumel has donated some time to greatly improve the vtreat R package documentation (now available as pre-rendered HTML here).

vtreat is an R data.frame processor/conditioner package that helps prepare real-world data for predictive modeling in a statistically sound manner. Continue reading Improved vtreat documentation

Posted on Categories Statistics, Tutorials

## Prepping Data for Analysis using R

Nina and I are proud to share our lecture: “Prepping Data for Analysis using R” from ODSC West 2015.

It is about 90 minutes, and covers a lot of the theory behind the `vtreat` data preparation library.

We also have a Github repository including all the lecture materials here. Continue reading Prepping Data for Analysis using R

Posted on 9 Comments on How Do You Know if Your Data Has Signal?

## How Do You Know if Your Data Has Signal?

Image by Liz Sullivan, Creative Commons. Source: Wikimedia

An all too common approach to modeling in data science is to throw all possible variables at a modeling procedure and “let the algorithm sort it out.” This is tempting when you are not sure what are the true causes or predictors of the phenomenon you are interested in, but it presents dangers, too. Very wide data sets are computationally difficult for some modeling procedures; and more importantly, they can lead to overfit models that generalize poorly on new data. In extreme cases, wide data can fool modeling procedures into finding models that look good on training data, even when that data has no signal. We showed some examples of this previously in our “Bad Bayes” blog post.

In this latest “Statistics as it should be” article, we will look at a heuristic to help determine which of your input variables have signal. Continue reading How Do You Know if Your Data Has Signal?

Posted on 1 Comment on Working with Sessionized Data 2: Variable Selection

## Working with Sessionized Data 2: Variable Selection

In our previous post in this series, we introduced sessionization, or converting log data into a form that’s suitable for analysis. We looked at basic considerations, like dealing with time, choosing an appropriate dataset for training models, and choosing appropriate (and achievable) business goals. In that previous example, we sessionized the data by considering all possible aggregations (window widths) of the data as features. Such naive sessionization can quickly lead to very wide data sets, with potentially more features than you have datums (and collinear features, as well). In this post, we will use the same example, but try to select our features more intelligently.

Illustration: Boris Artzybasheff
photo: James Vaughan, some rights reserved

The Example Problem

Recall that you have a mobile app with both free (A) and paid (B) actions; if a customer’s tasks involve too many paid actions, they will abandon the app. Your goal is to detect when a customer is in a state when they are likely to abandon, and offer them (perhaps through an in-app ad) a more economical alternative, for example a “Pro User” subscription that allows them to do what they are currently doing at a lower rate. You don’t want to be too aggressive about showing customers this ad, because showing it to someone who doesn’t need the subscription service is likely to antagonize them (and convince them to stop using your app).

You want to build a model that predicts whether a customer will abandon the app (“exit”) within seven days. Your training set is a set of 648 customers who were present on a specific reference day (“day 0”); their activity on day 0 and the ten days previous to that (days 1 through 10), and how many days until each customer exited (`Inf` for customers who never exit), counting from day 0. For each day, you constructed all possible windows within those ten days, and counted the relative rates of A events and B events in each window. This gives you 132 features per row. You also have a hold-out set of 660 customers, with the same structure. You can download the wide data set used for these examples as an `.rData` file here. The explanation of the variable names is in the previous post in this series.

In the previous installment, we built a regularized (ridge) logistic regression model over all 132 features. This model didn’t perform too badly, but in general there is more danger of overfitting when working with very wide data sets; in addition, it is quite expensive to analyze a large number of variables with standard implementations of logistic regression. In this installment, we will look for potentially more robust and less expensive ways of analyzing this data.