Posted on Categories Exciting Techniques, Opinion, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , ,

On Nested Models

We have been recently working on and presenting on nested modeling issues. These are situations where the output of one trained machine learning model is part of the input of a later model or procedure. I am now of the opinion that correct treatment of nested models is one of the biggest opportunities for improvement in data science practice. Nested models can be more powerful than non-nested, but are easy to get wrong.

At first glance nested models seem like they should be anathema. Using data to build a model and then applying the model or transform to that same data breaks the exchangeability that statistical machine learning depends on for correct behavior. It leads to overfit. The overfit can be big (where you have a chance to notice it) or small (where you miss it, but have unknowingly have somewhat inferior models). However when one looks further we see such nested procedures are already common statistical practice:

  • Using training data to build a principal components projection.
  • Stacking or super-learning (for a good intro see the talks and writings of Dr. Erin Ledell).
  • Variable selection.
  • Dimension reduction.
  • Variable transform/centering (such as carret::preProcess()).
  • Our own y-aware data preparation.
  • Deep models (such as multi-layer neural nets).
  • Estimation of Bayesian hyper-parameters.

Our point is: the above procedures are useful, but they are strictly correct only when a disjoint set of calibration data is used for the preparation design (and then never re-used in training, test, or later application). The strictness is taught and remembered for the marquee steps (such as model fitting and evaluation), and sometimes forgotten for the "safe steps" (such as principal components determination).

In the age of "big data" the statistical inefficiency of losing some data is far less than the statistical inefficiency of breaking your exchangeability. The recommended experimental design is similar to the Cal/Train/Test split taught in "The Elements of Statistical Learning" , 2nd edition, Jerome H. Friedman, Robert Tibshirani, and Trevor Hastie (though most practitioners squander the technique on needless hyper-parameter noodling).

We can build better models by making sure every bit of data is only used once. We already know not to use data we train on in scoring (as it damages model quality estimates with an undesired upward bias in quality estimate), but beyond that the precaution is not applied often enough. We now call out this as a general procedure: we should (in principle) never use any of our data twice (even during training). For example: any data used in variable conditioning or dimension reduction should not be re-used during model construction. If we have a lot of data (i.e. in the big data regime) this is not a problem. If we do not have enough data for this discipline we should simulate it through cross-validation procedures.

I’ll restate my points here:

  • Current data science practice is quietly losing statistical power through inappropriate re-used of data in different stages of the process (the analyst looking, variable pruning, variable treatment, dimension reduction, an so on).
  • In truly "big data" situations this can be fixed by never re-using data. Frankly the definition of having a lot of data could be you would not suffer from losing some of it.
  • In more common situations we can attempt to simulate fresh data through automated cross-validation procedures such as cross data or cross frames.
  • This may have applications in deep learning- it probably makes sense to stratify your training data and use a different fixed disjoint subset in training each layer of a neural net (both for calculation and gradient estimation).

Beyond correct statistical practice there is evidence that "read once" procedures (either using each instance of randomness only once as in N. Nisan, "On read-once vs. multiple access to randomness in logspace," Structure in Complexity Theory Conference, 1990, Proceedings., Fifth Annual, Barcelona, 1990, pp. 179-184. doi: 10.1109/SCT.1990.113966 or data only once) are of bounded power, which is yet again an opportunity for improving generalization.

Let’s illustrate the ideas with a simple nested modeling procedure in R. Our nested operation is the simple: scoreModel(buildModel(data)). Or in Magrittr style notation data %>% buildModel %>% scoreModel. I call the pipe notation out as Dr. Nina Zumel noticed there is a good opportunity for pipe notation in designing data treatment suggesting an opportunity for good formal tools and methods that automate cross-validation based simulations of fresh data. Another way to simulate fresh data involves the use of differential privacy, and this too could be automated.

On to our example:

# supply uniform interface fit and predict.
# to use other fitters change these functions
fitModel <- function(data,formula) {

applyModel <- function(model,newdata) {

# down stream application, in our case computing
# unadjusted in-sample R^2.  In super learning
# could be a derived model over many input columns.
rsq <- function(pred,y) {
# example data, intentionally no relation
d <- data.frame(x=rnorm(5),y=rnorm(5))

Standard "fit and apply" pattern.

d %>% fitModel(y~x) -> modelToReturn
modelToReturn %>% applyModel(newdata=d) -> predictions
# Unadjusted R^2 above zero (misleading).  Diliberately non adjusted so we can see the problem.
## [1] 0.4193942

Define a general procedure for simulated out of sample results by cross validating for any model that defines a fitModel, applyModel pair. The idea is simulateOutOfSampleTrainEval is going to simulate having used fresh data (disjoint from our training example) through cross validation methods. This is a very general and powerful procedure which should be applied more often (such as in controlling principal components analysis, variable significance estimation, and empirical Bayes prior/hyper-parameter estimation).

#' Simulate out of sample fitting and application.
#' @param d data.frame to work with
#' @param modelsToFit list of list(fitModel,formula,applyModel,modelName) triples to apply
#' @return data frame with derived predictions (in cross-validated manner to simulate out of sample training and application).
simulateOutOfSampleTrainEval <- function(d,modelsToFit) {
  eSets <- vtreat::buildEvalSets(nrow(d))
  preds <- lapply(modelsToFit,
                  function(pi) {
                    # could parallelize the next step
                    evals <- lapply(eSets, 
                                    function(ei) { 
                                      d[ei$train,] %>% pi$fitModel(pi$formula) %>% 
                    # re-assemble results into original row order
                    pred <- numeric(nrow(d))
                    for(eii in seq_len(length(eSets))) {
                      pred[eSets[[eii]]$app] <- evals[[eii]]
                    pred <- data.frame(x=pred,stringsAsFactors = FALSE)
                    colnames(pred) <- pi$modelName

With the above function these cross-validated procedures are not harder to apply that standard in-sample procedures (though there is some runtime cost).

modelsToFit <- list(

d %>% fitModel(y~x) -> modelToReturn
d %>% simulateOutOfSampleTrainEval(modelsToFit) -> predout
# Out of sample R^2 below zero, not misleading.
## [1] -0.568004

In a super learning context we would use simulateOutOfSampleTrainEval to fit a family of models and assemble their results into a data frame for additional fitting.

For nested modeling (or stacking / super-learning) the above procedure looks like the following.


Data-adaptive variable preparation is also essentially modeling. So any modeling that involves such preparation is essentially a nested model. Proper training procedures for nested models involves different (or fresh) data for each stage or simulating such data through cross-validation methods.

For data treatment the procedure looks like the following.


vtreat implements this directly through its mkCrossFrameCExperiment and mkCrossFrameNExperiment methods (and the development version exposes the buildEvalSets method we used in our explicit examples here).

One thought on “On Nested Models”

  1. You should have a look some time at Vovk’s conformal prediction if you haven’t already. Not always applicable, but one-pass algorithms which come with confidence intervals are one way to skin this cat.

Comments are closed.