vtreat
in the examples we show in this note, but you can easily implement the approach independently of vtreat
.
As with other geometric algorithms, principal components analysis is sensitive to the units of the data. In standard ("x-only") PCA, we often attempt to alleviate this problem by rescaling the x variables to their "natural units": that is, we rescale x by its own standard deviation. By individually rescaling each x variable to its "natural unit," we hope (but cannot guarantee) that all the data as a group will be in some "natural metric space," and that the structure we hope to discover in the data will manifest itself in this coordinate system. As we saw in the previous note, if the structure that we hope to discover is the relationship between x and y, we have even less guarantee that we are in the correct space, since the decomposition of the data was done without knowledge of y.
Y-aware PCA is simply PCA with a different scaling: we rescale the x data to be in y-units. That is, we want scaled variables x’ such that a unit change in x’ corresponds to a unit change in y. Under this rescaling, all the independent variables are in the same units, which are indeed the natural units for the problem at hand: characterizing their effect on y. (We also center the transformed variables x’ to be zero mean, as is done with standard centering and scaling).
It’s easy to determine the scaling for a variable x by fitting a linear regression model between x and y:
y = m * x + b
The coefficient m is the slope of the best fit line, so a unit change in x corresponds (on average) to a change of m units in y. If we rescale (and recenter) x as
x' := m * x - mean(m * x)
then x’ is in y units. This y-aware scaling is both complementary to variable pruning and powerful enough to perform well on its own.
In vtreat
, the treatment plan created by designTreatmentsN()
will store the information needed for y-aware scaling, so that if you then prepare
your data with the flag scale=TRUE
, the resulting treated frame will be scaled appropriately.
First, let’s build our example. We will use the same data set as our earlier "X only" discussion.
In this data set, there are two (unobservable) processes: one that produces the output yA
and one that produces the output yB
.We only observe the mixture of the two: y = yA + yB + eps
, where eps
is a noise term. Think of y
as measuring some notion of success and the x
variables as noisy estimates of two different factors that can each drive success.
We’ll set things up so that the first five variables (x.01, x.02, x.03, x.04, x.05) have all the signal. The odd numbered variables correspond to one process (yB
) and the even numbered variables correspond to the other (yA
). Then, to simulate the difficulties of real world modeling, we’ll add lots of pure noise variables (noise*
). The noise variables are unrelated to our y of interest — but are related to other "y-style" processes that we are not interested in. We do this because in real applications, there is no reason to believe that unhelpful variables have limited variation or are uncorrelated with each other, though things would certainly be easier if we could so assume. As we showed in the previous note, this correlation undesirably out-competed the y induced correlation among signaling variables when using standard PCA.
All the variables are also deliberately mis-scaled to model some of the difficulties of working with under-curated real world data.
Let’s start with our train and test data.
# make data
set.seed(23525)
dTrain <- mkData(1000)
dTest <- mkData(1000)
Let’s look at our outcome y and a few of our variables.
summary(dTrain[, c("y", "x.01", "x.02", "noise1.01", "noise1.02")])
## y x.01 x.02
## Min. :-5.08978 Min. :-4.94531 Min. :-9.9796
## 1st Qu.:-1.01488 1st Qu.:-0.97409 1st Qu.:-1.8235
## Median : 0.08223 Median : 0.04962 Median : 0.2025
## Mean : 0.08504 Mean : 0.02968 Mean : 0.1406
## 3rd Qu.: 1.17766 3rd Qu.: 0.93307 3rd Qu.: 1.9949
## Max. : 5.84932 Max. : 4.25777 Max. :10.0261
## noise1.01 noise1.02
## Min. :-30.5661 Min. :-30.4412
## 1st Qu.: -5.6814 1st Qu.: -6.4069
## Median : 0.5278 Median : 0.3031
## Mean : 0.1754 Mean : 0.4145
## 3rd Qu.: 5.9238 3rd Qu.: 6.8142
## Max. : 26.4111 Max. : 31.8405
Next, we’ll design a treatment plan for the frame, and examine the variable significances, as estimated by vtreat
.
# design treatment plan
treatmentsN <- designTreatmentsN(dTrain,setdiff(colnames(dTrain),'y'),'y',
verbose=FALSE)
scoreFrame = treatmentsN$scoreFrame
scoreFrame$vartype = ifelse(grepl("noise", scoreFrame$varName), "noise", "signal")
dotplot_identity(scoreFrame, "varName", "sig", "vartype") +
coord_flip() + ggtitle("vtreat variable significance estimates")+
scale_color_manual(values = c("noise" = "#d95f02", "signal" = "#1b9e77"))
Note that the noise variables typically have large significance values, denoting statistical insignificance. Usually we recommend doing some significance pruning on variables before moving on — see here for possible consequences of not pruning an over-abundance of variables, and here for a discussion of one way to prune, based on significance. For this example, however, we will attempt dimensionality reduction without pruning.
Now let’s prepare the treated frame, with scaling turned on. We will deliberately turn off variable pruning by setting pruneSig = 1
. In real applications, you would want to set pruneSig
to a value less than one to prune insignificant variables. However, here we turn off variable pruning to show that you can recover some of pruning’s benefits via scaling effects, because the scaled noise variables should not have a major effect in the principal components analysis. Pruning by significance is in fact a good additional precaution complementary to scaling by effects.
# prepare the treated frames, with y-aware scaling
examplePruneSig = 1.0
dTrainNTreatedYScaled <- prepare(treatmentsN,dTrain,pruneSig=examplePruneSig,scale=TRUE)
dTestNTreatedYScaled <- prepare(treatmentsN,dTest,pruneSig=examplePruneSig,scale=TRUE)
# get the variable ranges
ranges = vapply(dTrainNTreatedYScaled, FUN=function(col) c(min(col), max(col)), numeric(2))
rownames(ranges) = c("vmin", "vmax")
rframe = as.data.frame(t(ranges)) # make ymin/ymax the columns
rframe$varName = rownames(rframe)
varnames = setdiff(rownames(rframe), "y")
rframe = rframe[varnames,]
rframe$vartype = ifelse(grepl("noise", rframe$varName), "noise", "signal")
# show a few columns
summary(dTrainNTreatedYScaled[, c("y", "x.01_clean", "x.02_clean", "noise1.02_clean", "noise1.02_clean")])
## y x.01_clean x.02_clean
## Min. :-5.08978 Min. :-2.65396 Min. :-2.51975
## 1st Qu.:-1.01488 1st Qu.:-0.53547 1st Qu.:-0.48904
## Median : 0.08223 Median : 0.01063 Median : 0.01539
## Mean : 0.08504 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 1.17766 3rd Qu.: 0.48192 3rd Qu.: 0.46167
## Max. : 5.84932 Max. : 2.25552 Max. : 2.46128
## noise1.02_clean noise1.02_clean.1
## Min. :-0.0917910 Min. :-0.0917910
## 1st Qu.:-0.0186927 1st Qu.:-0.0186927
## Median : 0.0003253 Median : 0.0003253
## Mean : 0.0000000 Mean : 0.0000000
## 3rd Qu.: 0.0199244 3rd Qu.: 0.0199244
## Max. : 0.0901253 Max. : 0.0901253
barbell_plot(rframe, "varName", "vmin", "vmax", "vartype") +
coord_flip() + ggtitle("y-scaled variables: ranges") +
scale_color_manual(values = c("noise" = "#d95f02", "signal" = "#1b9e77"))
Notice that after the y-aware rescaling, the signal carrying variables have larger ranges than the noise variables.
Now we do the principal components analysis. In this case it is critical that the scale
parameter in prcomp
is set to FALSE
so that it does not undo our own scaling. Notice the magnitudes of the singular values fall off quickly after the first two to five values.
vars <- setdiff(colnames(dTrainNTreatedYScaled),'y')
# prcomp defaults to scale. = FALSE, but we already scaled/centered in vtreat- which we don't want to lose.
dmTrain <- as.matrix(dTrainNTreatedYScaled[,vars])
dmTest <- as.matrix(dTestNTreatedYScaled[,vars])
princ <- prcomp(dmTrain, center = FALSE, scale. = FALSE)
dotplot_identity(frame = data.frame(pc=1:length(princ$sdev),
magnitude=princ$sdev),
xvar="pc",yvar="magnitude") +
ggtitle("Y-Scaled variables: Magnitudes of singular values")
When we look at the variable loadings of the first five principal components, we see that we recover the even/odd loadings of the original signal variables. PC1
has the odd variables, and PC2
has the even variables. These two principal components carry most of the signal. The next three principal components complete the basis for the five original signal variables. The noise variables have very small loadings, compared to the signal variables.
proj <- extractProjection(2,princ)
rot5 <- extractProjection(5,princ)
rotf = as.data.frame(rot5)
rotf$varName = rownames(rotf)
rotflong = gather(rotf, "PC", "loading", starts_with("PC"))
rotflong$vartype = ifelse(grepl("noise", rotflong$varName), "noise", "signal")
dotplot_identity(rotflong, "varName", "loading", "vartype") +
facet_wrap(~PC,nrow=1) + coord_flip() +
ggtitle("Y-Scaled Variable loadings, first five principal components") +
scale_color_manual(values = c("noise" = "#d95f02", "signal" = "#1b9e77"))
Let’s look at the projection of the data onto its first two principal components, using color to code the y value. Notice that y increases both as we move up and as we move right. We have recovered two features that correlate with an increase in y. In fact, PC1
corresponds to the odd signal variables, which correspond to process yB, and PC2
corresponds to the even signal variables, which correspond to process yA.
# apply projection
projectedTrain <- as.data.frame(dmTrain %*% proj,
stringsAsFactors = FALSE)
# plot data sorted by principal components
projectedTrain$y <- dTrainNTreatedYScaled$y
ScatterHistN(projectedTrain,'PC1','PC2','y',
"Y-Scaled Training Data projected to first two principal components")
Now let’s fit a linear regression model to the first two principal components.
model <- lm(y~PC1+PC2,data=projectedTrain)
summary(model)
##
## Call:
## lm(formula = y ~ PC1 + PC2, data = projectedTrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3470 -0.7919 0.0172 0.7955 3.9588
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.08504 0.03912 2.174 0.03 *
## PC1 0.78611 0.04092 19.212 <2e-16 ***
## PC2 1.03243 0.04469 23.101 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.237 on 997 degrees of freedom
## Multiple R-squared: 0.4752, Adjusted R-squared: 0.4742
## F-statistic: 451.4 on 2 and 997 DF, p-value: < 2.2e-16
projectedTrain$estimate <- predict(model,newdata=projectedTrain)
trainrsq = rsq(projectedTrain$estimate,projectedTrain$y)
ScatterHist(projectedTrain,'estimate','y','Recovered model versus truth (y aware PCA train)',
smoothmethod='identity',annot_size=3)
This model, with only two variables, explains 47.52% of the variation in y. This is comparable to the variance explained by the model fit to twenty principal components using x-only PCA (as well as a model fit to all the original variables) in the previous note.
Let’s see how the model does on hold-out data.
# apply projection
projectedTest <- as.data.frame(dmTest %*% proj,
stringsAsFactors = FALSE)
# plot data sorted by principal components
projectedTest$y <- dTestNTreatedYScaled$y
ScatterHistN(projectedTest,'PC1','PC2','y',
"Y-Scaled Test Data projected to first two principal components")
projectedTest$estimate <- predict(model,newdata=projectedTest)
testrsq = rsq(projectedTest$estimate,projectedTest$y)
testrsq
## [1] 0.5063724
ScatterHist(projectedTest,'estimate','y','Recovered model versus truth (y aware PCA test)',
smoothmethod='identity',annot_size=3)
We see that this two-variable model captures about 50.64% of the variance in y on hold-out — again, comparable to the hold-out performance of the model fit to twenty principal components using x-only PCA. These two principal components also do a much better job of capturing the internal structure of the data — that is, the relationship of the signaling variables to the yA
and yB
processes — than the first two principal components of the x-only PCA.
caret::preProcess
?In this note, we used vtreat
, a data.frame processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner, followed by principal components regression. One could instead use caret
. The caret
package, as described in the documentation, "is a set of functions that attempt to streamline the process for creating predictive models."
caret::preProcess
is designed to implement a number of sophisticated x alone transformations, groupings, prunings, and repairs (see caret/preprocess.html#all, which demonstrates "the function on all the columns except the last, which is the outcome" on the schedulingData dataset). So caret::preProcess
is a super-version of the PCA step.
We could use it as follows either alone or before vtreat design/prepare as a initial pre-processor. Using it alone is similar to PCA for this data set, as our example doesn’t have some of the additional problems caret::preProcess
is designed to help with.
library('caret')
origVars <- setdiff(colnames(dTrain),'y')
# can try variations such adding/removing non-linear steps such as "YeoJohnson"
prep <- preProcess(dTrain[,origVars],
method = c("center", "scale", "pca"))
prepared <- predict(prep,newdata=dTrain[,origVars])
newVars <- colnames(prepared)
prepared$y <- dTrain$y
print(length(newVars))
## [1] 44
modelB <- lm(paste('y',paste(newVars,collapse=' + '),sep=' ~ '),data=prepared)
print(summary(modelB)$r.squared)
## [1] 0.5004569
print(summary(modelB)$adj.r.squared)
## [1] 0.4774413
preparedTest <- predict(prep,newdata=dTest[,origVars])
testRsqC <- rsq(predict(modelB,newdata=preparedTest),dTest$y)
testRsqC
## [1] 0.4824284
The 44 caret
-chosen PCA variables are designed to capture 95% of the in-sample explainable variation of the variables. The linear regression model fit to the selected variables explains about 50.05% of the y variance on training and 48.24% of the y variance on test. This is quite good, comparable to our previous results. However, note that caret
picked more than the twenty principal components that we picked visually in the previous note, and needed far more variables than we needed with y-aware PCA.
Because caret::preProcess
is x-only processing, the first few variables capture much less of the y variation. So we can’t model y without using a lot of the derived variables. To show this, let’s try fitting a model using only five of caret
‘s PCA variables.
model5 <- lm(paste('y',paste(newVars[1:5],collapse=' + '),sep=' ~ '),data=prepared)
print(summary(model5)$r.squared)
## [1] 0.1352
print(summary(model5)$adj.r.squared)
## [1] 0.1308499
The first 5 variables only capture about 13.52% of the in-sample variance; without being informed about y, we can’t know which variation to preserve and which we can ignore. We certainly haven’t captured the two subprocesses that drive y in an inspectable manner.
If your goal is regression, there are other workable y-aware dimension reducing procedures, such as L2-regularized regression or partial least squares. Both methods are also related to principal components analysis (see Hastie, etal 2009).
Bair, etal proposed a variant of principal components regression that they call Supervised PCR. In supervised PCR, as described in their 2006 paper, a univariate linear regression model is fit to each variable (after scaling and centering), and any variable whose coefficient (what we called m above) has a magnitude less than some threshold \(\theta\) is pruned. PCR is then done on the remaining variables. Conceptually, this is similar to the significance pruning that vtreat
offers, except that the pruning criterion is "effects-based" (that is, it’s based on the magnitude of a parameter, or the strength of an effect) rather than probability-based, such as pruning on significance.
One issue with an effects-based pruning criterion is that the appropriate pruning threshold varies from problem to problem, and not necessarily in an obvious way. Bair, etal find an appropriate threshold via cross-validation. Probability-based thresholds are in some sense more generalizable from problem to problem, since the score is always in probability units — the same units for all problems. A simple variation of supervised PCR might prune on the significance of the coefficient m, as determined by its t-statistic. This would be essentially equivalent to significance pruning of the variables via vtreat
before standard PCR.
Note that vtreat
uses the significance of the one-variable model fits, not coefficient significance to estimate variable significance. When both the dependent and independent variables are numeric, the model significance and the coefficient significance are identical (see Weisberg, Applied Linear Regression). In more general modeling situations where either the outcome is categorical or the original input variable is categorical with many degrees of freedom, they are not the same, and, in our opinion, using the model significance is preferable.
In general modeling situations where you are not specifically interested in the structure of the feature space, as described by the principal components, then we recommend significance pruning of the variables. As a rule of thumb, we suggest setting your significance pruning threshold based on the rate at which you can tolerate bad variables slipping into the model. For example, setting the pruning threshold at \(p=0.05\) would let pure noise variables in at the rate of about 1 in 20 in expectation. So a good upper bound on the pruning threshold might be 1/nvar, where nvar is the number of variables. We discuss this issue briefly here in the vtreat
documentation.
vtreat
does not supply any joint variable dimension reduction as we feel dimension reduction is a modeling task. vtreat
is intended to limit itself to only necessary "prior to modeling" processing and includes significance pruning reductions because such pruning can be necessary prior to modeling.
In our experience, there are two camps of analysts: those who never use principal components regression and those who use it far too often. While principal components analysis is a useful data conditioning method, it is sensitive to distances and geometry. Therefore it is only to be trusted when the variables are curated, pruned, and in appropriate units. Principal components regression should not be used blindly; it requires proper domain aware scaling, initial variable pruning, and posterior component pruning. If the goal is regression many of the purported benefits of principal components regression can be achieved through regularization.
The general principals are widely applicable, and often re-discovered and re-formulated in useful ways (such as autoencoders).
In our next note, we will look at some ways to pick the appropriate number of principal components procedurally.
Bair, Eric, Trevor Hastie, Debashis Paul and Robert Tibshirani, "Prediction by Supervised Principal Components", Journal of the American Statistical Association, Vol. 101, No. 473 (March 2006), pp. 119-137.
Hastie, Trevor, Robert Tibshirani, and Jerome Friedman, The Elements of Statistical Learning, 2nd Edition, 2009.
Weisberg, Sanford, Applied Linear Regression, Third Edition, Wiley, 2005.
devtools
to install WVPlots
(announced here and used to produce some of the graphs shown here). I thought I would write a note with a few instructions to help.
These are things you should not have to do often, and things those of us already running R
have stumbled through and forgotten about. These are also the kind of finicky system dependent non-repeatable interactive GUI steps you largely avoid once you have a scriptable system like fully R up and running.
First you will need install (likely admin) privileges on your machine and a network connection that is not blocking and of cran, RStudio or Github.
Make sure you have up to date copies of both R and RStudio. We have to assume you are somewhat familiar with R and RStudio (so suggest a tutorial if you are not).
Once you have these we will try to “knit” or render a R markdown document. To do this start RStudio select File->"New File"->"R Markdown"
as shown below (menus may be different on different systems, you will have to look around a bit).
Then click “OK”. Then press the “Knit HTML” button as shown in the next figure.
This will ask you to pick a filename to save as (anything ending in “.Rmd” will do). If RStudio asks to install anything let it. In the end you should get a rendered copy of RStudio’s example document. If any of this doesn’t work you can look to RStudio documentation.
Assuming the above worked paste the following commands into RStudio’s “Console” window (entering a “return” after the paste to ensure execution).
[Note any time we say paste or type, watch out for any errors caused by conversion of normal machine quotes to insidious smart quotes.]
install.packages(c('RCurl','ggplot2','tidyr',
'devtools','knitr'))
The set of packages you actually need can usually be found by looking at the R
you wish to run and looking for any library()
or ::
commands. R scripts and worksheets tend not to install packages on their own as that would be a bit invasive.
If the above commands execute without error (messages and warnings are okay) you can then try the command below to install WVPlots
:
devtools::install_github('WinVector/WVPlots',
build_vignettes=TRUE)
If the above fails (some Windows users are seeing “curl” errors) it can be a problem with your machine (perhaps permissions, or no curl library installed), network, anti-virus, or firewall software. If it does fail you can try to install WVPlots
yourself by doing the following:
WVPlots_0.1.tar.gz
. install.packages(c('ROCR', 'ggplot2', 'gridExtra', 'mgcv', 'plyr', 'reshape2', 'stringr', 'knitr', 'testthat'))
(we are installing the dependencies of WVPlots
by hand, the dependencies are found by looking at the WVPLots DESCRIPTION file, and excluding grid
as it is part of the base system and doesn’t need to be installed).install.packages('~/Downloads/WVPlots_0.1.tar.gz',repos=NULL)
(replacing '~/Downloads/WVPlots_0.1.tar.gz'
with wherever you downloaded WVPlots_0.1.tar.gz
to).If the above worked you can test the WVPlots
package by typing library("WVPlots")
.
Now you can try knitting one of our example worksheets.
XonlyPCA.Rmd
by right-clicking on the “Raw” button (towards the top right).XonlyPCA.Rmd.txt
to XonlyPCA.Rmd
.File->"Open File"
to open XonlyPCA.Rmd
.If this isn’t working then something is either not installed or configured correctly, or something is blocking access (such as anti-virus software or firewall software). The best thing to do is find another local R
user and debug together.
The purpose of this article is to set the stage for presenting dimensionality reduction techniques appropriate for predictive modeling, such as y-aware principal components analysis, variable pruning, L2-regularized regression, supervised PCR, or partial least squares. We do this by working detailed examples and building the relevant graphs. In our follow-up article we describe and demonstrate the idea of y-aware scaling.
Note we will try to say "principal components" (plural) throughout, following Everitt’s The Cambridge Dictionary of Statistics, though this is not the only common spelling (e.g. Wikipedia: Principal component regression). We will work all of our examples in R.
In principal components regression (PCR), we use principal components analysis (PCA) to decompose the independent (x) variables into an orthogonal basis (the principal components), and select a subset of those components as the variables to predict y. PCR and PCA are useful techniques for dimensionality reduction when modeling, and are especially useful when the independent variables are highly colinear.
Generally, one selects the principal components with the highest variance — that is, the components with the largest singular values — because the subspace defined by these principal components captures most of the variation in the data, and thus represents a smaller space that we believe captures most of the qualities of the data. Note, however, that standard PCA is an "x-only" decomposition, and as Jolliffe (1982) shows through examples from the literature, sometimes lower-variance components can be critical for predicting y, and conversely, high variance components are sometimes not important.
Mosteller and Tukey (1977, pp. 397-398) argue similarly that the components with small variance are unlikely to be important in regression, apparently on the basis that nature is "tricky, but not downright mean". We shall see in the examples below that without too much effort we can find examples where nature is "downright mean". — Jolliffe (1982)
The remainder of this note presents principal components analysis in the context of PCR and predictive modeling in general. We will show some of the issues in using an x-only technique like PCA for dimensionality reduction. In a follow-up note, we’ll discuss some y-aware approaches that address these issues.
First, let’s build our example. In this sort of teaching we insist on toy or synthetic problems so we actually know the right answer, and can therefore tell which procedures are better at modeling the truth.
In this data set, there are two (unobservable) processes: one that produces the output yA
and one that produces the output yB
. We only observe the mixture of the two: y = yA + yB + eps
, where eps
is a noise term. Think of y
as measuring some notion of success and the x
variables as noisy estimates of two different factors that can each drive success. We’ll set things up so that the first five variables (x.01, x.02, x.03, x.04, x.05) have all the signal. The odd numbered variables correspond to one process (yB
) and the even numbered variables correspond to the other (yA
).
Then, to simulate the difficulties of real world modeling, we’ll add lots of pure noise variables (noise*
). The noise variables are unrelated to our y of interest — but are related to other "y-style" processes that we are not interested in. As is common with good statistical counterexamples, the example looks like something that should not happen or that can be easily avoided. Our point is that the data analyst is usually working with data just like this.
Data tends to come from databases that must support many different tasks, so it is exactly the case that there may be columns or variables that are correlated to unknown and unwanted additional processes. The reason PCA can’t filter out these noise variables is that without use of y, standard PCA has no way of knowing what portion of the variation in each variable is important to the problem at hand and should be preserved. This can be fixed through domain knowledge (knowing which variables to use), variable pruning and y-aware scaling. Our next article will discuss these procedures; in this article we will orient ourselves with a demonstration of both what a good analysis and what a bad analysis looks like.
All the variables are also deliberately mis-scaled to model some of the difficulties of working with under-curated real world data.
# build example where even and odd variables are bringing in noisy images
# of two different signals.
mkData <- function(n) {
for(group in 1:10) {
# y is the sum of two effects yA and yB
yA <- rnorm(n)
yB <- rnorm(n)
if(group==1) {
d <- data.frame(y=yA+yB+rnorm(n))
code <- 'x'
} else {
code <- paste0('noise',group-1)
}
yS <- list(yA,yB)
# these variables are correlated with y in group 1,
# but only to each other (and not y) in other groups
for(i in 1:5) {
vi <- yS[[1+(i%%2)]] + rnorm(nrow(d))
d[[paste(code,formatC(i,width=2,flag=0),sep='.')]] <- ncol(d)*vi
}
}
d
}
Notice the copy of y in the data frame has additional "unexplainable variance" so only about 66% of the variation in y is predictable.
Let’s start with our train and test data.
# make data
set.seed(23525)
dTrain <- mkData(1000)
dTest <- mkData(1000)
Let’s look at our outcome y and a few of our variables.
summary(dTrain[, c("y", "x.01", "x.02",
"noise1.01", "noise1.02")])
## y x.01 x.02
## Min. :-5.08978 Min. :-4.94531 Min. :-9.9796
## 1st Qu.:-1.01488 1st Qu.:-0.97409 1st Qu.:-1.8235
## Median : 0.08223 Median : 0.04962 Median : 0.2025
## Mean : 0.08504 Mean : 0.02968 Mean : 0.1406
## 3rd Qu.: 1.17766 3rd Qu.: 0.93307 3rd Qu.: 1.9949
## Max. : 5.84932 Max. : 4.25777 Max. :10.0261
## noise1.01 noise1.02
## Min. :-30.5661 Min. :-30.4412
## 1st Qu.: -5.6814 1st Qu.: -6.4069
## Median : 0.5278 Median : 0.3031
## Mean : 0.1754 Mean : 0.4145
## 3rd Qu.: 5.9238 3rd Qu.: 6.8142
## Max. : 26.4111 Max. : 31.8405
Usually we recommend doing some significance pruning on variables before moving on — see here for possible consequences of not pruning an over-abundance of variables, and here for a discussion of one way to prune, based on significance. For this example, however, we will deliberately attempt dimensionality reduction without pruning (to demonstrate the problem). Part of what we are trying to show is to not assume PCA performs these steps for you.
First, let’s look at the ideal situation. If we had sufficient domain knowledge (or had performed significance pruning) to remove the noise, we would have no pure noise variables. In our example we know which variables carry signal and therefore can limit down to them before doing the PCA as follows.
goodVars <- colnames(dTrain)[grep('^x.',colnames(dTrain))]
dTrainIdeal <- dTrain[,c('y',goodVars)]
dTestIdeal <- dTrain[,c('y',goodVars)]
Let’s perform the analysis and look at the magnitude of the singular values.
# do the PCA
dmTrainIdeal <- as.matrix(dTrainIdeal[,goodVars])
princIdeal <- prcomp(dmTrainIdeal,center = TRUE,scale. = TRUE)
# extract the principal components
rot5Ideal <- extractProjection(5,princIdeal)
# prepare the data to plot the variable loadings
rotfIdeal = as.data.frame(rot5Ideal)
rotfIdeal$varName = rownames(rotfIdeal)
rotflongIdeal = gather(rotfIdeal, "PC", "loading",
starts_with("PC"))
rotflongIdeal$vartype = ifelse(grepl("noise",
rotflongIdeal$varName),
"noise", "signal")
# plot the singular values
dotplot_identity(frame = data.frame(pc=1:length(princIdeal$sdev),
magnitude=princIdeal$sdev),
xvar="pc",yvar="magnitude") +
ggtitle("Ideal case: Magnitudes of singular values")
The magnitudes of the singular values tell us that the first two principal components carry most of the signal. We can also look at the variable loadings of the principal components. The plot of the variable loadings is a graphical representation of the coordinates of the principal components. Each coordinate corresponds to the contribution of one of the original variables to that principal component.
dotplot_identity(rotflongIdeal, "varName", "loading", "vartype") +
facet_wrap(~PC,nrow=1) + coord_flip() +
ggtitle("x scaled variable loadings, first 5 principal components") +
scale_color_manual(values = c("noise" = "#d95f02", "signal" = "#1b9e77"))
We see that we recover the even/odd loadings of the original signal variables. PC1
has the odd variables, and PC2
has the even variables. The next three principal components complete the basis for the five original variables.
Since most of the signal is in the first two principal components, we can look at the projection of the data into that plane, using color to code y.
# signs are arbitrary on PCA, so instead of calling predict we pull out
# (and alter) the projection by hand
projectedTrainIdeal <-
as.data.frame(dmTrainIdeal %*% extractProjection(2,princIdeal),
stringsAsFactors = FALSE)
projectedTrainIdeal$y <- dTrain$y
ScatterHistN(projectedTrainIdeal,'PC1','PC2','y',
"Ideal Data projected to first two principal components")
Notice that the value of y increases both as we move up and as we move right. We have recovered two orthogonal features that each correlate with an increase in y (in general the signs of the principal components — that is, which direction is "positive" — are arbitrary, so without precautions the above graph can appear flipped). Recall that we constructed the data so that the odd variables (represented by PC1
) correspond to process yB and the even variables (represented by PC2
) correspond to process yA. We have recovered both of these relations in the figure.
This is why you rely on domain knowledge, or barring that, at least prune your variables. For this example variable pruning would have gotten us to the above ideal case. In our next article we will show how to perform the significance pruning.
To demonstrate the problem of x-only PCA on unpruned data in a predictive modeling situation, let’s analyze the same data without limiting ourselves to the known good variables. We are pretending (as is often the case) we don’t have the domain knowledge indicating which variables are useful and we have neglected to significance prune the variables before PCA. In our experience, this is a common mistake in using PCR, or, more generally, with using PCA in predictive modeling situations.
This example will demonstrate how you lose modeling power when you don’t apply the methods in a manner appropriate to your problem. Note that the appropriate method for your data may not match the doctrine of another field, as they may have different data issues.
We deliberately mis-scaled the original data when we generated it. Mis-scaled data is a common problem in data science situations, but perhaps less common in carefully curated scientific situations. In a messy data situation like the one we are emulating, the best practice is to re-scale the x variables; however, we’ll first naively apply PCA to the data as it is. This is to demonstrate the sensitivity of PCA to the units of the data.
vars <- setdiff(colnames(dTrain),'y')
duTrain <- as.matrix(dTrain[,vars])
prinU <- prcomp(duTrain,center = TRUE,scale. = FALSE)
dotplot_identity(frame = data.frame(pc=1:length(prinU$sdev),
magnitude=prinU$sdev),
xvar="pc",yvar="magnitude") +
ggtitle("Unscaled case: Magnitudes of singular values")
There is no obvious knee in the magnitudes of the singular values, so we are at a loss as to how many variables we should use. In addition, when we look at the variable loading of the first five principal components, we will see another problem:
rot5U <- extractProjection(5,prinU)
rot5U = as.data.frame(rot5U)
rot5U$varName = rownames(rot5U)
rot5U = gather(rot5U, "PC", "loading",
starts_with("PC"))
rot5U$vartype = ifelse(grepl("noise",
rot5U$varName),
"noise", "signal")
dotplot_identity(rot5U, "varName", "loading", "vartype") +
facet_wrap(~PC,nrow=1) + coord_flip() +
ggtitle("unscaled variable loadings, first 5 principal components") +
scale_color_manual(values = c("noise" = "#d95f02", "signal" = "#1b9e77"))
The noise variables completely dominate the loading of the first several principal components. Because of the way we deliberately mis-scaled the data, the noise variables are of much larger magnitude than the signal variables, and so the true signal is masked when we decompose the data.
Since the magnitudes of the singular values don’t really give us a clue as to how many components to use in our model, let’s try using all of them. This actually makes no sense, because using all the principal components is equivalent to using all the variables, thus defeating the whole purpose of doing PCA in the first place. But let’s do it anyway (as many unwittingly do).
# get all the principal components
# not really a projection as we took all components!
projectedTrain <- as.data.frame(predict(prinU,duTrain),
stringsAsFactors = FALSE)
vars = colnames(projectedTrain)
projectedTrain$y <- dTrain$y
varexpr = paste(vars, collapse="+")
fmla = paste("y ~", varexpr)
model <- lm(fmla,data=projectedTrain)
summary(model)
##
## Call:
## lm(formula = fmla, data = projectedTrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1748 -0.7611 0.0111 0.7821 3.6559
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.504e-02 3.894e-02 2.184 0.029204 *
## PC1 1.492e-04 4.131e-04 0.361 0.717983
## PC2 1.465e-05 4.458e-04 0.033 0.973793
## PC3 -7.372e-04 4.681e-04 -1.575 0.115648
## PC4 6.894e-04 5.211e-04 1.323 0.186171
## PC5 7.529e-04 5.387e-04 1.398 0.162577
## PC6 -2.382e-04 5.961e-04 -0.400 0.689612
## PC7 2.555e-04 6.142e-04 0.416 0.677546
## PC8 5.850e-04 6.701e-04 0.873 0.382908
## PC9 -6.890e-04 6.955e-04 -0.991 0.322102
## PC10 7.472e-04 7.650e-04 0.977 0.328993
## PC11 -7.034e-04 7.839e-04 -0.897 0.369763
## PC12 7.062e-04 8.039e-04 0.878 0.379900
## PC13 1.098e-04 8.125e-04 0.135 0.892511
## PC14 -8.137e-04 8.405e-04 -0.968 0.333213
## PC15 -5.163e-05 8.716e-04 -0.059 0.952776
## PC16 1.945e-03 9.015e-04 2.158 0.031193 *
## PC17 -3.384e-04 9.548e-04 -0.354 0.723143
## PC18 -9.339e-04 9.774e-04 -0.955 0.339587
## PC19 -6.110e-04 1.005e-03 -0.608 0.543413
## PC20 8.747e-04 1.042e-03 0.839 0.401494
## PC21 4.538e-04 1.083e-03 0.419 0.675310
## PC22 4.237e-04 1.086e-03 0.390 0.696428
## PC23 -2.011e-03 1.187e-03 -1.694 0.090590 .
## PC24 3.451e-04 1.204e-03 0.287 0.774416
## PC25 2.156e-03 1.263e-03 1.707 0.088183 .
## PC26 -6.293e-04 1.314e-03 -0.479 0.631988
## PC27 8.401e-04 1.364e-03 0.616 0.538153
## PC28 -2.578e-03 1.374e-03 -1.876 0.061014 .
## PC29 4.354e-04 1.423e-03 0.306 0.759691
## PC30 4.098e-04 1.520e-03 0.270 0.787554
## PC31 5.509e-03 1.650e-03 3.339 0.000875 ***
## PC32 9.097e-04 1.750e-03 0.520 0.603227
## PC33 5.617e-04 1.792e-03 0.314 0.753964
## PC34 -1.247e-04 1.870e-03 -0.067 0.946837
## PC35 -6.470e-04 2.055e-03 -0.315 0.752951
## PC36 1.435e-03 2.218e-03 0.647 0.517887
## PC37 4.906e-04 2.246e-03 0.218 0.827168
## PC38 -2.915e-03 2.350e-03 -1.240 0.215159
## PC39 -1.917e-03 2.799e-03 -0.685 0.493703
## PC40 4.827e-04 2.820e-03 0.171 0.864117
## PC41 -6.016e-05 3.060e-03 -0.020 0.984321
## PC42 6.750e-03 3.446e-03 1.959 0.050425 .
## PC43 -3.537e-03 4.365e-03 -0.810 0.417996
## PC44 -4.845e-03 5.108e-03 -0.948 0.343131
## PC45 8.643e-02 5.456e-03 15.842 < 2e-16 ***
## PC46 7.882e-02 6.267e-03 12.577 < 2e-16 ***
## PC47 1.202e-01 6.693e-03 17.965 < 2e-16 ***
## PC48 -9.042e-02 1.163e-02 -7.778 1.92e-14 ***
## PC49 1.309e-01 1.670e-02 7.837 1.23e-14 ***
## PC50 2.893e-01 3.546e-02 8.157 1.08e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.231 on 949 degrees of freedom
## Multiple R-squared: 0.5052, Adjusted R-squared: 0.4791
## F-statistic: 19.38 on 50 and 949 DF, p-value: < 2.2e-16
estimate <- predict(model,newdata=projectedTrain)
trainrsq <- rsq(estimate,projectedTrain$y)
Note that most of the variables that achieve significance are the very last ones! We will leave it to the reader to confirm that using even as many as the first 25 principal components — half the variables — explains little of the variation in y. If we wanted to use PCR to reduce the dimensionality of the problem, we have failed. This is an example of what Jolliffe would have called a "downright mean" modeling problem, which we caused by mis-scaling the data. Note the r-squared of 0.5052 for comparison, later.
So now let’s do what we should have done in the first place: scale the data.
Standard practice is to center the data at mean zero and scale it to unit standard deviation, which is easy with the scale
command.
dTrainNTreatedUnscaled <- dTrain
dTestNTreatedUnscaled <- dTest
# scale the data
dTrainNTreatedXscaled <-
as.data.frame(scale(dTrainNTreatedUnscaled[,colnames(dTrainNTreatedUnscaled)!='y'],
center=TRUE,scale=TRUE),stringsAsFactors = FALSE)
dTrainNTreatedXscaled$y <- dTrainNTreatedUnscaled$y
dTestNTreatedXscaled <-
as.data.frame(scale(dTestNTreatedUnscaled[,colnames(dTestNTreatedUnscaled)!='y'],
center=TRUE,scale=TRUE),stringsAsFactors = FALSE)
dTestNTreatedXscaled$y <- dTestNTreatedUnscaled$y
# get the variable ranges
ranges = vapply(dTrainNTreatedXscaled, FUN=function(col) c(min(col), max(col)), numeric(2))
rownames(ranges) = c("vmin", "vmax")
rframe = as.data.frame(t(ranges)) # make ymin/ymax the columns
rframe$varName = rownames(rframe)
varnames = setdiff(rownames(rframe), "y")
rframe = rframe[varnames,]
rframe$vartype = ifelse(grepl("noise", rframe$varName),
"noise", "signal")
summary(dTrainNTreatedXscaled[, c("y", "x.01", "x.02",
"noise1.01", "noise1.02")])
## y x.01 x.02
## Min. :-5.08978 Min. :-3.56466 Min. :-3.53178
## 1st Qu.:-1.01488 1st Qu.:-0.71922 1st Qu.:-0.68546
## Median : 0.08223 Median : 0.01428 Median : 0.02157
## Mean : 0.08504 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 1.17766 3rd Qu.: 0.64729 3rd Qu.: 0.64710
## Max. : 5.84932 Max. : 3.02949 Max. : 3.44983
## noise1.01 noise1.02
## Min. :-3.55505 Min. :-3.04344
## 1st Qu.:-0.67730 1st Qu.:-0.67283
## Median : 0.04075 Median :-0.01098
## Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.66476 3rd Qu.: 0.63123
## Max. : 3.03398 Max. : 3.09969
barbell_plot(rframe, "varName", "vmin", "vmax", "vartype") +
coord_flip() + ggtitle("x scaled variables: ranges") +
scale_color_manual(values = c("noise" = "#d95f02", "signal" = "#1b9e77"))
Note that the signal and noise variables now have commensurate ranges.
vars = setdiff(colnames(dTrainNTreatedXscaled), "y")
dmTrain <- as.matrix(dTrainNTreatedXscaled[,vars])
dmTest <- as.matrix(dTestNTreatedXscaled[,vars])
princ <- prcomp(dmTrain,center = TRUE,scale. = TRUE)
dotplot_identity(frame = data.frame(pc=1:length(princ$sdev),
magnitude=princ$sdev),
xvar="pc",yvar="magnitude") +
ggtitle("x scaled variables: Magnitudes of singular values")
Now the magnitudes of the singular values suggest that we can try to model the data with only the first twenty principal components. But first, let’s look at the variable loadings of the first five principal components.
rot5 <- extractProjection(5,princ)
rotf = as.data.frame(rot5)
rotf$varName = rownames(rotf)
rotflong = gather(rotf, "PC", "loading", starts_with("PC"))
rotflong$vartype = ifelse(grepl("noise", rotflong$varName),
"noise", "signal")
dotplot_identity(rotflong, "varName", "loading", "vartype") +
facet_wrap(~PC,nrow=1) + coord_flip() +
ggtitle("x scaled variable loadings, first 5 principal components") +
scale_color_manual(values = c("noise" = "#d95f02", "signal" = "#1b9e77"))
The signal variables now have larger loadings than they did in the unscaled case, but the noise variables still dominate the projection, in aggregate swamping out the contributions from the signal variables. The two processes that produced y have diffused amongst the principal components, rather than mostly concentrating in the first two, as they did in the ideal case. This is because we constructed the noise variables to have variation and some correlations with each other — but not be correlated with y. PCA doesn’t know that we are interested only in variable correlations that are due to y, so it must decompose the data to capture as much variation, and as many variable correlations, as possible.
In other words, PCA must represent all processes present in the data, regardless of whether we are trying to predict those particular processes or not. Without the knowledge of the y that we are trying to predict, PCA is forced to prepare for any possible future prediction task.
Let’s build a model using only the first twenty principal components, as our above analysis suggests we should.
# get all the principal components
# not really a projection as we took all components!
projectedTrain <- as.data.frame(predict(princ,dmTrain),
stringsAsFactors = FALSE)
projectedTrain$y <- dTrainNTreatedXscaled$y
ncomp = 20
# here we will only model with the first ncomp principal components
varexpr = paste(paste("PC", 1:ncomp, sep=''), collapse='+')
fmla = paste("y ~", varexpr)
model <- lm(fmla,data=projectedTrain)
summary(model)
##
## Call:
## lm(formula = fmla, data = projectedTrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2612 -0.7939 -0.0096 0.7898 3.8352
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.085043 0.039391 2.159 0.031097 *
## PC1 0.107016 0.025869 4.137 3.82e-05 ***
## PC2 -0.047934 0.026198 -1.830 0.067597 .
## PC3 0.135933 0.026534 5.123 3.62e-07 ***
## PC4 -0.162336 0.026761 -6.066 1.87e-09 ***
## PC5 0.356880 0.027381 13.034 < 2e-16 ***
## PC6 -0.126491 0.027534 -4.594 4.92e-06 ***
## PC7 0.092546 0.028093 3.294 0.001022 **
## PC8 -0.134252 0.028619 -4.691 3.11e-06 ***
## PC9 0.280126 0.028956 9.674 < 2e-16 ***
## PC10 -0.112623 0.029174 -3.860 0.000121 ***
## PC11 -0.065812 0.030564 -2.153 0.031542 *
## PC12 0.339129 0.030989 10.943 < 2e-16 ***
## PC13 -0.006817 0.031727 -0.215 0.829918
## PC14 0.086316 0.032302 2.672 0.007661 **
## PC15 -0.064822 0.032582 -1.989 0.046926 *
## PC16 0.300566 0.032739 9.181 < 2e-16 ***
## PC17 -0.339827 0.032979 -10.304 < 2e-16 ***
## PC18 -0.287752 0.033443 -8.604 < 2e-16 ***
## PC19 0.297290 0.034657 8.578 < 2e-16 ***
## PC20 0.084198 0.035265 2.388 0.017149 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.246 on 979 degrees of freedom
## Multiple R-squared: 0.4776, Adjusted R-squared: 0.467
## F-statistic: 44.76 on 20 and 979 DF, p-value: < 2.2e-16
projectedTrain$estimate <- predict(model,newdata=projectedTrain)
ScatterHist(projectedTrain,'estimate','y','Recovered 20 variable model versus truth (train)',
smoothmethod='identity',annot_size=3)
trainrsq <- rsq(projectedTrain$estimate,projectedTrain$y)
This model explains 47.76% of the variation in the training set. We do about as well on test.
projectedTest <- as.data.frame(predict(princ,dmTest),
stringsAsFactors = FALSE)
projectedTest$y <- dTestNTreatedXscaled$y
projectedTest$estimate <- predict(model,newdata=projectedTest)
testrsq <- rsq(projectedTest$estimate,projectedTest$y)
testrsq
## [1] 0.5033022
This is pretty good; recall that we had about 33% unexplainable variance in the data, so we would not expect any modeling algorithm to get better than an r-squared of about 0.67.
We can confirm that this performance is as good as simply regressing on all the variables without the PCA, so we have at least not lost information via our dimensionality reduction.
# fit a model to the original data
vars <- setdiff(colnames(dTrain),'y')
formulaB <- paste('y',paste(vars,collapse=' + '),sep=' ~ ')
modelB <- lm(formulaB,data=dTrain)
dTrainestimate <- predict(modelB,newdata=dTrain)
rsq(dTrainestimate,dTrain$y)
## [1] 0.5052081
dTestestimate <- predict(modelB,newdata=dTest)
rsq(dTestestimate,dTest$y)
## [1] 0.4751995
We will show in our next article how to get a similar test r-squared from this data using a model with only two variables.
Scaling the variables improves the performance of PCR on this data relative to not scaling, but we haven’t completely solved the problem (though some analysts are fooled into thinking thusly). We have not explicitly recovered the two processes that drive y, and recovering such structure in the data is one of the purposes of PCA — if we did not care about the underlying structure of the problem, we could simply fit a model to the original data, or use other methods (like significance pruning) to reduce the problem dimensionality.
It is a misconception in some fields that the variables must be orthogonal before fitting a linear regression model. This is not true. A linear model fit to collinear variables can still predict well; the only downside is that the coefficients of the model are not necessarily as easily interpretable as they are when the variables are orthogonal (and ideally, centered and scaled, as well). If your data has so much collinearity that the design matrix is ill-conditioned, causing the model coefficients to be inappropriately large or unstable, then regularization (ridge, lasso, or elastic-net regression) is a good solution. More complex predictive modeling approaches, for example random forest or gradient boosting, also tend to be more immune to collinearity.
So if you are doing PCR, you presumably are interested in the underlying structure of the data, and in this case, we haven’t found it. Projecting onto the first few principal components fails to show much of a relation between these components and y.
We can confirm the first two x-scaled principal components are not informative with the following graph.
proj <- extractProjection(2,princ)
# apply projection
projectedTrain <- as.data.frame(dmTrain %*% proj,
stringsAsFactors = FALSE)
projectedTrain$y <- dTrainNTreatedXscaled$y
# plot data sorted by principal components
ScatterHistN(projectedTrain,'PC1','PC2','y',
"x scaled Data projected to first two principal components")
We see that y is not well ordered by PC1
and PC2
here, as it was in the ideal case, and as it will be with the y-aware PCA.
In our next article we will show that we can explain almost 50% of the y variance in this data using only two variables. This is quite good as even the "all variable" model only picks up about that much of the relation and y by design has about 33% unexplainable variation. In addition to showing the standard methods (including variable pruning) we will introduce a technique we call "y-aware scaling."
Click here for part 2.
Everitt, B. S. The Cambridge Dictionary of Statistics, 2nd edition, Cambridge University Press, 2005.
Jolliffe, Ian T. "A Note on the Use of Principal Components in Regression," Journal of the Royal Statistical Society. Series C (Applied Statistics), Vol. 31, No. 3 (1982), pp. 300-303
As part of the promotion you can also use the supplied discount code mlcielenlt
for half off some R titles including R in Action, Second Edition and our own Practical Data Science with R. Combine these with our half off code (C3
) for our R video course Introduction to Data Science and you can get a lot of top quality data science material at a deep discount.
I’ve been editing a two-part three-part series Nina Zumel is writing on some of the pitfalls of improperly applied principal components analysis/regression and how to avoid them (we are using the plural spelling as used in following Everitt The Cambridge Dictionary of Statistics). The series is looking absolutely fantastic and I think it will really help people understand, properly use, and even teach the concepts.
The series includes fully worked graphical examples in R and is why we added the ScatterHistN
plot to WVPlots (plot shown below, explained in the upcoming series).
Frankly the material would have worked great as an additional chapter for Practical Data Science with R (but instead everybody is going to get it for free).
Please watch here for the series.
]]>As a follow on to “On Nested Models” we work R examples demonstrating “cross validated training frames” (or “cross frames”) in vtreat.
Consider the following data frame. The outcome only depends on the “good” variables, not on the (high degree of freedom) “bad” variables. Modeling such a data set runs a high risk of overfit.
set.seed(22626)
mkData <- function(n) {
d <- data.frame(xBad1=sample(paste('level',1:1000,sep=''),n,replace=TRUE),
xBad2=sample(paste('level',1:1000,sep=''),n,replace=TRUE),
xBad3=sample(paste('level',1:1000,sep=''),n,replace=TRUE),
xGood1=rnorm(n),
xGood2=rnorm(n))
# outcome only depends on "good" variables
d$y <- rnorm(nrow(d))+0.2*d$xGood1 + 0.3*d$xGood2>0.5
# the random group used for splitting the data set, not a variable.
d$rgroup <- sample(c("cal","train","test"),nrow(d),replace=TRUE)
d
}
d <- mkData(2000)
# devtools::install_github("WinVector/WVPlots")
# library('WVPlots')
plotRes <- function(d,predName,yName,title) {
print(title)
tab <- table(truth=d[[yName]],pred=d[[predName]]>0.5)
print(tab)
diag <- sum(vapply(seq_len(min(dim(tab))),
function(i) tab[i,i],numeric(1)))
acc <- diag/sum(tab)
# if(requireNamespace("WVPlots",quietly=TRUE)) {
# print(WVPlots::ROCPlot(d,predName,yName,title))
# }
print(paste('accuracy',acc))
}
Bad practice: use the same set of data to prepare variable encoding and train a model.
dTrain <- d[d$rgroup!='test',,drop=FALSE]
dTest <- d[d$rgroup=='test',,drop=FALSE]
treatments <- vtreat::designTreatmentsC(dTrain,c('xBad1','xBad2','xBad3','xGood1','xGood2'),
'y',TRUE,
rareCount=0 # Note: usually want rareCount>0, setting to zero to illustrate problem
)
## [1] "desigining treatments Thu May 5 07:17:01 2016"
## [1] "design var xBad1 Thu May 5 07:17:01 2016"
## [1] "design var xBad2 Thu May 5 07:17:01 2016"
## [1] "design var xBad3 Thu May 5 07:17:01 2016"
## [1] "design var xGood1 Thu May 5 07:17:01 2016"
## [1] "design var xGood2 Thu May 5 07:17:01 2016"
## [1] "scoring treatments Thu May 5 07:17:01 2016"
## [1] "have treatment plan Thu May 5 07:17:01 2016"
## [1] "rescoring complex variables Thu May 5 07:17:01 2016"
## [1] "done rescoring complex variables Thu May 5 07:17:01 2016"
dTrainTreated <- vtreat::prepare(treatments,dTrain,
pruneSig=c() # Note: usually want pruneSig to be a small fraction, setting to null to illustrate problem
)
m1 <- glm(y~xBad1_catB + xBad2_catB + xBad3_catB + xGood1_clean + xGood2_clean,
data=dTrainTreated,family=binomial(link='logit'))
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
print(summary(m1)) # notice low residual deviance
##
## Call:
## glm(formula = y ~ xBad1_catB + xBad2_catB + xBad3_catB + xGood1_clean +
## xGood2_clean, family = binomial(link = "logit"), data = dTrainTreated)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.70438 0.00000 0.00000 0.03995 2.61063
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.5074 0.3350 -1.515 0.12983
## xBad1_catB 2.9432 0.5549 5.304 1.13e-07 ***
## xBad2_catB 2.5338 0.5857 4.326 1.52e-05 ***
## xBad3_catB 3.4172 0.6092 5.610 2.03e-08 ***
## xGood1_clean 0.7288 0.2429 3.001 0.00269 **
## xGood2_clean 0.7788 0.2585 3.012 0.00259 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1724.55 on 1331 degrees of freedom
## Residual deviance: 132.59 on 1326 degrees of freedom
## AIC: 144.59
##
## Number of Fisher Scoring iterations: 12
dTrain$predM1 <- predict(m1,newdata=dTrainTreated,type='response')
plotRes(dTrain,'predM1','y','model1 on train')
## [1] "model1 on train"
## pred
## truth FALSE TRUE
## FALSE 848 18
## TRUE 6 460
## [1] "accuracy 0.981981981981982"
dTestTreated <- vtreat::prepare(treatments,dTest,pruneSig=c())
dTest$predM1 <- predict(m1,newdata=dTestTreated,type='response')
plotRes(dTest,'predM1','y','model1 on test')
## [1] "model1 on test"
## pred
## truth FALSE TRUE
## FALSE 360 114
## TRUE 153 41
## [1] "accuracy 0.600299401197605"
Notice above that we see a training accuracy of 98% and a test accuracy of 60%.
Now try a proper calibration/train/test split:
dCal <- d[d$rgroup=='cal',,drop=FALSE]
dTrain <- d[d$rgroup=='train',,drop=FALSE]
dTest <- d[d$rgroup=='test',,drop=FALSE]
treatments <- vtreat::designTreatmentsC(dCal,c('xBad1','xBad2','xBad3','xGood1','xGood2'),
'y',TRUE,
rareCount=0 # Note: usually want rareCount>0, setting to zero to illustrate problem
)
## [1] "desigining treatments Thu May 5 07:17:01 2016"
## [1] "design var xBad1 Thu May 5 07:17:01 2016"
## [1] "design var xBad2 Thu May 5 07:17:01 2016"
## [1] "design var xBad3 Thu May 5 07:17:01 2016"
## [1] "design var xGood1 Thu May 5 07:17:01 2016"
## [1] "design var xGood2 Thu May 5 07:17:01 2016"
## [1] "scoring treatments Thu May 5 07:17:01 2016"
## [1] "have treatment plan Thu May 5 07:17:01 2016"
## [1] "rescoring complex variables Thu May 5 07:17:01 2016"
## [1] "done rescoring complex variables Thu May 5 07:17:02 2016"
dTrainTreated <- vtreat::prepare(treatments,dTrain,
pruneSig=c() # Note: usually want pruneSig to be a small fraction, setting to null to illustrate problem
)
m1 <- glm(y~xBad1_catB + xBad2_catB + xBad3_catB + xGood1_clean + xGood2_clean,
data=dTrainTreated,family=binomial(link='logit'))
print(summary(m1))
##
## Call:
## glm(formula = y ~ xBad1_catB + xBad2_catB + xBad3_catB + xGood1_clean +
## xGood2_clean, family = binomial(link = "logit"), data = dTrainTreated)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5853 -0.9177 -0.6876 1.1651 2.3241
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.73798 0.12720 -5.802 6.56e-09 ***
## xBad1_catB -0.02380 0.02637 -0.903 0.367
## xBad2_catB -0.02495 0.02608 -0.957 0.339
## xBad3_catB 0.02058 0.02508 0.821 0.412
## xGood1_clean 0.39234 0.08632 4.545 5.49e-06 ***
## xGood2_clean 0.56252 0.09673 5.816 6.04e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 832.55 on 642 degrees of freedom
## Residual deviance: 769.28 on 637 degrees of freedom
## AIC: 781.28
##
## Number of Fisher Scoring iterations: 4
dTrain$predM1 <- predict(m1,newdata=dTrainTreated,type='response')
plotRes(dTrain,'predM1','y','model1 on train')
## [1] "model1 on train"
## pred
## truth FALSE TRUE
## FALSE 378 40
## TRUE 157 68
## [1] "accuracy 0.693623639191291"
dTestTreated <- vtreat::prepare(treatments,dTest,pruneSig=c())
dTest$predM1 <- predict(m1,newdata=dTestTreated,type='response')
plotRes(dTest,'predM1','y','model1 on test')
## [1] "model1 on test"
## pred
## truth FALSE TRUE
## FALSE 422 52
## TRUE 149 45
## [1] "accuracy 0.699101796407186"
Notice above that we now see training and test accuracies of 70%. We have defeated overfit in two ways: training performance is closer to test performance, and test performance is better. Also we see that the model now properly considers the “bad” variables to be insignificant.
Below is a more statistically efficient practice: building a cross training frame.
Consider any trained statistical model (in this case our treatment plan and variable selection plan) as a two-argument function f(A,B). The first argument is the training data and the second argument is the application data. In our case f(A,B) is: designTreatmentsC(A) %>% prepare(B)
, and it produces a treated data frame.
When we use the same data in both places to build our training frame, as in
TrainTreated = f(TrainData,TrainData),
we are not doing a good job simulating the future application of f(,), which will be f(TrainData,FutureData).
To improve the quality of our simulation we can call
TrainTreated = f(CalibrationData,TrainData)
where CalibrationData and TrainData are disjoint datasets (as we did in the earlier example) and expect this to be a good imitation of future f(CalibrationData,FutureData).
Another approach is to build a “cross validated” version of f. We split TrainData into a list of 3 disjoint row intervals: Train1,Train2,Train3. Instead of computing f(TrainData,TrainData) compute:
TrainTreated = f(Train2+Train3,Train1) + f(Train1+Train3,Train2) + f(Train1+Train2,Train3)
(where + denotes rbind()
).
The idea is this looks a lot like f(TrainData,TrainData) except it has the important property that no row in the right-hand side is ever worked on by a model built using that row (a key characteristic that future data will have) so we have a good imitation of f(TrainData,FutureData).
In other words: we use cross validation to simulate future data. The main thing we are doing differently is remembering that we can apply cross validation to any two argument function f(A,B) and not only to functions of the form f(A,B) = buildModel(A) %>% scoreData(B)
. We can use this formulation in stacking or super-learning with f(A,B) of the form buildSubModels(A) %>% combineModels(B)
(to produce a stacked or ensemble model); the idea applies to improving ensemble methods in general.
See:
In fact (though it was developed independently) you can think of vtreat as a superlearner.
In super learning cross validation techniques are used to simulate having built sub-model predictions on novel data. The simulated out of sample-applications of these sub models (and not the sub models themselves) are then used as input data for the next stage learner. In future application the actual sub-models are applied and their immediate outputs is used by the super model.
In vtreat the sub-models are single variable treatments and the outer model construction is left to the practitioner (using the cross-frames for simulation and not the treatmentplan). In application the treatment plan is used.
Below is the example cross-run. The function mkCrossFrameCExperiment
returns a treatment plan for use in preparing future data, and a cross-frame for use in fitting a model.
dTrain <- d[d$rgroup!='test',,drop=FALSE]
dTest <- d[d$rgroup=='test',,drop=FALSE]
prep <- vtreat::mkCrossFrameCExperiment(dTrain,
c('xBad1','xBad2','xBad3','xGood1','xGood2'),
'y',TRUE,
rareCount=0 # Note: usually want rareCount>0, setting to zero to illustrate problem
)
dTrainTreated <- prep$crossFrame
treatments <- prep$treatments
print(treatments$scoreFrame[,c('varName','lsig','csig')])
## varName lsig csig
## 1 xBad1_catP 8.172932e-01 8.172186e-01
## 2 xBad1_catB 5.675615e-01 5.676541e-01
## 3 xBad2_catP 7.446537e-01 7.441869e-01
## 4 xBad2_catB 5.792325e-01 5.793585e-01
## 5 xBad3_catP 4.356331e-01 4.342227e-01
## 6 xBad3_catB 1.786048e-01 1.770493e-01
## 7 xGood1_clean 6.529637e-12 6.072599e-12
## 8 xGood2_clean 8.584085e-21 8.286789e-21
Now fit the model to the cross-frame rather than to prepare(treatments, dTrain)
(the treated training data).
m1 <- glm(y~xBad1_catB + xBad2_catB + xBad3_catB + xGood1_clean + xGood2_clean,
data=dTrainTreated,family=binomial(link='logit'))
print(summary(m1))
##
## Call:
## glm(formula = y ~ xBad1_catB + xBad2_catB + xBad3_catB + xGood1_clean +
## xGood2_clean, family = binomial(link = "logit"), data = dTrainTreated)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6842 -0.9236 -0.6573 1.1824 2.3257
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.690824 0.091313 -7.565 3.87e-14 ***
## xBad1_catB 0.001813 0.017218 0.105 0.916
## xBad2_catB -0.023835 0.017128 -1.392 0.164
## xBad3_catB 0.024460 0.016978 1.441 0.150
## xGood1_clean 0.404827 0.061885 6.542 6.09e-11 ***
## xGood2_clean 0.570083 0.064988 8.772 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1724.6 on 1331 degrees of freedom
## Residual deviance: 1587.7 on 1326 degrees of freedom
## AIC: 1599.7
##
## Number of Fisher Scoring iterations: 4
dTrain$predM1 <- predict(m1,newdata=dTrainTreated,type='response')
plotRes(dTrain,'predM1','y','model1 on train')
## [1] "model1 on train"
## pred
## truth FALSE TRUE
## FALSE 776 90
## TRUE 335 131
## [1] "accuracy 0.680930930930931"
dTestTreated <- vtreat::prepare(treatments,dTest,pruneSig=c())
dTest$predM1 <- predict(m1,newdata=dTestTreated,type='response')
plotRes(dTest,'predM1','y','model1 on test')
## [1] "model1 on test"
## pred
## truth FALSE TRUE
## FALSE 421 53
## TRUE 145 49
## [1] "accuracy 0.703592814371258"
The model fit to the cross-frame behaves similarly to the model produced via the process f(CalibrationData, TrainData).
At first glance nested models seem like they should be anathema. Using data to build a model and then applying the model or transform to that same data breaks the exchangeability that statistical machine learning depends on for correct behavior. It leads to overfit. The overfit can be big (where you have a chance to notice it) or small (where you miss it, but have unknowingly have somewhat inferior models). However when one looks further we see such nested procedures are already common statistical practice:
carret::preProcess()
).Our point is: the above procedures are useful, but they are strictly correct only when a disjoint set of calibration data is used for the preparation design (and then never re-used in training, test, or later application). The strictness is taught and remembered for the marquee steps (such as model fitting and evaluation), and sometimes forgotten for the "safe steps" (such as principal components determination).
In the age of "big data" the statistical inefficiency of losing some data is far less than the statistical inefficiency of breaking your exchangeability. The recommended experimental design is similar to the Cal/Train/Test split taught in "The Elements of Statistical Learning" , 2nd edition, Jerome H. Friedman, Robert Tibshirani, and Trevor Hastie (though most practitioners squander the technique on needless hyper-parameter noodling).
We can build better models by making sure every bit of data is only used once. We already know not to use data we train on in scoring (as it damages model quality estimates with an undesired upward bias in quality estimate), but beyond that the precaution is not applied often enough. We now call out this as a general procedure: we should (in principle) never use any of our data twice (even during training). For example: any data used in variable conditioning or dimension reduction should not be re-used during model construction. If we have a lot of data (i.e. in the big data regime) this is not a problem. If we do not have enough data for this discipline we should simulate it through cross-validation procedures.
I’ll restate my points here:
Beyond correct statistical practice there is evidence that "read once" procedures (either using each instance of randomness only once as in N. Nisan, "On read-once vs. multiple access to randomness in logspace," Structure in Complexity Theory Conference, 1990, Proceedings., Fifth Annual, Barcelona, 1990, pp. 179-184. doi: 10.1109/SCT.1990.113966 or data only once) are of bounded power, which is yet again an opportunity for improving generalization.
Let’s illustrate the ideas with a simple nested modeling procedure in R. Our nested operation is the simple: scoreModel(buildModel(data))
. Or in Magrittr style notation data %>% buildModel %>% scoreModel
. I call the pipe notation out as Dr. Nina Zumel noticed there is a good opportunity for pipe notation in designing data treatment suggesting an opportunity for good formal tools and methods that automate cross-validation based simulations of fresh data. Another way to simulate fresh data involves the use of differential privacy, and this too could be automated.
On to our example:
# supply uniform interface fit and predict.
# to use other fitters change these functions
fitModel <- function(data,formula) {
lm(formula,data)
}
applyModel <- function(model,newdata) {
predict(model,newdata=newdata)
}
# down stream application, in our case computing
# unadjusted in-sample R^2. In super learning
# could be a derived model over many input columns.
rsq <- function(pred,y) {
1-sum((y-pred)^2)/sum((y-mean(y))^2)
}
set.seed(2352356)
# example data, intentionally no relation
d <- data.frame(x=rnorm(5),y=rnorm(5))
Standard "fit and apply" pattern.
d %>% fitModel(y~x) -> modelToReturn
modelToReturn %>% applyModel(newdata=d) -> predictions
# Unadjusted R^2 above zero (misleading). Diliberately non adjusted so we can see the problem.
rsq(predictions,d$y)
## [1] 0.4193942
Define a general procedure for simulated out of sample results by cross validating for any model that defines a fitModel, applyModel pair. The idea is simulateOutOfSampleTrainEval is going to simulate having used fresh data (disjoint from our training example) through cross validation methods. This is a very general and powerful procedure which should be applied more often (such as in controlling principal components analysis, variable significance estimation, and empirical Bayes prior/hyper-parameter estimation).
#' Simulate out of sample fitting and application.
#'
#' @param d data.frame to work with
#' @param modelsToFit list of list(fitModel,formula,applyModel,modelName) triples to apply
#' @return data frame with derived predictions (in cross-validated manner to simulate out of sample training and application).
#'
simulateOutOfSampleTrainEval <- function(d,modelsToFit) {
eSets <- vtreat::buildEvalSets(nrow(d))
preds <- lapply(modelsToFit,
function(pi) {
# could parallelize the next step
evals <- lapply(eSets,
function(ei) {
d[ei$train,] %>% pi$fitModel(pi$formula) %>%
pi$applyModel(d[ei$app,])
})
# re-assemble results into original row order
pred <- numeric(nrow(d))
for(eii in seq_len(length(eSets))) {
pred[eSets[[eii]]$app] <- evals[[eii]]
}
pred <- data.frame(x=pred,stringsAsFactors = FALSE)
colnames(pred) <- pi$modelName
pred
})
dplyr::bind_cols(preds)
}
Cross-validated fit and apply pattern (safe for nesting models, as in variable treatment or in super learning). With the above function these cross-validated procedures are not harder to apply that standard in-sample procedures (though there is some runtime cost).
modelsToFit <- list(
list(fitModel=fitModel,
formula=y~x,
applyModel=applyModel,
modelName='linearRegression'))
d %>% fitModel(y~x) -> modelToReturn
d %>% simulateOutOfSampleTrainEval(modelsToFit) -> predout
# Out of sample R^2 below zero, not misleading.
rsq(predout$linearRegression,d$y)
## [1] -0.568004
In a super learning context we would use simulateOutOfSampleTrainEval to fit a family of models and assemble their results into a data frame for additional fitting.
For nested modeling (or stacking / super-learning) the above procedure looks like the following.
Data-adaptive variable preparation is also essentially modeling. So any modeling that involves such preparation is essentially a nested model. Proper training procedures for nested models involves different (or fresh) data for each stage or simulating such data through cross-validation methods.
For data treatment the procedure looks like the following.
vtreat implements this directly through its mkCrossFrameCExperiment and mkCrossFrameNExperiment methods (and the development version exposes the buildEvalSets method we used in our explicit examples here).
]]>vtreat is an R data.frame processor/conditioner package that helps prepare real-world data for predictive modeling in a statistically sound manner.
Even with modern machine learning techniques (random forests, support vector machines, neural nets, gradient boosted trees, and so on) or standard statistical methods (regression, generalized regression, generalized additive models) there are common data issues that can cause modeling to fail. vtreat deals with a number of these in a principled and automated fashion.
In particular vtreat emphasizes a concept called “y-aware pre-processing” and implements:
scale
argument to vtreat::prepare()
, using some of the ideas discussed here). This allows correct/sensible application of principal component analysis pre-processing in a machine learning context.The idea is: even with a sophisticated machine learning algorithm there are many ways messy real world data can defeat the modeling process, and vtreat helps with at least ten of them. We emphasize: these problems are already in your data, you simply build better and more reliable models if you attempt to mitigate them. Automated processing is no substitute for actually looking at the data, but vtreat supplies efficient, reliable, documented, and tested implementations of many of the commonly needed transforms.
To help explain the methods we have prepared some documentation:
vtreat::prepare(scale=TRUE)
.This all can seem pretty complicated: but we feel the complication is coming from the delicacy of the necessary task (preparing data for modeling) and not from the procedures.
]]>All code and examples can be found here and in WVPlots.
For those that don’t want to watch the video or follow links, below is the (non-printing) version of the wrapper.
#' Capture arguments of exception throwing plot for later debugging.
#'
#' Run fn, save arguments on failure.
#'
#' @param saveFile path to save RDS to.
#' @param fn function to call
#' @param ... arguments for fn
#' @return fn(...) normally, but if f(...) throws an exception,
#' save to saveFile RDS of list r such that do.call(r$fn,r$args)
#' repeats the call to fn with args.
#'
DebugFn <- function(saveFile,fn,...) {
args <- list(...)
tryCatch({
res = do.call(fn,args)
res
},
error = function(e) {
saveRDS(object=list(fn=fn,args=args),file=saveFile)
stop(paste0("Wrote '",saveFile,"' on catching '",as.character(e),"'"))
})
}
And how to use it.
> f <- function(a,b) { a[[b]] }
> f(0,0)
Error in a[[b]] : attempt to select less than one element
> DebugFn('example.RDS','f',0,0)
Error in value[[3L]](cond) :
Wrote 'example.RDS' on catching 'Error in a[[b]]: attempt to select less than one element'
> p <- readRDS('example.RDS')
> print(p)
$fn
[1] "f"
$args
$args[[1]]
[1] 0
$args[[2]]
[1] 0
> do.call(p$fn,p$args)
Error in a[[b]] : attempt to select less than one element
>
If you are using RStudio you may need to toggle the RStudo->Debug->”On Error” message settings to “Message Only” on the capture run (see below).
However, for context I strongly recommend watching the short video and checking the included resources. Also, please check here and here for more ideas.
]]>Here is the discount count in Tweetable form (please Tweet/share!):
Deal of the Day April 9: Half off my book Practical Data Science with R. Use code
dotd040916au
at https://www.manning.com/books/practical-data-science-with-r
In celebration of this we are offering our video instruction course Introduction to Data Science (Nina Zumel, John Mount 2015) is also half off with “code C3
” (https://www.udemy.com/introduction-to-data-science/?couponCode=C3).