From feedback I am not sure everybody noticed that in addition to being easy and effective, the method is actually novel (we haven’t yet found an academic reference to it or seen it already in use after visiting numerous clients). Likely it has been applied before (as it is a simple method), but it is not currently considered a standard method (something we would like to change).
In this note I’ll discuss some of the context of y-aware scaling.
y-aware scaling is a transform that has been available in as “scale mode” in the vtreat R package since prior to the first public release Aug 7, 2014 (derived from earlier proprietary work). It was always motivated by a “dimensional analysis” or “get the units consistent” argument. It is intended as the pre-processing step before operations that are metric sensitive, such as KNN classification and principal components regression. We didn’t really work on proving theorems about it, because in certain contexts it can be recognized as “the right thing to do.” It derives from considering input (or independent variables or columns) as single variable models and the combining of such variables as a nested model or ensemble model construction (chapter 6 of Practical Data Science with R Nina Zumel, John Mount; Manning 2014 was somewhat organized with this idea behind the scenes). Considering y (or the outcome to be modeled) during dimension reduction prior to predictive modeling is a natural concern, but it seems to be anathema in principal components analysis.
y-aware scaling is in fact simple (it involves multiplying by the slope coefficients from linear regressions for a regression problem or multiplying by the slope coefficient from a logistic regression for classification problems; this is different than multiplying by the outcome y which would not be available during the application phase of a predictive model). The fact that it is simple makes it a bit hard to accept that it is both effective and novel. We are not saying it is unprecedented, but it is certainly not center in the standard literature (despite being an easy and effective technique).
There is an an extensive literature on scaling, filtering, transforming, and pre-conditioning data for principal components analysis (for example see “Centering, scaling, and transformations: improving the biological information content of metabolomics data”, Robert A van den BergEmail, Huub CJ Hoefsloot, Johan A Westerhuis, Age K Smilde and Mariët J van der Werf, BMC Genomics20067:142, 2006). However, these are all what we call x-only transforms.
When you consult references (such as The Elements of Statistical Learning, 2nd edition, Trevor Hastie, Robert Tibshirani, Jerome Friedman, Springer 2009; and Applied Predictive Modeling, Max Kuhn, Kjell Johnson, Springer 2013) you basically see only two y-sensitive principal components style techniques (in addition to recommendations to use regularized regression):
I would like to repeat (it is already implied in Nina’s article): y-aware scaling is not equivalent to either of these methods.
Supervised PCA is simply pruning the variables by inspecting small regressions prior to the PCA steps. In my mind it makes our point that principal components users do not consider using the outcome or y-variable in their data preparation that in 2006 you could get a publication by encouraging this natural step and giving the step a name. I’ll repeat: filtering and pruning variables is common in many forms of data analysis so it is remarkable how much work was required to sell the idea of supervised PCA.
Partial Least Squares Regression is an interesting y-aware technique, but it is a different (and more complicated) technique than y-aware scaling. Here is an example (in R) showing the two methods having very different performance on (an admittedly artificial) problem: PLS.md.
In conclusion, I encourage you to take the time to read up on y-aware scaling and consider using it during your dimension reduction steps prior to predictive modeling.
]]>After reading the article we have a few follow-up thoughts on the topic.
Our group has written on the use of differential privacy to improve machine learning algorithms (by slowing down the exhaustion of novelty in your data):
However, these are situations without competing interests: we are just trying to build a better model. What about the original application of differential privacy: trading modeling effectiveness against protecting those one has collected data on? Is un-audited differential privacy an effective protection, or is it a fig-leaf that merely checks off data privacy regulations?
A few of the points to ponder:
We’ll end with: we think the applications of differential privacy techniques to improving machine performance are still the most promising applications as they don’t have the difficulty of trying to serve competing interests (modeling effectiveness versus privacy). A great example of this is the fascinating paper “The Ladder: A Reliable Leaderboard for Machine Learning Competitions” by Avrim Blum, Moritz Hardt. I’d like to think it is clever applications such as the preceding drive current interest in the topic of differential privacy (post 2015). But it looks like all anyone cares about is the Apple’s announcement.
What I am trying to say: claiming the use of differential privacy should not be a “get out of regulation free card.” At best it is a tool that can be part of implementing privacy protection, and one that definitely requires ongoing detailed oversight and auditing.
]]>Win-Vector LLC’s Dr. Nina Zumel has a three part series on Principal Components Regression that we think is well worth your time.
You can read her first article part 1 here. Principal Components Regression (PCR) is the use of Principal Components Analysis (PCA) as a dimension reduction step prior to linear regression. It is one of the best known dimensionality reduction techniques and a staple procedure in many scientific fields. PCA is used because:
We often find ourselves having to often remind readers that this last reason is not actually positive. The standard derivation of PCA involves trotting out the math and showing the determination of eigenvector directions. It yields visually attractive diagrams such as the following.
Wikipedia: PCA
And this leads to a deficiency in much of the teaching of the method: glossing over the operational consequences and outcomes of applying the method. The mathematics is important to the extent it allows you to reason about the appropriateness of the method, the consequences of the transform, and the pitfalls of the technique. The mathematics is also critical to the correct implementation, but that is what one hopes is already supplied in a reliable analysis platform (such as R). Dr. Zumel uses the expressive and graphical power of R to work through the use of Principal Components Regression in an operational series of examples. She works through how Principal Components Regression is typically mis-applied and continues on to how to correctly apply it. Taking the extra time to work through the all too common errors allows her to demonstrate and quantify the benefits of correct technique. Dr. Zumel will soon follow part 1 later with a shorter part 2 article demonstrating important "y-aware" techniques that squeeze much more modeling power out of your data in predictive analytic situations (which is what regression actually is). Some of the methods are already in the literature, but are still not used widely enough. We hope the demonstrated techniques and included references will give you a perspective to improve how you use or even teach Principal Components Regression. Please read on here.
In part 2 of her series on Principal Components Regression Dr. Nina Zumel illustrates so-called y-aware techniques. These often neglected methods use the fact that for predictive modeling problems we know the dependent variable, outcome or y, so we can use this during data preparation in addition to using it during modeling. Dr. Zumel shows the incorporation of y-aware preparation into Principal Components Analyses can capture more of the problem structure in fewer variables. Such methods include:
This recovers more domain structure and leads to better models. Using the foundation set in the first article Dr. Zumel quickly shows how to move from a traditional x-only analysis that fails to preserve a domain-specific relation of two variables to outcome to a y-aware analysis that preserves the relation. Or in other words how to move away from a middling result where different values of y (rendered as three colors) are hopelessly intermingled when plotted against the first two found latent variables as shown below.
Dr. Zumel shows how to perform a decisive analysis where y is somewhat sortable by the each of the first two latent variable and the first two latent variables capture complementary effects, making them good mutual candidates for further modeling (as shown below).
Click here (part 2 y-aware methods) for the discussion, examples, and references. Part 1 (x only methods) can be found here.
In her series on principal components analysis for regression in R Win-Vector LLC‘s Dr. Nina Zumel broke the demonstration down into the following pieces:
In the earlier parts Dr. Zumel demonstrates common poor practice versus best practice and quantifies the degree of available improvement. In part 3 she moves from the usual "pick the number of components by eyeballing it" non-advice and teaches decisive decision procedures. For picking the number of components to retain for analysis there are a number of standard techniques in the literature including:
Dr. Zumel shows that the last method (designing a formal statistical test) is particularly easy to encode as a permutation test in the y-aware setting (there is also an obvious similarly good bootstrap test). This is well-founded and pretty much state of the art. It is also a great example of why to use a scriptable analysis platform (such as R) as it is easy to wrap arbitrarily complex methods into functions and then directly perform empirical tests on these methods. The following "broken stick" type test yields the following graph which identifies five principal components as being significant:
However, Dr. Zumel goes on to show that in a supervised learning or regression setting we can further exploit the structure of the problem and replace the traditional component magnitude tests with simple model fit significance pruning. The significance method in this case gets the stronger result of finding the two principal components that encode the known even and odd loadings of the example problem:
In fact that is sort of her point: significance pruning either on the original variables or on the derived latent components is enough to give us the right answer. In general we get much better results when (in a supervised learning or regression situation) we use knowledge of the dependent variable (the "y" or outcome) and do all of the following:
The above will become much clearer and much more specific if you click here to read part 3.
]]>Exploring Data Science gives you a free sample of important data science topics chosen from great Manning books. Each chapter was chosen by John Mount and Nina Zumel and includes a brief orientation/introduction. The topics are:
This 191 page e-book is free, but only officially licensed/available from manning.com. To get your free PDF click here and use Manning’s online shopping system. You will have to enter your email and other details, but Manning Publications is a reputable vendor well worth having an account with.
Please check it out! And please help us promote this fun offering by posting, sharing, and Tweeting.
Update: in addition to the excerpted chapters and new introductions the free e-book contains special discount codes for any of the books mentioned!!! So this is really something to consider if you want to deepen or broaden your data science knowledge.
]]>geom_step
is an interesting geom supplied by the R package ggplot2. It is an appropriate rendering option for financial market data and we will show how and why to use it in this article.
Let’s take a simple example of plotting market data. In this case we are plotting the "ask price" (the publicly published price an item is available for purchase at a given time), the "bid price" (the publicly published price an item can be sold for at a given time), and "trades" (past purchases and sales).
Most markets maintain these "quoted" prices as an order book and the public ask price is always greater than the public bid price (else we would have a "crossed market"). We can also track recent transactions or trades. Here is some example (made-up) data.
print(quotes)
## quoteTime date askPrice bidPrice
## 1 2016-01-04 09:14:00 2016-01-04 10.81 10.69
## 2 2016-01-04 11:45:17 2016-01-04 11.09 10.68
## 3 2016-01-04 15:25:00 2016-01-04 12.32 12.03
## 4 2016-01-05 10:12:13 2016-01-05 14.33 13.69
## 5 2016-01-06 09:02:00 2016-01-06 17.17 16.20
## 6 2016-01-06 15:10:00 2016-01-06 18.86 18.35
## 7 2016-01-06 15:27:00 2016-01-06 20.89 20.32
print(trades)
## tradeTime date tradePrice quantity
## 1 2016-01-04 09:14:00 2016-01-04 10.81 600
## 2 2016-01-04 11:45:17 2016-01-04 10.68 500
## 6 2016-01-06 15:10:00 2016-01-06 18.35 200
## 7 2016-01-06 15:27:00 2016-01-06 20.89 200
Notice each revision of the book (notification of a bid price, ask price, or both) happens at a specific time. Ask and bid prices are good until they are revised or withdrawn.
There is some issue as to what is the "price" of a financial instrument (say in this case a stock).
Money only changes hands on trades- so past quotes that were never "hit" or traded against in some sense never happened (in fact this is becoming a problem called "flashing"). So market participants can somewhat manipulate bids and asks as long as they don’t cross. Asks and bids represent risk or a one-sided opinion on price but can not be trusted (especially when the "bid ask gap" is very large).
Trades cost fees and transfer money, so they are evidence of two parties agreeing on price for a moment. But all trades you know about are in the past. Just because somebody purchased some shares of IBM in the past for $120 a share doesn’t mean you can do the same. You could only make such a purchase if there is an appropriate ask price in the market (or you place your own limit order forming a bid that somebody else hits).
What I am trying to say is the classic "ticker tape pattern" graph shown below drawing only trades and connecting them with sloping lines is not appropriate for plotting markets (especially when plotting high frequency or in-day data).
ggplot(data=trades,aes(x=tradeTime,y=tradePrice)) +
geom_line() + geom_point()
There is a lot wrong with such graphs.
geom_smooth
as the defaults use data from the past and future to perform the smoothing (instead should use a trailing window such as exponential smoothing).(Side note: if anybody has some good code to make geom_smooth
perform exponential smoothing in all cases, including grouping and facets I would really like a copy. Right now I have to join in smoothed data as new column as I have never completely grocked all of the implementation interface requirements for new ggplot2
statistics in their full production complexity.)
If all that seems complicated, scary, unpleasant and technical: that is the right way to think. Markets are not safe, simple, or pleasant. They can be reasoned about and worked with, but it is wrong to think they are simple or easy.
An (unfortunately) more complicated (and slightly less legible graph) is needed to try and faithfully present the information. Since asks and bids are good until withdrawn and revised we render then with a step shape (such as generated by ggplot2::geom_step
) and since trades happen only at a single time (and are not a promise going forward) we render them with points. Such a graph is given below.
ggplot() +
geom_step(data=quotes,mapping=aes(x=quoteTime,y=askPrice),
linetype=2,color='#1b9e77',alpha=0.5) +
geom_step(data=quotes,mapping=aes(x=quoteTime,y=bidPrice),
linetype=2,color='#d95f02',alpha=0.5) +
geom_point(data=trades,mapping=(aes(x=tradeTime,y=tradePrice))) +
ylab('price') + xlab('time') + scale_y_log10(breaks=breaks)
The step functions propagate flat lines forward from quote revisions, correctly indicating what ask price and bid price were in effect at all times. Trades are shown as dots since they have no propagation. Each item drawn on the graph at a given time was actually know by that time (so a person or trading strategy would also have access to such information at that time).
Trades that occur nearer the ask price can be considered "buyer initiated" and trades that occur near the bid price are considered can be considered "seller initiated", which we can indicate through color.
mids <- (lastKnownValue(NA,quotes$quoteTime,quotes$askPrice,trades$tradeTime)+
lastKnownValue(NA,quotes$quoteTime,quotes$bidPrice,trades$tradeTime))/2
trades$type <- ifelse(trades$tradePrice>=mids,'buy','sell')
ggplot() +
geom_step(data=quotes,mapping=aes(x=quoteTime,y=askPrice),
linetype=2,color='#1b9e77',alpha=0.5) +
geom_step(data=quotes,mapping=aes(x=quoteTime,y=bidPrice),
linetype=2,color='#d95f02',alpha=0.5) +
geom_point(data=trades,mapping=(aes(x=tradeTime,y=tradePrice,color=type))) +
ylab('price') + xlab('time') + scale_y_log10(breaks=breaks) +
scale_color_brewer(palette = 'Dark2')
This is a good time to point out a problem in these graphs. We are mostly plotting times when the market is closed. Most of the space is wasted. In the graph below we indicate (fictitious) market hours by shading the "market open hours" to illustrate the issue.
print(openClose)
## date time what askPrice bidPrice
## 1 2016-01-04 2016-01-04 09:00:00 open NA NA
## 2 2016-01-04 2016-01-04 15:30:00 close 12.32 12.03
## 3 2016-01-05 2016-01-05 09:00:00 open 12.32 12.03
## 4 2016-01-05 2016-01-05 15:30:00 close 14.33 13.69
## 5 2016-01-06 2016-01-06 09:00:00 open 14.33 13.69
## 6 2016-01-06 2016-01-06 15:30:00 close 20.89 20.32
openClose %>% select(date,time,what) %>% spread(what,time) -> marketHours
ggplot() +
geom_step(data=quotes,mapping=aes(x=quoteTime,y=askPrice),
linetype=2,color='#1b9e77',alpha=0.5) +
geom_step(data=quotes,mapping=aes(x=quoteTime,y=bidPrice),
linetype=2,color='#d95f02',alpha=0.5) +
geom_point(data=trades,mapping=(aes(x=tradeTime,y=tradePrice,color=type))) +
geom_rect(data=marketHours,
mapping=aes(xmin=open,xmax=close,ymin=0,ymax=Inf),
fill='blue',alpha=0.3) +
ylab('price') + xlab('time') + scale_y_log10(breaks=breaks) +
scale_color_brewer(palette = 'Dark2')
The easiest way to fix this in ggplot2
would be to use facet_wrap
, but this crashes (at least for ggplot2
version 2.1.0
current on Cran 2016-06-03) with the very cryptic error message as shown below.
ggplot() +
geom_step(data=quotes,mapping=aes(x=quoteTime,y=askPrice),
linetype=2,color='#1b9e77',alpha=0.5) +
geom_step(data=quotes,mapping=aes(x=quoteTime,y=bidPrice),
linetype=2,color='#d95f02',alpha=0.5) +
geom_point(data=trades,mapping=(aes(x=tradeTime,y=tradePrice,color=type))) +
facet_wrap(~date,scale='free_x') +
ylab('price') + xlab('time') + scale_color_brewer(palette = 'Dark2')
## Error in grid.Call.graphics(L_lines, x$x, x$y, index, x$arrow): invalid line type
Despite the message "invalid line type" the error is not the user’s selection of linetype. It is easier to see what is going on if we replace geom_step
with geom_line
as we show below.
ggplot() +
geom_line(data=quotes,mapping=aes(x=quoteTime,y=askPrice),
linetype=2,color='#1b9e77',alpha=0.5) +
geom_line(data=quotes,mapping=aes(x=quoteTime,y=bidPrice),
linetype=2,color='#d95f02',alpha=0.5) +
geom_point(data=trades,mapping=(aes(x=tradeTime,y=tradePrice,color=type))) +
facet_wrap(~date,scale='free_x') +
ylab('price') + xlab('time') + scale_y_log10(breaks=breaks) +
scale_color_brewer(palette = 'Dark2')
## geom_path: Each group consists of only one observation. Do you need to
## adjust the group aesthetic?
## geom_path: Each group consists of only one observation. Do you need to
## adjust the group aesthetic?
The above graph is now using sloped lines to connect ask price and bid price revisions (given the false impression that these intermediate prices were ever available and essentially "leaking information from the future" into the visual presentation). However, we get a graph and a more reasonable warning message: "geom_path: Each group consists of only one observation." There was only one quote revision on 2016-01-05 so as facet_wrap
treats each facet as sub-graph (and not as a portal into a single larger graph): days with fewer than 2 quote revisions have trouble drawing paths. The trouble causes the (deceptive) blank facet for 2016-01-05 if we are using simple sloped lines (geom_line
) and seems to error out on the more complicated geom_step
.
In my opinion geom_step
should "fail a bit gentler" on this example (as geom_line
already does). In any case the correct domain specific fix is to regularize the data a bit by adding market open and close information. In many markets the open and closing prices are set by specific mechanisms (such as an opening auction and a closing volume or time weighted average). For our example we will just use last known price (which we have already prepared).
openClose %>% mutate(quoteTime=time) %>%
bind_rows(quotes) %>%
arrange(time) %>%
select(date,askPrice,bidPrice,quoteTime) -> joinedData
ggplot() +
geom_step(data=joinedData,mapping=aes(x=quoteTime,y=askPrice),
linetype=2,color='#1b9e77',alpha=0.5) +
geom_step(data=joinedData,mapping=aes(x=quoteTime,y=bidPrice),
linetype=2,color='#d95f02',alpha=0.5) +
geom_point(data=trades,mapping=(aes(x=tradeTime,y=tradePrice,color=type))) +
facet_wrap(~date,scale='free_x') +
ylab('price') + xlab('time') + scale_y_log10(breaks=breaks) +
scale_color_brewer(palette = 'Dark2')
## Warning: Removed 1 rows containing missing values (geom_path).
## Warning: Removed 1 rows containing missing values (geom_path).
The above graph is pretty good. In fact easily producing a graph like this in R using dygraphs
is currently an open issue.
In previous writings we have gone to great lengths to document, explain and motivate vtreat
. That necessarily gets long and unnecessarily feels complicated.
In this example we are going to show what building a predictive model using vtreat
best practices looks like assuming you were somehow already in the habit of using vtreat for your data preparation step. We are deliberately not going to explain any steps, but just show the small number of steps we advise routinely using. This is a simple schematic, but not a guide. Of course we do not advise use without understanding (and we work hard to teach the concepts in our writing), but want what small effort is required to add vtreat
to your predictive modeling practice.
First we set things up: load libraries, initialize parallel processing.
library('vtreat')
library('caret')
library('gbm')
library('doMC')
library('WVPlots') # see https://github.com/WinVector/WVPlots
# parallel for vtreat
ncores <- parallel::detectCores()
parallelCluster <- parallel::makeCluster(ncores)
# parallel for caret
registerDoMC(cores=ncores)
The we load our data for analysis. We are going to build a model predicting an income level from other demographic features. The data is taken from here and you can perform all of the demonstrated steps if you download the contents of the example git directory. Obviously this has a lot of moving parts (R, R Markdown, Github, R packages, devtools)- but is very easy to do a second time (first time can be a bit of learning and preparation).
# load data
# data from: http://archive.ics.uci.edu/ml/machine-learning-databases/adult/
colnames <-
c(
'age',
'workclass',
'fnlwgt',
'education',
'education-num',
'marital-status',
'occupation',
'relationship',
'race',
'sex',
'capital-gain',
'capital-loss',
'hours-per-week',
'native-country',
'class'
)
dTrain <- read.table(
'adult.data.txt',
header = FALSE,
sep = ',',
strip.white = TRUE,
stringsAsFactors = FALSE,
na.strings = c('NA', '?', '')
)
colnames(dTrain) <- colnames
dTest <- read.table(
'adult.test.txt',
skip = 1,
header = FALSE,
sep = ',',
strip.white = TRUE,
stringsAsFactors = FALSE,
na.strings = c('NA', '?', '')
)
colnames(dTest) <- colnames
Now we use vtreat
to prepare the data for analysis. The goal of vtreat is to ensure a ready-to-dance data frame in a statistically valid manner. We are respecting the test/train split and building our data preparation plan only on the training data (though we do apply it to the test data). This step helps with a huge number of potential problems through automated repairs:
# define problem
yName <- 'class'
yTarget <- '>50K'
varNames <- setdiff(colnames,yName)
# build variable encoding plan and prepare simulated out of sample
# training fame (cross-frame)
# http://www.win-vector.com/blog/2016/05/vtreat-cross-frames/
system.time({
cd <- vtreat::mkCrossFrameCExperiment(dTrain,varNames,yName,yTarget,
parallelCluster=parallelCluster)
scoreFrame <- cd$treatments$scoreFrame
dTrainTreated <- cd$crossFrame
# pick our variables
newVars <- scoreFrame$varName[scoreFrame$sig<1/nrow(scoreFrame)]
dTestTreated <- vtreat::prepare(cd$treatments,dTest,
pruneSig=NULL,varRestriction=newVars)
})
## user system elapsed
## 11.340 2.760 30.872
#print(newVars)
Now we train our model. In this case we are using the caret package to tune parameters.
# train our model using caret
system.time({
yForm <- as.formula(paste(yName,paste(newVars,collapse=' + '),sep=' ~ '))
# from: http://topepo.github.io/caret/training.html
fitControl <- trainControl(
method = "cv",
number = 3)
model <- train(yForm,
data = dTrainTreated,
method = "gbm",
trControl = fitControl,
verbose = FALSE)
print(model)
dTest$pred <- predict(model,newdata=dTestTreated,type='prob')[,yTarget]
})
## Stochastic Gradient Boosting
##
## 32561 samples
## 64 predictor
## 2 classes: '<=50K', '>50K'
##
## No pre-processing
## Resampling: Cross-Validated (3 fold)
## Summary of sample sizes: 21707, 21708, 21707
## Resampling results across tuning parameters:
##
## interaction.depth n.trees Accuracy Kappa
## 1 50 0.8476398 0.5083558
## 1 100 0.8556555 0.5561726
## 1 150 0.8577746 0.5699958
## 2 50 0.8560855 0.5606650
## 2 100 0.8593102 0.5810931
## 2 150 0.8625042 0.5930111
## 3 50 0.8593717 0.5789289
## 3 100 0.8649919 0.6017707
## 3 150 0.8660975 0.6073645
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150,
## interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.
## user system elapsed
## 61.908 2.227 36.850
Finally we take a look at the results on the held-out test data.
WVPlots::ROCPlot(dTest,'pred',yName,'predictions on test')
WVPlots::DoubleDensityPlot(dTest,'pred',yName,'predictions on test')
confusionMatrix <- table(truth=dTest[[yName]],pred=dTest$pred>=0.5)
print(confusionMatrix)
## pred
## truth FALSE TRUE
## <=50K. 11684 751
## >50K. 1406 2440
testAccuarcy <- (confusionMatrix[1,1]+confusionMatrix[2,2])/sum(confusionMatrix)
testAccuarcy
## [1] 0.8675143
Notice the achieved test accuracy is in the ballpark of what was reported for this dataset.
(From [adult.names description](http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.names) )
Error Accuracy reported as follows, after removal of unknowns from
| train/test sets):
| C4.5 : 84.46+-0.30
| Naive-Bayes: 83.88+-0.30
| NBTree : 85.90+-0.28
We can also compare accuracy on the "complete cases":
dTestComplete <- dTest[complete.cases(dTest[,varNames]),]
confusionMatrixComplete <- table(truth=dTestComplete[[yName]],
pred=dTestComplete$pred>=0.5)
print(confusionMatrixComplete)
## pred
## truth FALSE TRUE
## <=50K. 10618 742
## >50K. 1331 2369
testAccuarcyComplete <- (confusionMatrixComplete[1,1]+confusionMatrixComplete[2,2])/
sum(confusionMatrixComplete)
testAccuarcyComplete
## [1] 0.8623506
# clean up
parallel::stopCluster(parallelCluster)
These two scores are within noise bounds of each other, but it is our experience that missingness is often actually informative, so in addition to imputing missing values you would like to preserve some notation indicating the missingness (which vtreat
does in fact do).
And that is all there is to this example. I’d like to emphasize that vtreat steps were only a few lines in one of the blocks of code. vtreat
treatment can take some time, but it is usually bearable. By design it is easy to add vtreat to your predictive analytics projects.
The point is: we got competitive results on real world data, in a single try (using vtreat to prepare data and caret to tune parameters). The job of the data scientist is to actually work longer on a problem and do better. But having a good start helps.
The theory behind vtreat is fairly important to the correctness of our implementation, and we would love for you to read through some of it:
But operationally, please think of vtreat
as just adding a couple of lines to your analysis scripts. Again, the raw R markdown source can be found here and a rendered copy (with results and graphs) here.
Before starting the discussion, let’s quickly redo our y-aware PCA. Please refer to our previous post for a full discussion of this data set and this approach.
#
# make data
#
set.seed(23525)
dTrain <- mkData(1000)
dTest <- mkData(1000)
#
# design treatment plan
#
treatmentsN <- designTreatmentsN(dTrain,
setdiff(colnames(dTrain),'y'),'y',
verbose=FALSE)
#
# prepare the treated frames, with y-aware scaling
#
examplePruneSig = 1.0
dTrainNTreatedYScaled <- prepare(treatmentsN,dTrain,
pruneSig=examplePruneSig,scale=TRUE)
dTestNTreatedYScaled <- prepare(treatmentsN,dTest,
pruneSig=examplePruneSig,scale=TRUE)
#
# do the principal components analysis
#
vars <- setdiff(colnames(dTrainNTreatedYScaled),'y')
# prcomp defaults to scale. = FALSE, but we already
# scaled/centered in vtreat- which we don't want to lose.
dmTrain <- as.matrix(dTrainNTreatedYScaled[,vars])
dmTest <- as.matrix(dTestNTreatedYScaled[,vars])
princ <- prcomp(dmTrain, center = FALSE, scale. = FALSE)
If we examine the magnitudes of the resulting singular values, we see that we should use from two to five principal components for our analysis. In fact, as we showed in the previous post, the first two singular values accurately capture the two unobservable processes that contribute to y, and a linear model fit to these two components captures most of the explainable variance in the data, both on training and on hold-out data.
We picked the number of principal components to use by eye; but it’s tricky to implement code based on the strategy "look for a knee in the curve." So how might we automate picking the appropriate number of components in a reliable way?
Jackson (1993) and Peres-Neto, et.al. (2005) are two excellent surveys and evaluations of the different published approaches to picking the number of components in standard PCA. Those methods include:
caret::preProcess.
The papers also cover other approaches, as well as different variations of the above.
Kabakoff (R In Action, 2nd Edition, 2015) suggests comparing the magnitudes of the singular values to those extracted from random matrices of the same shape as the original data. Let’s assume that the original data has k variables, and that PCA on the original data extracts the k singular values s_{i} and the k principal components PC_{i}.To pick the appropriate number of principal components:
The idea is that if there is more variation in a given direction than you would expect at random, then that direction is probably meaningful. If you assume that higher variance directions are more useful than lower variance directions (the usual assumption), then one handy variation is to find the first i such that s_{i} < r_{i}, and keep the first i-1 principal components.
This approach is similar to what the authors of the survey papers cited above refer to as the broken-stick method. In their research, the broken-stick method was among the best performing approaches for a variety of simulated and real-world examples.
With the proper adjustment, all of the above heuristics work as well in the y-adjusted case as they do with traditional x-only PCA.
Since in our case we know y, we can — and should — take advantage of this information. We will use a variation of the broken-stick method, but rather than comparing our data to a random matrix, we will compare our data to alternative datasets where x has no relation to y. We can do this by randomly permuting the y values. This preserves the structure of x — that is, the correlations and relationships of the x variables to each other — but it changes the units of the problem, that is, the y-aware scaling. We are testing whether or not a given principal component appears more meaningful in a metric space induced by the true y than it does in a random metric space, one that preserves the distribution of y, but not the relationship of y to x.
You can read a more complete discussion of permutation tests and their application to variable selection (significance pruning) in this post.
In our example, we’ll use N=100, and rather than using the means of the singular values from our experiments as the thresholds, we’ll use the 98th percentiles. This represents a threshold value that is likely to be exceeded by a singular value induced in a random space only 1/(the number of variables) (1/50=0.02) fraction of the time.
#
# Resample y, do y-aware PCA,
# and return the singular values
#
getResampledSV = function(data,yindices) {
# resample y
data$y = data$y[yindices]
# treatment plan
treatplan = vtreat::designTreatmentsN(data,
setdiff(colnames(data), 'y'),
'y', verbose=FALSE)
# y-aware scaling
dataTreat = vtreat::prepare(treatplan, data, pruneSig=1, scale=TRUE)
# PCA
vars = setdiff(colnames(dataTreat), 'y')
dmat = as.matrix(dataTreat[,vars])
princ = prcomp(dmat, center=FALSE, scale=FALSE)
# return the magnitudes of the singular values
princ$sdev
}
#
# Permute y, do y-aware PCA,
# and return the singular values
#
getPermutedSV = function(data) {
n = nrow(data)
getResampledSV(data,sample(n,n,replace=FALSE))
}
#
# Run the permutation tests and collect the outcomes
#
niter = 100 # should be >> nvars
nvars = ncol(dTrain)-1
# matrix: 1 column for each iter, nvars rows
svmat = vapply(1:niter, FUN=function(i) {getPermutedSV(dTrain)}, numeric(nvars))
rownames(svmat) = colnames(princ$rotation) # rows are principal components
colnames(svmat) = paste0('rep',1:niter) # each col is an iteration
# plot the distribution of values for the first singular value
# compare it to the actual first singular value
ggplot(as.data.frame(t(svmat)), aes(x=PC1)) +
geom_density() + geom_vline(xintercept=princ$sdev[[1]], color="red") +
ggtitle("Distribution of magnitudes of first singular value, permuted data")
Here we show the distribution of the magnitude of the first singular value on the permuted data, and compare it to the magnitude of the actual first singular value (the red vertical line). We see that the actual first singular value is far larger than the magnitude you would expect from data where x is not related to y. Let’s compare all the singular values to their permutation test thresholds. The dashed line is the mean value of each singular value from the permutation tests; the shaded area represents the 98th percentile.
# transpose svmat so we get one column for every principal component
# Get the mean and empirical confidence level of every singular value
as.data.frame(t(svmat)) %>% dplyr::summarize_each(funs(mean)) %>% as.numeric() -> pmean
confF <- function(x) as.numeric(quantile(x,1-1/nvars))
as.data.frame(t(svmat)) %>% dplyr::summarize_each(funs(confF)) %>% as.numeric() -> pupper
pdata = data.frame(pc=seq_len(length(pmean)), magnitude=pmean, upper=pupper)
# we will use the first place where the singular value falls
# below its threshold as the cutoff.
# Obviously there are multiple comparison issues on such a stopping rule,
# but for this example the signal is so strong we can ignore them.
below = which(princ$sdev < pdata$upper)
lastSV = below[[1]] - 1
This test suggests that we should use 5 principal components, which is consistent with what our eye sees. This is perhaps not the "correct" knee in the graph, but it is undoubtably a knee.
Empirically estimating the quantiles from the permuted data so that we can threshold the non-informative singular values will have some undesirable bias and variance, especially if we do not perform enough experiment replications. This suggests that instead of estimating quantiles ad-hoc, we should use a systematic method: The Bootstrap. Bootstrap replication breaks the input to output association by re-sampling with replacement rather than using permutation, but comes with built-in methods to estimate bias-adjusted confidence intervals. The methods are fairly technical, and on this dataset the results are similar, so we don’t show them here, although the code is available in the R markdown document used to produce this note.
Alternatively, we can treat the principal components that we extracted via y-aware PCA simply as transformed variables — which is what they are — and significance prune them in the standard way. As our article on significance pruning discusses, we can estimate the significance of a variable by fitting a one variable model (in this case, a linear regression) and looking at that model’s significance value. You can pick the pruning threshold by considering the rate of false positives that you are willing to tolerate; as a rule of thumb, we suggest one over the number of variables.
In regular significance pruning, you would take any variable with estimated significance value lower than the threshold. Since in the PCR situation we presume that the variables are ordered from most to least useful, you can again look for the first position i where the variable appears insignificant, and use the first i-1 variables.
We’ll use vtreat
to get the significance estimates for the principal components. We’ll use one over the number of variables (1/50 = 0.02) as the pruning threshold.
# get all the principal components
# not really a projection as we took all components!
projectedTrain <- as.data.frame(predict(princ,dTrainNTreatedYScaled),
stringsAsFactors = FALSE)
vars = colnames(projectedTrain)
projectedTrain$y = dTrainNTreatedYScaled$y
# designing the treatment plan for the transformed data
# produces a data frame of estimated significances
tplan = designTreatmentsN(projectedTrain, vars, 'y', verbose=FALSE)
threshold = 1/length(vars)
scoreFrame = tplan$scoreFrame
scoreFrame$accept = scoreFrame$sig < threshold
# pick the number of variables in the standard way:
# the number of variables that pass the significance prune
nPC = sum(scoreFrame$accept)
Significance pruning picks 2 principal components, again consistent with our visual assessment. This time, we picked the correct knee: as we saw in the previous post, the first two principal components were sufficient to describe the explainable structure of the problem.
Since one of the purposes of PCR/PCA is to discover the underlying structure in the data, it’s generally useful to examine the singular values and the variable loadings on the principal components. However an analysis should also be repeatable, and hence, automatable, and it’s not straightforward to automate something as vague as "look for a knee in the curve" when selecting the number of principal components to use. We’ve covered two ways to programatically select the appropriate number of principal components in a predictive modeling context.
To conclude this entire series, here is our recommended best practice for principal components regression:
Thanks to Cyril Pernet, who blogs at NeuroImaging and Statistics, for requesting this follow-up post and pointing us to the Jackson reference.
Jackson, Donald A. "Stopping Rules in Principal Components Analysis: A Comparison of Heuristical and Statistical Approaches", Ecology Vol 74, no. 8, 1993.
Kabacoff, Robert I. R In Action, 2nd edition, Manning, 2015.
Efron, Bradley and Robert J. Tibshirani. An Introduction to the Bootstrap, Chapman and Hall/CRC, 1998.
Peres-Neto, Pedro, Donald A. Jackson and Keith M. Somers. "How many principal components? Stopping rules for determining the number of non-trivial axes revisited", Computational Statistics & Data Analysis, Vol 49, no. 4, 2005.
One package of interest is ranger a fast parallel C++ implementation of random forest machine learning. Ranger is great package and at first glance appears to remove the “only 63 levels allowed for string/categorical variables” limit found in the Fortran randomForest package. Actually this appearance is due to the strange choice of default value respect.unordered.factors=FALSE
in ranger::ranger()
which we strongly advise overriding to respect.unordered.factors=TRUE
in applications.
To illustrate the issue we build a simple data set (split into training and evaluation) where the dependent (or outcome) variable y
is given as the sum of how many input level codes end in an odd digit minus how many input level codes end in an even digit.
Some example data is given below
print(head(dTrain))
## x1 x2 x3 x4 y
## 77 lev_008 lev_004 lev_007 lev_011 0
## 41 lev_016 lev_015 lev_019 lev_012 0
## 158 lev_007 lev_019 lev_001 lev_015 4
## 69 lev_010 lev_017 lev_018 lev_009 0
## 6 lev_003 lev_014 lev_016 lev_017 0
## 18 lev_004 lev_015 lev_014 lev_007 0
Given enough data this relation is easily learnable. In our example we have only 100 training rows and 20 possible levels for each input variable- so we at best get a noisy impression of how each independent (or input) variable affects y
.
What the default ranger default training setting respect.unordered.factors=FALSE
does is decide that string-valued variables (such as we have here) are to be treated as “ordered”. This allows ranger to skip any of the expensive re-encoding of such variables as contrasts, dummies or indicators. This is achieved in ranger by only using ordered cuts in its underlying trees and is equivalent to re-encoding the categorical variable as the numeric order codes. These variables are thus essentially treated as numeric, and ranger appears to run faster over fairly complicated variables.
The above is good if all of your categorical variables were in fact known to have ordered relations with the outcome. We must emphasize that this is very rarely the case in practice as one of the main reasons for using categorical variables is that we may not a-priori know the relation between the variable levels and outcome and would like the downstream machine learning to estimate the relation. The default respect.unordered.factors=FALSE
in fact weakens the expressiveness of the ranger model (which is why it is faster).
This is simpler to see with an example. Consider fitting a ranger model on our example data (all code/data shared including classification and use of parallel here).
If we try to build a ranger model on the data using the default settings we get the following:
# default ranger model, treat categoricals as ordered (a very limiting treatment)
m1 <- ranger(y~x1+x2+x3+x4,
data=dTrain, write.forest=TRUE)
Keep in mind the 0.24 R-squared on test.
If we set respect.unordered.factors=TRUE
ranger takes a lot longer to run (as it is doing more work in actually respecting the individual levels of our categorical variables) but gets a much better result (test R-squared 0.54).
m2 <- ranger(y~x1+x2+x3+x4,
data=dTrain, write.forest=TRUE,
respect.unordered.factors=TRUE)
The loss of modeling power seen with the default respect.unordered.factors=FALSE
is similar to the undesirable loss of modeling power seen if one hash-encodes categorical levels. The default behavior of ranger
is essentially equivalent to calling as.numeric(as.factor())
on the categorical columns. Everyone claims they would never do such a thing (hash or call as.numeric()
), but we strongly suggest inspecting your team’s work for these bad but tempting shortcuts.
If even one of the variables had 64 or more levels ranger would throw an exception and not complete training (as the randomForest library also does).
The correct way to feed large categoricals to a random forest model remains to explicitly introduce the dummy/indicators yourself or re-encode them as impact/effect sub models. Both of these are services supplied by the vtreat package so we demonstrate the technique here.
# vtreat re-encoded model
ct <- vtreat::mkCrossFrameNExperiment(dTrain,
c('x1','x2','x3','x4'),
'y')
newvars <- ct$treatments$scoreFrame$varName[(ct$treatments$scoreFrame$code=='catN') &
(ct$treatments$scoreFrame$sig<1)]
m3 <- ranger(paste('y',paste(newvars,collapse=' + '),sep=' ~ '),
data=ct$crossFrame,
write.forest=TRUE)
dTestTreated <- vtreat::prepare(ct$treatments,dTest,
pruneSig=c(),varRestriction=newvars)
dTest$rangerNestedPred <- predict(m3,data=dTestTreated)$predictions
WVPlots::ScatterHist(dTest,'rangerNestedPred','y',
'ranger vtreat nested prediction on test',
smoothmethod='identity',annot_size=3)
The point is a test R-squared of 0.6 or 0.54 is a lot better than an R-squared of 0.24. You do not want to achieve 0.24 if 0.6 is within easy reach. So at the very least when using ranger set respect.unordered.factors=TRUE
; for unordered factors (the most common kind) the default is making things easy for ranger at the expense of model quality.
Instructions explaining the use of vtreat
can be found here.
vtreat
in the examples we show in this note, but you can easily implement the approach independently of vtreat
.
As with other geometric algorithms, principal components analysis is sensitive to the units of the data. In standard ("x-only") PCA, we often attempt to alleviate this problem by rescaling the x variables to their "natural units": that is, we rescale x by its own standard deviation. By individually rescaling each x variable to its "natural unit," we hope (but cannot guarantee) that all the data as a group will be in some "natural metric space," and that the structure we hope to discover in the data will manifest itself in this coordinate system. As we saw in the previous note, if the structure that we hope to discover is the relationship between x and y, we have even less guarantee that we are in the correct space, since the decomposition of the data was done without knowledge of y.
Y-aware PCA is simply PCA with a different scaling: we rescale the x data to be in y-units. That is, we want scaled variables x’ such that a unit change in x’ corresponds to a unit change in y. Under this rescaling, all the independent variables are in the same units, which are indeed the natural units for the problem at hand: characterizing their effect on y. (We also center the transformed variables x’ to be zero mean, as is done with standard centering and scaling).
It’s easy to determine the scaling for a variable x by fitting a linear regression model between x and y:
y = m * x + b
The coefficient m is the slope of the best fit line, so a unit change in x corresponds (on average) to a change of m units in y. If we rescale (and recenter) x as
x' := m * x - mean(m * x)
then x’ is in y units. This y-aware scaling is both complementary to variable pruning and powerful enough to perform well on its own.
In vtreat
, the treatment plan created by designTreatmentsN()
will store the information needed for y-aware scaling, so that if you then prepare
your data with the flag scale=TRUE
, the resulting treated frame will be scaled appropriately.
Our current example is a regression example, for the techniques needed for a classification example please see here.
First, let’s build our example. We will use the same data set as our earlier "X only" discussion.
In this data set, there are two (unobservable) processes: one that produces the output yA
and one that produces the output yB
.We only observe the mixture of the two: y = yA + yB + eps
, where eps
is a noise term. Think of y
as measuring some notion of success and the x
variables as noisy estimates of two different factors that can each drive success.
We’ll set things up so that the first five variables (x.01, x.02, x.03, x.04, x.05) have all the signal. The odd numbered variables correspond to one process (yB
) and the even numbered variables correspond to the other (yA
). Then, to simulate the difficulties of real world modeling, we’ll add lots of pure noise variables (noise*
). The noise variables are unrelated to our y of interest — but are related to other "y-style" processes that we are not interested in. We do this because in real applications, there is no reason to believe that unhelpful variables have limited variation or are uncorrelated with each other, though things would certainly be easier if we could so assume. As we showed in the previous note, this correlation undesirably out-competed the y induced correlation among signaling variables when using standard PCA.
All the variables are also deliberately mis-scaled to model some of the difficulties of working with under-curated real world data.
Let’s start with our train and test data.
# make data
set.seed(23525)
dTrain <- mkData(1000)
dTest <- mkData(1000)
Let’s look at our outcome y and a few of our variables.
summary(dTrain[, c("y", "x.01", "x.02", "noise1.01", "noise1.02")])
## y x.01 x.02
## Min. :-5.08978 Min. :-4.94531 Min. :-9.9796
## 1st Qu.:-1.01488 1st Qu.:-0.97409 1st Qu.:-1.8235
## Median : 0.08223 Median : 0.04962 Median : 0.2025
## Mean : 0.08504 Mean : 0.02968 Mean : 0.1406
## 3rd Qu.: 1.17766 3rd Qu.: 0.93307 3rd Qu.: 1.9949
## Max. : 5.84932 Max. : 4.25777 Max. :10.0261
## noise1.01 noise1.02
## Min. :-30.5661 Min. :-30.4412
## 1st Qu.: -5.6814 1st Qu.: -6.4069
## Median : 0.5278 Median : 0.3031
## Mean : 0.1754 Mean : 0.4145
## 3rd Qu.: 5.9238 3rd Qu.: 6.8142
## Max. : 26.4111 Max. : 31.8405
Next, we’ll design a treatment plan for the frame, and examine the variable significances, as estimated by vtreat
.
# design treatment plan
treatmentsN <- designTreatmentsN(dTrain,setdiff(colnames(dTrain),'y'),'y',
verbose=FALSE)
scoreFrame = treatmentsN$scoreFrame
scoreFrame$vartype = ifelse(grepl("noise", scoreFrame$varName), "noise", "signal")
dotplot_identity(scoreFrame, "varName", "sig", "vartype") +
coord_flip() + ggtitle("vtreat variable significance estimates")+
scale_color_manual(values = c("noise" = "#d95f02", "signal" = "#1b9e77"))
Note that the noise variables typically have large significance values, denoting statistical insignificance. Usually we recommend doing some significance pruning on variables before moving on — see here for possible consequences of not pruning an over-abundance of variables, and here for a discussion of one way to prune, based on significance. For this example, however, we will attempt dimensionality reduction without pruning.
Now let’s prepare the treated frame, with scaling turned on. We will deliberately turn off variable pruning by setting pruneSig = 1
. In real applications, you would want to set pruneSig
to a value less than one to prune insignificant variables. However, here we turn off variable pruning to show that you can recover some of pruning’s benefits via scaling effects, because the scaled noise variables should not have a major effect in the principal components analysis. Pruning by significance is in fact a good additional precaution complementary to scaling by effects.
# prepare the treated frames, with y-aware scaling
examplePruneSig = 1.0
dTrainNTreatedYScaled <- prepare(treatmentsN,dTrain,pruneSig=examplePruneSig,scale=TRUE)
dTestNTreatedYScaled <- prepare(treatmentsN,dTest,pruneSig=examplePruneSig,scale=TRUE)
# get the variable ranges
ranges = vapply(dTrainNTreatedYScaled, FUN=function(col) c(min(col), max(col)), numeric(2))
rownames(ranges) = c("vmin", "vmax")
rframe = as.data.frame(t(ranges)) # make ymin/ymax the columns
rframe$varName = rownames(rframe)
varnames = setdiff(rownames(rframe), "y")
rframe = rframe[varnames,]
rframe$vartype = ifelse(grepl("noise", rframe$varName), "noise", "signal")
# show a few columns
summary(dTrainNTreatedYScaled[, c("y", "x.01_clean", "x.02_clean", "noise1.02_clean", "noise1.02_clean")])
## y x.01_clean x.02_clean
## Min. :-5.08978 Min. :-2.65396 Min. :-2.51975
## 1st Qu.:-1.01488 1st Qu.:-0.53547 1st Qu.:-0.48904
## Median : 0.08223 Median : 0.01063 Median : 0.01539
## Mean : 0.08504 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 1.17766 3rd Qu.: 0.48192 3rd Qu.: 0.46167
## Max. : 5.84932 Max. : 2.25552 Max. : 2.46128
## noise1.02_clean noise1.02_clean.1
## Min. :-0.0917910 Min. :-0.0917910
## 1st Qu.:-0.0186927 1st Qu.:-0.0186927
## Median : 0.0003253 Median : 0.0003253
## Mean : 0.0000000 Mean : 0.0000000
## 3rd Qu.: 0.0199244 3rd Qu.: 0.0199244
## Max. : 0.0901253 Max. : 0.0901253
barbell_plot(rframe, "varName", "vmin", "vmax", "vartype") +
coord_flip() + ggtitle("y-scaled variables: ranges") +
scale_color_manual(values = c("noise" = "#d95f02", "signal" = "#1b9e77"))
Notice that after the y-aware rescaling, the signal carrying variables have larger ranges than the noise variables.
Now we do the principal components analysis. In this case it is critical that the scale
parameter in prcomp
is set to FALSE
so that it does not undo our own scaling. Notice the magnitudes of the singular values fall off quickly after the first two to five values.
vars <- setdiff(colnames(dTrainNTreatedYScaled),'y')
# prcomp defaults to scale. = FALSE, but we already scaled/centered in vtreat- which we don't want to lose.
dmTrain <- as.matrix(dTrainNTreatedYScaled[,vars])
dmTest <- as.matrix(dTestNTreatedYScaled[,vars])
princ <- prcomp(dmTrain, center = FALSE, scale. = FALSE)
dotplot_identity(frame = data.frame(pc=1:length(princ$sdev),
magnitude=princ$sdev),
xvar="pc",yvar="magnitude") +
ggtitle("Y-Scaled variables: Magnitudes of singular values")
When we look at the variable loadings of the first five principal components, we see that we recover the even/odd loadings of the original signal variables. PC1
has the odd variables, and PC2
has the even variables. These two principal components carry most of the signal. The next three principal components complete the basis for the five original signal variables. The noise variables have very small loadings, compared to the signal variables.
proj <- extractProjection(2,princ)
rot5 <- extractProjection(5,princ)
rotf = as.data.frame(rot5)
rotf$varName = rownames(rotf)
rotflong = gather(rotf, "PC", "loading", starts_with("PC"))
rotflong$vartype = ifelse(grepl("noise", rotflong$varName), "noise", "signal")
dotplot_identity(rotflong, "varName", "loading", "vartype") +
facet_wrap(~PC,nrow=1) + coord_flip() +
ggtitle("Y-Scaled Variable loadings, first five principal components") +
scale_color_manual(values = c("noise" = "#d95f02", "signal" = "#1b9e77"))
Let’s look at the projection of the data onto its first two principal components, using color to code the y value. Notice that y increases both as we move up and as we move right. We have recovered two features that correlate with an increase in y. In fact, PC1
corresponds to the odd signal variables, which correspond to process yB, and PC2
corresponds to the even signal variables, which correspond to process yA.
# apply projection
projectedTrain <- as.data.frame(dmTrain %*% proj,
stringsAsFactors = FALSE)
# plot data sorted by principal components
projectedTrain$y <- dTrainNTreatedYScaled$y
ScatterHistN(projectedTrain,'PC1','PC2','y',
"Y-Scaled Training Data projected to first two principal components")
Now let’s fit a linear regression model to the first two principal components.
model <- lm(y~PC1+PC2,data=projectedTrain)
summary(model)
##
## Call:
## lm(formula = y ~ PC1 + PC2, data = projectedTrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3470 -0.7919 0.0172 0.7955 3.9588
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.08504 0.03912 2.174 0.03 *
## PC1 0.78611 0.04092 19.212 <2e-16 ***
## PC2 1.03243 0.04469 23.101 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.237 on 997 degrees of freedom
## Multiple R-squared: 0.4752, Adjusted R-squared: 0.4742
## F-statistic: 451.4 on 2 and 997 DF, p-value: < 2.2e-16
projectedTrain$estimate <- predict(model,newdata=projectedTrain)
trainrsq = rsq(projectedTrain$estimate,projectedTrain$y)
ScatterHist(projectedTrain,'estimate','y','Recovered model versus truth (y aware PCA train)',
smoothmethod='identity',annot_size=3)
This model, with only two variables, explains 47.52% of the variation in y. This is comparable to the variance explained by the model fit to twenty principal components using x-only PCA (as well as a model fit to all the original variables) in the previous note.
Let’s see how the model does on hold-out data.
# apply projection
projectedTest <- as.data.frame(dmTest %*% proj,
stringsAsFactors = FALSE)
# plot data sorted by principal components
projectedTest$y <- dTestNTreatedYScaled$y
ScatterHistN(projectedTest,'PC1','PC2','y',
"Y-Scaled Test Data projected to first two principal components")
projectedTest$estimate <- predict(model,newdata=projectedTest)
testrsq = rsq(projectedTest$estimate,projectedTest$y)
testrsq
## [1] 0.5063724
ScatterHist(projectedTest,'estimate','y','Recovered model versus truth (y aware PCA test)',
smoothmethod='identity',annot_size=3)
We see that this two-variable model captures about 50.64% of the variance in y on hold-out — again, comparable to the hold-out performance of the model fit to twenty principal components using x-only PCA. These two principal components also do a much better job of capturing the internal structure of the data — that is, the relationship of the signaling variables to the yA
and yB
processes — than the first two principal components of the x-only PCA.
caret::preProcess
?In this note, we used vtreat
, a data.frame processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner, followed by principal components regression. One could instead use caret
. The caret
package, as described in the documentation, "is a set of functions that attempt to streamline the process for creating predictive models."
caret::preProcess
is designed to implement a number of sophisticated x alone transformations, groupings, prunings, and repairs (see caret/preprocess.html#all, which demonstrates "the function on all the columns except the last, which is the outcome" on the schedulingData dataset). So caret::preProcess
is a super-version of the PCA step.
We could use it as follows either alone or before vtreat design/prepare as a initial pre-processor. Using it alone is similar to PCA for this data set, as our example doesn’t have some of the additional problems caret::preProcess
is designed to help with.
library('caret')
origVars <- setdiff(colnames(dTrain),'y')
# can try variations such adding/removing non-linear steps such as "YeoJohnson"
prep <- preProcess(dTrain[,origVars],
method = c("center", "scale", "pca"))
prepared <- predict(prep,newdata=dTrain[,origVars])
newVars <- colnames(prepared)
prepared$y <- dTrain$y
print(length(newVars))
## [1] 44
modelB <- lm(paste('y',paste(newVars,collapse=' + '),sep=' ~ '),data=prepared)
print(summary(modelB)$r.squared)
## [1] 0.5004569
print(summary(modelB)$adj.r.squared)
## [1] 0.4774413
preparedTest <- predict(prep,newdata=dTest[,origVars])
testRsqC <- rsq(predict(modelB,newdata=preparedTest),dTest$y)
testRsqC
## [1] 0.4824284
The 44 caret
-chosen PCA variables are designed to capture 95% of the in-sample explainable variation of the variables. The linear regression model fit to the selected variables explains about 50.05% of the y variance on training and 48.24% of the y variance on test. This is quite good, comparable to our previous results. However, note that caret
picked more than the twenty principal components that we picked visually in the previous note, and needed far more variables than we needed with y-aware PCA.
Because caret::preProcess
is x-only processing, the first few variables capture much less of the y variation. So we can’t model y without using a lot of the derived variables. To show this, let’s try fitting a model using only five of caret
‘s PCA variables.
model5 <- lm(paste('y',paste(newVars[1:5],collapse=' + '),sep=' ~ '),data=prepared)
print(summary(model5)$r.squared)
## [1] 0.1352
print(summary(model5)$adj.r.squared)
## [1] 0.1308499
The first 5 variables only capture about 13.52% of the in-sample variance; without being informed about y, we can’t know which variation to preserve and which we can ignore. We certainly haven’t captured the two subprocesses that drive y in an inspectable manner.
If your goal is regression, there are other workable y-aware dimension reducing procedures, such as L2-regularized regression or partial least squares. Both methods are also related to principal components analysis (see Hastie, etal 2009).
Bair, etal proposed a variant of principal components regression that they call Supervised PCR. In supervised PCR, as described in their 2006 paper, a univariate linear regression model is fit to each variable (after scaling and centering), and any variable whose coefficient (what we called m above) has a magnitude less than some threshold \(\theta\) is pruned. PCR is then done on the remaining variables. Conceptually, this is similar to the significance pruning that vtreat
offers, except that the pruning criterion is "effects-based" (that is, it’s based on the magnitude of a parameter, or the strength of an effect) rather than probability-based, such as pruning on significance.
One issue with an effects-based pruning criterion is that the appropriate pruning threshold varies from problem to problem, and not necessarily in an obvious way. Bair, etal find an appropriate threshold via cross-validation. Probability-based thresholds are in some sense more generalizable from problem to problem, since the score is always in probability units — the same units for all problems. A simple variation of supervised PCR might prune on the significance of the coefficient m, as determined by its t-statistic. This would be essentially equivalent to significance pruning of the variables via vtreat
before standard PCR.
Note that vtreat
uses the significance of the one-variable model fits, not coefficient significance to estimate variable significance. When both the dependent and independent variables are numeric, the model significance and the coefficient significance are identical (see Weisberg, Applied Linear Regression). In more general modeling situations where either the outcome is categorical or the original input variable is categorical with many degrees of freedom, they are not the same, and, in our opinion, using the model significance is preferable.
In general modeling situations where you are not specifically interested in the structure of the feature space, as described by the principal components, then we recommend significance pruning of the variables. As a rule of thumb, we suggest setting your significance pruning threshold based on the rate at which you can tolerate bad variables slipping into the model. For example, setting the pruning threshold at \(p=0.05\) would let pure noise variables in at the rate of about 1 in 20 in expectation. So a good upper bound on the pruning threshold might be 1/nvar, where nvar is the number of variables. We discuss this issue briefly here in the vtreat
documentation.
vtreat
does not supply any joint variable dimension reduction as we feel dimension reduction is a modeling task. vtreat
is intended to limit itself to only necessary "prior to modeling" processing and includes significance pruning reductions because such pruning can be necessary prior to modeling.
In our experience, there are two camps of analysts: those who never use principal components regression and those who use it far too often. While principal components analysis is a useful data conditioning method, it is sensitive to distances and geometry. Therefore it is only to be trusted when the variables are curated, pruned, and in appropriate units. Principal components regression should not be used blindly; it requires proper domain aware scaling, initial variable pruning, and posterior component pruning. If the goal is regression many of the purported benefits of principal components regression can be achieved through regularization.
The general principals are widely applicable, and often re-discovered and re-formulated in useful ways (such as autoencoders).
In our next note, we will look at some ways to pick the appropriate number of principal components procedurally.
Bair, Eric, Trevor Hastie, Debashis Paul and Robert Tibshirani, "Prediction by Supervised Principal Components", Journal of the American Statistical Association, Vol. 101, No. 473 (March 2006), pp. 119-137.
Hastie, Trevor, Robert Tibshirani, and Jerome Friedman, The Elements of Statistical Learning, 2nd Edition, 2009.
Weisberg, Sanford, Applied Linear Regression, Third Edition, Wiley, 2005.
devtools
to install WVPlots
(announced here and used to produce some of the graphs shown here). I thought I would write a note with a few instructions to help.
These are things you should not have to do often, and things those of us already running R
have stumbled through and forgotten about. These are also the kind of finicky system dependent non-repeatable interactive GUI steps you largely avoid once you have a scriptable system like fully R up and running.
First you will need install (likely admin) privileges on your machine and a network connection that is not blocking and of cran, RStudio or Github.
Make sure you have up to date copies of both R and RStudio. We have to assume you are somewhat familiar with R and RStudio (so suggest a tutorial if you are not).
Once you have these we will try to “knit” or render a R markdown document. To do this start RStudio select File->"New File"->"R Markdown"
as shown below (menus may be different on different systems, you will have to look around a bit).
Then click “OK”. Then press the “Knit HTML” button as shown in the next figure.
This will ask you to pick a filename to save as (anything ending in “.Rmd” will do). If RStudio asks to install anything let it. In the end you should get a rendered copy of RStudio’s example document. If any of this doesn’t work you can look to RStudio documentation.
Assuming the above worked paste the following commands into RStudio’s “Console” window (entering a “return” after the paste to ensure execution).
[Note any time we say paste or type, watch out for any errors caused by conversion of normal machine quotes to insidious smart quotes.]
install.packages(c('RCurl','ggplot2','tidyr',
'devtools','knitr'))
The set of packages you actually need can usually be found by looking at the R
you wish to run and looking for any library()
or ::
commands. R scripts and worksheets tend not to install packages on their own as that would be a bit invasive.
If the above commands execute without error (messages and warnings are okay) you can then try the command below to install WVPlots
:
devtools::install_github('WinVector/WVPlots',
build_vignettes=TRUE)
If the above fails (some Windows users are seeing “curl” errors) it can be a problem with your machine (perhaps permissions, or no curl library installed), network, anti-virus, or firewall software. If it does fail you can try to install WVPlots
yourself by doing the following:
WVPlots_0.1.tar.gz
. install.packages(c('ROCR', 'ggplot2', 'gridExtra', 'mgcv', 'plyr', 'reshape2', 'stringr', 'knitr', 'testthat'))
(we are installing the dependencies of WVPlots
by hand, the dependencies are found by looking at the WVPLots DESCRIPTION file, and excluding grid
as it is part of the base system and doesn’t need to be installed).install.packages('~/Downloads/WVPlots_0.1.tar.gz',repos=NULL)
(replacing '~/Downloads/WVPlots_0.1.tar.gz'
with wherever you downloaded WVPlots_0.1.tar.gz
to).If the above worked you can test the WVPlots
package by typing library("WVPlots")
.
Now you can try knitting one of our example worksheets.
XonlyPCA.Rmd
by right-clicking on the “Raw” button (towards the top right).XonlyPCA.Rmd.txt
to XonlyPCA.Rmd
.File->"Open File"
to open XonlyPCA.Rmd
.If this isn’t working then something is either not installed or configured correctly, or something is blocking access (such as anti-virus software or firewall software). The best thing to do is find another local R
user and debug together.