In our previous note we demonstrated Y-Aware PCA and other y-aware approaches to dimensionality reduction in a predictive modeling context, specifically Principal Components Regression (PCR). For our examples, we selected the appropriate number of principal components by eye. In this note, we will look at ways to select the appropriate number of principal components in a more automated fashion.
It is often said that “R is its packages.”
One package of interest is ranger a fast parallel C++ implementation of random forest machine learning. Ranger is great package and at first glance appears to remove the “only 63 levels allowed for string/categorical variables” limit found in the Fortran randomForest package. Actually this appearance is due to the strange choice of default value
ranger::ranger() which we strongly advise overriding to
respect.unordered.factors=TRUE in applications. Continue reading On ranger respect.unordered.factors
In our previous note, we discussed some problems that can arise when using standard principal components analysis (specifically, principal components regression) to model the relationship between independent (x) and dependent (y) variables. In this note, we present some dimensionality reduction techniques that alleviate some of those problems, in particular what we call Y-Aware Principal Components Analysis, or Y-Aware PCA. We will use our variable treatment package
vtreat in the examples we show in this note, but you can easily implement the approach independently of
Some readers have been having a bit of trouble using
devtools to install
WVPlots (announced here and used to produce some of the graphs shown here). I thought I would write a note with a few instructions to help.
These are things you should not have to do often, and things those of us already running
R have stumbled through and forgotten about. These are also the kind of finicky system dependent non-repeatable interactive GUI steps you largely avoid once you have a scriptable system like fully R up and running. Continue reading Installing WVPlots and “knitting R markdown”
In this note, we discuss principal components regression and some of the issues with it:
- The need for scaling.
- The need for pruning.
- The lack of “y-awareness” of the standard dimensionality reduction step.
Our publisher Manning Publications is celebrating the release of a new data science in Python title Introducing Data Science by offering it and other Manning titles at half off until Wednesday, May 18.
As part of the promotion you can also use the supplied discount code
mlcielenlt for half off some R titles including R in Action, Second Edition and our own Practical Data Science with R. Combine these with our half off code (
C3) for our R video course Introduction to Data Science and you can get a lot of top quality data science material at a deep discount.
Just a “heads-up.”
I’ve been editing a
two-part three-part series Nina Zumel is writing on some of the pitfalls of improperly applied principal components analysis/regression and how to avoid them (we are using the plural spelling as used in following Everitt The Cambridge Dictionary of Statistics). The series is looking absolutely fantastic and I think it will really help people understand, properly use, and even teach the concepts.
Frankly the material would have worked great as an additional chapter for Practical Data Science with R (but instead everybody is going to get it for free).
Please watch here for the series.
The complete series is now up: