Nina Zumel introduced y-aware scaling in her recent article Principal Components Regression, Pt. 2: Y-Aware Methods. I really encourage you to read the article and add the technique to your repertoire. The method combines well with other methods and can drive better predictive modeling results.
From feedback I am not sure everybody noticed that in addition to being easy and effective, the method is actually novel (we haven’t yet found an academic reference to it or seen it already in use after visiting numerous clients). Likely it has been applied before (as it is a simple method), but it is not currently considered a standard method (something we would like to change).
In this note I’ll discuss some of the context of y-aware scaling. Continue reading y-aware scaling in context
I want to recommend an excellent article on the recent claimed use of differential privacy to actually preserve user privacy: “A Few Thoughts on Cryptographic Engineering” by Matthew Green.
After reading the article we have a few follow-up thoughts on the topic. Continue reading Another note on differential privacy
Win-Vector LLC’s Dr. Nina Zumel has a three part series on Principal Components Regression that we think is well worth your time.
- Part 1: the proper preparation of data (including scaling) and use of principal components analysis (particularly for supervised learning or regression).
- Part 2: the introduction of y-aware scaling to direct the principal components analysis to preserve variation correlated with the outcome we are trying to predict.
- Part 3: how to pick the number of components to retain for analysis.
Continue reading Why you should read Nina Zumel’s 3 part series on principal components analysis and regression
We are pleased to announce a new free e-book from Manning Publications: Exploring Data Science. Exploring Data Science is a collection of five chapters hand picked by John Mount and Nina Zumel, introducing you to various areas in data science and explaining which methodologies work best for each.
Continue reading Free e-book: Exploring Data Science
geom_step is an interesting geom supplied by the R package ggplot2. It is an appropriate rendering option for financial market data and we will show how and why to use it in this article.
Continue reading Using geom_step
This article is a demonstration the use of the R vtreat variable preparation package followed by caret controlled training.
In previous writings we have gone to great lengths to document, explain and motivate
vtreat. That necessarily gets long and unnecessarily feels complicated.
In this example we are going to show what building a predictive model using
vtreat best practices looks like assuming you were somehow already in the habit of using vtreat for your data preparation step. We are deliberately not going to explain any steps, but just show the small number of steps we advise routinely using. This is a simple schematic, but not a guide. Of course we do not advise use without understanding (and we work hard to teach the concepts in our writing), but want what small effort is required to add
vtreat to your predictive modeling practice.
Continue reading A demonstration of vtreat data preparation