Another note on differential privacy

Posted on Categories OpinionTags , , , Leave a comment on Another note on differential privacy

I want to recommend an excellent article on the recent claimed use of differential privacy to actually preserve user privacy: “A Few Thoughts on Cryptographic Engineering” by Matthew Green.

After reading the article we have a few follow-up thoughts on the topic. Continue reading Another note on differential privacy

Why you should read Nina Zumel’s 3 part series on principal components analysis and regression

Posted on Categories Administrativia, Exciting Techniques, Expository Writing, Statistics, TutorialsTags , , , Leave a comment on Why you should read Nina Zumel’s 3 part series on principal components analysis and regression

Short form:

Win-Vector LLC’s Dr. Nina Zumel has a three part series on Principal Components Regression that we think is well worth your time.

  • Part 1: the proper preparation of data (including scaling) and use of principal components analysis (particularly for supervised learning or regression).
  • Part 2: the introduction of y-aware scaling to direct the principal components analysis to preserve variation correlated with the outcome we are trying to predict.
  • Part 3: how to pick the number of components to retain for analysis.

Continue reading Why you should read Nina Zumel’s 3 part series on principal components analysis and regression

Free e-book: Exploring Data Science

Posted on Categories Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, StatisticsTags , , , , 3 Comments on Free e-book: Exploring Data Science

We are pleased to announce a new free e-book from Manning Publications: Exploring Data Science. Exploring Data Science is a collection of five chapters hand picked by John Mount and Nina Zumel, introducing you to various areas in data science and explaining which methodologies work best for each.

ExploringDataScience Continue reading Free e-book: Exploring Data Science

Using geom_step

Posted on Categories TutorialsTags , , , , 4 Comments on Using geom_step

geom_step is an interesting geom supplied by the R package ggplot2. It is an appropriate rendering option for financial market data and we will show how and why to use it in this article.

Continue reading Using geom_step

A demonstration of vtreat data preparation

Posted on Categories Exciting Techniques, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , 8 Comments on A demonstration of vtreat data preparation

This article is a demonstration the use of the R vtreat variable preparation package followed by caret controlled training.

In previous writings we have gone to great lengths to document, explain and motivate vtreat. That necessarily gets long and unnecessarily feels complicated.

In this example we are going to show what building a predictive model using vtreat best practices looks like assuming you were somehow already in the habit of using vtreat for your data preparation step. We are deliberately not going to explain any steps, but just show the small number of steps we advise routinely using. This is a simple schematic, but not a guide. Of course we do not advise use without understanding (and we work hard to teach the concepts in our writing), but want what small effort is required to add vtreat to your predictive modeling practice.

Continue reading A demonstration of vtreat data preparation

Principal Components Regression, Pt. 3: Picking the Number of Components

Posted on Categories Mathematics, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, StatisticsTags , , , , 1 Comment on Principal Components Regression, Pt. 3: Picking the Number of Components

In our previous note we demonstrated Y-Aware PCA and other y-aware approaches to dimensionality reduction in a predictive modeling context, specifically Principal Components Regression (PCR). For our examples, we selected the appropriate number of principal components by eye. In this note, we will look at ways to select the appropriate number of principal components in a more automated fashion.

Continue reading Principal Components Regression, Pt. 3: Picking the Number of Components

On ranger respect.unordered.factors

Posted on Categories Expository Writing, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , 9 Comments on On ranger respect.unordered.factors

It is often said that “R is its packages.”

One package of interest is ranger a fast parallel C++ implementation of random forest machine learning. Ranger is great package and at first glance appears to remove the “only 63 levels allowed for string/categorical variables” limit found in the Fortran randomForest package. Actually this appearance is due to the strange choice of default value respect.unordered.factors=FALSE in ranger::ranger() which we strongly advise overriding to respect.unordered.factors=TRUE in applications. Continue reading On ranger respect.unordered.factors

Principal Components Regression, Pt. 2: Y-Aware Methods

Posted on Categories data science, Exciting Techniques, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , , 2 Comments on Principal Components Regression, Pt. 2: Y-Aware Methods

In our previous note, we discussed some problems that can arise when using standard principal components analysis (specifically, principal components regression) to model the relationship between independent (x) and dependent (y) variables. In this note, we present some dimensionality reduction techniques that alleviate some of those problems, in particular what we call Y-Aware Principal Components Analysis, or Y-Aware PCA. We will use our variable treatment package vtreat in the examples we show in this note, but you can easily implement the approach independently of vtreat.

Continue reading Principal Components Regression, Pt. 2: Y-Aware Methods

Installing WVPlots and “knitting R markdown”

Posted on Categories Administrativia, TutorialsTags

Some readers have been having a bit of trouble using devtools to install WVPlots (announced here and used to produce some of the graphs shown here). I thought I would write a note with a few instructions to help.

These are things you should not have to do often, and things those of us already running R have stumbled through and forgotten about. These are also the kind of finicky system dependent non-repeatable interactive GUI steps you largely avoid once you have a scriptable system like fully R up and running. Continue reading Installing WVPlots and “knitting R markdown”

Principal Components Regression, Pt.1: The Standard Method

Posted on Categories data science, Expository Writing, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , 14 Comments on Principal Components Regression, Pt.1: The Standard Method

In this note, we discuss principal components regression and some of the issues with it:

  • The need for scaling.
  • The need for pruning.
  • The lack of “y-awareness” of the standard dimensionality reduction step.

Continue reading Principal Components Regression, Pt.1: The Standard Method