FeaturedPosted on Categories Opinion, Practical Data Science, StatisticsTags , 2 Comments on Practical Data Science with R2

Practical Data Science with R2

The secret is out: Nina Zumel and I are busy working on Practical Data Science with R2, the second edition of our best selling book on learning data science using the R language.

Our publisher, Manning, has a great slide deck describing the book (and a discount code!!!) here:

Pdsr2s

We also just got back our part-1 technical review for the new book. Here is a quote from the technical review we are particularly proud of:

The dot notation for base R and the dplyr package did make me stand up and think. Certain things suddenly made sense.

Continue reading Practical Data Science with R2

Posted on Categories Administrativia, Opinion, Practical Data Science, Pragmatic Data Science, Pragmatic Machine LearningTags Leave a comment on Practical Data Science with R 2nd Edition update

Practical Data Science with R 2nd Edition update

We are in the last stages of proofing the galleys/typesetting of Zumel, Mount, Practical Data Science with R, 2nd Edition, Manning 2019. So this edition will definitely be out soon!

If you ever wanted to see what Nina Zumel and John Mount are like when we have the help of editors, this book is your chance!

One thing I noticed in working through the galleys: it becomes easy to see why Dr. Nina Zumel is first author.

2/3rds of the book is her work.

Posted on Categories Administrativia, data science, Exciting Techniques, Opinion, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, StatisticsTags , , Leave a comment on Free R/datascience Extract: Evaluating a Classification Model with a Spam Filter

Free R/datascience Extract: Evaluating a Classification Model with a Spam Filter

We are excited to share a free extract of Zumel, Mount, Practical Data Science with R, 2nd Edition, Manning 2019: Evaluating a Classification Model with a Spam Filter.

Zumel eacmwasf 02

This section reflects an important design decision in the book: teach model evaluation first, and as a step separate from model construction.

It is funny, but it takes some effort to teach in this way. New data scientists want to dive into the details of model construction first, and statisticians are used to getting model diagnostics as a side-effect of model fitting. However, to compare different modeling approaches one really needs good model evaluation that is independent of the model construction techniques.

This teaching style has worked very well for us both in R and in Python (it is considered one of the merits of our LinkedIn AI Academy course design):

One of the best data science courses I’ve taken. The course focuses on model selection and evaluation which are usually underestimated. Thanks to John Mount, the teacher and the co-authors of Practical Data Science with R. hashtag#AI200

(Note: Nina Zumel, leads on the course design, which is the heavy lifting, John Mount just got tasked to be the one delivering it.)

Zumel, Mount, Practical Data Science with R, 2nd Edition is coming out in print very soon. Here is a discount code to help you get a good deal on the book:

Take 37% off Practical Data Science with R, Second Edition by entering fcczumel3 into the discount code box at checkout at manning.com.

Posted on Categories Administrativia, data science, OpinionTags , Leave a comment on AI for Engineers

AI for Engineers

For the last year we (Nina Zumel, and myself: John Mount) have had the honor of teaching the AI200 portion of LinkedIn’s AI Academy.

John Mount at LinkedIn

John Mount at the LinkedIn campus

Nina Zumel designed most of the material, and John Mount has been delivering it and bringing her feedback. We’ve just started our 9th cohort. We adjust the course each time. Our students teach us a lot about how one thinks about data science. We bring that forward to each round of the course.

Roughly the goal is the following.

If every engineer, product manager, and project manager had some hands-on experience with data science and AI (deep neural nets), then they are both more likely to think of using these techniques in their work and of introducing the instrumentation required to have useful data in the first place.

This will have huge downstream benefits for LinkedIn. Our group is thrilled to be a part of this.

We are looking for more companies that want an on-site data science intensive for their teams (either in Python or in R).

Posted on Categories Administrativia, data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine LearningTags , , , Leave a comment on vtreat Cross Validation

vtreat Cross Validation

Nina Zumel finished new documentation on how vtreat‘s cross validation works, which I want to share here.

vtreat is a system that makes data preparation for machine learning a “one-liner” (available in R or available in Python). We have a set of starting off points here. These documents describe what vtreat does for you, you just find the one that matches your task and you should have a good start for solving data science problems in R or in Python.

The latest documentation is a bit about how vtreat works, and how to control some of the details of the work it is doing for you.

The new documentation is:

Please give one of the examples a try, and consider adding vtreat to your data science workflow.

Posted on Categories ProgrammingTags 1 Comment on You Can Override Just About Anything in R

You Can Override Just About Anything in R

To understand computations in R, two slogans are helpful:

  • Everything that exists is an object.
  • Everything that happens is a function call.

John Chambers

In R, the “[” array access operator is a function call. And it is one a user can re-bind to the new effect of their own choosing.

Let’s see what sort of mischief we can get into using this capability.

Continue reading You Can Override Just About Anything in R

Posted on Categories Administrativia, data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, TutorialsTags , , , Leave a comment on New vtreat Documentation (Starting with Multinomial Classification)

New vtreat Documentation (Starting with Multinomial Classification)

Nina Zumel finished some great new documentation showing how to use Python vtreat to prepare data for multinomial classification mode. And I have finally finished porting the documentation to R vtreat. So we now have good introductions on how to use vtreat to prepare data for the common tasks of:

That is now 8 introductions to start with. To use vtreat you only have to work through one introduction (the one helping with the task you have at hand in the language you are using).

As I have said before:

  • vtreat helps with project blocking issues commonly seen in real world data: missing values, re-coding categorical variables, and dealing high cardinality categorical variables.
  • If you aren’t using a tool like vtreat in your data science projects: you are really missing out (and making more work for yourself).
Posted on Categories data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , Leave a comment on Why Do We Plot Predictions on the x-axis?

Why Do We Plot Predictions on the x-axis?

When studying regression models, One of the first diagnostic plots most students learn is to plot residuals versus the model’s predictions (that is, with the predictions on the x-axis). Here’s a basic example.

# build an "ideal" linear process.
set.seed(34524)
N = 100
x1 = runif(N)
x2 = runif(N)
noise = 0.25*rnorm(N)
y = x1 + x2 + noise
df = data.frame(x1=x1, x2=x2, y=y)

# Fit a linear regression model
model = lm(y~x1+x2, data=df)
summary(model)
## 
## Call:
## lm(formula = y ~ x1 + x2, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.73508 -0.16632  0.02228  0.19501  0.55190 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.16706    0.07111   2.349   0.0208 *  
## x1           0.90047    0.09435   9.544 1.30e-15 ***
## x2           0.81444    0.09288   8.769 6.07e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2662 on 97 degrees of freedom
## Multiple R-squared:  0.6248, Adjusted R-squared:  0.6171 
## F-statistic: 80.78 on 2 and 97 DF,  p-value: < 2.2e-16
# plot it
library(ggplot2)

df$pred = predict(model, newdata=df)
df$residual = with(df, y-pred)

# standard residual plot
ggplot(df, aes(x=pred, y=residual)) + 
  geom_point(alpha=0.5) + geom_hline(yintercept=0, color="red") +
  geom_smooth(se=FALSE) + 
  ggtitle("Standard residual plot",
          subtitle = "linear model and process")

Standard Residual Plot

In the above plot, we’re plotting the residuals as a function of model prediction, and comparing them to the line y = 0, using a smoothing curve through the residuals. The idea is that for a well-fit model, the smoothing curve should approximately lie on the line y = 0. This is true not only for linear models, but for any model that captures most of the explainable variance, and for which the unexplainable variance (the noise) is IID and zero mean.

If the residuals aren’t zero mean independently of the model’s predictions, then either you are missing some explanatory variables, or your model does not have the correct structure, or an appropriate inductive bias. A simple example of the second case is trying to fit a linear model to a process where the outcome is quadratically (or otherwise non-linearly) related to the outcome. To see this, let’s make an example quadratic system while deliberately failing to supply that structure to the model.

# a simple quadratic example
x3 = runif(N)
qf = data.frame(x1=x1, x2=x2, x3=x3)
qf$y = x1 + x2 + 2*x3^2 + 0.25*noise

# Fit a linear regression model
model2 = lm(y~x1+x2+x3, data=qf)
# summary(model2)

qf$pred = predict(model2, newdata=qf)
qf$residual = with(qf, y-pred)

ggplot(qf, aes(x=pred, y=residual)) + 
  geom_point(alpha=0.5) + geom_hline(yintercept=0, color="red") +
  geom_smooth(se=FALSE) + 
  ggtitle("Standard residual plot",
          subtitle = "linear model, quadratic process")

Standard Residual Plot, quadratic process

In this case, the smoothing line on the residuals doesn’t approximate the line y = 0; when the model predicts a value in the range 1 to about 2.3, it tends to be overpredicting; otherwise, it tends to underpredict. This is an instance of a pathology called “structure in the residuals.”

The Peril of Outcomes on the x-axis

What happens if you erroneously plot the residuals versus the true outcome, instead of the predictions? Let’s try this with the model for the linear process (which we know is a well-fit model):

# the wrong residual graph
ggplot(df, aes(x=y, y=residual)) + 
  geom_point(alpha=0.5) + geom_hline(yintercept=0, color="red") +
  geom_smooth(se=FALSE) + 
  ggtitle("Incorrect residual plot",
          subtitle = "linear model and process")

Incorrect residual plot, linear process

If you make this plot when you meant to make the other, you will give yourself a nasty shock. Plotting residuals versus the outcome will always look more or less like the above graph. You might think that for a good model, the outcome and the prediction are close to each other, so the residual graphs should look about the same no matter which quantity you plot on the x-axis, right? Why do they look so different?

Reversion to mediocrity (or the mean)

One reason that the proper residual graph (for a well fit model) should smooth out to the line y=0 is known as reversion to mediocrity, or regression to the mean.

Imagine that you have an ideal process that always produces a single value y. You don’t actually observe this “true value”; instead, what you observe is y plus (IID, zero mean) noise. You can build a “model” for this process that predicts the mean of the observations, in this case the value 0.1033149. Then you can calculate the residuals of your “model” in the usual way.

When you plot the residuals as a function of the prediction, all the datums fall at the same horizontal coordinate of the graph, centered around zero, and approximately equally distributed between positive and negative. The “smoothing line” through this graph is simply the point (0.1033149, 0) – that is, the graph is centered at zero.

residuals correctly

On the other hand, if you plot the residuals as a function of the observed outcome, all the observations will be sorted so that the observations with positive noise are to the right of the observations with negative noise, and the smoothing line through the graph no longer looks like the line y = 0.

residuals wrong

For a process that varies as a function of the input, you can think of the prediction corresponding to an input X as the mean of all the observations corresponding to X, and the idea is the same.

Incidentally, this regression to the mean is also why model predictions tend to have less range than the original training data.

Plotting observations versus predictions

Sometimes instead of plotting residuals versus the predictions, I plot observations versus predictions. In this case, you want to check that the predictions lie approximately on the line y = x. This isn’t a standard diagnostic plot, but it does give a better sense of the magnitude of the errors relative to the magnitudes of the outcomes. Again, the important thing to remember is that the predictions go on the x-axis.

Here’s the correct plot:

# standard prediction plot
ggplot(df, aes(x=pred, y=y)) + 
  geom_point(alpha=0.5) + geom_abline(color="red") + 
  geom_smooth(se=FALSE) + 
  ggtitle("Standard prediction plot")

standard prediction plot

And here’s the wrong plot:

# the "wrong" way
ggplot(df, aes(x=y, y=pred)) + 
  geom_point(alpha=0.5) + geom_abline(color="red") +
  geom_smooth(se=FALSE) + 
  ggtitle("Incorrect prediction plot")

wrong prediction plot

Notice how the wrong plot again seems to show pathological structure where none exists.

Conclusion

The above examples show why you should always take care to plot your model diagnostics as functions of the predictions and not of the observations. Most students have heard this already, but we feel that demonstrating why will be more memorable that simply saying “make it so.”

Posted on Categories data science, Opinion, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, TutorialsTags , Leave a comment on How to Prepare Data

How to Prepare Data

Real world data can present a number of challenges to data science workflows. Even properly structured data (each interesting measurement already landed in distinct columns), can present problems, such as missing values and high cardinality categorical variables.

In this note we describe some great tools for working with such data.

Continue reading How to Prepare Data

Posted on Categories data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , Leave a comment on Preparing Data for Supervised Classification

Preparing Data for Supervised Classification

Nina Zumel has been polishing up new vtreat for Python documentation and tutorials. They are coming out so good that I find to be fair to the R community I must start to back-port this new documentation to vtreat for R.

Continue reading Preparing Data for Supervised Classification

Posted on Categories Practical Data Science, Statistics, TutorialsTags , , , , , Leave a comment on The Advantages of Record Transform Specifications

The Advantages of Record Transform Specifications

Nina Zumel had a really great article on how to prepare a nice Keras performance plot using R.


Keras plot

I will use this example to show some of the advantages of cdata record transform specifications.

Continue reading The Advantages of Record Transform Specifications