Posted on

## Why Do We Plot Predictions on the x-axis?

When studying regression models, One of the first diagnostic plots most students learn is to plot residuals versus the model’s predictions (that is, with the predictions on the x-axis). Here’s a basic example.

``````# build an "ideal" linear process.
set.seed(34524)
N = 100
x1 = runif(N)
x2 = runif(N)
noise = 0.25*rnorm(N)
y = x1 + x2 + noise
df = data.frame(x1=x1, x2=x2, y=y)

# Fit a linear regression model
model = lm(y~x1+x2, data=df)
summary(model)``````
``````##
## Call:
## lm(formula = y ~ x1 + x2, data = df)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -0.73508 -0.16632  0.02228  0.19501  0.55190
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  0.16706    0.07111   2.349   0.0208 *
## x1           0.90047    0.09435   9.544 1.30e-15 ***
## x2           0.81444    0.09288   8.769 6.07e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2662 on 97 degrees of freedom
## Multiple R-squared:  0.6248, Adjusted R-squared:  0.6171
## F-statistic: 80.78 on 2 and 97 DF,  p-value: < 2.2e-16``````
``````# plot it
library(ggplot2)

df\$pred = predict(model, newdata=df)
df\$residual = with(df, y-pred)

# standard residual plot
ggplot(df, aes(x=pred, y=residual)) +
geom_point(alpha=0.5) + geom_hline(yintercept=0, color="red") +
geom_smooth(se=FALSE) +
ggtitle("Standard residual plot",
subtitle = "linear model and process")``````

In the above plot, we’re plotting the residuals as a function of model prediction, and comparing them to the line `y = 0`, using a smoothing curve through the residuals. The idea is that for a well-fit model, the smoothing curve should approximately lie on the line `y = 0`. This is true not only for linear models, but for any model that captures most of the explainable variance, and for which the unexplainable variance (the noise) is IID and zero mean.

If the residuals aren’t zero mean independently of the model’s predictions, then either you are missing some explanatory variables, or your model does not have the correct structure, or an appropriate inductive bias. A simple example of the second case is trying to fit a linear model to a process where the outcome is quadratically (or otherwise non-linearly) related to the outcome. To see this, let’s make an example quadratic system while deliberately failing to supply that structure to the model.

``````# a simple quadratic example
x3 = runif(N)
qf = data.frame(x1=x1, x2=x2, x3=x3)
qf\$y = x1 + x2 + 2*x3^2 + 0.25*noise

# Fit a linear regression model
model2 = lm(y~x1+x2+x3, data=qf)
# summary(model2)

qf\$pred = predict(model2, newdata=qf)
qf\$residual = with(qf, y-pred)

ggplot(qf, aes(x=pred, y=residual)) +
geom_point(alpha=0.5) + geom_hline(yintercept=0, color="red") +
geom_smooth(se=FALSE) +
ggtitle("Standard residual plot",
subtitle = "linear model, quadratic process")``````

In this case, the smoothing line on the residuals doesn’t approximate the line `y = 0`; when the model predicts a value in the range 1 to about 2.3, it tends to be overpredicting; otherwise, it tends to underpredict. This is an instance of a pathology called “structure in the residuals.”

### The Peril of Outcomes on the x-axis

What happens if you erroneously plot the residuals versus the true outcome, instead of the predictions? Let’s try this with the model for the linear process (which we know is a well-fit model):

``````# the wrong residual graph
ggplot(df, aes(x=y, y=residual)) +
geom_point(alpha=0.5) + geom_hline(yintercept=0, color="red") +
geom_smooth(se=FALSE) +
ggtitle("Incorrect residual plot",
subtitle = "linear model and process")``````

If you make this plot when you meant to make the other, you will give yourself a nasty shock. Plotting residuals versus the outcome will always look more or less like the above graph. You might think that for a good model, the outcome and the prediction are close to each other, so the residual graphs should look about the same no matter which quantity you plot on the x-axis, right? Why do they look so different?

### Reversion to mediocrity (or the mean)

One reason that the proper residual graph (for a well fit model) should smooth out to the line `y=0` is known as reversion to mediocrity, or regression to the mean.

Imagine that you have an ideal process that always produces a single value y. You don’t actually observe this “true value”; instead, what you observe is y plus (IID, zero mean) noise. You can build a “model” for this process that predicts the mean of the observations, in this case the value 0.1033149. Then you can calculate the residuals of your “model” in the usual way.

When you plot the residuals as a function of the prediction, all the datums fall at the same horizontal coordinate of the graph, centered around zero, and approximately equally distributed between positive and negative. The “smoothing line” through this graph is simply the point (0.1033149, 0) – that is, the graph is centered at zero.

On the other hand, if you plot the residuals as a function of the observed outcome, all the observations will be sorted so that the observations with positive noise are to the right of the observations with negative noise, and the smoothing line through the graph no longer looks like the line `y = 0`.

For a process that varies as a function of the input, you can think of the prediction corresponding to an input `X` as the mean of all the observations corresponding to `X`, and the idea is the same.

Incidentally, this regression to the mean is also why model predictions tend to have less range than the original training data.

### Plotting observations versus predictions

Sometimes instead of plotting residuals versus the predictions, I plot observations versus predictions. In this case, you want to check that the predictions lie approximately on the line `y = x`. This isn’t a standard diagnostic plot, but it does give a better sense of the magnitude of the errors relative to the magnitudes of the outcomes. Again, the important thing to remember is that the predictions go on the x-axis.

Here’s the correct plot:

``````# standard prediction plot
ggplot(df, aes(x=pred, y=y)) +
geom_point(alpha=0.5) + geom_abline(color="red") +
geom_smooth(se=FALSE) +
ggtitle("Standard prediction plot")``````

And here’s the wrong plot:

``````# the "wrong" way
ggplot(df, aes(x=y, y=pred)) +
geom_point(alpha=0.5) + geom_abline(color="red") +
geom_smooth(se=FALSE) +
ggtitle("Incorrect prediction plot")``````

Notice how the wrong plot again seems to show pathological structure where none exists.

### Conclusion

The above examples show why you should always take care to plot your model diagnostics as functions of the predictions and not of the observations. Most students have heard this already, but we feel that demonstrating why will be more memorable that simply saying “make it so.”

Posted on Tags ,

## How to Prepare Data

Real world data can present a number of challenges to data science workflows. Even properly structured data (each interesting measurement already landed in distinct columns), can present problems, such as missing values and high cardinality categorical variables.

In this note we describe some great tools for working with such data.

Posted on Tags , , ,

## Preparing Data for Supervised Classification

Nina Zumel has been polishing up new `vtreat` for `Python` documentation and tutorials. They are coming out so good that I find to be fair to the `R` community I must start to back-port this new documentation to `vtreat` for `R`.

Posted on Categories Practical Data Science, Statistics, Tutorials

## The Advantages of Record Transform Specifications

Nina Zumel had a really great article on how to prepare a nice `Keras` performance plot using `R`.

I will use this example to show some of the advantages of `cdata` record transform specifications.

Posted on 2 Comments on Practical Data Science with R update

## Practical Data Science with R update

Just got the following note from a new reader:

Thank you for writing Practical Data Science with R. It’s challenging for me, but I am learning a lot by following your steps and entering the commands.

Wow, this is exactly what Nina Zumel and I hoped for. We wish we could make everything easy, but an appropriate amount of challenge is required for significant learning and accomplishment.

Of course we try to avoid inessential problems. All of the code examples from the book can be found here (and all the data sets here).

The second edition is coming out very soon. Please check it out.

Posted on Tags , , ,

## WVPlots 1.1.2 on CRAN

I have put a new release of the `WVPlots` package up on CRAN. This release adds palette and/or color controls to most of the plotting functions in the package.

`WVPlots` was originally a catch-all package of `ggplot2` visualizations that we at Win-Vector tended to use repeatedly, and wanted to turn into “one-liners.” A consequence of this is that the older visualizations had our preferred color schemes hard-coded in. More recent additions to the package sometimes had palette or color controls, but not in a consistent way. Making color controls more consistent has been a “todo” for a while—one that I’d been putting off. A recent request from user Brice Richard (thanks Brice!) has pushed me to finally make the changes.

Most visualizations in the package that color-code by group now have a `palette` argument that takes the name of a Brewer palette for the graph; `Dark2` is usually the default. To use the `ggplot2` default palette, or to set an alternative palette, such as viridis or a manually specified color scheme, set `palette=NULL`. Here’s some examples:

``````library(WVPlots)
library(ggplot2)

mpg = ggplot2::mpg
mpg\$trans = gsub("\\(.*\$", '', mpg\$trans)

# default palette: Dark2
DoubleDensityPlot(mpg, "cty", "trans", "City driving mpg by transmission type")``````

``````# set a different Brewer color palette
DoubleDensityPlot(mpg, "cty", "trans",
"City driving mpg by transmission type",
palette = "Accent")``````

``````# set a custom palette
cmap = c("auto" = "#7b3294", "manual" = "#008837")

DoubleDensityPlot(mpg, "cty", "trans",
"City driving mpg by transmission type",
palette=NULL) +
scale_color_manual(values=cmap) +
scale_fill_manual(values=cmap)``````

For other plots, the user can now specify the desired color for different elements of the graph.

``````title = "Count of cars by number of carburetors and cylinders"

# default fill: darkblue
title = title)``````

``````# specify fill
title = title,
fillcolor = "#a6611a")``````

We hope that these changes make `WVPlots` even more useful to our users. For examples of several of the visualizations in `WVPlots`, see this example vignette. For the complete list of visualizations, see the reference page.

Posted on

## Advanced Data Reshaping in Python and R

This note is a simple data wrangling example worked using both the Python data_algebra package and the R cdata package. Both of these packages make data wrangling easy through he use of coordinatized data concepts (relying heavily on Codd’s “rule of access”).

The advantages of data_algebra and cdata are:

• The user specifies their desired transform declaratively by example and in data. What one does is: work an example, and then write down what you want (we have a tutorial on this here).
• The transform systems can print what a transform is going to do. This makes reasoning about data transforms much easier.
• The transforms, as they themselves are written as data, can be easily shared between systems (such as R and Python).
Posted on Tags ,

## New Getting Started with vtreat Documentation

Win Vector LLC‘s Dr. Nina Zumel has just released some new vtreat documentation.

vtreat is a an all-in one step data preparation system that helps defend your machine learning algorithms from:

• Missing values
• Large cardinality categorical variables
• Novel levels from categorical variables

I hoped she could get the Python vtreat documentation up to parity with the R vtreat documentation. But I think she really hit the ball out of the park, and went way past that.

The new documentation is 3 “getting started” guides. These guides deliberately overlap, so you don’t have to read them all. Just read the one suited to your problem and go.

The new guides:

Perhaps we can back-port the new guides to the R version at some point.