Posted on

## Why Do We Plot Predictions on the x-axis?

When studying regression models, One of the first diagnostic plots most students learn is to plot residuals versus the model’s predictions (that is, with the predictions on the x-axis). Here’s a basic example.

``````# build an "ideal" linear process.
set.seed(34524)
N = 100
x1 = runif(N)
x2 = runif(N)
noise = 0.25*rnorm(N)
y = x1 + x2 + noise
df = data.frame(x1=x1, x2=x2, y=y)

# Fit a linear regression model
model = lm(y~x1+x2, data=df)
summary(model)``````
``````##
## Call:
## lm(formula = y ~ x1 + x2, data = df)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -0.73508 -0.16632  0.02228  0.19501  0.55190
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  0.16706    0.07111   2.349   0.0208 *
## x1           0.90047    0.09435   9.544 1.30e-15 ***
## x2           0.81444    0.09288   8.769 6.07e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2662 on 97 degrees of freedom
## Multiple R-squared:  0.6248, Adjusted R-squared:  0.6171
## F-statistic: 80.78 on 2 and 97 DF,  p-value: < 2.2e-16``````
``````# plot it
library(ggplot2)

df\$pred = predict(model, newdata=df)
df\$residual = with(df, y-pred)

# standard residual plot
ggplot(df, aes(x=pred, y=residual)) +
geom_point(alpha=0.5) + geom_hline(yintercept=0, color="red") +
geom_smooth(se=FALSE) +
ggtitle("Standard residual plot",
subtitle = "linear model and process")`````` In the above plot, we’re plotting the residuals as a function of model prediction, and comparing them to the line `y = 0`, using a smoothing curve through the residuals. The idea is that for a well-fit model, the smoothing curve should approximately lie on the line `y = 0`. This is true not only for linear models, but for any model that captures most of the explainable variance, and for which the unexplainable variance (the noise) is IID and zero mean.

If the residuals aren’t zero mean independently of the model’s predictions, then either you are missing some explanatory variables, or your model does not have the correct structure, or an appropriate inductive bias. A simple example of the second case is trying to fit a linear model to a process where the outcome is quadratically (or otherwise non-linearly) related to the outcome. To see this, let’s make an example quadratic system while deliberately failing to supply that structure to the model.

``````# a simple quadratic example
x3 = runif(N)
qf = data.frame(x1=x1, x2=x2, x3=x3)
qf\$y = x1 + x2 + 2*x3^2 + 0.25*noise

# Fit a linear regression model
model2 = lm(y~x1+x2+x3, data=qf)
# summary(model2)

qf\$pred = predict(model2, newdata=qf)
qf\$residual = with(qf, y-pred)

ggplot(qf, aes(x=pred, y=residual)) +
geom_point(alpha=0.5) + geom_hline(yintercept=0, color="red") +
geom_smooth(se=FALSE) +
ggtitle("Standard residual plot",
subtitle = "linear model, quadratic process")`````` In this case, the smoothing line on the residuals doesn’t approximate the line `y = 0`; when the model predicts a value in the range 1 to about 2.3, it tends to be overpredicting; otherwise, it tends to underpredict. This is an instance of a pathology called “structure in the residuals.”

### The Peril of Outcomes on the x-axis

What happens if you erroneously plot the residuals versus the true outcome, instead of the predictions? Let’s try this with the model for the linear process (which we know is a well-fit model):

``````# the wrong residual graph
ggplot(df, aes(x=y, y=residual)) +
geom_point(alpha=0.5) + geom_hline(yintercept=0, color="red") +
geom_smooth(se=FALSE) +
ggtitle("Incorrect residual plot",
subtitle = "linear model and process")`````` If you make this plot when you meant to make the other, you will give yourself a nasty shock. Plotting residuals versus the outcome will always look more or less like the above graph. You might think that for a good model, the outcome and the prediction are close to each other, so the residual graphs should look about the same no matter which quantity you plot on the x-axis, right? Why do they look so different?

### Reversion to mediocrity (or the mean)

One reason that the proper residual graph (for a well fit model) should smooth out to the line `y=0` is known as reversion to mediocrity, or regression to the mean.

Imagine that you have an ideal process that always produces a single value y. You don’t actually observe this “true value”; instead, what you observe is y plus (IID, zero mean) noise. You can build a “model” for this process that predicts the mean of the observations, in this case the value 0.1033149. Then you can calculate the residuals of your “model” in the usual way.

When you plot the residuals as a function of the prediction, all the datums fall at the same horizontal coordinate of the graph, centered around zero, and approximately equally distributed between positive and negative. The “smoothing line” through this graph is simply the point (0.1033149, 0) – that is, the graph is centered at zero. On the other hand, if you plot the residuals as a function of the observed outcome, all the observations will be sorted so that the observations with positive noise are to the right of the observations with negative noise, and the smoothing line through the graph no longer looks like the line `y = 0`. For a process that varies as a function of the input, you can think of the prediction corresponding to an input `X` as the mean of all the observations corresponding to `X`, and the idea is the same.

Incidentally, this regression to the mean is also why model predictions tend to have less range than the original training data.

### Plotting observations versus predictions

Sometimes instead of plotting residuals versus the predictions, I plot observations versus predictions. In this case, you want to check that the predictions lie approximately on the line `y = x`. This isn’t a standard diagnostic plot, but it does give a better sense of the magnitude of the errors relative to the magnitudes of the outcomes. Again, the important thing to remember is that the predictions go on the x-axis.

Here’s the correct plot:

``````# standard prediction plot
ggplot(df, aes(x=pred, y=y)) +
geom_point(alpha=0.5) + geom_abline(color="red") +
geom_smooth(se=FALSE) +
ggtitle("Standard prediction plot")`````` And here’s the wrong plot:

``````# the "wrong" way
ggplot(df, aes(x=y, y=pred)) +
geom_point(alpha=0.5) + geom_abline(color="red") +
geom_smooth(se=FALSE) +
ggtitle("Incorrect prediction plot")`````` Notice how the wrong plot again seems to show pathological structure where none exists.

### Conclusion

The above examples show why you should always take care to plot your model diagnostics as functions of the predictions and not of the observations. Most students have heard this already, but we feel that demonstrating why will be more memorable that simply saying “make it so.”

Posted on

## How to de-Bias Standard Deviation Estimates

This note is about attempting to remove the bias brought in by using sample standard deviation estimates to estimate an unknown true standard deviation of a population. We establish there is a bias, concentrate on why it is not important to remove it for reasonable sized samples, and (despite that) give a very complete bias management solution.

In statistical work in the age of big data we often get hung up on differences that are statistically significant (reliable enough to show up again and again in repeated measurements), but clinically insignificant (visible in aggregation, but too small to make any real difference to individuals).

An example would be: a diet that changes individual weight by an ounce on average with a standard deviation of a pound. With a large enough population the diet is statistically significant. It could also be used to shave an ounce off a national average weight. But, for any one individual: this diet is largely pointless.

The concept is teachable, but we have always stumbled of the naming “statistical significance” versus “practical clinical significance.”

I am suggesting trying the word “substantial” (and its antonym “insubstantial”) to describe if changes are physically small or large.

This comes down to having to remind people that “p-values are not effect sizes”. In this article we recommended reporting three statistics: a units-based effect size (such as expected delta pounds), a dimensionless effects size (such as Cohen’s d), and a reliability of experiment size measure (such as a statistical significance, which at best measures only one possible risk: re-sampling risk).

The merit is: if we don’t confound different meanings, we may be less confusing. A downside is: some of these measures are a bit technical to discuss. I’d be interested in hearing opinions and about teaching experiences along these distinctions.

Posted on 3 Comments on Upcoming data preparation and modeling article series

## Upcoming data preparation and modeling article series

I am pleased to announce that `vtreat` version 0.6.0 is now available to `R` users on CRAN. `vtreat` is an excellent way to prepare data for machine learning, statistical inference, and predictive analytic projects. If you are an `R` user we strongly suggest you incorporate `vtreat` into your projects. Continue reading Upcoming data preparation and modeling article series

Posted on 4 Comments on Data Preparation, Long Form and tl;dr Form

## Data Preparation, Long Form and tl;dr Form

Data preparation and cleaning are some of the most important steps of predictive analytic and data science tasks. They are laborious, where most of the errors are made, your last line of defense against a wild data, and hold the biggest opportunities for outcome improvement. No matter how much time you spend on them, they still seem like a neglected topic. Data preparation isn’t as self contained or genteel as tweaking machine learning models or hyperparameter tuning; and that is one of the reasons data preparation represents such an important practical opportunity for improvement. Our group is distributing a detailed writeup of the theory and operation behind our R realization of a set of sound data preparation and cleaning procedures called vtreat here: arXiv:1611.09477 [stat.AP]. This is where you can find out what `vtreat` does, decide if it is appropriate for your problem, or even find a specification allowing the use of the techniques in non-`R` environments (such as `Python`/`Pandas`/`scikit-learn`, `Spark`, and many others).

We have submitted this article for formal publication, so it is our intent you can cite this article (as it stands) in scientific work as a pre-print, and later cite it from a formally refereed source.

Or alternately, below is the tl;dr (“too long; didn’t read”) form. Continue reading Data Preparation, Long Form and tl;dr Form

Posted on 9 Comments on Be careful evaluating model predictions

## Be careful evaluating model predictions

One thing I teach is: when evaluating the performance of regression models you should not use correlation as your score.

This is because correlation tells you if a re-scaling of your result is useful, but you want to know if the result in your hand is in fact useful. For example: the Mars Climate Orbiter software issued thrust commands in pound-seconds units to an engine expecting the commands to be in newton-seconds units. The two quantities are related by a constant ratio of 1.4881639, and therefore anything measured in pound-seconds units will have a correlation of 1.0 with the same measurement in newton-seconds units. However, one is not the other and the difference is why the Mars Climate Orbiter “encountered Mars at a lower than anticipated altitude and disintegrated due to atmospheric stresses.”

The need for a convenient direct F-test without accidentally triggering the implicit re-scaling that is associated with calculating a correlation is one of the reasons we supply the sigr R library. However, even then things can become confusing. Posted on 2 Comments on You should re-encode high cardinality categorical variables

## You should re-encode high cardinality categorical variables

Nina Zumel and I have been doing a lot of writing on the (important) details of re-encoding high cardinality categorical variables for predictive modeling. These are variables that essentially take on string-values (also called levels or factors) and vary through many such levels. Typical examples include zip-codes, vendor IDs, and product codes.

In a sort of “burying the lede” way I feel we may not have sufficiently emphasized that you really do need to perform such re-encodings. Below is a graph (generated in R, code available here) of the kind of disaster you see if you throw such variables into a model without any pre-processing or post-controls. In the above graph each dot represents the performance of a model fit on synthetic data. The x-axis is model performance (in this case pseudo R-squared, 1 being perfect and below zero worse than using an average). The training pane represents performance on the training data (perfect, but over-fit) and the test pane represents performance on held-out test data (an attempt to simulate future application data). Notice the test performance implies these models are dangerously worse than useless.

Please read on for how to fix this. Continue reading You should re-encode high cardinality categorical variables

Posted on 1 Comment on Data science for executives and managers

## Data science for executives and managers

Nina Zumel recently announced upcoming speaking appearances. I want to promote the upcoming sessions at ODSC West 2016 (11:15am-1:00pm on Friday November 4th, or 3:00pm-4:30pm on Saturday November 5th) and invite executives, managers, and other data science consumers to attend. We assume most of the Win-Vector blog audience is made of practitioners (who we hope are already planning to attend), so we are asking you our technical readers to help promote this talk to a broader audience of executives and managers.

Our messages is: if you have to manage data science projects, you need to know how to evaluate results.

In these talks we will lay out how data science results should be examined and evaluated. If you can’t make ODSC (or do attend and like what you see), please reach out to us and we can arrange to present an appropriate targeted summarized version to your executive team. Continue reading Data science for executives and managers

Posted on

## On accuracy

In our last article on the algebra of classifier measures we encouraged readers to work through Nina Zumel’s original “Statistics to English Translation” series. This series has become slightly harder to find as we have use the original category designation “statistics to English translation” for additional work.

To make things easier here are links to the original three articles which work through scores, significance, and includes a glossery.

A lot of what Nina is presenting can be summed up in the diagram below (also by her). If in the diagram the first row is truth (say red disks are infected) which classifier is the better initial screen for infection? Should you prefer the model 1 80% accurate row or the model 2 70% accurate row? This example helps break dependence on “accuracy as the only true measure” and promote discussion of additional measures. Posted on Categories Mathematics, Statistics2 Comments on A budget of classifier evaluation measures

## A budget of classifier evaluation measures

Beginning analysts and data scientists often ask: “how does one remember and master the seemingly endless number of classifier metrics?”

• Read Nina Zumel’s excellent series on scoring classifiers.
• Keep notes.
• Settle on one or two metrics as you move project to project. We prefer “AUC” early in a project (when you want a flexible score) and “deviance” late in a project (when you want a strict score).
• When working on practical problems work with your business partners to find out which of precision/recall, or sensitivity/specificity most match their business needs. If you have time show them and explain the ROC plot and invite them to price and pick points along the ROC curve that most fit their business goals. Finance partners will rapidly recognize the ROC curve as “the efficient frontier” of classifier performance and be very comfortable working with this summary.

That being said it always seems like there is a bit of gamesmanship in that somebody always brings up yet another score, often apparently in the hope you may not have heard of it. Some choice of measure is signaling your pedigree (precision/recall implies a data mining background, sensitivity/specificity a medical science background) and hoping to befuddle others. Stanley Wyatt illustration from “Mathmanship” Nicholas Vanserg, 1958, collected in A Stress Analysis of a Strapless Evening Gown, Robert A. Baker, Prentice-Hall, 1963

The rest of this note is some help in dealing with this menagerie of common competing classifier evaluation scores.