Authors: John Mount (more articles) and Nina Zumel (more articles).
When fitting and selecting models in a data science project, how do you know that your final model is good? And how sure are you that it’s better than the models that you rejected? In this Part 2 of our four part mini-series “How do you know if your model is going to work?” we develop in-training set measures.
Previously we worked on:
- Part 1: Defining the scoring problem
In-training set measures
The most tempting procedure is to score your model on the data used to train it. The attraction is this avoids the statistical inefficiency of denying some of your data to the training procedure.
Run it once procedure
A common way to asses score quality is to run your scoring function on the data used to build your model. We might try comparing several models scored by AUC or deviance (normalized to factor out sample size) on their own training data as shown below.
What we have done is take five popular machine learning techniques (random forest, logistic regression, gbm, GAM logistic regression, and elastic net logistic regression) and plotted their performance in terms of AUC and normalized deviance on their own training data. For AUC larger numbers are better, and for deviance smaller numbers are better. Because we have evaluated multiple models we are starting to get a sense of scale. We should suspect an AUC of 0.7 on training data is good (though random forest achieved an AUC on training of almost 1.0), and we should be acutely aware that evaluating models on their own training data has an upward bias (the model has seen the training data, so it has a good chance of doing well on it; or training data is not exchangeable with future data for the purpose of estimating model performance).
There are two more Gedankenexperiment models that any machine data scientist should always have in mind:
- The null model (on the graph as “null model”). This is the performance of the best constant model (model that returns the same answer for all datums). In this case it is a model scores each and every row as having an identical 7% chance of churning. This is an important model that you want to better than. It is also a model you are often competing against as a data science as it is the “what if we treat everything in this group the same” option (often the business process you are trying to replace).
The data scientist should always compare their work to the null model on deviance (null model AUC is trivially 0.5) and packages like logistic regression routinely report this statistic.
- The best single variable model (on the graph as “best single variable model”). This is the best model built using only one variable or column (in this case using a GAM logistic regression as the modeling method). This is another model the data scientist wants to out perform as it represents the “maybe one of the columns is already the answer case” (if so that would be very good for the business as they could get good predictions without modeling infrastructure).
The data scientist should definitely compare their model to the best single variable model. Until you significantly outperform the best single variable model you have not outperformed what an analyst can find with a single pivot table.
At this point it would be tempting to pick the random forest model as the winner as it performed best on the training data. There are at least two things wrong with this idea:
Continue reading How do you know if your model is going to work? Part 2: In-training set measures
Authors: John Mount (more articles) and Nina Zumel (more articles).
“Essentially, all models are wrong, but some are useful.”
Here’s a caricature of a data science project: your company or client needs information (usually to make a decision). Your job is to build a model to predict that information. You fit a model, perhaps several, to available data and evaluate them to find the best. Then you cross your fingers that your chosen model doesn’t crash and burn in the real world.
We’ve discussed detecting if your data has a signal. Now: how do you know that your model is good? And how sure are you that it’s better than the models that you rejected?
Geocentric illustration Bartolomeu Velho, 1568 (Bibliothèque Nationale, Paris)
Notice the Sun in the 4th revolution about the earth. A very pretty, but not entirely reliable model.
In this latest “Statistics as it should be” series, we will systematically look at what to worry about and what to check. This is standard material, but presented in a “data science” oriented manner. Meaning we are going to consider scoring system utility in terms of service to a negotiable business goal (one of the many ways data science differs from pure machine learning).
To organize the ideas into digestible chunks, we are presenting this article as a four part series (to finished in the next 3 Tuesdays). This part (part 1) sets up the specific problem.
Continue reading How do you know if your model is going to work? Part 1: The problem
I’ll admit it: I have been wrong about statistics. However, that isn’t what this article is about. This article is less about some of the statistical mistakes I have made, as a mere working data scientist, and more of a rant about the hectoring tone of corrections from some statisticians (both when I have been right and when I have been wrong).
Used wrong (image Justin Baeder, some rights reserved).
Continue reading I was wrong about statistics
In Gelman and Nolan’s paper “You Can Load a Die, But You Can’t Bias a Coin” The American Statistician, November 2002, Vol. 56, No. 4 it is argued you can’t easily produce a coin that is biased when flipped (and caught). A number of variations that can be easily biased (such as spinning) are also discussed.
Obviously Gelman and Nolan are smart and careful people. And we are discussing a well-regarded peer-reviewed article. So we don’t expect there is a major error. What we say is the abstraction they are using doesn’t match the physical abstraction I would pick. I pick a different one and I get different results. This is what I would like to discuss. Continue reading I still think you can manufacture an unfair coin
Deal of the Day August 1: Half off my book Practical Data Science with R. Use code dotd0801au at www.manning.com/zumel/
A fair complaint when seeing yet another “data science” article is to say: “this is just medical statistics” or “this is already part of bioinformatics.” We certainly label many articles as “data science” on this blog. Probably the complaint is slightly cleaner if phrased as “this is already known statistics.” But the essence of the complaint is a feeling of claiming novelty in putting old wine in new bottles. Rob Tibshirani nailed this type of distinction in is famous machine learning versus statistics glossary.
I’ve written about statistics v.s. machine learning , but I would like to explain why we (the authors of this blog) often use the term data science. Nina Zumel explained being a data scientist very well, I am going to take a swipe at explaining data science.
We (the authors on this blog) label many of our articles as being about data science because we want to emphasize that the various techniques we write about are only meaningful when considered parts of a larger end to end process. The process we are interested in is the deployment of useful data driven models into production. The important components are learning the true business needs (often by extensive partnership with customers), enabling the collection of data, managing data, applying modeling techniques and applying statistics criticisms. The pre-existing term I have found that is closest to describing this whole project system is data science, so that is the term I use. I tend to use it a lot, because while I love the tools and techniques our true loyalty is to the whole process (and I want to emphasize this to our readers).
The phrase “data science” as in use it today is a fairly new term (made popular by William S. Cleveland, DJ Patil, and Jeff Hammerbacher). I myself worked in a “computational sciences” group in the mid 1990’s (this group emphasized simulation based modeling of small molecules and their biological interactions, the naming was an attempt to emphasize computation over computers). So for me “data science” seems like a good term when your work is driven by data (versus driven from computer simulations). For some people data science is considered a new calling and for others it is a faddish misrepresentation of work that has already been done. I think there are enough substantial differences in approach between traditional statistics, machine learning, data mining, predictive analytics, and data science to justify at least this much nomenclature. In this article I will try to describe (but not fully defend) my opinion. Continue reading Data Science, Machine Learning, and Statistics: what is in a name?
Using correlation to track model performance is “a mistake that nobody would ever make” combined with a vague “what would be wrong if I did do that” feeling. I hope after reading this feel a least a small urge to double check your work and presentations to make sure you have not reported correlation where R-squared, likelihood or root mean square error (RMSE) would have been more appropriate.
It is tempting (but wrong) to use correlation to track the performance of model predictions. The temptation arises because we often (correctly) use correlation to evaluate possible model inputs. And the correlation function is often a convenient built-in function. Continue reading Don’t use correlation to track prediction performance
Model level fit summaries can be tricky in R. A quick read of model fit summary data for factor levels can be misleading. We describe the issue and demonstrate techniques for dealing with them. Continue reading Level fit summaries can be tricky in R
Logistic Regression is a popular and effective technique for modeling categorical outcomes as a function of both continuous and categorical variables. The question is: how robust is it? Or: how robust are the common implementations? (note: we are using robust in a more standard English sense of performs well for all inputs, not in the technical statistical sense of immune to deviations from assumptions or outliers.)
Even a detailed reference such as “Categorical Data Analysis” (Alan Agresti, Wiley, 1990) leaves off with an empirical observation: “the convergence … for the Newton-Raphson method is usually fast” (chapter 4, section 4.7.3, page 117). This is a book that if there is a known proof that the estimation step is a contraction (one very strong guarantee of convergence) you would expect to see the proof reproduced. I always suspected there was some kind of Brouwer fixed-point theorem based folk-theorem proving absolute convergence of the Newton-Raphson method in for the special case of logistic regression. This can not be the case as the Newton-Raphson method can diverge even on trivial full-rank well-posed logistic regression problems. Continue reading How robust is logistic regression?
One of the shortcomings of regression (both linear and logistic) is that it doesn’t handle categorical variables with a very large number of possible values (for example, postal codes). You can get around this, of course, by going to another modeling technique, such as Naive Bayes; however, you lose some of the advantages of regression — namely, the model’s explicit estimates of variables’ explanatory value, and explicit insight into and control of variable to variable dependence.
Here we discuss one modeling trick that allows us to keep categorical variables with a large number of values, and at the same time retain much of logistic regression’s power.
Continue reading Modeling Trick: Impact Coding of Categorical Variables with Many Levels