Continuing our series of reading out loud from a single page of a statistics book we look at page 224 of the 1972 Dover edition of Leonard J. Savage’s “The Foundations of Statistics.” On this page we are treated to an example attributed to Leo A. Goodman in 1953 that illustrates how for normally distributed data the maximum likelihood, unbiased, and minimum variance estimators of variance are in fact typically three different values. So in the spirit of gamesmanship you always have at least two reasons to call anybody else’s estimator incorrect. Continue reading
What is the Gauss-Markov theorem?
From “The Cambridge Dictionary of Statistics” B. S. Everitt, 2nd Edition:
A theorem that proves that if the error terms in a multiple regression have the same variance and are uncorrelated, then the estimators of the parameters in the model produced by least squares estimation are better (in the sense of having lower dispersion about the mean) than any other unbiased linear estimator.
This is pretty much considered the “big boy” reason least squares fitting can be considered a good implementation of linear regression.
Suppose you are building a model of the form:
y(i) = B . x(i) + e(i)
B is a vector (to be inferred),
i is an index that runs over the available data (say
x(i) is a per-example vector of features, and
y(i) is the scalar quantity to be modeled. Only
y(i) are observed. The
e(i) term is the un-modeled component of
y(i) and you typically hope that the
e(i) can be thought of unknowable effects, individual variation, ignorable errors, residuals, or noise. How weak/strong assumptions you put on the
e(i) (and other quantities) depends on what you know, what you are trying to do, and which theorems you need to meet the pre-conditions of. The Gauss-Markov theorem assures a good estimate of
B under weak assumptions.
How to interpret the theorem
The point of the Gauss-Markov theorem is that we can find conditions ensuring a good fit without requiring detailed distributional assumptions about the
e(i) and without distributional assumptions about the
x(i). However, if you are using Bayesian methods or generative models for predictions you may want to use additional stronger conditions (perhaps even normality of errors and even distributional assumptions on the
We are going to read through the Wikipedia statement of the Gauss-Markov theorem in detail.
When you apply machine learning algorithms on a regular basis, on a wide variety of data sets, you find that certain data issues come up again and again:
- Missing values (
- Problematic numerical values (
NaN, sentinel values like 999999999 or -1)
- Valid categorical levels that don’t appear in the training data (especially when there are rare levels, or a large number of levels)
- Invalid values
Of course, you should examine the data to understand the nature of the data issues: are the missing values missing at random, or are they systematic? What are the valid ranges for the numerical data? Are there sentinel values, what are they, and what do they mean? What are the valid values for text fields? Do we know all the valid values for a categorical variable, and are there any missing? Is there any principled way to roll up category levels? In the end though, the steps you take to deal with these issues will often be the same from data set to data set, so having a package of ready-to-go functions for data treatment is useful. In this article, we will discuss some of our usual data treatment procedures, and describe a prototype R package that implements them.
Been reading a lot of Gelman, Carlin, Stern, Dunson, Vehtari, Rubin “Bayesian Data Analysis” 3rd edition lately. Overall in the Bayesian framework some ideas (such as regularization, and imputation) are way easier to justify (though calculating some seemingly basic quantities becomes tedious). A big advantage (and weakness) of this formulation is statistics has a much less “shrink wrapped” feeling than the classic frequentist presentations. You feel like the material is being written to peers instead of written to calculators (of the human or mechanical variety). In the Bayesian formulation you don’t feel like you will be yelled at for using 1 tablespoon of sugar when the recipe calls for 3 teaspoons (at least if you live in the United States).
Some other stuff reads differently after this though. Continue reading
The goal of Zumel/Mount: Practical Data Science with R is to teach, through guided practice, the skills of a data scientist. We define a data scientist as the person who organizes client input, data, infrastructure, statistics, mathematics and machine learning to deploy useful predictive models into production.
Our plan to teach is to:
- Order the material by what is expected from the data scientist.
- Emphasize the already available bread and butter machine learning algorithms that most often work.
- Provide a large set of worked examples.
- Expose the reader to a number of realistic data sets.
Some of these choices may put-off some potential readers. But it is our goal to try and spend out time on what a data scientist needs to do. Our point: the data scientist is responsible for end to end results, which is not always entirely fun. If you want to specialize in machine learning algorithms or only big data infrastructure, that is a fine goal. However, the job of the data scientist is to understand and orchestrate all of the steps (working with domain experts, curating data, using data tools, and applying machine learning and statistics).
Once you define what a data scientist does, you find fewer people want to work as one.
We expand a few of our points below. Continue reading
Many data science projects and presentations are needlessly derailed by not having set shared business relevant quantitative expectations early on (for some advice see Setting expectations in data science projects). One of the most common issues is the common layman expectation of “perfect prediction” from classification projects. It is important to set expectations correctly so your partners know what you are actually working towards and do not consider late choices of criteria disappointments or “venue shopping.” Continue reading
As a data scientist I have seen variations of principal component analysis and factor analysis so often blindly misapplied and abused that I have come to think of the technique as unprincipled component analysis. PCA is a good technique often used to reduce sensitivity to overfitting. But this stated design intent leads many to (falsely) believe that any claimed use of PCA prevents overfit (which is not always the case). In this note we comment on the intent of PCA like techniques, common abuses and other options.
The idea is to illustrate what can quietly go wrong in an analysis and what tests to perform to make sure you see the issue. The main point is some analysis issues can not be fixed without going out and getting more domain knowledge, more variables or more data. You can’t always be sure that you have insufficient data in your analysis (there is always a worry that some clever technique will make the current data work), but it must be something you are prepared to consider. Continue reading
We demonstrate a dataset that causes many good machine learning algorithms to horribly overfit.
The example is designed to imitate a common situation found in predictive analytic natural language processing. In this type of application you are often building a model using many rare text features. The rare text features are often nearly unique k-grams and the model can be anything from Naive Bayes to conditional random fields. This sort of modeling situation exposes the modeler to a lot of training bias. You can get models that look good on training data even though they have no actual value on new data (very poor generalization performance). In this sort of situation you are very vulnerable to having fit mere noise.
Often there is a feeling if a model is doing really well on training data then must be some way to bound generalization error and at least get useful performance on new test and production data. This is, of course, false as we will demonstrate by building deliberately useless features that allow various models to perform well on training data. What is actually happening is you are working through variations of worthless models that only appear to be good on training data due to overfitting. And the more “tweaking, tuning, and fixing” you try only appears to improve things because as you peek at your test-data (which you really should have held some out until the entire end of project for final acceptance) your test data is becoming less exchangeable with future new data and more exchangeable with your training data (and thus less helpful in detecting overfit).
Any researcher that does not have proper per-feature significance checks or hold-out testing procedures will be fooled into promoting faulty models. Continue reading
I often need to build a predictive model that estimates rates. The example of our age is: ad click through rates (how often a viewer clicks on an ad estimated as a function of the features of the ad and the viewer). Another timely example is estimating default rates of mortgages or credit cards. You could try linear regression, but specialized tools often do much better. For rate problems involving estimating probabilities and frequencies we recommend logistic regression. For non-frequency (and non-categorical) rate problems (such as forecasting yield or purity) we suggest beta regression.