Continuing our series of reading out loud from a single page of a statistics book we look at page 224 of the 1972 Dover edition of Leonard J. Savage’s “The Foundations of Statistics.” On this page we are treated to an example attributed to Leo A. Goodman in 1953 that illustrates how for normally distributed data the maximum likelihood, unbiased, and minimum variance estimators of variance are in fact typically three different values. So in the spirit of gamesmanship you always have at least two reasons to call anybody else’s estimator incorrect. Continue reading
Win-Vector LLC‘s John Mount will be speaking at Strata + Hadoop World 2014 this month. Please attend my panel on data inventories (a key driver of data science project success) and attend my “Practical Data Science with R” book office hour (get your book signed!). Thank you both O’Reilly Media, Inc. and Waterline Data Science for making this possible.
Current schedule/location details after the click. Continue reading
As we’ve mentioned on previous occasions, one of the defining characteristics of data science is the emphasis on the availability of “large” data sets, which we define as “enough data that statistical efficiency is not a concern” (note that a “large” data set need not be “big data,” however you choose to define it). In particular, we advocate the use of hold-out data to evaluate the performance of models.
There is one caveat: if you are evaluating a series of models to pick the best (and you usually are), then a single hold-out set is strictly speaking not enough. Hastie, et.al, say it best:
Ideally, the test set should be kept in a “vault,” and be brought out only at the end of the data analysis. Suppose instead that we use the test-set repeatedly, choosing the model with smallest test-set error. Then the test set error of the final chosen model will underestimate the true test error, sometimes substantially.
The ideal way to select a model from a set of candidates (or set parameters for a model, for example the regularization constant) is to use a training set to train the model(s), a calibration set to select the model or choose parameters, and a test set to estimate the generalization error of the final model.
In many situations, breaking your data into three sets may not be practical: you may not have very much data, or the the phenomena you’re interested in are rare enough that you need a lot of data to detect them. In those cases, you will need more statistically efficient estimates for generalization error or goodness-of-fit. In this article, we look at the PRESS statistic, and how to use it to estimate generalization error and choose between models.
The primary user-facing data types in the R statistical computing environment behave as vectors. That is: one dimensional arrays of scalar values that have a nice operational algebra. There are additional types (lists, data frames, matrices, environments, and so-on) but the most common data types are vectors. In fact vectors are so common in R that scalar values such as the number
5 are actually represented as length-1 vectors. We commonly think about working over vectors of “logical”, “integer”, “numeric”, “complex”, “character”, and “factor” types. However, a “factor” is not a R vector. In fact “factor” is not a first-class citizen in R, which can lead to some ugly bugs.
For example, consider the following R code.
levels <- c('a','b','c') f <- factor(c('c','a','a',NA,'b','a'),levels=levels) print(f) ##  c a a <NA> b a ## Levels: a b c print(class(f)) ##  "factor"
This example encoding a series of 6 observations into a known set of factor-levels (
'c'). As is the case with real data some of the positions might be missing/invalid values such as
NA. One of the strengths of R is we have a uniform explicit representation of bad values, so with appropriate domain knowledge we can find and fix such problems. Suppose we knew (by policy or domain experience) that the level
'a' was a suitable default value to use when the actual data is missing/invalid. You would think the following code would be the reasonable way to build a new revised data column.
fRevised <- ifelse(is.na(f),'a',f) print(fRevised) ##  "3" "1" "1" "a" "2" "1" print(class(fRevised)) ##  "character"
Notice the new column
fRevised is an absolute mess (and not even of class/type factor). This sort of fix would have worked if
f had been a vector of characters or even a vector of integers, but for factors we get gibberish.
We are going to work through some more examples of this problem. Continue reading
From time to time we are asked “what is the company name Win-Vector LLC referring to?” It is a cryptic pun trying to be an encoding of “we deliver victory.”
The story is an inside joke referring to something really only funny to one of the founders. But a joke that amuses the teller is always enjoyed by at least one person. Win-Vector LLC’s John Mount had the honor of co-authoring a 1997 paper titled “The Polytope of Win Vectors.” The paper title is obviously mathematical terms in an odd combination. However the telegraphic grammar is coincidentally similar to deliberately ungrammatical gamer slang such as “full of win” and “so much win.”
If we treat “win” as a concrete noun (say something you can put in a sack) and “vector” in its non-mathematical sense (as an entity of infectious transmission) we have “Win-Vector LLC is an infectious delivery of victory.” I.e.: we deliver success to our clients. Of course, we have now attempt to explain a weak joke. It is not as grand as “winged victory,” but it does encode a positive company value: Win-Vector LLC delivers successful data science projects and training to clients.
Winged Victory: from Wikipedia
Let’s take this as an opportunity to describe what a win vector is. Continue reading
What is the Gauss-Markov theorem?
From “The Cambridge Dictionary of Statistics” B. S. Everitt, 2nd Edition:
A theorem that proves that if the error terms in a multiple regression have the same variance and are uncorrelated, then the estimators of the parameters in the model produced by least squares estimation are better (in the sense of having lower dispersion about the mean) than any other unbiased linear estimator.
This is pretty much considered the “big boy” reason least squares fitting can be considered a good implementation of linear regression.
Suppose you are building a model of the form:
y(i) = B . x(i) + e(i)
B is a vector (to be inferred),
i is an index that runs over the available data (say
x(i) is a per-example vector of features, and
y(i) is the scalar quantity to be modeled. Only
y(i) are observed. The
e(i) term is the un-modeled component of
y(i) and you typically hope that the
e(i) can be thought of unknowable effects, individual variation, ignorable errors, residuals, or noise. How weak/strong assumptions you put on the
e(i) (and other quantities) depends on what you know, what you are trying to do, and which theorems you need to meet the pre-conditions of. The Gauss-Markov theorem assures a good estimate of
B under weak assumptions.
How to interpret the theorem
The point of the Gauss-Markov theorem is that we can find conditions ensuring a good fit without requiring detailed distributional assumptions about the
e(i) and without distributional assumptions about the
x(i). However, if you are using Bayesian methods or generative models for predictions you may want to use additional stronger conditions (perhaps even normality of errors and even distributional assumptions on the
We are going to read through the Wikipedia statement of the Gauss-Markov theorem in detail.
When you apply machine learning algorithms on a regular basis, on a wide variety of data sets, you find that certain data issues come up again and again:
- Missing values (
- Problematic numerical values (
NaN, sentinel values like 999999999 or -1)
- Valid categorical levels that don’t appear in the training data (especially when there are rare levels, or a large number of levels)
- Invalid values
Of course, you should examine the data to understand the nature of the data issues: are the missing values missing at random, or are they systematic? What are the valid ranges for the numerical data? Are there sentinel values, what are they, and what do they mean? What are the valid values for text fields? Do we know all the valid values for a categorical variable, and are there any missing? Is there any principled way to roll up category levels? In the end though, the steps you take to deal with these issues will often be the same from data set to data set, so having a package of ready-to-go functions for data treatment is useful. In this article, we will discuss some of our usual data treatment procedures, and describe a prototype R package that implements them.
Win-Vector LLC’s Nina Zumel takes some time off to publish a literary book review: Reading Red Spectres: Russian Gothic Tales.
Nina Zumel also examines aspects of the supernatural in literature and in folk culture at her blog, multoghost.wordpress.com. She writes about folklore, ghost stories, weird fiction, or anything else that strikes her fancy. Follow her on Twitter @multoghost.
Page 94 of Gelman, Carlin, Stern, Dunson, Vehtari, Rubin “Bayesian Data Analysis” 3rd Edition (which we will call BDA3) provides a great example of what happens when common broad frequentist bias criticisms are over-applied to predictions from ordinary linear regression: the predictions appear to fall apart. BDA3 goes on to exhibit what might be considered the kind of automatic/mechanical fix responding to such criticisms would entail (producing a bias corrected predictor), and rightly shows these adjusted predictions are far worse than the original ordinary linear regression predictions. BDA3 makes a number of interesting points and is worth studying closely. We work their example in a bit more detail for emphasis. Continue reading
Two of the most common methods of statistical inference are frequentism and Bayesianism (see Bayesian and Frequentist Approaches: Ask the Right Question for some good discussion). In both cases we are attempting to perform reliable inference of unknown quantities from related observations. And in both cases inference is made possible by introducing and reasoning over well-behaved distributions of values.
As a first example, consider the problem of trying to estimate the speed of light from a series of experiments.
In this situation the frequentist method quietly does some heavy philosophical lifting before you even start work. Under the frequentist interpretation since the speed of light is thought to have a single value it does not make sense to model it as having a prior distribution of possible values over any non-trivial range. To get the ability to infer, frequentist philosophy considers the act of measurement repeatable and introduces very subtle concepts such as confidence intervals. The frequentist statement that a series of experiments places the speed of light in vacuum at 300,000,000 meters a second plus or minus 1,000,000 meters a second with 95% confidence does not mean there is a 95% chance that the actual speed of light is in the interval 299,000,000 to 301,000,000 (the common incorrect recollection of what a confidence interval is). It means if the procedure that generated the interval were repeated on new data, then 95% of the time the speed of light would be in the interval produced: which may not be the interval we are looking at right now. Frequentist procedures are typically easy on the practitioner (all of the heavy philosophic work has already been done) and result in simple procedures and calculations (through years of optimization of practice).
Bayesian procedures on the other hand are philosophically much simpler, but require much more from the user (production and acceptance of priors). The Bayesian philosophy is: given a generative model, a complete prior distribution (detailed probabilities of the unknown value posited before looking at the current experimental data) of the quantity to be estimated, and observations: then inference is just a matter of calculating the complete posterior distribution of the quantity to be estimated (by correct application of Bayes’ Law). Supply a bad model or bad prior beliefs on possible values of the speed of light and you get bad results (and it is your fault, not the methodology’s fault). The Bayesian method seems to ask more, but you have to remember it is trying to supply more (complete posterior distribution, versus subjunctive confidence intervals).
In this article we are going to work a simple (but important) problem where (for once) the Bayesian calculations are in fact easier than the frequentist ones. Continue reading