I have been working through (with some honest appreciation) a recent article comparing many classifiers on many data sets: “Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?” Manuel Fernández-Delgado, Eva Cernadas, Senén Barro, Dinani Amorim; 15(Oct):3133−3181, 2014 (which we will call “the DWN paper” in this note). This paper applies 179 popular classifiers to around 120 data sets (mostly from the UCI Machine Learning Repository). The work looks good and interesting, but we do have one quibble with the data-prep on 8 of the 123 shared data sets. Given the paper is already out (not just in pre-print) I think it is appropriate to comment publicly. Continue reading
Any practicing data scientist is going to eventually have to work with a data stored in a Microsoft
Excel spreadsheet. A lot of analysts use this format, so if you work with others you are going to run into it. We have already written how we don’t recommend using
Excel-like formats to exchange data. But we know if you are going to work with others you are going to have to make accommodations (we even built our own modified version of
Perl script to work around a bug).
But one thing that continues to confound us is how hard it is to read
Excel data correctly. When
Excel exports into
CSV/TSV style formats it uses fairly clever escaping rules about quotes and new-lines. Most
CSV/TSV readers fail to correctly implement these rules and often fail on fields that contain actual quote characters, separators (tab or comma), or new-lines. Another issue is
Excel itself often transforms data without any user verification or control. For example:
Excel routinely turns date-like strings into time since epoch (which it then renders as a date). We recently ran into another uncontrollable
Excel transform: changing the strings “
TRUE” and “
FALSE” into 1 and 0 inside the actual “
.xlsx” file. That is
Excel does not faithfully store the strings “
TRUE” and “
FALSE” even in its native format. Most
Excel users do not know about this, so they certainly are in no position to warn you about it.
This would be a mere annoyance, except it turns out
Libre Office (or at least LibreOffice_4.3.4_MacOS_x86-64) has a severe and silent data mangling bug on this surprising Microsoft boolean type.
We first ran into this in client data (and once the bug triggered it seemed to alter most of the columns), but it turns out the bug is very easy to trigger. In this note we will demonstrate the data representation issue and bug. Continue reading
In most of our data science teaching (including our book Practical Data Science with R) we emphasize the deliberately easy problem of “exchangeable prediction.” We define exchangeable prediction as: given a series of observations with two distinguished classes of variables/observations denoted “x”s (denoting control variables, independent variables, experimental variables, or predictor variables) and “y” (denoting an outcome variable, or dependent variable) then:
- Estimate an approximate functional relation
y ~ f(x).
- Apply that relation to new instances where
xis known and
yis not yet known.
An example of this would be to use measured characteristics of online shoppers to predict if they will purchase in the next month. Data more than a month old gives us a training set where both
y are known. Newer shoppers give us examples where only
x is currently known and it would presumably be of some value to estimate
y or estimate the probability of different
y values. The problem is philosophically “easy” in the sense we are not attempting inference (estimating unknown parameters that are not later exposed to us) and we are not extrapolating (making predictions about situations that are out of the range of our training data). All we are doing is essentially generalizing memorization: if somebody who shares characteristics of recent buyers shows up, predict they are likely to buy. We repeat: we are not forecasting or “predicting the future” as we are not modeling how many high-value prospects will show up, just assigning scores to the prospects that do show up.
The reliability of such a scheme rests on the concept of exchangeability. If the future individuals we are asked to score are exchangeable with those we had access to during model construction then we expect to be able to make useful predictions. How we construct the model (and how to ensure we indeed find a good one) is the core of machine learning. We can bring in any big name machine learning method (deep learning, support vector machines, random forests, decision trees, regression, nearest neighbors, conditional random fields, and so-on) but the legitimacy of the technique pretty much stands on some variation of the idea of exchangeability.
One effect antithetical to exchangeability is “concept drift.” Concept drift is when the meanings and distributions of variables or relations between variables changes over time. Concept drift is a killer: if the relations available to you during training are thought not to hold during later application then you should not expect to build a useful model. This one of the hard lessons that statistics tries so hard to quantify and teach.
We know that you should always prefer fixing your experimental design over trying a mechanical correction (which can go wrong). And there are no doubt “name brand” procedures for dealing with concept drift. However, data science and machine learning practitioners are at heart tinkerers. We ask: can we (to a limited extent) attempt to directly correct for concept drift? This article demonstrates a simple correction applied to a deliberately simple artificial example.
Image: Wikipedia: Elgin watchmaker
Continuing our series of reading out loud from a single page of a statistics book we look at page 224 of the 1972 Dover edition of Leonard J. Savage’s “The Foundations of Statistics.” On this page we are treated to an example attributed to Leo A. Goodman in 1953 that illustrates how for normally distributed data the maximum likelihood, unbiased, and minimum variance estimators of variance are in fact typically three different values. So in the spirit of gamesmanship you always have at least two reasons to call anybody else’s estimator incorrect. Continue reading
Win-Vector LLC‘s John Mount will be speaking at Strata + Hadoop World 2014 this month. Please attend my panel on data inventories (a key driver of data science project success) and attend my “Practical Data Science with R” book office hour (get your book signed!). Thank you both O’Reilly Media, Inc. and Waterline Data Science for making this possible.
Current schedule/location details after the click. Continue reading
As we’ve mentioned on previous occasions, one of the defining characteristics of data science is the emphasis on the availability of “large” data sets, which we define as “enough data that statistical efficiency is not a concern” (note that a “large” data set need not be “big data,” however you choose to define it). In particular, we advocate the use of hold-out data to evaluate the performance of models.
There is one caveat: if you are evaluating a series of models to pick the best (and you usually are), then a single hold-out set is strictly speaking not enough. Hastie, et.al, say it best:
Ideally, the test set should be kept in a “vault,” and be brought out only at the end of the data analysis. Suppose instead that we use the test-set repeatedly, choosing the model with smallest test-set error. Then the test set error of the final chosen model will underestimate the true test error, sometimes substantially.
The ideal way to select a model from a set of candidates (or set parameters for a model, for example the regularization constant) is to use a training set to train the model(s), a calibration set to select the model or choose parameters, and a test set to estimate the generalization error of the final model.
In many situations, breaking your data into three sets may not be practical: you may not have very much data, or the the phenomena you’re interested in are rare enough that you need a lot of data to detect them. In those cases, you will need more statistically efficient estimates for generalization error or goodness-of-fit. In this article, we look at the PRESS statistic, and how to use it to estimate generalization error and choose between models.
The primary user-facing data types in the R statistical computing environment behave as vectors. That is: one dimensional arrays of scalar values that have a nice operational algebra. There are additional types (lists, data frames, matrices, environments, and so-on) but the most common data types are vectors. In fact vectors are so common in R that scalar values such as the number
5 are actually represented as length-1 vectors. We commonly think about working over vectors of “logical”, “integer”, “numeric”, “complex”, “character”, and “factor” types. However, a “factor” is not a R vector. In fact “factor” is not a first-class citizen in R, which can lead to some ugly bugs.
For example, consider the following R code.
levels <- c('a','b','c') f <- factor(c('c','a','a',NA,'b','a'),levels=levels) print(f) ##  c a a <NA> b a ## Levels: a b c print(class(f)) ##  "factor"
This example encoding a series of 6 observations into a known set of factor-levels (
'c'). As is the case with real data some of the positions might be missing/invalid values such as
NA. One of the strengths of R is we have a uniform explicit representation of bad values, so with appropriate domain knowledge we can find and fix such problems. Suppose we knew (by policy or domain experience) that the level
'a' was a suitable default value to use when the actual data is missing/invalid. You would think the following code would be the reasonable way to build a new revised data column.
fRevised <- ifelse(is.na(f),'a',f) print(fRevised) ##  "3" "1" "1" "a" "2" "1" print(class(fRevised)) ##  "character"
Notice the new column
fRevised is an absolute mess (and not even of class/type factor). This sort of fix would have worked if
f had been a vector of characters or even a vector of integers, but for factors we get gibberish.
We are going to work through some more examples of this problem. Continue reading
From time to time we are asked “what is the company name Win-Vector LLC referring to?” It is a cryptic pun trying to be an encoding of “we deliver victory.”
The story is an inside joke referring to something really only funny to one of the founders. But a joke that amuses the teller is always enjoyed by at least one person. Win-Vector LLC’s John Mount had the honor of co-authoring a 1997 paper titled “The Polytope of Win Vectors.” The paper title is obviously mathematical terms in an odd combination. However the telegraphic grammar is coincidentally similar to deliberately ungrammatical gamer slang such as “full of win” and “so much win.”
If we treat “win” as a concrete noun (say something you can put in a sack) and “vector” in its non-mathematical sense (as an entity of infectious transmission) we have “Win-Vector LLC is an infectious delivery of victory.” I.e.: we deliver success to our clients. Of course, we have now attempt to explain a weak joke. It is not as grand as “winged victory,” but it does encode a positive company value: Win-Vector LLC delivers successful data science projects and training to clients.
Winged Victory: from Wikipedia
Let’s take this as an opportunity to describe what a win vector is. Continue reading
What is the Gauss-Markov theorem?
From “The Cambridge Dictionary of Statistics” B. S. Everitt, 2nd Edition:
A theorem that proves that if the error terms in a multiple regression have the same variance and are uncorrelated, then the estimators of the parameters in the model produced by least squares estimation are better (in the sense of having lower dispersion about the mean) than any other unbiased linear estimator.
This is pretty much considered the “big boy” reason least squares fitting can be considered a good implementation of linear regression.
Suppose you are building a model of the form:
y(i) = B . x(i) + e(i)
B is a vector (to be inferred),
i is an index that runs over the available data (say
x(i) is a per-example vector of features, and
y(i) is the scalar quantity to be modeled. Only
y(i) are observed. The
e(i) term is the un-modeled component of
y(i) and you typically hope that the
e(i) can be thought of unknowable effects, individual variation, ignorable errors, residuals, or noise. How weak/strong assumptions you put on the
e(i) (and other quantities) depends on what you know, what you are trying to do, and which theorems you need to meet the pre-conditions of. The Gauss-Markov theorem assures a good estimate of
B under weak assumptions.
How to interpret the theorem
The point of the Gauss-Markov theorem is that we can find conditions ensuring a good fit without requiring detailed distributional assumptions about the
e(i) and without distributional assumptions about the
x(i). However, if you are using Bayesian methods or generative models for predictions you may want to use additional stronger conditions (perhaps even normality of errors and even distributional assumptions on the
We are going to read through the Wikipedia statement of the Gauss-Markov theorem in detail.