As John mentioned in his last post, we have been quite interested in the recent study by Fernandez-Delgado, et.al., “Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?” (the “DWN study” for short), which evaluated 179 popular implementations of common classification algorithms over 120 or so data sets, mostly from the UCI Machine Learning Repository. For fun, we decided to do a follow-up study, using their data and several classifier implementations from `scikit-learn`

, the Python machine learning library. We were interested not just in classifier accuracy, but also in seeing if there is a “geometry” of classifiers: which classifiers produce predictions patterns that look similar to each other, and which classifiers produce predictions that are quite different? To examine these questions, we put together a Shiny app to interactively explore how the relative behavior of classifiers changes for different types of data sets.

# Category Archives: data science

# A comment on preparing data for classifiers

I have been working through (with some honest appreciation) a recent article comparing many classifiers on many data sets: “Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?” Manuel Fernández-Delgado, Eva Cernadas, Senén Barro, Dinani Amorim; 15(Oct):3133−3181, 2014 (which we will call “the DWN paper” in this note). This paper applies 179 popular classifiers to around 120 data sets (mostly from the UCI Machine Learning Repository). The work looks good and interesting, but we do have one quibble with the data-prep on 8 of the 123 shared data sets. Given the paper is already out (not just in pre-print) I think it is appropriate to comment publicly. Continue reading A comment on preparing data for classifiers

# Excel spreadsheets are hard to get right

Any practicing data scientist is going to eventually have to work with a data stored in a Microsoft `Excel`

spreadsheet. A lot of analysts use this format, so if you work with others you are going to run into it. We have already written how we don’t recommend using `Excel`

-like formats to exchange data. But we know if you are going to work with others you are going to have to make accommodations (we even built our own modified version of `gdata`

‘s underlying `Perl`

script to work around a bug).

But one thing that continues to confound us is how hard it is to read `Excel`

data correctly. When `Excel`

exports into `CSV/TSV`

style formats it uses fairly clever escaping rules about quotes and new-lines. Most `CSV/TSV`

readers fail to correctly implement these rules and often fail on fields that contain actual quote characters, separators (tab or comma), or new-lines. Another issue is `Excel`

itself often transforms data without any user verification or control. For example: `Excel`

routinely turns date-like strings into time since epoch (which it then renders as a date). We recently ran into another uncontrollable `Excel`

transform: changing the strings “`TRUE`

” and “`FALSE`

” into 1 and 0 inside the actual “`.xlsx`

” file. That is `Excel`

does not faithfully store the strings “`TRUE`

” and “`FALSE`

” even in its native format. Most `Excel`

users do not know about this, so they certainly are in no position to warn you about it.

This would be a mere annoyance, except it turns out `Libre Office`

(or at least LibreOffice_4.3.4_MacOS_x86-64) has a severe and silent data mangling bug on this surprising Microsoft boolean type.

We first ran into this in client data (and once the bug triggered it seemed to alter most of the columns), but it turns out the bug is very easy to trigger. In this note we will demonstrate the data representation issue and bug. Continue reading Excel spreadsheets are hard to get right

# Can we try to make an adjustment?

In most of our data science teaching (including our book Practical Data Science with R) we emphasize the deliberately easy problem of “exchangeable prediction.” We define exchangeable prediction as: given a series of observations with two distinguished classes of variables/observations denoted “x”s (denoting control variables, independent variables, experimental variables, or predictor variables) and “y” (denoting an outcome variable, or dependent variable) then:

- Estimate an approximate functional relation
`y ~ f(x)`

. - Apply that relation to new instances where
`x`

is known and`y`

is not yet known.

An example of this would be to use measured characteristics of online shoppers to predict if they will purchase in the next month. Data more than a month old gives us a training set where both `x`

and `y`

are known. Newer shoppers give us examples where only `x`

is currently known and it would presumably be of some value to estimate `y`

or estimate the probability of different `y`

values. The problem is philosophically “easy” in the sense we are not attempting inference (estimating unknown parameters that are not later exposed to us) and we are not extrapolating (making predictions about situations that are out of the range of our training data). All we are doing is essentially generalizing memorization: if somebody who shares characteristics of recent buyers shows up, predict they are likely to buy. We repeat: we are *not* forecasting or “predicting the future” as we are not modeling how many high-value prospects will show up, just assigning scores to the prospects that do show up.

The reliability of such a scheme rests on the concept of exchangeability. If the future individuals we are asked to score are exchangeable with those we had access to during model construction then we expect to be able to make useful predictions. How we construct the model (and how to ensure we indeed find a good one) is the core of machine learning. We can bring in any big name machine learning method (deep learning, support vector machines, random forests, decision trees, regression, nearest neighbors, conditional random fields, and so-on) but the legitimacy of the technique pretty much stands on some variation of the idea of exchangeability.

One effect antithetical to exchangeability is “concept drift.” Concept drift is when the meanings and distributions of variables or relations between variables changes over time. Concept drift is a killer: if the relations available to you during training are thought not to hold during later application then you should not expect to build a useful model. This one of the hard lessons that statistics tries so hard to quantify and teach.

We know that you should always prefer fixing your experimental design over trying a mechanical correction (which can go wrong). And there are no doubt “name brand” procedures for dealing with concept drift. However, data science and machine learning practitioners are at heart tinkerers. We ask: can we (to a limited extent) attempt to directly correct for concept drift? This article demonstrates a simple correction applied to a deliberately simple artificial example.

Image: Wikipedia: Elgin watchmaker

# Win-Vector LLC’s John Mount at Strata + Hadoop World October 2014

Win-Vector LLC‘s John Mount will be speaking at Strata + Hadoop World 2014 this month. Please attend my panel on data inventories (a key driver of data science project success) and attend my “Practical Data Science with R” book office hour (get your book signed!). Thank you both O’Reilly Media, Inc. and Waterline Data Science for making this possible.

Current schedule/location details after the click. Continue reading Win-Vector LLC’s John Mount at Strata + Hadoop World October 2014

# Reading the Gauss-Markov theorem

**What is the Gauss-Markov theorem?**

From “The Cambridge Dictionary of Statistics” B. S. Everitt, 2nd Edition:

A theorem that proves that if the error terms in a

multiple regressionhave the same variance and are uncorrelated, then the estimators of the parameters in the model produced byleast squares estimationare better (in the sense of having lower dispersion about the mean) than any other unbiased linear estimator.

This is pretty much considered the “big boy” reason least squares fitting can be considered a good implementation of linear regression.

Suppose you are building a model of the form:

```
``` y(i) = B . x(i) + e(i)

where `B`

is a vector (to be inferred), `i`

is an index that runs over the available data (say `1`

through `n`

), `x(i)`

is a per-example vector of features, and `y(i)`

is the scalar quantity to be modeled. Only `x(i)`

and `y(i)`

are observed. The `e(i)`

term is the un-modeled component of `y(i)`

and you typically hope that the `e(i)`

can be thought of unknowable effects, individual variation, ignorable errors, residuals, or noise. How weak/strong assumptions you put on the `e(i)`

(and other quantities) depends on what you know, what you are trying to do, and which theorems you need to meet the pre-conditions of. The Gauss-Markov theorem assures a good estimate of `B`

under weak assumptions.

**How to interpret the theorem**

The point of the Gauss-Markov theorem is that we can find conditions ensuring a good fit without requiring detailed distributional assumptions about the `e(i)`

and without distributional assumptions about the `x(i)`

. However, if you are using Bayesian methods or generative models for predictions you *may want* to use additional stronger conditions (perhaps even normality of errors and *even* distributional assumptions on the `x`

s).

We are going to read through the Wikipedia statement of the Gauss-Markov theorem in detail.

# Vtreat: designing a package for variable treatment

When you apply machine learning algorithms on a regular basis, on a wide variety of data sets, you find that certain data issues come up again and again:

- Missing values (
`NA`

or blanks) - Problematic numerical values (
`Inf`

,`NaN`

, sentinel values like 999999999 or -1) - Valid categorical levels that don’t appear in the training data (especially when there are rare levels, or a large number of levels)
- Invalid values

Of course, you should examine the data to understand the nature of the data issues: are the missing values missing at random, or are they systematic? What are the valid ranges for the numerical data? Are there sentinel values, what are they, and what do they mean? What are the valid values for text fields? Do we know all the valid values for a categorical variable, and are there any missing? Is there any principled way to roll up category levels? In the end though, the steps you take to deal with these issues will often be the same from data set to data set, so having a package of ready-to-go functions for data treatment is useful. In this article, we will discuss some of our usual data treatment procedures, and describe a prototype R package that implements them.

Continue reading Vtreat: designing a package for variable treatment

# Skimming statistics papers for the ideas (instead of the complete procedures)

Been reading a lot of Gelman, Carlin, Stern, Dunson, Vehtari, Rubin “Bayesian Data Analysis” 3rd edition lately. Overall in the Bayesian framework some ideas (such as regularization, and imputation) are way easier to justify (though calculating some seemingly basic quantities becomes tedious). A big advantage (and weakness) of this formulation is statistics has a much less “shrink wrapped” feeling than the classic frequentist presentations. You feel like the material is being written to peers instead of written to calculators (of the human or mechanical variety). In the Bayesian formulation you don’t feel like you will be yelled at for using 1 tablespoon of sugar when the recipe calls for 3 teaspoons (at least if you live in the United States).

Some other stuff reads differently after this though. Continue reading Skimming statistics papers for the ideas (instead of the complete procedures)

# Trimming the Fat from glm() Models in R

One of the attractive aspects of logistic regression models (and linear models in general) is their compactness: the size of the model grows in the number of coefficients, not in the size of the training data. With R, though, `glm`

models are not so concise; we noticed this to our dismay when we tried to automate fitting a moderate number of models (about 500 models, with on the order of 50 coefficients) to data sets of moderate size (several tens of thousands of rows). A workspace save of the models alone was in the tens of gigabytes! How is this possible? We decided to find out.

As many R users know (but often forget), a `glm`

model object carries a copy of its training data by default. You can use the settings `y=FALSE`

and `model=FALSE`

to turn this off.

set.seed(2325235) # Set up a synthetic classification problem of a given size # and two variables: one numeric, one categorical # (two levels). synthFrame = function(nrows) { d = data.frame(xN=rnorm(nrows), xC=sample(c('a','b'),size=nrows,replace=TRUE)) d$y = (d$xN + ifelse(d$xC=='a',0.2,-0.2) + rnorm(nrows))>0.5 d } # first show that model=F and y=F help reduce model size dTrain = synthFrame(1000) model1 = glm(y~xN+xC,data=dTrain,family=binomial(link='logit')) model2 = glm(y~xN+xC,data=dTrain,family=binomial(link='logit'), y=FALSE) model3 = glm(y~xN+xC,data=dTrain,family=binomial(link='logit'), y=FALSE, model=FALSE) # # Estimate the object's size as the size of its serialization # length(serialize(model1, NULL)) # [1] 225251 length(serialize(model2, NULL)) # [1] 206341 length(serialize(model3, NULL)) # [1] 189562 dTest = synthFrame(100) p1 = predict(model1, newdata=dTest, type='response') p2 = predict(model2, newdata=dTest, type='response') p3 = predict(model3, newdata=dTest, type='response') sum(abs(p1-p2)) # [1] 0 sum(abs(p1-p3)) # [1] 0

# A clear picture of power and significance in A/B tests

A/B tests are one of the simplest reliable experimental designs.

Controlled experiments embody the best scientific design for establishing a causal relationship between changes and their influence on user-observable behavior.

“Practical guide to controlled experiments on the web: listen to your customers not to the HIPPO” Ron Kohavi, Randal M Henne, and Dan Sommerfield, Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, 2007 pp. 959-967.

The ideas is to test a variation (called “treatment” or “B”) in parallel with continuing to test a baseline (called “control” or “A”) to see if the variation drives a desired effect (increase in revenue, cure of disease, and so on). By running both tests at the same time it is hoped that any confounding or omitted factors are nearly evenly distributed between the two groups and therefore not spoiling results. This is a much safer system of testing than retrospective studies (where we look for features from data already collected).

Interestingly enough the multi-armed bandit alternative to A/B testing (a procedure that introduces online control) is one of the simplest non-trivial Markov decision processes. However, we will limit ourselves to traditional A/B testing for the remainder of this note. Continue reading A clear picture of power and significance in A/B tests