Win-Vector LLC’s Nina Zumel has a great new article on the issue of taste in design and problem solving: Design, Problem Solving, and Good Taste. I think it is a big issue: how can you expect good work if you can’t even discuss how to tell good from bad?
Any practicing data scientist is going to eventually have to work with a data stored in a Microsoft
Excel spreadsheet. A lot of analysts use this format, so if you work with others you are going to run into it. We have already written how we don’t recommend using
Excel-like formats to exchange data. But we know if you are going to work with others you are going to have to make accommodations (we even built our own modified version of
Perl script to work around a bug).
But one thing that continues to confound us is how hard it is to read
Excel data correctly. When
Excel exports into
CSV/TSV style formats it uses fairly clever escaping rules about quotes and new-lines. Most
CSV/TSV readers fail to correctly implement these rules and often fail on fields that contain actual quote characters, separators (tab or comma), or new-lines. Another issue is
Excel itself often transforms data without any user verification or control. For example:
Excel routinely turns date-like strings into time since epoch (which it then renders as a date). We recently ran into another uncontrollable
Excel transform: changing the strings “
TRUE” and “
FALSE” into 1 and 0 inside the actual “
.xlsx” file. That is
Excel does not faithfully store the strings “
TRUE” and “
FALSE” even in its native format. Most
Excel users do not know about this, so they certainly are in no position to warn you about it.
This would be a mere annoyance, except it turns out
Libre Office (or at least LibreOffice_4.3.4_MacOS_x86-64) has a severe and silent data mangling bug on this surprising Microsoft boolean type.
We first ran into this in client data (and once the bug triggered it seemed to alter most of the columns), but it turns out the bug is very easy to trigger. In this note we will demonstrate the data representation issue and bug. Continue reading Excel spreadsheets are hard to get right
In most of our data science teaching (including our book Practical Data Science with R) we emphasize the deliberately easy problem of “exchangeable prediction.” We define exchangeable prediction as: given a series of observations with two distinguished classes of variables/observations denoted “x”s (denoting control variables, independent variables, experimental variables, or predictor variables) and “y” (denoting an outcome variable, or dependent variable) then:
- Estimate an approximate functional relation
y ~ f(x).
- Apply that relation to new instances where
xis known and
yis not yet known.
An example of this would be to use measured characteristics of online shoppers to predict if they will purchase in the next month. Data more than a month old gives us a training set where both
y are known. Newer shoppers give us examples where only
x is currently known and it would presumably be of some value to estimate
y or estimate the probability of different
y values. The problem is philosophically “easy” in the sense we are not attempting inference (estimating unknown parameters that are not later exposed to us) and we are not extrapolating (making predictions about situations that are out of the range of our training data). All we are doing is essentially generalizing memorization: if somebody who shares characteristics of recent buyers shows up, predict they are likely to buy. We repeat: we are not forecasting or “predicting the future” as we are not modeling how many high-value prospects will show up, just assigning scores to the prospects that do show up.
The reliability of such a scheme rests on the concept of exchangeability. If the future individuals we are asked to score are exchangeable with those we had access to during model construction then we expect to be able to make useful predictions. How we construct the model (and how to ensure we indeed find a good one) is the core of machine learning. We can bring in any big name machine learning method (deep learning, support vector machines, random forests, decision trees, regression, nearest neighbors, conditional random fields, and so-on) but the legitimacy of the technique pretty much stands on some variation of the idea of exchangeability.
One effect antithetical to exchangeability is “concept drift.” Concept drift is when the meanings and distributions of variables or relations between variables changes over time. Concept drift is a killer: if the relations available to you during training are thought not to hold during later application then you should not expect to build a useful model. This one of the hard lessons that statistics tries so hard to quantify and teach.
We know that you should always prefer fixing your experimental design over trying a mechanical correction (which can go wrong). And there are no doubt “name brand” procedures for dealing with concept drift. However, data science and machine learning practitioners are at heart tinkerers. We ask: can we (to a limited extent) attempt to directly correct for concept drift? This article demonstrates a simple correction applied to a deliberately simple artificial example.
Image: Wikipedia: Elgin watchmaker