`Excel`

spreadsheet. A lot of analysts use this format, so if you work with others you are going to run into it. We have already written how we don’t recommend using `Excel`

-like formats to exchange data. But we know if you are going to work with others you are going to have to make accommodations (we even built our own modified version of `gdata`

‘s underlying `Perl`

script to work around a bug).
But one thing that continues to confound us is how hard it is to read `Excel`

data correctly. When `Excel`

exports into `CSV/TSV`

style formats it uses fairly clever escaping rules about quotes and new-lines. Most `CSV/TSV`

readers fail to correctly implement these rules and often fail on fields that contain actual quote characters, separators (tab or comma), or new-lines. Another issue is `Excel`

itself often transforms data without any user verification or control. For example: `Excel`

routinely turns date-like strings into time since epoch (which it then renders as a date). We recently ran into another uncontrollable `Excel`

transform: changing the strings “`TRUE`

” and “`FALSE`

” into 1 and 0 inside the actual “`.xlsx`

” file. That is `Excel`

does not faithfully store the strings “`TRUE`

” and “`FALSE`

” even in its native format. Most `Excel`

users do not know about this, so they certainly are in no position to warn you about it.

This would be a mere annoyance, except it turns out `Libre Office`

(or at least LibreOffice_4.3.4_MacOS_x86-64) has a severe and silent data mangling bug on this surprising Microsoft boolean type.

We first ran into this in client data (and once the bug triggered it seemed to alter most of the columns), but it turns out the bug is very easy to trigger. In this note we will demonstrate the data representation issue and bug.

Our example `Excel`

spreadsheet was produced using Microsoft `Excel`

2011 for OSX. We started a new sheet and typed in a few cells by hand. We formatted the header and the numeric column, but did not move off default settings for any of the `TRUE/FALSE`

cells. The spreadsheet looks like the following:

Original

`Excel`

spreadsheet (TRUE/FALSE typed in as text, no formatting commands on those cells).You can also download the spreadsheet here.

On `OSX`

Apple `Numbers`

can read the sheet correctly. We demonstrate this below.

Sheet looks okay in Apple Numbers.

However, `Libre Office`

doesn’t reverse the encoding (as it may not know some details of `Excel`

‘s encoding practices) *and* also shows corrupted data as we see below.

`TRUE/FALSE`

represented as `1/0`

in `Libre Office`

, and third row damaged.In practice we have seen the data damage is pervasive and not limited to columns who’s original value was `FALSE`

. It may be a presentation problem as examining individual cells shows “`=TRUE()`

” and “`=FALSE()`

” as the contents of the affected cells (and apparently in the correct positions independent of what is being displayed).

Apple `Preview`

and `Quick Look`

both also fail to understand the `Excel`

data encoding, as we show below.

Sheet damaged in Apple Preview (same for Apple Quick Look).

Our favorite analysis hammer (R) appears to read the data correctly (with only the undesired translation of `TRUE/FALSE`

to `1/0`

):

R appears to load what was stored correctly.

But what is going on? It turns out `Excel`

`.xlsx`

files are actually `zip`

archives storing a directory tree of `xml`

artificts. By changing the file extension from `.xlsx`

to `.zip`

we can treat the spreadsheet as a `zip`

archive and inflate it to see the underlying files. The inflated file tree is shown below.

The file tree representing the

`Excel`

workbook on disk.Of particular interest are the files `xl/worksheets/sheet1.xml`

and `xl/sharedStrings.xml`

. `sheet1.xml`

contains the worksheet data and `sharedStrings.xml`

is a shared string table containing all strings used in the worksheet (the worksheet stores no user supplied strings, only indexes into the shared string table). Let’s look into `sheet1.xml`

:

The XML representing the sheet data.

The sheet data is arranged into rows that contain columns. It is easy to match these rows and cells to our original spreadsheet. For cells containing uninterpreted strings the `<c>`

tag has has an attributed set to `t="s"`

(probably denoting type is “string” and to use the `<v>`

value as a string index). Notice floating point numbers are not treated as shared strings, but stored directly in the `<v>`

tag. Further notice that the last three columns are stored as `0/1`

and have the attribute `t="b"`

set. My guess is this is declaring the type is “boolean” which then must have the convention that `1`

represents `TRUE`

and `0`

represents `FALSE`

.

This doesn’t seem that complicated, but clearly of all the “`Excel`

compatible” tools we tried only Apple `Numbers`

knew all of the details of this encoding (and was able to reverse it). Other than `Numbers`

only `R`

‘s `gdata`

package was able to extract usable data (and even it only recovered the encoded version of the field, not the original user value).

And these are our issue with working with data that has passed through `Excel`

.

`Excel`

has a lot of non-controllable data transforms including booleans, and dates (in fact mangling string fragments`Excel`

even suspects could be made into dates). Some of these transforms are non-faithful or not reversible.- Very few tools that claim to interoperate with
`Excel`

actually get the corner cases right. Even for simple well-documented data types like`Excel`

`CSV`

export. And definitely not for the native`.xlsx`

format.

These transforms and conventions make exporting data harder (and riskier) than it has to be. To add insult to injury you often run into projects that are sharing `Excel`

`.xlsx`

spreadsheets where neither the reader nor the writer is `Excel`

, so neither end is even good at working with the format. Because working with data that has passed through `Excel`

is hard to get right, data that has passed through `Excel`

is often wrong.

(Note: I definitely feel we do need to be thankful to open source and free software developers. These teams in addition to generously supplying software without charge are also working to preserve user freedoms and often the only way to read older data. However, when we are using software for work we do need it to work correctly and be faithful to data. This problem is small *when you detect it*, but large if hidden in a larger project.)

- Estimate an approximate functional relation
`y ~ f(x)`

. - Apply that relation to new instances where
`x`

is known and`y`

is not yet known.

An example of this would be to use measured characteristics of online shoppers to predict if they will purchase in the next month. Data more than a month old gives us a training set where both `x`

and `y`

are known. Newer shoppers give us examples where only `x`

is currently known and it would presumably be of some value to estimate `y`

or estimate the probability of different `y`

values. The problem is philosophically “easy” in the sense we are not attempting inference (estimating unknown parameters that are not later exposed to us) and we are not extrapolating (making predictions about situations that are out of the range of our training data). All we are doing is essentially generalizing memorization: if somebody who shares characteristics of recent buyers shows up, predict they are likely to buy. We repeat: we are *not* forecasting or “predicting the future” as we are not modeling how many high-value prospects will show up, just assigning scores to the prospects that do show up.

The reliability of such a scheme rests on the concept of exchangeability. If the future individuals we are asked to score are exchangeable with those we had access to during model construction then we expect to be able to make useful predictions. How we construct the model (and how to ensure we indeed find a good one) is the core of machine learning. We can bring in any big name machine learning method (deep learning, support vector machines, random forests, decision trees, regression, nearest neighbors, conditional random fields, and so-on) but the legitimacy of the technique pretty much stands on some variation of the idea of exchangeability.

One effect antithetical to exchangeability is “concept drift.” Concept drift is when the meanings and distributions of variables or relations between variables changes over time. Concept drift is a killer: if the relations available to you during training are thought not to hold during later application then you should not expect to build a useful model. This one of the hard lessons that statistics tries so hard to quantify and teach.

We know that you should always prefer fixing your experimental design over trying a mechanical correction (which can go wrong). And there are no doubt “name brand” procedures for dealing with concept drift. However, data science and machine learning practitioners are at heart tinkerers. We ask: can we (to a limited extent) attempt to directly correct for concept drift? This article demonstrates a simple correction applied to a deliberately simple artificial example.

Image: Wikipedia: Elgin watchmaker

For this project we are getting into the realm of transductive inference. Traditionally we build a model based only on an initial fixed set of training data and then score each later application datum independently. In this write-up we will assume we have access to the later data we need to score during model construction (or at least the control variables or “x”s) and can use statistics about the data we are actually going to be asked to score to influence how we convert our training data (data for which both “x”s and “y” are known) into a model and predictions or scores.

Let’s describe our simple artificial problem. Suppose we have access to a number of instances of training data. These are ordered pairs of observations `(x_i,y_i) (i=1 ... a)`

where the `x_i`

are vectors in `R^n`

and the `y_i`

are real numbers. A typical regression task is to find a `g`

in `R^n`

such that `g.x_i`

is a good estimate of `y_i`

. Now further assume the following generative model. Unobserved vectors `z_i`

in `R^n`

are generated according to some (unknown distribution) and it is the case that `y_i = b.z_i + e_i`

(for some unobserved `b`

in `R^n`

, and noise-term `e_i`

) and our observed `x_i`

are generated as `L1 z_i + s_i`

(where `L1`

is an unobserved linear transform and `s_i`

is a vector noise term).

Graphically we can represent our problem as follows (we are using “`u~v`

” to informally denote “`u`

is distributed mean `v`

plus iid noise/error”).

And we can estimate `g`

without worrying over-much about details like `L1`

. However, the fact we are not directly observing an un-noised `z_i`

means we do not meet the standard conditions of simple least squares regression and are already in a more complicated errors-in-variables situation (which we will ignore). The additional difficulty we actually want to concentrate on is a form of concept drift. Suppose after the training period when time comes to apply the model we no longer observe `x_i ~ L1 z_i`

, but instead observe `q_i ~ L2 z_i`

(where `L2`

is a new unobserved linear operator, and `i = a+1 ... a+b`

). In this case our fit estimate `g`

may no longer supply best possible predictions. We may want to use an adjusted linear model. We would like to adjust by `L1 L2^{-1}`

, but we don’t directly observe `L1`

, `L2`

, or `L1 L2^{-1}`

. The situation during application time (when we are trying to predict new unobserved `y_i`

from `q_i`

) is illustrated below.

This situation may seem a bit contrived, but actually fairly familiar in the world of engineering (relevant topics being system identification and techniques like the Kalman filter).

There are some standard statistical practices that could help in this situation. One would be re-scale the observed `x_i`

during training (either through principal components methods, or by running individual variables through a CDF). We are not huge fans of “x-alone” scaling and feel more for partial least squares or inverse regression ideas. Since we are assuming during the application phase the `y_i`

s are not yet observable (say we have to make a block of predictions before we have a chance to observe any new `y_i`

s) we will have to try to find an x-alone scaling solution. We want to try and estimate `L1 L2^{-1}`

from the observed inertial-ellipsoids/covariance-matrices as illustrated below.

The issue is we are trying to find a change of basis without any so-called “registration marks.” We can try and estimate `E1 = L1 M`

(where `M M^{T}`

is the covariance matrix of the unobserved `z_i`

) and `E2 = L2 M`

from our data. So we could try to estimate `L1 L2^{-1}`

as `E1 E2^{-1}`

. But the problem is (in addition to having to use one of our estimates in a denominator, always a bad situation) without registration marks our frame of reference estimates `E1`

and `E2`

are only determined up to an orthonormal transformation. So we actually want to pick an estimate `L1 L2^{-1} ~ E1 W E2^{-1}`

where `W`

is an arbitrary orthogonal matrix (or orthonormal linear transformation). In our case we want to pick `W`

so that `E1 W E2^{-1}`

is near the identity. The principal being: don’t move anything without strong evidence a move is needed.

We don’t have simple code to pick orthogonal `W`

with `E1 W E2^{-1}`

nearest the identity. Though we could obviously give this to a general optimizer. We strongly agree with the principle that machine learning researchers should usually limit themselves to writing down the conditions of optimality and not cripple methods by over-specifying an (often inferior) optimizer. This point is made in “The Interplay of Optimization and Machine Learning Research” Kristin P Bennett and Emilio Parrado-Hernandez, Journal of Machine Learning Research, 2006 vol. 7 pp. 1265-1281 and in “The Elements of Statistical Learning”, Second Edition, Trevor Hastie, Robert Tibshirani, Jerome Friedman. But let’s ignore that and change the problem to one we happen to know the solution to.

There is a tempting and elegant solution that can pick orthogonal `W`

such that `W`

is near as possible to `E1^{-1} E2`

. So we are asking for a orthogonal matrix near `E1^{-1} E2`

instead of one that minimizes residual error. This form of problem is known as the orthogonal Procrustes problem and we show how to solve it using singular value decomposition in the following worked iPython example. The gist is: we form the singular value decomposition of `E1^{-1} E2 = U D V^{T}`

(`U`

, `V`

orthogonal matrices, `D`

a non-negative diagonal matrix) and it turns out `W = U V^{T}`

is the desired orthogonal estimate. So our estimate of `L1 L2^{-1}`

should then be `E1 U V^{T} E2^{-1}`

.

In our worked example the adjusted model has half the root mean square error of using the un-adjusted model.

The scatter plot of predicted versus actual using an un-adjusted (or `g`

model) is as follows.

And the scatter plot from the better adjusted estimate is given below.

This is not as good as re-fitting after the concept change, but it is better than nothing. I am not sure I would use this adjustment in practice, but the derivation of the estimate is fun.

Obviously these hierarchical models I diagramed are very much easier to interpret in a principled manner in a Bayesian setting (due to the need to integrate out the unobserved `z_i`

). But, frankly, I don’t have enough experience with Stan to know how to efficiently specify such a beast (with data) for standard inference.

Let us work the example.

Consider `n`

identically distributed independent normal random variables `x_1`

, … `x_n`

. A common naive estimate of the unknown common mean `U`

and variance `V`

of the generating distribution is given as follows:

```
``` u = sum_{i=1...n} x_i / n
v = sum_{i=1...n} (x_i - u)^2 / n

That is: we are calculating simple estimates `u,v`

that we hope will be close to the unknown true population values `U,V`

. Unfortunately if you show this estimate to a statistical audience you will likely be open to ridicule. The problem is that the preferred estimate is not what we just wrote, but in fact:

```
``` u = sum_{i=1...n} x_i / n
v' = sum_{i=1...n} (x_i - u)^2 / (n-1)

The proffered argument will be that the estimate `v`

is biased (indeed, an undesirable property) and the estimate `v'`

is unbiased. If one wants to be rude one can take pleasure in accusing the author (me) of not knowing the difference between sample variance and population variance.

In my opinion the actual reason for disagreement is: statistics, at least when taught out of major, is largely taught as a prescriptive practice; you follow the exact specified procedure or you are wholly wrong.

Let us take the time to reason about our naive estimate a bit more. We have indeed made a mistake in using it. The mistake is we didn’t state the intended goal of the estimator. That is sloppy thinking, we should always have some goal in mind (right or wrong) and not blindly execute procedures. If the goal is an unbiased estimator we have indeed picked the wrong estimator. But suppose we had been more careful and said we wanted a maximum likelihood estimator. `u,v`

is in fact maximum likelihood and `u,v'`

is not. Unbiasedness is not the only possible performance criteria, and often incompatible with other estimation goals (see here, and here for more examples).

The usual derivation that `u,v'`

is unbiased involves observing that if we define:

```
Q := (sum_{i=1...n} x_i^2) -
(1/n) (sum_{i=1...n} x_i)^2
```

A bit of algebra that is very familiar to statisticians shows that our earlier maximum likelihood estimate `v`

is in fact equal to `Q/n`

. We also note we can derive (using our knowledge of the non-central moments `U,V`

) that `E[Q|U,V] = (n-1) V`

(and *not* `n*V`

). And a small amount of algebra then gives you the unbiased estimate `u,v'`

.

This seems superior and fine, until you notice the following. A glob of messy algebra gives you `E[Q^2|U,V] = (n^2 - 1) V^2`

(claimed in Savage, the derivation will need the stated additional distributional assumption that the data are normal to ensure facts we need about the first four moments of the observed data hold). But this is enough to show that `Q/(n+1)`

is lower variance than the maximum likelihood estimate `v=Q/n`

and also lower variance than the unbiased estimate `v'=Q/(n-1)`

. So if we had stated our goal was a more statistically efficient estimate of the unknown variance (or lower variance in our estimate of variance) we might have preferred an estimate of the form:

```
u = sum_{i=1...n} x_i / n
v'' = sum_{i=1...n} (x_i - u)^2 / (n+1)
```

What is going on is the empirically observed variance is a different beast that the empirically observed mean, even for normal variates. For one the empirically observed variance is a non-negative random variable (so it itself is certainly not normal). And unlike the empirical mean we don’t get all of the maximum likelihood, zero bias, and minimal variance estimates all co-occurring.

The math isn’t too bad. From Savage:

```
E[ (a Q - V)^2 | U, V]
= (a^2 (n^2 - 1) - 2 a (n - 1) + 1) V^2
= ((a - 1/(n+1))^2 (n^2 - 1) + 2/(n+1)) V^2
>= 2 V^2 / (n+1)
```

And this bound is tight at `a = 1/(n+1)`

. Note that the algebra is only valid when `n>1`

, but `Q=0`

when `n=1`

or `V=0`

, which means `a*Q=0`

for all `a`

. So we will assume `n>1`

and `V>0`

. Thus: when `n>1`

and `V>0`

we have `v'' = Q/(n+1)`

is the unique least variance estimate of the form `a*Q`

where `a`

is a constant (not depending on the `n`

, `U`

, `V`

, or the `x_i`

).

Frankly we have never seen an estimate of the form `v'' = Q/(n+1)`

in use. It is unlikely the additional distributional assumptions are worth the promised reduction in estimation variance. But the point is: we have exhibited three different “optimal” estimates for the variance, so it is a bit harder to claim one is always obviously preferred (especially without context).

Or (following the math with an attempt at interpretation): estimating the variance of even a population of normal variates is a common example of where there are lower variance estimators than the standard unbiased choices (without getting into the complications of Stein’s example, James-Stein estimators, or Hodges–Le Cam estimators). In fact it is such a common example it is often ignored.

Or (without the math): as long as our estimators are what statisticians call *consistent* and `n`

is large (which is one of the great advantages of big data) we really can afford to be civil about the differences between these estimates.

Current schedule/location details after the click.

Hadoop Effortlessly: A Data Inventory is Key to Data Self-service 10/16/2014 1:45pm - 2:25pm EDT (40 minutes) Room: 1 E05 http://en.oreilly.com/stratany2014/public/schedule/detail/37956

Office Hour with John Mount (Win Vector LLC) 10/16/2014 2:35pm - 3:15pm EDT (40 minutes) Room: Table C http://en.oreilly.com/stratany2014/public/schedule/detail/37989

Also, look for us and “Practical Data Science with R” at Waterline Data Science’s Strata booth (booth 553).

Javits Center 655 W 34th Street New York, NY 10001

For more updates (events, book discounts), follow us on Twitter: @WinVectorLLC.

]]>There is one caveat: if you are evaluating a series of models to pick the best (and you usually are), then a single hold-out set is strictly speaking not enough. Hastie, et.al, say it best:

Ideally, the test set should be kept in a “vault,” and be brought out only at the end of the data analysis. Suppose instead that we use the test-set repeatedly, choosing the model with smallest test-set error. Then the test set error of the final chosen model will underestimate the true test error, sometimes substantially.

– Hastie, Tibshirani and Friedman, *The Elements of Statistical Learning*, 2nd edition.

The ideal way to select a model from a set of candidates (or set parameters for a model, for example the regularization constant) is to use a training set to train the model(s), a calibration set to select the model or choose parameters, and a test set to estimate the generalization error of the final model.

In many situations, breaking your data into three sets may not be practical: you may not have very much data, or the the phenomena you’re interested in are rare enough that you need a lot of data to detect them. In those cases, you will need more statistically efficient estimates for generalization error or goodness-of-fit. In this article, we look at the PRESS statistic, and how to use it to estimate generalization error and choose between models.

**The PRESS Statistic**

You can think of the PRESS statistic as an “adjusted sum of squared error (SSE).” It is calculated as

Where *n* is the number of data points in the training set, *y _{i}* is the outcome of the

For example, if you wanted to calculate the PRESS statistic for linear regression models in R, you could do it this way (though I wouldn’t recommend it):

# For explanation purposes only - # DO NOT implement PRESS this way brutePRESS.lm = function(fmla, dframe, outcome) { npts = dim(dframe)[1] ssdev = 0 for(i in 1:npts) { # a data frame with all but the ith row d = dframe[-i,] # build a model using all but pt i m = lm(fmla, data=d) # then predict outcome[i] pred = predict(m, newdata=dframe[i,]) # sum the squared deviations ssdev = ssdev + (pred - outcome[i])^2 } ssdev }

We have implemented a couple of helper functions to calculate the PRESS statistic (and related measures) for linear regression models more efficiently. You can find the code here. The function `hold1OutLMPreds(fmla, dframe)`

returns the vector `f`

, where f[i] is the prediction on the ith row of `dframe`

, when fitting the linear regression model described by `fmla`

on `dframe[-i,]`

. The function `hold1OutMeans(y)`

returns a vector `g`

where `g[i] = mean(y[-i])`

. With these function, you can efficiently calculate the PRESS statistic for a linear regression model:

hopreds = hold1OutLMPreds(fmla, dframe) devs = y-hopreds PRESS = sum(devs^2)

One disadvantage of the SSE (and the PRESS) is that they are dependent on the data size; you can’t compare a single model’s performance across data sets of different size. You can remove that dependency by going to the root mean squared error (RMSE): `rmse = sqrt(sse/n)`

, where `n`

is the size of the data set. You can also calculate an equivalent “root mean PRESS” statistic:

n = length(y) hopreds = hold1OutLMPreds(fmla, dframe) devs = y-hopreds rmPRESS = sqrt(mean(devs^2))

And you can also define a “PRESS R-squared”:

n = length(y) hopreds = hold1OutLMPreds(fmla, dframe) homeans = hold1OutMeans(y) devs = y-hopreds dely = y-homeans PRESS = sum(devs^2) PRESS.r2= 1 - (PRESS/sum(dely^2))

The “PRESS R-squared” is one minus the ratio of the model’s PRESS over the “PRESS of y’s mean value;” it adjusts the estimate of how much variation the model explains by using 1-fold cross validation rather than adjusting for the model’s degrees of freedom (as the more standard adjusted R-square does).

You might also consider defining a PRESS R-squared using the in-sample total error (`y-mean(y)`

) instead of the 1-hold-out mean; we decided on the latter in an “apples-to-apples” spirit. Note also that PRESS R-squared can be negative if the model is very poor.

**An Example**

Let’s imagine a situation where we want to predict a quantity *y*, and we have many many potential inputs to use in our prediction. Some of these inputs are truly correlated with *y*; some of them are not. Of course, we don’t know which are which. We have some training data with which to build models, and we will get (but don’t yet have) hold-out data to evaluate the final model. How might we proceed?

First, let’s create a process to simulate this situation:

# build a data frame with pure noise columns # and columns weakly correlated with y buildExample1 <- function(nRows) { nNoiseCols <- 300 nCorCols <- 20 copyDegree <- 0.1 noiseMagnitude <- 0.1 d <- data.frame(y=rnorm(nRows)) for(i in 1:nNoiseCols) { nm <- paste('noise',i,sep='_') d[,nm] <- noiseMagnitude*rnorm(nRows) + ifelse(runif(nRows)<=copyDegree, rnorm(nRows), 0) } for(i in 1:nCorCols) { nm <- paste('cor',i,sep='_') d[,nm] <- noiseMagnitude*rnorm(nRows) + ifelse(runif(nRows)<=copyDegree,d$y,0) } d }

This function will produce a dataset of `nRows`

rows with 20 columns that are weakly correlated (called `cor_1, cor_2...`

) with `y`

and 300 columns (`noise_1, noise_2...`

) that are independent of `y`

. The process is designed so that the noise columns and the correlated columns have similar magnitudes and variances. The outcome can be expressed as a linear combination of the correlated inputs, so a linear regression model should give reasonable predictions.

Let's suppose we have two candidate models: one which uses all the variables, and one which magically uses only the intentionally correlated variables.

set.seed(22525) train = buildExample1(1000) output = "y" inputs = setdiff(colnames(train), output) truein = inputs[grepl("^cor",inputs)] # all variables, including noise # (noisy model) fmla1 = paste(output, "~", paste(inputs, collapse="+")) mod1 = lm(fmla1, data=train) # only true inputs # (clean model) fmla2 = paste(output, "~", paste(truein, collapse="+")) mod2 = lm(fmla2, data=train)

We can extract all the model coefficients that `lm()`

deemed significant to p < 0.05 (that is, all the coefficients that are marked with at least one "*" in the model summary).

# 0.05 = "*" in the model summary sigCoeffs = function(model, pmax=0.05) { cmat = summary(model)$coefficients pvals = cmat[,4] plo = names(pvals)[pvals < pmax] plo } # significant coefficients in the noisy model sigCoeffs(mod1) ## [1] "noise_41" "noise_59" "noise_66" "noise_117" "noise_207" ## [6] "noise_256" "noise_279" "noise_280" "cor_1" "cor_2" ## [11] "cor_3" "cor_4" "cor_5" "cor_6" "cor_7" ## [16] "cor_8" "cor_9" "cor_10" "cor_11" "cor_12" ## [21] "cor_13" "cor_14" "cor_15" "cor_16" "cor_17" ## [26] "cor_18" "cor_19" "cor_20"

In other words, several of the noise inputs appear to be correlated with the output in the training data, just by chance. This means that the noisy model has overfit the data. Can we detect that? Let's look at the SSE and the PRESS:

## name sse PRESS ## 1 noisy model 203.3 448.6 ## 2 clean model 285.8 306.8

Looking at the in-sample SSE, the noisy model looks better than the clean model; the PRESS says otherwise. We can see the same thing if we look at the R-squared style measures:

## name R2 R2adj PRESSr2 ## 1 noisy model 0.7931 0.6956 0.5442 ## 2 clean model 0.7091 0.7031 0.6884

Again, R-squared makes the noisy model look better than the clean model. The adjusted R-squared correctly indicates that the additional variables in the noisy model do not improve the fit, and slightly prefers the clean model. The PRESS R-squared identifies the clean model as the better model, with a much larger margin of difference than the adjusted R-squared.

**The PRESS statistic versus Hold-out Data**

Of course, while the PRESS statistic is statistically efficient, it is not always computationally efficient, especially with modeling techniques other than linear regression. The calculation of the adjusted R-squared is not computationally demanding, and it also identified the better model in our experiment. One could ask, why not just use adjusted R-squared?

One reason is that the PRESS statistic is attempting to directly model future predictive performance. Our experiment suggests that it shows clearer distinctions between the models than the adjusted R-squared. But how well does the PRESS statistic estimate the "true" generalization error of a model?

To test this, we will hold the ground truth (that is, the data generation process) and the training set fixed. We will then repeat generating test sets, measuring the RMSE of the models' predictions against the test sets, and compare them to the training RMSE and root mean PRESS. This is akin to a situation where the training data and model fitting are accomplished facts, and we are hypothesizing possible future applications of the model.

Specifically, we used `buildExample1()`

to generate one hundred tests sets of size 100 (10% the size of the training set) and one hundred tests sets of size 1000 (the size of the training set). We then evaluated both the clean model and the noisy model against all the test sets and compared the distributions of the hold-out root mean squared error (RMSE) against the in-sample RMSE and PRESS statistics. The results are shown below.

For each plot, the solid black vertical line is the mean of the distribution of test RMSE; we can assume that the observed mean is a good approximation to the "true" expected RMSE of the model. Not surprisingly, a smaller test set size leads to more variance in the observed RMSE, but after 100 trials, both the n=100 and n=1000 hold out sets lead to similar estimates of the expected RMSE (just under 0.7 for the noisy model, just under 0.6 for the clean model.

The dashed red lines give the root mean PRESS of both models on the training data, and the dashed blue lines give each models' training set RMSE. For both the noisy and clean models, the root mean PRESS gives a better estimate of the models' expected RMSE than the training set RMSE -- dramatically so with the noisy, overfit model.

Note, however, that in this experiment, a single hold-out set reliably preferred the clean model to the noisy one (that is, the hold-out SSE was always greater for the noisy model than the clean one when both models were applied to the same test data). The moral of the story: use hold-out data (both calibration and test sets) when that is feasible. When data is at a premium, then try more statistically efficient metrics like the PRESS statistic to "stretch" the data that you have.

]]>`5`

are actually represented as length-1 vectors. We commonly think about working over vectors of “logical”, “integer”, “numeric”, “complex”, “character”, and “factor” types. However, a “factor” is not a R vector. In fact “factor” is For example, consider the following R code.

```
```levels <- c('a','b','c')
f <- factor(c('c','a','a',NA,'b','a'),levels=levels)
print(f)
## [1] c a a <NA> b a
## Levels: a b c
print(class(f))
## [1] "factor"

This example encoding a series of 6 observations into a known set of factor-levels (`'a'`

, `'b'`

, and `'c'`

). As is the case with real data some of the positions might be missing/invalid values such as `NA`

. One of the strengths of R is we have a uniform explicit representation of bad values, so with appropriate domain knowledge we can find and fix such problems. Suppose we knew (by policy or domain experience) that the level `'a'`

was a suitable default value to use when the actual data is missing/invalid. You would think the following code would be the reasonable way to build a new revised data column.

```
```fRevised <- ifelse(is.na(f),'a',f)
print(fRevised)
## [1] "3" "1" "1" "a" "2" "1"
print(class(fRevised))
## [1] "character"

Notice the new column `fRevised`

is an absolute mess (and not even of class/type factor). This sort of fix would have worked if `f`

had been a vector of characters or even a vector of integers, but for factors we get gibberish.

We are going to work through some more examples of this problem.

R is designed to support statistical computation. In R analyses and calculations are often centered on a type called a data-frame. A data frame is very much like a SQL table in that it is a sequence of rows (each row representing an instance of data) organized against a column schema. This is also very much like a spreadsheet where we have good column names and column types. (One caveat: in R vectors that are all `NA`

typically lose their type information and become type `"logical"`

.) An example of an R data frame is given below.

```
```d <- data.frame(x=c(1,-0.4),y=c('a','b'))
print(d)
## x y
## 1 1.0 a
## 2 -0.4 b

A R data frame is actually implemented as a list of columns, each column being treated as a vector. This encourages a very powerful programming style where we specify transformations as operations over columns. An example of working over column vectors is given below:

```
```d <- data.frame(x=c(1,-0.4),y=c('a','b'))
d$xSquared <- d$x^2
print(d)
## x y xSquared
## 1 1.0 a 1.00
## 2 -0.4 b 0.16

Notice that we did not need to specify any for-loop, iteration, or range over the rows. We work over column vectors to great advantage in clarity and speed. This is fairly clever as traditional databases tend to be row-oriented (define operations as traversing rows) and spreadsheets tend to be cell-oriented (define operations over ranges of cells). We can confirm R’s implementation of data frames is in fact a list of column vectors (not merely some other structure behaving as such) through the unclass-trick:

```
```print(class(unclass(d)))
## [1] "list"
print(unclass(d))
## $x
## [1] 1.0 -0.4
##
## $y
## [1] a b
## Levels: a b
##
## $xSquared
## [1] 1.00 0.16
##
## attr(,"row.names")
## [1] 1 2

The data frame `d`

is implemented as a class/type annotation over a list of columns (`x`

, `y`

, and `xSquared`

). Let’s take a closer look at the class or type of the column `y`

.

```
```print(class(d$y))
## [1] "factor"

The class of `y`

is `"factor"`

. We gave R a sequence of strings and it promoted or coerced them into a sequence of factor levels. For statistical work this makes a lot of sense; we are more likely to want to work over factors (which we will define soon) than over strings. And at first glance R seems to like factors more than strings. For example `summary()`

works better with factors than with strings:

```
```print(summary(d))
## x y xSquared
## Min. :-0.40 a:1 Min. :0.16
## 1st Qu.:-0.05 b:1 1st Qu.:0.37
## Median : 0.30 Median :0.58
## Mean : 0.30 Mean :0.58
## 3rd Qu.: 0.65 3rd Qu.:0.79
## Max. : 1.00 Max. :1.00
print(summary(data.frame(x=c(1,-0.4),y=c('a','b'),
stringsAsFactors=FALSE)))
## x y
## Min. :-0.40 Length:2
## 1st Qu.:-0.05 Class :character
## Median : 0.30 Mode :character
## Mean : 0.30
## 3rd Qu.: 0.65
## Max. : 1.00

Notice how if `y`

is a factor column we get nice counts of how often each factor-level occurred, but if `y`

is a character type (forced by setting `stringsAsFactors=FALSE`

to turn off conversion) we don’t get a usable summary. So as a a default behavior R promotes strings/characters to factors and has better summaries for strings/characters than for factors. This would make you think that factors might be a preferred/safe data type in R. This turns out to not completely be the case. A careful R programmer must really decide when and where they want to allow factors in their code.

What is a factor? In principle a factor is a value where the value is known to be taken from a known finite set of possible values called levels. This is similar to an enumerated type. Typically we think of factor levels or categories taking values from a fixed set of strings. Factors are very useful in encoding categorical responses or data. For example we can represent which continent a country is in with the factor levels `"Asia"`

, `"Africa"`

, `"North America"`

, `"South America"`

, `"Antarctica"`

, `"Europe"`

, and `"Australia"`

. When the data has been encoded as a factor (perhaps during ETL) you not only have the continents indicated, you also know the complete set of continents and have a guarantee of no ad-hoc alternate responses (such as “SA” for South America). Additional machine-readable knowledge and constraints make downstream code much more compact, powerful, and safe.

You can think of a factor vector as a sequence of strings with an additional annotation as to what universe of strings the strings are taken from. The R implementation of factor actually implements factor as a sequence of integers where each integer represents the index (starting from 1) of the string in the sequence of possible levels.

```
```print(class(unclass(d$y)))
## [1] "integer"
print(unclass(d$y))
## [1] 1 2
## attr(,"levels")
## [1] "a" "b"

This implementation difference *should* not matter, except R exposes implementation details (more on this later). Exposing implementation details is generally considered to be a bad thing as we don’t know if code that uses factors is using the declared properties and interfaces or is directly manipulating the implementation.

Down-stream users or programmers are supposed to mostly work over the supplied abstraction not over the implementation. Users should not routinely have direct access to the implementation details and certainly not be able to directly manipulate the underlying implementation. In many cases the user must be *aware* of some of the limitations of the implementation, but this is considered a necessary *undesirable* consequence of a leaky abstraction. An example of a necessarily leaky abstraction: abstracting base-2 floating point arithmetic as if it were arithmetic over the real numbers. For decent speed you need your numerics to be based on machine floating point (or some multi-precision extension of machine floating point), but you want to think of numerics abstractly as real numbers. With this leaky compromise the user doesn’t have to have the entire IEEE Standard for Floating-Point Arithmetic (IEEE 754) open on their desk at all times. But the user should known the exceptions like: `(3-2.9)<=0.1`

tends to evaluate to `FALSE`

(due to the implementation, and in violation of the claimed abstraction) and know the necessary defensive coding practices (such as being familiar with What Every Computer Scientist Should Know About Floating-Point Arithmetic).

Now: factors *can* be efficiently implemented perfectly, so they *should* be implemented perfectly. At first glance it appears that they have been implemented correctly in R and the user is protected from the irrelevant implementation details. For example if we try and manipulate the underlying integer array representing the factor levels we get caught.

```
```d$y[1] <- 2
## Warning message:
## In `[<-.factor`(`*tmp*`, 1, value = c(NA, 1L)) :
## invalid factor level, NA generated

This is good, when we tried to monkey with the implementation we got caught. This is how the R implementors try to ensure there is not a lot of user code directly monkeying with the current representation of factors (leaving open the possibility of future bug-fixes and implementation improvements). Likely this safety was gotten by overloading/patching the `[<-`

operator. However, as with most fix-to-finish designs, a few code paths are missed and there are places the user is exposed to the implementation of factors when they expected to be working over the abstraction. Here are a few examples:

```
```f <- factor(c('a','b','a')) # make a factor example
print(class(f))
## [1] "factor"
print(f)
## [1] a b a
## Levels: a b
# c() operator collapses to implementation
print(class(c(f,f)))
## [1] "integer"
## [1] 1 2 1 1 2 1
print(c(f,f))
## [1] 1 2 1 1 2 1
# ifelse(,,) operator collapses to implementation
print(ifelse(rep(TRUE,length(f)),f,f))
# [1] 1 2 1
# factors are not actually vectors
# this IS as claimed in help(vector)
print(is.vector(f))
## [1] FALSE
# factor implementations are not vectors either
# despite being "integer"
print(class(unclass(f)))
## [1] "integer"
print(is.vector(unclass(f)))
## [1] FALSE
# unlist of a factor is not a vector
# despite help(unlist):
# "Given a list structure x, unlist simplifies it to produce a vector:
print(is.vector(unlist(f)))
## [1] FALSE
print(unlist(f))
## [1] a b a
## Levels: a b
print(as.vector(f))
## [1] "a" "b" "a"

What we have done is found instances where a `factor`

column does not behave as we would expect a character vector to behave. These defects in behavior are why I claim factor are not first class in R. They don’t get the full-service expected behavior from a number of basic R operations (such is passing through `c()`

or `ifelse(,,)`

without losing their class label). It is hard to say a factor is treated as a first-class citizens that correctly “supports all the operations generally available to other entities” (quote taken from Wikipedia: First-class_citizen). R doesn’t seem to trust leaving factor data types in factor data types (which should give one pause about doing the same).

The reason these differences are not a mere curiosities is: in any code where we are expecting one behavior and we experience another, we have a bug. So these conversions or abstraction leaks cause system brittleness which can lead to verbose hard to test overly defensive code (see Postel’s law: not sure who to be angry with for some of the downsides of being required to code defensively).

September 9, 1947 Grace Murray Hopper “First actual case of bug being found.”

(image: Computer History Museum)

Why should we expect a factor to behave like a character vector? Why not expect it to behave like an integer vector? The reason is: we supplied a character vector and R’s default behavior in `data.frame()`

was to convert it to a factor. R’s behavior only makes sense under the assumption there is some commonality of behavior between factors and character vectors. Otherwise R has made a surprising substitution and violated the principle of least astonishment. To press the point further: from an object oriented view (which is a common way to talk about the separation of concerns of interface and implementation) a valid substitution should at the very least follow some form of the Liskov substitution principle of a factor being a valid sub-type of character vector. But this is *not* possible between mutable versions of factor and character vector, so the substitution should not have been offered.

What we are trying to point out is: design is not always just a matter of taste. With enough design principles in mind (such as least astonishment, Liskov substitution, and a few others) you can actually say some design decisions are wrong (and maybe even some day some other design decisions are right). There are very few general principals of software system design, so you really don’t want to ignore the few you have.

One possible criticism of my examples is: “You have done everything wrong, *everybody* knows to set `stringsAsFactors=FALSE`

.” I call this the “Alice’s Adventures in Wonderland” defense. In my opinion the user is a guest and it is fair for the guest to initially assume default settings are generally the correct or desirable settings. The relevant “Alice’s Adventures in Wonderland” quote being:

At this moment the King, who had been for some time busily writing in his note-book, cackled out ‘Silence!’ and read out from his book, ‘Rule Forty-two. All persons more than a mile high to leave the court.’

Everybody looked at Alice.

‘I’m not a mile high,’ said Alice.

‘You are,’ said the King.

‘Nearly two miles high,’ added the Queen.

‘Well, I shan’t go, at any rate,’ said Alice: ‘besides, that’s not a regular rule: you invented it just now.’

‘It’s the oldest rule in the book,’ said the King.

‘Then it ought to be Number One,’ said Alice.

(text: Project Gutenberg)

(image from Wikipedia)

Another obvious criticism is: “You have worked hard to write bugs.” That is not the case, I have worked hard to make consequences direct and obvious. Where I first noticed my bug was code deep in an actual project which is similar to the following example. First let’s build a synthetic data set where `y~f(x)`

where `x`

is a factor or categorical variable.

```
```# build a synthetic data set
set.seed(36236)
n <- 50
d <- data.frame(x=sample(c('a','b','c','d','e'),n,replace=TRUE))
d$train <- FALSE
d$train[sample(1:n,n/2)] <- TRUE
print(summary(d$x))
## a b c d e
## 4 7 12 14 13
# build noisy = f(x), with f('a')==f('b')
vals <- rnorm(length(levels(d$x)))
vals[2] <- vals[1]
names(vals) <- levels(d$x)
d$y <- rnorm(n) + vals[d$x]
print(vals)
## a b c d e
## 1.3394631 1.3394631 0.3536642 1.6990172 -0.5423986
# build a model
model1 <- lm(y~0+x,data=subset(d,train))
d$pred1 <- predict(model1,newdata=d)
print(summary(model1))
##
## Call:
## lm(formula = y ~ 0 + x, data = subset(d, train))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.53459 -0.43303 -0.07942 0.49278 2.20614
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## xa 2.9830 0.7470 3.993 0.000715 ***
## xb 2.0506 0.5282 3.882 0.000926 ***
## xc 1.2824 0.3993 3.212 0.004378 **
## xd 2.3644 0.3993 5.922 8.6e-06 ***
## xe -1.1541 0.4724 -2.443 0.023974 *
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
##
## Residual standard error: 1.056 on 20 degrees of freedom
## Multiple R-squared: 0.8046, Adjusted R-squared: 0.7558
## F-statistic: 16.47 on 5 and 20 DF, p-value: 1.714e-06

Our first model is good. But during the analysis phase we might come across some domain knowledge, such as `'a'`

and `'b'`

are actually equivalent codes. We could reduce fitting variance by incorporating this knowledge in our feature engineering. In this example it won’t be much of an improvement, we are not merging much and not eliminating many degrees of freedom. In a real production example this can be a very important step where you may have a domain supplied roll-up dictionary that merges a large number of levels. However, what happens is our new merged column gets quietly converted to a column of integers which is then treated as a numeric column in the following modeling step. So the merge is in fact disastrous, we lose the categorical structure of the variable. We can, of course, re-institute the structure by calling `as.factor()`

if we know about the problem (which we might not), but even then we have lost the string labels for new integer level labels (making debugging even harder). Let’s see the failure we are anticipating, notice how the training adjusted R-squared disastrously drops from 0.7558 to 0.1417 after we attempt our “improvement.”

```
```# try (and fail) to build an improved model
# using domain knowledge f('a')==f('b')
d$xMerged <- ifelse(d$x=='b',factor('a',levels=levels(d$x)),d$x)
print(summary(as.factor(d$xMerged)))
## 1 3 4 5
## 11 12 14 13
# disaster! xMerged is now class integer
# which is treated as numeric in lm, losing a lot of information
model2 <- lm(y~0+xMerged,data=subset(d,train))
print(summary(model2))
##
## Call:
## lm(formula = y ~ 0 + xMerged, data = subset(d, train))
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3193 -0.5818 0.8281 1.6237 3.5451
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## xMerged 0.2564 0.1132 2.264 0.0329 *
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
##
## Residual standard error: 1.98 on 24 degrees of freedom
## Multiple R-squared: 0.176, Adjusted R-squared: 0.1417
## F-statistic: 5.128 on 1 and 24 DF, p-value: 0.03286

There is an obvious method to merge the levels correctly: convert back to character (which we show below). The issue is: if you don’t know about the conversion to integer happening, you may not know to look for it and correct it.

```
```# correct f('a')==f('b') merge
d$xMerged <- ifelse(d$x=='b','a',as.character(d$x))
model3 <- lm(y~0+xMerged,data=subset(d,train))
d$pred3 <- predict(model3,newdata=d)
print(summary(model3))
##
## Call:
## lm(formula = y ~ 0 + xMerged, data = subset(d, train))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.53459 -0.51084 -0.05408 0.71385 2.20614
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## xMergeda 2.3614 0.4317 5.470 1.99e-05 ***
## xMergedc 1.2824 0.3996 3.209 0.00422 **
## xMergedd 2.3644 0.3996 5.916 7.15e-06 ***
## xMergede -1.1541 0.4729 -2.441 0.02361 *
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
##
## Residual standard error: 1.057 on 21 degrees of freedom
## Multiple R-squared: 0.7945, Adjusted R-squared: 0.7553
## F-statistic: 20.3 on 4 and 21 DF, p-value: 5.693e-07
dTest <- subset(d,!train)
nTest <- dim(dTest)[[1]]
# Root Mean Square Error of original model on test data
print(sqrt(sum((dTest$y-dTest$pred1)^2)/nTest))
## [1] 1.330894
# Root Mean Square Error of f('a')==f('b') model on test data
print(sqrt(sum((dTest$y-dTest$pred3)^2)/nTest))
## [1] 1.297682

Factors are definitely useful, and I am glad R has them. I just wish they had fewer odd behaviors. My rule of thumb is just to use them as late as possible, set `stringsAsFactors=FALSE`

and if you need factors in some place convert from character near that place.

Please see the following articles for more ideas on working with categorical variables and preparing data for analysis.

]]>The story is an inside joke referring to something really only funny to one of the founders. But a joke that amuses the teller is always enjoyed by at least one person. Win-Vector LLC’s John Mount had the honor of co-authoring a 1997 paper titled “The Polytope of Win Vectors.” The paper title is obviously mathematical terms in an odd combination. However the telegraphic grammar is coincidentally similar to deliberately ungrammatical gamer slang such as “full of win” and “so much win.”

If we treat “win” as a concrete noun (say something you can put in a sack) and “vector” in its *non-mathematical* sense (as an entity of infectious transmission) we have “Win-Vector LLC is an infectious delivery of victory.” I.e.: we deliver success to our clients. Of course, we have now attempt to explain a weak joke. It is not as grand as “winged victory,” but it does encode a positive company value: Win-Vector LLC delivers successful data science projects and training to clients.

Winged Victory: from Wikipedia

Let’s take this as an opportunity to describe what a win vector is.

We take the phrase “win vector” from a technical article titled “The Polytope of Win Vectors” by J.E. Bartles, J. Mount, and D.J.A. Welsh (Annals of Combinatorics I, 1997, pp. 1-15. The topic of this paper concerns the possible outcomes of game tournaments (or other things that can be expressed as tournaments). For example: we could have four teams (A,B,C, and D) scheduled to play each other a number of times, as indicated in the diagram below.

This graph is just saying in the tournament: A will play B 5 times, B will not play C, and so on. We assume each game can end in a win for one team (given them 1 point), or a loss or tie (giving them zero points). We can record a summary of the tournament outcomes as a vector (vector now back to its mathematical sense) that just records how often each team won. For example the vector [10,1,1,0] is a win vector compatible with the above diagram (it encodes A winning all matches and D losing all matches). The vector [0,0,0,5] is not a valid win vector for the digram as D did not play 5 games (so can not have 5 wins). (The Win-Vector LLC logo is itself a stylized single game tournament diagram, with the directed arrow representing both victory and reminiscent of vectors in the mathematical sense.)

The idea is that a win vector might be treated as a sufficient statistic for the tournament. Or more accurately the win vector may be all that is known about a previously run tournament. Such censored observations may be all that is possible in field biology where wins represent territory or offspring. The question is then: given knowledge of the tournament structure (the graph) and the summary of outcomes (the win vector) is there evidence one team is dominant, or are the effects random? So we have well-formed statistical questions about effect strength and significance.

The question of significance is: when we introduce a notion of effect strength how likely are we to see an effect of that size assuming identical players. For example if we make our notion of effect strength the maximum ratio of wins to plays seen in the win vector should we consider this evidence of a strong player, or is it to be expected by random fluctuation? We need to estimate how strong a conditioning effect our tournament constraints impose on unobserved outcomes (to determine if irregularities in distribution are from player strengths our tournament mis-design).

Relating distributions of unobserved details to observed totals (or margins) is one of the most fundamental problems in statistics. We have written on it many times (two examples: Google ad market reporting and checking scientific claims). In all cases you would be better off with direct detailed observations (i.e. without the censorship); but often you have to work with the data you have instead of the experiment you would design.

The math is a little easier to explain for a related problem: working out the number of ways to fill in a matrix with non-negative integers to meet given row and column totals. I’ll move on to discuss this contingency table problem a bit.

The statistical ideas largely come from “Testing for Independence in a Two-Way Table: New Interpretations of the Chi-Square Statistic”, Persi Diaconis and Bradley Efron, Ann. Statist., Vol. 13, No. 3, 1985 pp. 845-874. A contingency table is a matrix of non-negative integers, and the statistical problem is relating known row and column totals to possible fill-ins. In this paper the authors criticize some of the standard significance tests (chi-square, Fisher’s exact test) and propose a parameterized family of tests that at the extreme end considers a null-model of uniform fill-ins (each possible fill in equally likely). Obviously a uniform model is very different than the more standard distributions which tend to have cell counts more highly concentrated around their means. But the idea is: this proposed test takes more of the structure of the margin totals into account (or equivalently assumes away less of the margin mediated cell dependencies) and has its own merits.

However, we are actually describing the work of mathematicians and theoretical computer scientists. In that style you only speak with “applied types” (such as theoretical statisticians) to justify working on a snappy math problem. In this case: counting the number of ways to fill in a contingency table or the number of detailed results compatible with a given win vector (the link between counting, and generation having been strongly established in “Randomised Algorithms for Counting and Generating Combinatorial Structures”, A.J. Sinclair, Ph.D. thesisUniversity of Edinburgh (1988) and related works).

The contingency table problem is partially solved in:

- “Sampling contingency tables” Martin Dyer, Ravi Kannan, John Mount, Random Structures and Algorithms Vol. 10, no. 4, July 1997 pp. 487-506.
- “Fast Unimodular Counting” John Mount, Combinatorics Probability and Computing, Vol. 9, No. 3, May 2000, pp 277-285.

The second paper (strengthening some results from my Ph.D. thesis) lets you calculate that the number of ways to fill in the following four by four contingency table with non-negative integers to meet the shown row and column totals is exactly `350854066054593772938684218633979710637454260`

(about `3.508541e+44`

).

```
``` x(0,0) x(0,1) x(0,2) x(0,3) 154179
x(1,0) x(1,1) x(1,2) x(1,3) 255424
x(2,0) x(2,1) x(2,2) x(2,3) 277000
x(3,0) x(3,1) x(3,2) x(3,3) 160179
191780 288348 165221 201433

The point being: the table could arise as the summary from a data set with `846782`

(`=191780 + 288348 + 165221 + 201433`

) items; to characterize probabilities over such a tables you need good methods to sample over the astronomical family of potential alternate fill-ins (and this is where you apply the link between counting and sampling for self-reducible problem families). We have example code, notes, improved runtime proof, and results here.

“The Polytope of Win Vectors” introduced additional ideas from integral polymatroids to more strongly relate volume to number of integer vectors (and gets more complete theoretical results for its problem).

All the “big hammer” math is trying to extend some of the beauty of G.H. Hardy and J.E. Littlewood, “Some problems of Diophantine approximation: the lattice points of a right-angled triangle,” Hamburg. Math.Abh., 1 (1921) 212–249 to more general settings.

Or more succinctly: we just like the word “win.”

]]>**What is the Gauss-Markov theorem?**

From “The Cambridge Dictionary of Statistics” B. S. Everitt, 2nd Edition:

A theorem that proves that if the error terms in a

multiple regressionhave the same variance and are uncorrelated, then the estimators of the parameters in the model produced byleast squares estimationare better (in the sense of having lower dispersion about the mean) than any other unbiased linear estimator.

This is pretty much considered the “big boy” reason least squares fitting can be considered a good implementation of linear regression.

Suppose you are building a model of the form:

```
``` y(i) = B . x(i) + e(i)

where `B`

is a vector (to be inferred), `i`

is an index that runs over the available data (say `1`

through `n`

), `x(i)`

is a per-example vector of features, and `y(i)`

is the scalar quantity to be modeled. Only `x(i)`

and `y(i)`

are observed. The `e(i)`

term is the un-modeled component of `y(i)`

and you typically hope that the `e(i)`

can be thought of unknowable effects, individual variation, ignorable errors, residuals, or noise. How weak/strong assumptions you put on the `e(i)`

(and other quantities) depends on what you know, what you are trying to do, and which theorems you need to meet the pre-conditions of. The Gauss-Markov theorem assures a good estimate of `B`

under weak assumptions.

**How to interpret the theorem**

The point of the Gauss-Markov theorem is that we can find conditions ensuring a good fit without requiring detailed distributional assumptions about the `e(i)`

and without distributional assumptions about the `x(i)`

. However, if you are using Bayesian methods or generative models for predictions you *may want* to use additional stronger conditions (perhaps even normality of errors and *even* distributional assumptions on the `x`

s).

We are going to read through the Wikipedia statement of the Gauss-Markov theorem in detail.

**Wikipedia’s stated pre-conditions of the Gauss-Markov theorem**

To apply the Gauss-Markov theorem the Wikipedia says you must assume your data has the following properties:

```
```

```
E[e(i)] = 0 (lack of structural errors, needed to avoid bias)
V[e(i)] = c (equal variance, one form of homoscedasticity)
cov[e(i),e(j)] = 0 for i!=j (non-correlation of errors)
```

```
```

It is always important to know precisely what probability model the expectation (`E[]`

), variance (`V[]`

), and covariance (`cov[]`

) operators are working over in the Wikipedia conditions. This is usually left implicit, but it is critical to know exactly what is being asserted. When reading/listening about statistical or probabilistic work you should *always* insist on a concrete description of the probability model underlying all the notation (the `E[]`

s and `V[]`

s). A lot of confusion and subtle tricks get hidden by not sharing an explicit description of the probability model.

**Probability models**

Two plausible probability models are:

- Frequentist: unobserved parameters are held constant and all probabilities are over re-draws of the data. At first guess you would think this is the correct model for this problem, as the content of the Gauss-Markov theorem is about how estimates drawn from a larger population perform in expectation.
- x-Generative: This is not standard and not immediately implied by the notation (and represents a fairly strong set of assumptions). In this model all of the observed
`x`

s are held constant and unobserved`e`

s and`y`

s are regenerated with respect to the`x`

s. This is similar to a Bayesian generative model, except in the usual Bayesian formulation all observables (both`x`

s and`y`

s) are held fixed. We only introduce this model as it seems to be the simplest one which makes for a workable interpretation of the Wikipedia statements.

The issue is: the conditions as stated are not strong enough to ensure actual homoscedasticity (or even non-structure of errors/bias) needed to apply the Gauss-Markov theorem under a strict frequentist model. So we must go venue-shopping and find what model is likely intended. An easy way to do this is to design synthetic data that is considered well-behaved under one model and not under the other.

**A source of examples**

Let’s use a deliberately naive empirical view of data. Suppose the entire possible universe of data is `X(i),Y(i),Z(i) i=1...k`

for some `k`

(`k`

and `X(i),Y(i),Z(i)`

all finite real vectors). Our chosen explicit probability model for generating the observed data `x(i),y(i)`

and unobserved `e(i)`

is the following. We pick a length-`n`

sequence of integers `s(1),...,s(n)`

where each `s(i)`

is picked uniformly and independently from `1...k`

and add a bit of unique noise. Our sample data is then (only `x(i),y(i)`

are observed, `e(i)`

is an unobserved notional quantity):

```
```

```
(x(i),y(i),e(i)) = (X(s(i)),Y(s(i))+t(i),Z(s(i))+t(i)) for i=1...n,
where t(i) is an independent normal variable with mean 0 and variance 1
```

```
```

This is similar to a standard statistical model (empirical re-sampling from a fixed set, and designed to be similar to a sampling distribution). `Z(i)`

represents an idealized error term and `e(i)`

represents a per-sample unobserved realization of `Z(i)`

. It is a nice model because the `e(i)`

are independently identically distributed (and so are the `x(i)`

and `y(i)`

, though obviously there can be dependencies between the `x,y and e`

s). This model can be thought of as “too nice” as it isn’t powerful enough to capture the full power of the Gauss-Markov theorem (it can’t express non- independent identically distributed situations). However it can concretely embody situations that do meet the Gauss-Markov conditions and be used to work clarifying examples.

**Good examples under the frequentist probability model**

Let’s see what conditions on `X(i),Y(i),Z(i) i=1...k`

are needed to meet the Gauss-Markov pre-conditions assuming a frequentist probability model.

- The first one is easy:
`E[e(i)] = 0`

if and only if`sum_{j=1...k} Z(j) = 0`

. - When we have
`E[e(i)]=0`

the second condition (homoscedasticity as stated) simplifies to`V[e(i)] = E[(e(i) - E[e(j)])^2] = E[e(i)^2] = E[Z^2] + 1`

which is independent of`i`

. - When we have
`E[e(i)]=0`

the third condition simplifies to`E[e(i) e(j)] = 0`

for`i!=j`

. And then follows immediately from our overly strong condition of the index selections`s(i)`

being independent (giving us`E[e(i) e(j)] = E[e(i)] E[e(j)] = 0 for i!=j`

).

So all we need is: `sum_{j=1...k} Z(j) = 0`

and then the other conditions hold. This seems too easy, and is evidence that the frequentist probability model is not the model intended by Wikipedia. We will confirm this with a specific counter example later.

**Good examples under the x-generative probability model**

Under the x-generative probability model (and this is *not* standard terminology) the Wikipedia conditions are more properly written conditionally:

```
```

```
E[e(i)|x(i)] = 0
V[e(i)|x(i)] = c
cov[e(i),e(j)|x(i),x(j)] = 0 for i!=j
```

```
```

Or more precisely: if the conditions had been written in their conditional form we wouldn’t have to contrive a phrase like “x-generative model” to ensure the correct interpretation. These conditions are strict. Checking or ensuring these properties is a problem when `x`

is continuous and we have a joint description of how `x,y,e`

are generated (instead of a hierarchical one). These conditions as stated are strong enough to support the Gauss-Markov theorem, but probably in fact stronger than the minimum or canonical conditions. But let’s see how they work.

To meet these conditions our `Z(i)`

must pretty much be free of dependence on `x(i)`

(even one snuck through the index `i`

). This is somewhat unsatisfying as our overly simple modeling framework (producing `x,y,e`

from `X,Y,Z`

) combined with these strong conditions don’t really model much more than identical independence (so do not capture the full breadth of the Gauss-Markov theorem). The frequentist conditions are too lenient to work and the x-generative/conditioned conditions seem too strong (at least when combined with our simplistic source of examples).

**A good example**

The following R example (also available here) shows a data set generated under our framework where the Gauss-Markov theorem applies (under either probability model). In this case the true `y`

is produced as an actual linear function of `x`

plus iid (independent identically distributed) noise. This model meets the pre-conditions of the Gauss-Markov condition (under both the frequentist and x-generative models). We observe that the empirical samples average out to the correct theoretical coefficients taken from the original universal population. All of the calculations are designed to match the quantities discussed in the Wikipedia derivations.

```
library(ggplot2)
workProblem <- function(dAll,nreps,name,sampleSize=10) {
xAll <- matrix(data=c(dAll$x0,dAll$x1),ncol=2)
cAll <- solve(t(xAll) %*% xAll) %*% t(xAll)
beta <- as.numeric(cAll %*% dAll$y)
betaSamples <- matrix(data=0,nrow=2,ncol=nreps)
nrows <- dim(dAll)[[1]]
for(i in 1:nreps) {
dSample <- dAll[sample.int(nrows,sampleSize,replace=TRUE),]
individualError <- rnorm(sampleSize)
dSample$y <- dSample$y + individualError
dSample$e <- dSample$z + individualError
xSample <- matrix(data=c(dSample$x0,dSample$x1),ncol=2)
cSample <- solve(t(xSample) %*% xSample) %*% t(xSample)
betaS <- as.numeric(cSample %*% dSample$y)
betaSamples[,i] <- betaS
}
d <- c()
for(i in 1:(dim(betaSamples)[[1]])) {
coef <- paste('beta',(i-1),sep='')
mean <- mean(betaSamples[i,])
dev <- sqrt(var(betaSamples[i,])/nreps)
d <- rbind(d,data.frame(nsamples=nreps,model=name,coef=coef,
actual=beta[i],est=mean,estP=mean+2*dev,estM=mean-2*dev))
}
d
}
repCounts <- as.integer(floor(10^(0.25*(4:24))))
print('good example')
## [1] "good example"
set.seed(2623496)
dGood <- data.frame(x0=1,x1=0:10)
dGood$y <- 3*dGood$x0 + 2*dGood$x1
dGood$z <- dGood$y - predict(lm(y~0+x0+x1,data=dGood))
print(dGood)
## x0 x1 y z
## 1 1 0 3 -9.326e-15
## 2 1 1 5 -7.994e-15
## 3 1 2 7 -7.105e-15
## 4 1 3 9 -5.329e-15
## 5 1 4 11 -5.329e-15
## 6 1 5 13 -3.553e-15
## 7 1 6 15 -1.776e-15
## 8 1 7 17 -3.553e-15
## 9 1 8 19 0.000e+00
## 10 1 9 21 0.000e+00
## 11 1 10 23 0.000e+00
print(summary(lm(y~0+x0+x1,data=dGood)))
## Warning: essentially perfect fit: summary may be unreliable
##
## Call:
## lm(formula = y ~ 0 + x0 + x1, data = dGood)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.77e-15 -1.69e-15 -5.22e-16 4.48e-16 6.53e-15
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x0 3.00e+00 1.58e-15 1.9e+15 <2e-16 ***
## x1 2.00e+00 2.67e-16 7.5e+15 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.8e-15 on 9 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.47e+32 on 2 and 9 DF, p-value: <2e-16
print(workProblem(dGood,10,'good/works',10000))
## nsamples model coef actual est estP estM
## 1 10 good/works beta0 3 3.006 3.016 2.995
## 2 10 good/works beta1 2 1.999 2.001 1.997
pGood <- c()
set.seed(2623496)
for(reps in repCounts) {
pGood <- rbind(pGood,workProblem(dGood,reps,'goodData'))
}
ggplot(data=pGood,aes(x=nsamples)) +
geom_line(aes(y=actual)) +
geom_line(aes(y=est),linetype=2,color='blue') +
geom_ribbon(aes(ymax=estP,ymin=estM),alpha=0.2,fill='blue') +
facet_wrap(~coef,ncol=1,scales='free_y') + scale_x_log10() +
theme(axis.title.y=element_blank())
```

Notice the code is using the “return data frames” principle. The derived graph shows what we expect from an unbiased low-variance estimate: convergence to the correct values as we increase number of repetitions.

**A bad example**

The following R example meets all of the *Wikipedia stated* conditions of the Gauss-Markov theorem under a frequentist probability model, but doesn’t even exhibit unbiased estimates- let alone a minimal variance such on small samples. It does produce correct estimates on large samples (so one could work with it), but we are not seeing unbiasedness (let alone low variance) on small samples. For this example: the ideal distribution and large samples are unbiased (but have some ugly structure), yet small samples appear biased.

This bad example is essentially given as: `y = x^2`

and we haven’t made `x^2`

available to the model (only `x`

). So this data set doesn’t actually follow the assumed linear modeling structure. However, we can be sophists and claim the effect to model is `y = 10*x - 15 + e`

(which is linear in the features we are making available) and the error term is in fact `e=x^2 - 10*x + 15 + individualError`

(which does have an expected value of zero when `x`

is sampled uniformly from the integers `0...10`

).

This data set is designed to slip past the Gauss-Markov theorem pre-conditions under the frequentist interpretation. As we have shown all we need to do is check `sum_{k} Z(k)`

is zero and the rest of the properties follow. In our case we have `sum_{k} Z(k) = sum_{x=0...10} (x^2 - 10*x + 15) = 0`

. This data set does not slip past the Gauss-Markov theorem pre-conditions under the x-generative model as the obviously structured error term is what they are designed to prohibit/avoid. This sets us up for the following syllogism.

- This data set satisfies the Gauss-Markov theorem pre-conditions under the frequentist model.
- Our R simulation shows the data set doesn’t satisfy the conclusions of the Gauss-Markov theorem.
- We can then conclude the Gauss-Markov theorem pre-conditions can’t be based on the frequentist model.

We confirm this with the following R-simulation.

```
```

```
dBad <- data.frame(x0=1,x1=0:10)
dBad$y <- dBad$x1^2 # or y = -15 + 10*x1 with structured error
dBad$z <- dBad$y - predict(lm(y~0+x0+x1,data=dBad))
print('bad example')
## [1] "bad example"
print(dBad)
## x0 x1 y z
## 1 1 0 0 15
## 2 1 1 1 6
## 3 1 2 4 -1
## 4 1 3 9 -6
## 5 1 4 16 -9
## 6 1 5 25 -10
## 7 1 6 36 -9
## 8 1 7 49 -6
## 9 1 8 64 -1
## 10 1 9 81 6
## 11 1 10 100 15
print(summary(lm(y~0+x0+x1,data=dBad)))
##
## Call:
## lm(formula = y ~ 0 + x0 + x1, data = dBad)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.0 -7.5 -1.0 6.0 15.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x0 -15.000 5.508 -2.72 0.023 *
## x1 10.000 0.931 10.74 2e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.76 on 9 degrees of freedom
## Multiple R-squared: 0.966, Adjusted R-squared: 0.959
## F-statistic: 128 on 2 and 9 DF, p-value: 2.42e-07
print(workProblem(dBad,10,'bad/works',10000))
## nsamples model coef actual est estP estM
## 1 10 bad/works beta0 -15 -14.92 -14.81 -15.023
## 2 10 bad/works beta1 10 9.99 10.01 9.971
print(sum(dBad$z*dBad$x0))
## [1] -7.816e-14
print(sum(dBad$z*dBad$x1))
## [1] -1.013e-13
pBad <- c()
set.seed(2623496)
for(reps in repCounts) {
pBad <- rbind(pBad,workProblem(dBad,reps,'badData'))
}
ggplot(data=pBad,aes(x=nsamples)) +
geom_line(aes(y=actual)) +
geom_line(aes(y=est),linetype=2,color='blue') +
geom_ribbon(aes(ymax=estP,ymin=estM),alpha=0.2,fill='blue') +
facet_wrap(~coef,ncol=1,scales='free_y') + scale_x_log10() +
theme(axis.title.y=element_blank())
```

```
```

Notice even when we drive the number of repetitions high enough to collapse the error bars we still have one of the coefficient estimates routinely below its ideal value. This is what a biased estimation procedure looks like. Again, it isn’t strictly correct to say we the problem is due to heteroscedasticity, as we are seeing bias (not just systematic changes in magnitude of variation).

The reason the average of small samples retains bias on this example is: least squares fitting is a non-linear function of the `x`

s (it is only linear in the `y`

s). Without an additional argument (such as the Gauss-Markov theorem) to appeal to there is no a priori reason to believe an average of non-linear estimates will converge to the original population values. However, we feel it is much easier to teach a conclusion like this from stronger assumptions such as identically independent distributed errors than from homoscedasticity. The gain in generality in basing inference on homoscedasticity is not really so large and the loss in clarity is expensive. The main downside of basing inference on identically independent distributed errors appears to be: you get accused of not knowing of the Gauss-Markov theorem.

**What is homoscedasticity/heteroscedasticity**

Heteroscedasticity is a general *undesirable* modeling situation where the variability of some of your variables changes from sub-population to sub-population. That is what the Wikipedia requirement is trying to get at with `V[e(i)]=c`

. However as we move from informal text definitions to actual strict mathematics we have to precisely specify: what is varying with respect to what and which sub-populations do we consider identifiable?

Also be aware that while data with structured errors (the sign of errors being somewhat predictable from `x`

s or even from omitted variables) can not be homoscedastic, it is not traditional to call such situations heteroscedastic (but to instead point out the structural error and say in the presence of such problems the question between homoscedastic/heteroscedastic does not apply).

We would also point out that B.S. Everitt’s “The Cambridge Dictionary of Statistics” 2nd edition does not have primary entries for homoscedastic or heteroscedastic. Our opinion is not that Everitt forgot them or did not know of them. But, likely Everitt found the criticism he would get for leaving these entries out of his dictionary would be less than the loss of clarity/conciseness that would come from including them (and the verbiage needed to respect their detailed historic definitions and conventions).

For our part: we have come to regret ever having used the term “heteroscedacity” (which we have only attempted out of respect to our sources, which use the term). It is far simpler to introduce an ad-hoc term like *structural errors* and supply a precise definition and examples of what is meant in concise mathematical notation. What turns out to be complicated is: using standard statistical terminology which comes with a lot of conventions and historic linguistic baggage. Part of the problem is of course our own background is mathematics, not statistics. In mathematics term definitions tend to be revised to fit use and intended meaning, instead of being frozen to document priority (as is more common in sciences).

**Summary/conclusions**

Many probability/statistical write-ups fail to explicitly identify what probability model is actually underling operators such as `E[],V[]`

, and `cov[]`

. This is for brevity and pretty much the standard convention. Common probability models to consider include: frequentist (all parameters held constant and data regenerated), Bayesian (all observables held constant and probability statements are over distributions of unobserved quantities and parameters), and ad-hoc generative/conditional distributions (as we used here). The issue is: different probability models give different answers. Usually this is not a problem because by the same token: probability models encode so much about intent you can usually infer the right one from knowing intent.

Most common sampling questions use a frequentist model/interpretation (for example see Bayesian and Frequentist Approaches: Ask the Right Question). The issue is: under that rubric the statement there is a `c`

such that `V[e(i)] = c`

doesn’t carry a lot of content. What is probably meant/intended are strong conditional distribution statements like `E[e(i)|x(i)]=0`

and `V[e(i)|x(i)]=c`

. A quick proof analysis shows the derivations in the Wikipedia article are definitely pushing the `E[]`

operator through `X`

s as if the `X`

s are constants independent of the sample/experiment. This is not correct in general (as our bad example showed), but is a legitimate step if all operators are conditioned on `X`

(but again, that is a fairly strong condition).

Part of this is just a reminder that the Wikipedia is an encyclopedia, not a primary source. The other part is: don’t let statistical bullies force you away from clear thoughts and definitions.

For example: it is considered vulgar or ignorant to assume something as strong as independent identically distributed errors. The feeling is: the conclusion of the Gauss-Markov theorem gives facts about only the first two moments of a distribution, so the invoked pre-conditions should only use facts about the first two moments of any input distributions. But philosophically: assuming identical errors makes sense: errors we can’t tell apart in some sense *must* be treated as identical (as we can’t tell them apart). A data scientist if asked why they believe the residuals hidden in their data may be homoscedastic is more likely to appeal to some sort of assumed independent generative structure in their problem (which is itself not as weak or as general as homoscedasticity) than to point to an empirical test of homoscedasticity (which can itself be unreliable).

A lot tends to be going on in statistics papers (probabilities, interpretation, reasoning over counterfactuals, math, and more) so expect technical terminology (or even argot), implied conventions, and telegraphic writing. Correct comprehension often requires introducing and working your own examples.

]]>- Missing values (
`NA`

or blanks) - Problematic numerical values (
`Inf`

,`NaN`

, sentinel values like 999999999 or -1) - Valid categorical levels that don’t appear in the training data (especially when there are rare levels, or a large number of levels)
- Invalid values

Of course, you should examine the data to understand the nature of the data issues: are the missing values missing at random, or are they systematic? What are the valid ranges for the numerical data? Are there sentinel values, what are they, and what do they mean? What are the valid values for text fields? Do we know all the valid values for a categorical variable, and are there any missing? Is there any principled way to roll up category levels? In the end though, the steps you take to deal with these issues will often be the same from data set to data set, so having a package of ready-to-go functions for data treatment is useful. In this article, we will discuss some of our usual data treatment procedures, and describe a prototype R package that implements them.

**Missing Values; Missing Category Levels**

First, we’ll look at what to do when there are missing values or NAs in the data, and how to guard against category levels that don’t appear in the training data. Let’s make a small example data set that manifests these issues.

set.seed(9394092) levels = c('a', 'b', 'c', 'd') levelfreq = c(0.3, 0.3, 0.3, 0.1) means = c(1, 6, 2, 7) names(means) = levels NArate = 1/30 X = sample(levels, 200, replace=TRUE, prob=levelfreq) Y = rnorm(200) + means[X] train = data.frame(x=X[1:150], y=Y[1:150], stringsAsFactors=FALSE) test = data.frame(x=X[151:200], y=Y[151:200], stringsAsFactors=FALSE) # remove a level from training train = subset(train, x !='d') # sprinkle in some NAs ntrain = dim(train)[1] ; ntest = dim(test)[1] train$x = ifelse(runif(ntrain) < NArate, NA, train$x) test$x = ifelse(runif(ntest) < NArate, NA, test$x) table(train$x) ## a b c ## 40 44 42 sum(is.na(train$x)) ## [1] 4 sum(is.na(test$x)) ## [1] 2

This simulates a situation where a rare level failed to be collected in the training data. In addition, we’ve simulated a missing value mechanism. In this example, it’s a “faulty sensor” mechanism (missing values show up at random, as if a sensor were intermittently and randomly failing) – though it may also in general be a systematic mechanism, where the `NA`

means something specific, like the measurement doesn't apply (say "most recent pregnancy date" for a male subject).

We can build a linear regression model for predicting `y`

from `x`

:

# build a model model1 = lm("y~x", data=train) train$pred = predict(model1, newdata=train) # this works predict(model1, newdata=test) # this fails ## Error in model.frame.default(Terms, newdata, na.action = na.action, ## xlev = object$xlevels) : factor x has new levels d

The model fails on the holdout data because the new data has a value of `x`

which was not observed in the training data. You can always refuse to predict in such cases, of course, but in some situations even a not-so-good prediction may be better than no prediction at all. Note also that `lm`

quietly omitted the rows where x was missing while training, and the resulting model will return `NA`

as the predicted outcome in such cases. This is again perfectly reasonable, but not always what you want, especially in cases where a large fraction of the data has missing values.

Are there alternative ways to handle these issues? If `NA`

s show up in the data, the conservative assumption is that they are missing systematically; in this situation (when `x`

is a categorical value), we can then treat them as just another category value, for example by pretreating the variable to convert `NA`

to "Unknown." When novel values show up in the test data (or when `NA`

s appear in the holdout data, but not in the training data), the best assumption we can make is that the novel value is in fact one of the values that we have already observed; the probability of being any given value being proportional to the training set frequencies.

We've implemented these data treatments, and others, in an R package called `vtreat`

. The package is very much at the alpha stage, and is not yet available on CRAN; we'll explain how you can get the package later on in the post. For now, let's see how it works.

The first step is to use the training data to create a set of variable treatments, one for each variable of interest.

library(vtreat) # our library, not public; we'll show how to install later treatments = designTreatmentsN(train, c("x"), "y")

The function `designTreatmentsN()`

takes as input the data frame of training data, the list of input columns, and the (numerical) outcome column. There is a similar function `designTreatmentsC()`

for binary classification problems. The output of the function is a list of variable treatment objects (of class `treatmentplan`

), one per input variable.

treatments ## $treatments ## $treatments[[1]] ## [1] "vtreat 'Categoric Indicators'('x'->'x_lev_NA','x_lev_x.a','x_lev_x.b','x_lev_x.c')" ## ## ## $vars ## [1] "x_lev_NA" "x_lev_x.a" "x_lev_x.b" "x_lev_x.c" ## ## $varScores ## x_lev_NA x_lev_x.a x_lev_x.b x_lev_x.c ## 1.0310 0.6948 0.2439 0.8959 ## ## $varMoves ## x_lev_NA x_lev_x.a x_lev_x.b x_lev_x.c ## TRUE TRUE TRUE TRUE ## ## $outcomename ## [1] "y" ## ## $meanY ## [1] 3.246 ## ## $ndat ## [1] 130 ## ## attr(,"class") ## [1] "treatmentplan"

The `vars`

field of a `treatmentplan`

object gives the names of the new variables that were formed from the original variable: a categorical variable like `x`

is converted to several indicator variables, one for each known level of `x`

-- including `NA`

, if it is observed in the training data. `varMoves`

is TRUE if the new variable in question varies (that is, if it has more than one value in the training data). `meanY`

is the base mean of the outcome variable (unconditioned on the inputs). `ndat`

is the number of data points.

The field `varScores`

is a rough indicator of variable importance, based on the Press statistic. The Press statistic of a model is the sum of the variance of all the hold-one-out models: that is, the sum of `(y_i - f_i)^2`

, where `y_i`

is the outcome corresponding to the ith data point, and `f_i`

is the prediction of the model built by using all the training data *except* the ith data point. We calculate the `varScore`

of the jth input variable `x_j`

to be the Press statistic of the one-dimensional linear regression model that uses only `x_j`

, divided by the Press statistic of the unconditioned mean of `y`

. A varScore of 0 means the model predicts perfectly. A varScore close to one means that the variable predicts only about as well as the global mean; a varScore above 1 means that the model predicts outcome worse than the global mean. So the lower the varScore, the better. You can use `varScores`

to prune uninformative variables, as we will show later.

Once you have created the treatment plans using `designTreatmentsN()`

, you can treat the training and test data frames using the function `prepare()`

. This creates new data frames that express the outcome in terms of the new transformed variables. `prepare()`

takes as input a list of treatment plans and a data set to be treated. The optional argument `pruneLevel`

lets you specify a threshold for `varScores`

; variables with a varScore higher than `pruneLevel`

will be eliminated. By default, `prepare()`

will prune away any variables with a varScore greater than 0.99; we will use `pruneLevel=NULL`

to force `prepare()`

to create all possible variables.

# pruneLevel=NULL turns pruning OFF train.treat = prepare(treatments, train, pruneLevel=NULL) test.treat = prepare(treatments, test, pruneLevel=NULL) train.treat[1:4,] ## x_lev_NA x_lev_x.a x_lev_x.b x_lev_x.c y ## 1 0 0 1 0 7.037 ## 2 0 0 0 1 1.209 ## 3 0 0 0 1 2.819 ## 4 0 0 0 1 2.099 subset(train.treat, is.na(train$x)) # similarly for test ## x_lev_NA x_lev_x.a x_lev_x.b x_lev_x.c y ## 12 1 0 0 0 -0.4593 ## 48 1 0 0 0 6.4741 ## 49 1 0 0 0 5.3387 ## 81 1 0 0 0 2.2319

The listing above shows that instead of the training data frame `(x, y)`

, we now have a training data frame with four `x`

indicator variables, one for the each known `x`

-values "a", "b", and "c" -- plus `NA`

. According to the listing, the first four values for `x`

in the training data were `c("b", "c", "c", "c")`

. `NA`

s are encoded as the variable `x_lev_NA`

.

We can see how `prepare()`

handles novel values in the test data:

# # when we encounter a new variable value, we assign it all levels, # proportional to training set frequencies # subset(test.treat, test$x=='d') ## x_lev_NA x_lev_x.a x_lev_x.b x_lev_x.c y ## 8 0.03077 0.3077 0.3385 0.3231 4.622

Looking back at the process by which we generated `y`

, we can see in this case that the "d" level isn't actually a proportional combination of the other levels; still this is the best assumption in the absence of any other information. Furthermore, in the more common situation of multiple input variables, this assumption allows us to take advantage of information that is available through those other variables.

Now we can fit a model using the transformed variables:

# get the names of the x variables vars = setdiff(colnames(train.treat), "y") fmla = paste("y ~ ", paste(vars, collapse=" + ")) model2 = lm(fmla, data=train.treat) summary(model2) ## ## Call: ## lm(formula = fmla, data = train.treat) ## ## Residuals: ## Min 1Q Median 3Q Max ## -3.856 -0.756 -0.026 0.782 3.078 ## ## Coefficients: (1 not defined because of singularities) ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 2.103 0.168 12.54 < 2e-16 *** ## x_lev_NA 1.293 0.569 2.27 0.02461 * ## x_lev_x.a -0.830 0.240 -3.46 0.00075 *** ## x_lev_x.b 4.014 0.234 17.13 < 2e-16 *** ## x_lev_x.c NA NA NA NA ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.09 on 126 degrees of freedom ## Multiple R-squared: 0.794, Adjusted R-squared: 0.789 ## F-statistic: 162 on 3 and 126 DF, p-value: <2e-16

The significance levels of the variables are consistent with the variable importance scores we observed in the treatment plan. The fact that one of the levels is NAd out is to be expected; four levels implies 3 degrees of freedom (plus the intercept). The standard practice is to omit one level of a categorical as redundant. We don't do this in our treatment plan, as regularized models can actually benefit from having the extra level left in. You will get warnings about possibly misleading fits when applying the model; in this case, we know how the variables were constructed, and that there are no hidden degeneracies in the variables (at least none that we created), so we can disregard the warning.

# you get the warnings about rank-deficient fits train.treat$pred = predict(model2, newdata=train.treat) ## Warning: prediction from a rank-deficient fit may be misleading test.treat$pred = predict(model2, newdata=test.treat) # works! ## Warning: prediction from a rank-deficient fit may be misleading # no NAs summary(train.treat$pred) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 1.27 1.27 2.10 3.25 6.12 6.12 summary(test.treat$pred) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 1.27 1.27 2.10 3.37 6.12 6.12 # note that this model gives the same answers on training data # as the default model sum(abs(train$pred - train.treat$pred), na.rm=TRUE) ## [1] 9.566e-13

The last command of the above listing confirms that on the training data, the model learned from the treated data is equivalent to the model learned on the original data. Now we can look at model accuracy. .

rmse = function(y, pred) { se = (y-pred)^2 sqrt(mean(se)) } # model does well where it really has x values with(subset(train, !is.na(x)), rmse(y, pred)) ## [1] 0.973 # not too bad on NAs with(train.treat, rmse(y,pred)) ## [1] 1.07 # model generalizes well on levels it's observed with(subset(test.treat, test$x != "d"), rmse(y,pred)) ## [1] 1.08 # less well on novel values with(test.treat, rmse(y,pred)) ## [1] 1.272 subset(test.treat, test$x=='d')[,c("y", "pred")] ## y pred ## 8 4.622 3.246

As expected, the model does not perform as well on novel data values (`x`

= "d"), but at least it returns a prediction without crashing. Furthermore, if the novel levels are rare (as we would expect), then predicting them poorly will not affect the overall performance of the model too much.

Let's try preparing the data with the default pruning parameters (`pruneLevel=0.99`

):

train.treat = prepare(treatments, train) test.treat = prepare(treatments, test) # The x_lev_NA variable has been pruned away train.treat[1:4,] ## x_lev_x.a x_lev_x.b x_lev_x.c y ## 1 0 1 0 7.037 ## 2 0 0 1 1.209 ## 3 0 0 1 2.819 ## 4 0 0 1 2.099 # NAs are now encoded as (0,0,0) subset(train.treat, is.na(train$x)) ## x_lev_x.a x_lev_x.b x_lev_x.c y ## 12 0 0 0 -0.4593 ## 48 0 0 0 6.4741 ## 49 0 0 0 5.3387 ## 81 0 0 0 2.2319 # d is now encoded as the relative frequencies of a, b, and c. subset(test.treat, test$x=='d') ## x_lev_x.a x_lev_x.b x_lev_x.c y ## 8 0.3077 0.3385 0.3231 4.622

We no longer keep `NA`

as a level, because it's not any more informative than the global mean; novel levels are still encoded as "all the known levels," proportionally weighted. If we use this data representation to model, we don't have a rank-deficient fit.

vars = setdiff(colnames(train.treat), "y") fmla = paste("y ~ ", paste(vars, collapse=" + ")) model2 = lm(fmla, data=train.treat) summary(model2) ## ## Call: ## lm(formula = fmla, data = train.treat) ## ## Residuals: ## Min 1Q Median 3Q Max ## -3.856 -0.756 -0.026 0.782 3.078 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.396 0.543 6.25 5.8e-09 *** ## x_lev_x.a -2.123 0.570 -3.73 0.00029 *** ## x_lev_x.b 2.721 0.567 4.80 4.5e-06 *** ## x_lev_x.c -1.293 0.569 -2.27 0.02461 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.09 on 126 degrees of freedom ## Multiple R-squared: 0.794, Adjusted R-squared: 0.789 ## F-statistic: 162 on 3 and 126 DF, p-value: <2e-16

The model performance is similar to that of the model that included `x_lev_NA`

.

train.treat$pred = predict(model2, newdata=train.treat) test.treat$pred = predict(model2, newdata=test.treat) sum(abs(train$pred - train.treat$pred), na.rm=TRUE) ## [1] 6.297e-13 with(train.treat, rmse(y,pred)) ## [1] 1.07 with(test.treat, rmse(y,pred)) ## [1] 1.272

**Numerical variables and Categorical variables with many levels**

The above examples looked at data treatment for a simple categorical variable with a moderate number of levels, some possibly missing. There are two other cases to consider. First, we would like basic data treatment for numerical variables, to protect against bad values like `NA`

, `NaN`

or `Inf`

.

Second, we'd like to gracefully manage categorical variables with a large number of possible levels, such as ZIP code, telephone area code, or even city or other geographical region. Such categorical variables can be problematic because they introduce computational or data size issues for some modeling algorithms. For example, the size of the design matrix when computing linear or logistic regression models grows as the square of the number of variables -- and a categorical variable with `N`

levels is represented as `N-1`

indicator variables. The `randomForest`

implementation in R cannot handle categorical variables with more than 32 levels. Categoricals with a large number of levels are also a problem because it is more likely that some of the rarer levels will not appear in the training set, triggering the "novel level" problem on new data: if only a few of your customers come from Alaska or Rhode Island, then those states may not show up in your training set -- but they may show up when you deploy the model to your website.

There are often domain specific ways to handle categories with many levels. For example, a common trick with zip codes is to map them to a new variable whose value is related to zip code and relevant to the problem, such as average household income within that zip code. Obviously, this mapping won't be appropriate in all situations, so it's good to have an automatic procedure to fall back on.

Previously, we've discussed a technique that we call "impact coding" to manage this issue. We discuss this technique here and here; see also Chapter 6 of *Practical Data Science with R*. Impact coding converts a categorical variable `x_cat`

into a numerical variable that corresponds to a one-variable bayesian model for the outcome as a function of `x_cat`

. The `vtreat`

library implements impact coding as discussed in those posts, with a few improvements.

Let's build another simple example, to demonstrate impact coding and the treatment of numerical variables.

N = 100 # a variable with 100 levels levels = paste('gp', 1:N, sep='') fhi = c(0.15, 0.1, 0.1) # the first three levels account for 35% of of the data fx = sum(fhi)/(N-length(fhi)) levelfreq = c(fhi, numeric(N-length(fhi))+fx) means = sample.int(10, size=N, replace=TRUE) names(means) = levels X = sample(levels, 200, replace=TRUE, prob=levelfreq) U = rnorm(200, mean=0.5) # numeric variable Y = rnorm(200) + means[X] + U length(unique(X)) # the data set is missing levels ## [1] 68 train = data.frame(x=X[1:150], u = U[1:150], y=Y[1:150], stringsAsFactors=FALSE) test = data.frame(x=X[151:200], u= U[151:200], y=Y[151:200], stringsAsFactors=FALSE) # sprinkle a few NAs into u (for demonstration purposes) train$u = ifelse(runif(150) < 0.01, NA, train$u) length(setdiff(unique(test$x), unique(train$x))) # and test has some levels train doesn't ## [1] 11

The `designTreatmentsN`

function has two parameters that control when a categorical variable is impact coded. The parameter `minFraction`

(default value: 0.02) controls what fraction of the time an indicator variable has to be "on" (that is, not zero) to be used (this is separate from the `pruneLevel`

parameter in `prepare`

). The purpose is to eliminate rare variables or rare levels. By default, we eliminate variables that are on less than 2% of the time.

When a categorical variable has a large number of levels, it's likely that many of them will be on less than 2% of the time. In that case, the corresponding indicator variables are eliminated, and all of those rare levels will encode to `c(0, 0, ...)`

, in the way the `NA`

level did in our second example above. Let's call the fraction of the data that gets encoded to zero due to rare levels the fraction of the data that we "lose". The parameter `maxMissing`

(default value: 0.04) specifies what fraction of the data we are allowed to "lose" before automatically switching to an impact coded variable. By default, if the eliminated levels correspond to more than 4% of the data, then the treatment plan will switch to impact coding.

In the example above, three levels of the variable `x`

account for 35% of the data, so all the other levels will account for roughly `(1-0.35)/97 = 0.0067`

or the data each, or less than 1% of the mass each. So, all of those 97 levels would be eliminated, and we will "lose" 65% of the data if we keep the categorical representation! Therefore, the data treatment automatically converts `x`

to an impact-coded variable.

# # create the treatment plan. # treatments = designTreatmentsN(train, c("x", "u"), "y") treatments ## $treatments ## $treatments[[1]] ## [1] "vtreat 'Scalable Impact Code'('x'->'x_catN')" ## ## $treatments[[2]] ## [1] "vtreat 'Scalable pass through'('u'->'u_clean')" ## ## $treatments[[3]] ## [1] "vtreat 'is.bad'('u'->'u_isBAD')" ## ## ## $vars ## [1] "x_catN" "u_clean" "u_isBAD" ## ## $varScores ## x_catN u_clean u_isBAD ## 0.1717 0.9183 1.0116 ## ## $varMoves ## x_catN u_clean u_isBAD ## TRUE TRUE TRUE ## ## $outcomename ## [1] "y" ## ## $meanY ## [1] 5.493 ## ## $ndat ## [1] 150 ## ## attr(,"class") ## [1] "treatmentplan"

The variable `x_catN`

is the impact-coded variable corresponding to `x`

. If we refer to the mean of `y`

conditioned on `x`

as `y|x`

, and `meanY`

as grand (unconditioned) mean of `y`

then `x_catN = y|x - meanY`

. Note that `x_catN`

has a low `varScore`

, indicating that it is a good, informative variable.

The variable `u_clean`

is the numerical variable `u`

, with all "bad" values (`NA`

, `NaN`

, `Inf`

) converted to the mean of the "non-bad" `u`

(we'll call this the "clean mean" of `u`

). The variable `u_isBAD`

is an indicator variable that is one whenever `u`

is bad. If the bad values are due to a "faulty sensor" (that is, they occur at random), then converting to the clean mean value of `u`

is the right thing to do. If the bad values are systematic, then `u_isBAD`

can be used by the modeling algorithm to adjust for the systematic effect (assuming it survives the pruning, which in this case, it won't).

We can see how this works concretely by preparing the test and training sets.

train.treat = prepare(treatments, train) test.treat = prepare(treatments, test) train.treat[1:5,] # isBAD column didn't survive ## x_catN u_clean y ## 1 0.04809 1.0749 5.328 ## 2 1.37053 -0.4429 5.413 ## 3 -2.32535 1.4380 2.372 ## 4 -5.02863 0.6611 0.464 ## 5 -1.04404 0.8327 4.449 # ------------------------ # "bad" u values map to the "clean mean" of u # ------------------------ train.treat[is.na(train$u),] ## x_catN u_clean y ## 74 1.371 0.5014 5.269 ## 133 3.053 0.5014 8.546 # compare to u_clean, above mean(train$u, na.rm=TRUE) ## [1] 0.5014 #----------------------- # confirm (x_catN | x = xlevel) is mean(y | x=xlevel) - mean(y) # ------------------------ subset(train.treat, train$x==levels[1])[1:2,] ## x_catN u_clean y ## 3 -2.325 1.4380 2.372 ## 15 -2.325 0.0661 2.381 # compare to x_catN, above mean(subset(train, x==levels[1])$y) - mean(train$y) ## [1] -2.325 #----------------------- # missing levels map to 0, which is equivalent to # mapping them to all known levels proportional to frequency #----------------------- missingInTest = setdiff(unique(test$x), unique(train$x)) subset(test.treat, test$x %in% missingInTest)[1:2,] ## x_catN u_clean y ## 1 4.737e-16 1.3802 1.754 ## 13 4.737e-16 0.9062 6.862

Finally, we use the treated data to model.

vars = setdiff(colnames(train.treat), "y") fmla = paste("y ~ ", paste(vars, collapse=" + ")) model = lm(fmla, data=train.treat) summary(model) ## ## Call: ## lm(formula = fmla, data = train.treat) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2.2077 -0.6131 -0.0113 0.5237 2.5923 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 5.1347 0.0800 64.2 <2e-16 *** ## x_catN 0.9846 0.0296 33.3 <2e-16 *** ## u_clean 0.7139 0.0760 9.4 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.862 on 147 degrees of freedom ## Multiple R-squared: 0.894, Adjusted R-squared: 0.892 ## F-statistic: 617 on 2 and 147 DF, p-value: <2e-16 train.treat$pred = predict(model, newdata=train.treat) test.treat$pred = predict(model, newdata=test.treat) with(train.treat, rmse(y,pred)) ## [1] 0.8529 with(test.treat, rmse(y,pred)) ## [1] 1.964 # evaluate only on the known levels with(subset(test.treat, test$x %in% unique(train$x)), rmse(y, pred)) ## [1] 1.138

As you can see, the model performs better on categories that it saw during training, but it still handles novel levels gracefully -- and remember, some modeling algorithms can't handle a large number of categories at all.

That describes the most basic data treatment procedures that our package implements. For binary classification and logistic regression problems, the package has another function, `designTreatmentsC()`

, which creates treatment plans when the outcome is a binary class variable.

**Loading the vtreat package**

We have made `vtreat`

available on github; remember, this is an alpha package, so it will be rough around the edges. To install the package, download the `vtreat`

tar file (at this writing, `vtreat_0.2.tar.gz`

), as shown in the figure below:

Once you've downloaded it, you can install it from the R command line, as you would any other package. If your R working directory is the same directory where you've downloaded the tar file, then the command looks like this:

install.packages('vtreat_0.2.tar.gz',repos=NULL,type='source')

Once it's installed, `library(vtreat)`

will load the package. Type `help(vtreat)`

to get a short description of how to use the package, along with some example code snippets.

`vtreat`

has a few more features that we will cover in future posts, but this post has given you enough to get you started. Remember, automatic data treatment procedures are not a substitute for inspecting and exploring your data before modeling. However, once you've gotten a feel for the data, you will find that the procedures we have implemented are applicable to a wide variety of situations.

If you try the package, please do send along feedback, including any errors or bugs that you might discover.

For more on data treatment, see Chapter 4 of *Practical Data Science with R*.

Nina Zumel also examines aspects of the supernatural in literature and in folk culture at her blog, multoghost.wordpress.com. She writes about folklore, ghost stories, weird fiction, or anything else that strikes her fancy. Follow her on Twitter @multoghost.

]]>