The DWN paper is an interesting empirical study that measures the performance of a good number of popular classifiers (179 but their own account) on about 120 data sets (mostly from UCI).

This actually represents a bit of work as the UCI data sets are not all in exactly the same format. The data sets have varying file names, varying separators, varying missing value symbols, varying quoting/escaping conventions, non-machine readable headers, some data sets have row-ids, column to be predicted in varying positions, some data in zip files, and many other painful variations. I have always described UCI as “not quite machine readable.” Working with any one data set is easy, but the prospect of building an adapter for each of a large number of such data sets is unappealing. Combined with the fact that the data sets are often of small size, and often artificial/synthetic (designed to show off one particular inference method) few people work with more than a few of these data sets. The authors of DMW worked with well over 100 *and* shared their fully machine readable results ( `.arff`

and apparently standardized `*_R.dat`

files) in a convenient single downloadable tar-file (see their paper for the URL).

The stated conclusion of the paper is comforting, and not entirely unexpected: random forest methods are usually in the top 3 classifiers in terms of accuracy.

The problem is: we are always more accepting of an expected outcome. To confirm such a conclusion we will, of course, need more studies (on larger and more industry-typical data sets), better measures than accuracy (see here for some details), and a lot of digging in to methodology (including data preparation).

To be clear: I like the paper. The authors (as good scientists) publicly shared their data and a bit of their preparation code. This is something most authors do not do, and should in fact be our standard for accepting work for evaluation.

But, let us get down to quibbles. Let’s unpack the data and look at an example. Suppose we start with “car” a synthetic data set we have often used as an example. The UCI repository supplies 3 files: car.c45-names, car.data, and car.names

`car.names`

Free-form description of the data-set and format.`car.data`

Comma separated data (without header).`car.c45-names`

Presumably machine readable header for a`C4.5`

package

The standard way to deal with this data is to (by hand) inspect `car.names`

or `car.c45-names`

and hand-build a custom command to load the data. Example R code to do this is given below:

```
```library(RCurl)
url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data"
tab <- read.table(text=getURL(url,write=basicTextGatherer()),
header=F,sep=',')
colnames(tab) <- c('buying', 'maint', 'doors',
'persons', 'lug_boot', 'safety', 'class')
options(width=50)
print(summary(tab))

Which (assuming `RCurl`

is properly installed) yields:

```
``` buying maint doors persons
high :432 high :432 2 :432 2 :576
low :432 low :432 3 :432 4 :576
med :432 med :432 4 :432 more:576
vhigh:432 vhigh:432 5more:432
lug_boot safety class
big :576 high:576 acc : 384
med :576 low :576 good : 69
small:576 med :576 unacc:1210
vgood: 65

For any one data set having to read the documentation and adapt that into custom loading code is not a big deal. However, having to do this for over 100 data sets is an effort. Let’s look into how the DWN paper did this.

The DWN paper `car`

directory has 9 items:

`car.data`

original file from UCI.`car.names`

original file from UCI.`le_datos.m`

Matlab custom data loading code.`car.txt`

Facts about the data set.`car.arff`

Derived`.arff`

format version of the data set.`car.cost`

Pricing of classification errors.`car_R.dat`

Derived standard tab separated values file with header.`conxuntos.dat`

Likely a result file.`conxuntos_kfold.dat`

Likely a result file.

The files I am interested in are `car_R.dat`

and `le_datos.m`

. `car_R.dat`

looks to be a TSV (tab separated values) file with header, likely intended to be read into R. It looks like the file is in a very regular format with row numbers, feature columns first (and named `f*`

) and category to be predicted last (named `clase`

and re-encoded as an integer). Notice that all features (which in this case were originally strings or factors) have been re-encoded as floating point numbers. That is potentially a problem. Let’s try and dig in how this conversion may have been done. We look into `le_datos.m`

and see the following code fragment:

```
```for i_fich=1:n_fich
f=fopen(fich{i_fich}, 'r');
if -1==f
error('erro en fopen abrindo %s\n', fich{i_fich});
end
for i=1:n_patrons(i_fich)
fprintf(2,'%5.1f%%\r', 100*n_iter++/n_patrons_total);
for j = 1:n_entradas
t= fscanf(f,'%s',1);
if j==1 || j==2
val={'vhigh', 'high', 'med', 'low'};
elseif j==3
val={'2', '3', '4', '5-more'};
elseif j==4
val={'2', '4', 'more'};
elseif j==5
val={'small', 'med', 'big'};
elseif j==6
val={'low', 'med', 'high'};
end
n=length(val); a=2/(n-1); b=(1+n)/(1-n);
for k=1:n
if strcmp(t,val{k})
x(i_fich,i,j)=a*k+b; break
end
end
end
t = fscanf(f,'%s',1); % lectura da clase
for j=1:n_clases
if strcmp(t,clase{j})
cl(i_fich,i)=j; break
end
end
end
fclose(f);
end

It looks like for each categorical variable the researchers have hand-coded an ordered choice of levels. Then each level is replaced by equally spaced code-number from `-1`

through `1`

(using the linear rule `x(i_fich,i,j)=a*k+b`

). Then (in code not shown) possibly more transformations are applied to numeric variables (such as centering and scaling to unit variance). This changes the original data which looks like this:

```
``` buying maint doors persons lug_boot safety class
1 vhigh vhigh 2 2 small low unacc
2 vhigh vhigh 2 2 small med unacc
3 vhigh vhigh 2 2 small high unacc
4 vhigh vhigh 2 2 med low unacc
5 vhigh vhigh 2 2 med med unacc
6 vhigh vhigh 2 2 med high unacc

To this

```
``` f1 f2 f3 f4 f5 f6 clase
1 -1.34125 -1.34125 -1.52084 -1.22439 -1.22439 -1.22439
1
2 -1.34125 -1.34125 -1.52084 -1.22439 -1.22439 0 1
3 -1.34125 -1.34125 -1.52084 -1.22439 -1.22439 1.22439 1
4 -1.34125 -1.34125 -1.52084 -1.22439 0 -1.22439 1

It appears as if one of the machine learning libraries the authors are using only accepts numeric features (I think some of the `Python scikit-learn`

methods have this limitation) or the authors believe they are using such a package. Whomever prepared this data seemed to be unaware that the standard way to convert categorical variables to numeric is the introduction of multiple indicator variables (see page 33 of chapter 2 of Practical Data Science with R for more details).

Indicator variables encoding US Census reported levels of education.

The point is: encoding multiple levels of a categorical variable into a single number may seem reversible to a person (as it is a 1-1 map), but some machine learning methods can not undo the geometric detail lost in such an encoding. For example: with a linear method (be it regression, logistic regression, a linear SVM, or so on) we lose explanatory power unless the encoding has properly guessed both the correct order of the attributes *and* the relative magnitudes. Even tree-based methods (like decision trees, or even random forest) waste part of their explanatory power (roughly degrees of freedom) trying to invert the encoding (leaving less power remaining to explain the original relation in the data). This sort of ad-hoc encoding may not cause much harm in this one example, but it is exactly what you don’t want to do if there are a great number of levels, cases where the order isn’t obvious, or when you are comparing different methods (as different methods are damaged to different degrees by this encoding).

This sort of “convert categorical features” through an arbitrary function is something we have seen a few times. It is one of the reasons we explicitly discuss indicator variables in “Practical Data Science with R” despite the common wisdom that “everybody already knows about them.” When you are trying to get best possible results for a client, you don’t want to inflict avoidable errors in your data transforms.

If you absolutely don’t want to use indicator variables consider impact coding or a safe automated transform such as vtreat. In both cases the actual training data is used to try and estimate the order and relative magnitudes of an encoding that would be useful for downstream modeling.

Is there any actual damage in this encoding? Let’s load the processed data set and see.

```
```url2 <- 'http://winvector.github.io/uciCar/car_R.dat'
dTreated <- read.table(url2,
sep='\t',header=TRUE)

The original data set supports a pretty good logistic regression model for unaccaptable cars:

```
```set.seed(32353)
train <- rbinom(dim(tab)[[1]],1,0.5)==1
m1 <- glm(class=='unacc'~buying+maint+doors+persons+lug_boot+safety,
family=binomial(link='logit'),
data=tab[train,])
tab$pred <- predict(m1,newdata=tab,type="response")
print(table(class=tab[!train,'class'],
unnacPred=tab[!train,'pred']>0.5))
## unnacPred
## class FALSE TRUE
## acc 181 18
## good 30 0
## unacc 22 577
## vgood 35 0

The transformed data set does not support as good a logistic regression mode.

```
```m2 <- glm(clase==1~f1+f2+f3+f4+f5+f6,
family=binomial(link='logit'),
data=dTreated[train,])
dTreated$pred <- predict(m2,newdata=dTreated,type="response")
print(table(class=dTreated[!train,'clase'],
unnacPred=dTreated[!train,'pred']>0.5))
## unnacPred
## class FALSE TRUE
## 0 35 0
## 1 43 556
## 2 118 81
## 3 28 2

Now obviously some modeling methods are more sensitive to this mis-coding than others. In fact for a moderate number of levels you would expect random forest methods to actually invert the coding. But the fact that some methods are more affected than others is one reason why you don’t want to perform this encoding before making comparisons. As to the question why to ever use logistic regression? Because when you have a proper encoding of the data and the model structure is in fact somewhat linear, logistic regression can in fact be a very good method.

In the DWN paper 8 data sets (out of 123) have the `a*k+b`

fragment in their `le_datos.m`

file. So likely the study was largely driven by data sets that natively have only numeric features. Also, we emphasize the DWN paper shared its data and a bit of its methods, which puts it light-years ahead of most published empirical studies. The only reason we can’t so critique other authors is many other authors don’t share their work.

It always surprises statisticians that the indicator variable trick is not always first in mind. This means we forget to teach and re-teach the method enough. We also need to do more to root-out the incorrect alternatives to the method. Indicator encoding is sometimes hard to point out as it is either not done correctly or done silently.

In `R`

, `strings`

and `factors`

can be treated as single columns or variables and are silently converted during model training and application (or can be explicitly built using `model.matrix()`

. Oddly enough `R`

also goes out of its way to also provide a publicly visible “convert to numbers by using interior codes” method (`data.matrix()`

) which in my opinion is almost *always* the wrong method and lures unsuspecting programmers and engineers into error. I have written on this before, but if anything failed to fully appreciate the pervasive nature of the incorrect practice.

`Python`

‘s `scikit-learn`

supplies the correct encoding methods in sklearn.feature_extraction.DictVectorizer/sklearn.preprocessing.OneHotEncoder(). I think a lot of Python users get confused because they do not appreciate that `Pandas`

(which deals so well with data representation) and `scikit-learn`

(which really only wants to work with numbers) are two independent packages (and coded not to depend on each other) and some work is required to faithfully move data from one package to the other.

Note: as expected `randomForest`

does better reversing the re-encoding. Also we accidentally left out the variable `f6`

in an early version of this post.

```
```library(randomForest)
m1F <- randomForest(as.factor(class=='unacc')~
buying+maint+doors+persons+lug_boot+safety,
data=tab[train,])
tab$predF <- predict(m1F,newdata=tab,type="response")
print(table(class=tab[!train,'class'],
unnacPred=tab[!train,'predF']))
## unnacPred
## class FALSE TRUE
## acc 193 6
## good 30 0
## unacc 9 590
## vgood 35 0
m2F <- randomForest(as.factor(clase==1)~f1+f2+f3+f4+f5+f6,
data=dTreated[train,])
dTreated$predF <- predict(m2F,newdata=dTreated,type="response")
print(table(class=dTreated[!train,'clase'],
unnacPred=dTreated[!train,'predF']))
## unnacPred
## class FALSE TRUE
## 0 35 0
## 1 10 589
## 2 193 6
## 3 30 0

And we can confirm the encoding is in fact reversible by showing which variables and outcomes are in bijective correspondence. This means something as simple as changing the `type/class`

declaration from `real`

to `string/factor`

would undoing the coding problem. The machine learning doesn’t need to know the original names of the levels, it just needs to know to treat the data as levels.

```
```print(table(tab$class,dTreated$clase))
## 0 1 2 3
## acc 0 0 384 0
## good 0 0 0 69
## unacc 0 1210 0 0
## vgood 65 0 0 0
print(table(tab$buying,dTreated$f1))
## -1.34125 -0.447084 0.447084 1.34125
## high 0 432 0 0
## low 0 0 0 432
## med 0 0 432 0
## vhigh 432 0 0 0
print(table(tab$maint,dTreated$f2))
## -1.34125 -0.447084 0.447084 1.34125
## high 0 432 0 0
## low 0 0 0 432
## med 0 0 432 0
## vhigh 432 0 0 0
print(table(tab$doors,dTreated$f3))
## -1.52084 -0.168982 0.506946 1.18287
## 2 432 0 0 0
## 3 0 432 0 0
## 4 0 0 0 432
## 5more 0 0 432 0
print(table(tab$persons,dTreated$f4))
## -1.22439 0 1.22439
## 2 576 0 0
## 4 0 576 0
## more 0 0 576
print(table(tab$lug_boot,dTreated$f5))
## -1.22439 0 1.22439
## big 0 0 576
## med 0 576 0
## small 576 0 0
print(table(tab$safety,dTreated$f6))
## -1.22439 0 1.22439
## high 0 0 576
## low 576 0 0
## med 0 576 0

Typeset in The Future has a great example of semiotic sign standards in alien (including the infamous “Purina alien chow” symbol).

]]>`Excel`

spreadsheet. A lot of analysts use this format, so if you work with others you are going to run into it. We have already written how we don’t recommend using `Excel`

-like formats to exchange data. But we know if you are going to work with others you are going to have to make accommodations (we even built our own modified version of `gdata`

‘s underlying `Perl`

script to work around a bug).
But one thing that continues to confound us is how hard it is to read `Excel`

data correctly. When `Excel`

exports into `CSV/TSV`

style formats it uses fairly clever escaping rules about quotes and new-lines. Most `CSV/TSV`

readers fail to correctly implement these rules and often fail on fields that contain actual quote characters, separators (tab or comma), or new-lines. Another issue is `Excel`

itself often transforms data without any user verification or control. For example: `Excel`

routinely turns date-like strings into time since epoch (which it then renders as a date). We recently ran into another uncontrollable `Excel`

transform: changing the strings “`TRUE`

” and “`FALSE`

” into 1 and 0 inside the actual “`.xlsx`

” file. That is `Excel`

does not faithfully store the strings “`TRUE`

” and “`FALSE`

” even in its native format. Most `Excel`

users do not know about this, so they certainly are in no position to warn you about it.

This would be a mere annoyance, except it turns out `Libre Office`

(or at least LibreOffice_4.3.4_MacOS_x86-64) has a severe and silent data mangling bug on this surprising Microsoft boolean type.

We first ran into this in client data (and once the bug triggered it seemed to alter most of the columns), but it turns out the bug is very easy to trigger. In this note we will demonstrate the data representation issue and bug.

Our example `Excel`

spreadsheet was produced using Microsoft `Excel`

2011 for OSX. We started a new sheet and typed in a few cells by hand. We formatted the header and the numeric column, but did not move off default settings for any of the `TRUE/FALSE`

cells. The spreadsheet looks like the following:

Original

`Excel`

spreadsheet (TRUE/FALSE typed in as text, no formatting commands on those cells).You can also download the spreadsheet here.

On `OSX`

Apple `Numbers`

can read the sheet correctly. We demonstrate this below.

Sheet looks okay in Apple Numbers.

However, `Libre Office`

doesn’t reverse the encoding (as it may not know some details of `Excel`

‘s encoding practices) *and* also shows corrupted data as we see below.

`TRUE/FALSE`

represented as `1/0`

in `Libre Office`

, and third row damaged.In practice we have seen the data damage is pervasive and not limited to columns who’s original value was `FALSE`

. It may be a presentation problem as examining individual cells shows “`=TRUE()`

” and “`=FALSE()`

” as the contents of the affected cells (and apparently in the correct positions independent of what is being displayed).

Apple `Preview`

and `Quick Look`

both also fail to understand the `Excel`

data encoding, as we show below.

Sheet damaged in Apple Preview (same for Apple Quick Look).

Our favorite analysis hammer (R) appears to read the data correctly (with only the undesired translation of `TRUE/FALSE`

to `1/0`

):

R appears to load what was stored correctly.

But what is going on? It turns out `Excel`

`.xlsx`

files are actually `zip`

archives storing a directory tree of `xml`

artificts. By changing the file extension from `.xlsx`

to `.zip`

we can treat the spreadsheet as a `zip`

archive and inflate it to see the underlying files. The inflated file tree is shown below.

The file tree representing the

`Excel`

workbook on disk.Of particular interest are the files `xl/worksheets/sheet1.xml`

and `xl/sharedStrings.xml`

. `sheet1.xml`

contains the worksheet data and `sharedStrings.xml`

is a shared string table containing all strings used in the worksheet (the worksheet stores no user supplied strings, only indexes into the shared string table). Let’s look into `sheet1.xml`

:

The XML representing the sheet data.

The sheet data is arranged into rows that contain columns. It is easy to match these rows and cells to our original spreadsheet. For cells containing uninterpreted strings the `<c>`

tag has has an attributed set to `t="s"`

(probably denoting type is “string” and to use the `<v>`

value as a string index). Notice floating point numbers are not treated as shared strings, but stored directly in the `<v>`

tag. Further notice that the last three columns are stored as `0/1`

and have the attribute `t="b"`

set. My guess is this is declaring the type is “boolean” which then must have the convention that `1`

represents `TRUE`

and `0`

represents `FALSE`

.

This doesn’t seem that complicated, but clearly of all the “`Excel`

compatible” tools we tried only Apple `Numbers`

knew all of the details of this encoding (and was able to reverse it). Other than `Numbers`

only `R`

‘s `gdata`

package was able to extract usable data (and even it only recovered the encoded version of the field, not the original user value).

And these are our issue with working with data that has passed through `Excel`

.

`Excel`

has a lot of non-controllable data transforms including booleans, and dates (in fact mangling string fragments`Excel`

even suspects could be made into dates). Some of these transforms are non-faithful or not reversible.- Very few tools that claim to interoperate with
`Excel`

actually get the corner cases right. Even for simple well-documented data types like`Excel`

`CSV`

export. And definitely not for the native`.xlsx`

format.

These transforms and conventions make exporting data harder (and riskier) than it has to be. To add insult to injury you often run into projects that are sharing `Excel`

`.xlsx`

spreadsheets where neither the reader nor the writer is `Excel`

, so neither end is even good at working with the format. Because working with data that has passed through `Excel`

is hard to get right, data that has passed through `Excel`

is often wrong.

(Note: I definitely feel we do need to be thankful to open source and free software developers. These teams in addition to generously supplying software without charge are also working to preserve user freedoms and often the only way to read older data. However, when we are using software for work we do need it to work correctly and be faithful to data. This problem is small *when you detect it*, but large if hidden in a larger project.)

- Estimate an approximate functional relation
`y ~ f(x)`

. - Apply that relation to new instances where
`x`

is known and`y`

is not yet known.

An example of this would be to use measured characteristics of online shoppers to predict if they will purchase in the next month. Data more than a month old gives us a training set where both `x`

and `y`

are known. Newer shoppers give us examples where only `x`

is currently known and it would presumably be of some value to estimate `y`

or estimate the probability of different `y`

values. The problem is philosophically “easy” in the sense we are not attempting inference (estimating unknown parameters that are not later exposed to us) and we are not extrapolating (making predictions about situations that are out of the range of our training data). All we are doing is essentially generalizing memorization: if somebody who shares characteristics of recent buyers shows up, predict they are likely to buy. We repeat: we are *not* forecasting or “predicting the future” as we are not modeling how many high-value prospects will show up, just assigning scores to the prospects that do show up.

The reliability of such a scheme rests on the concept of exchangeability. If the future individuals we are asked to score are exchangeable with those we had access to during model construction then we expect to be able to make useful predictions. How we construct the model (and how to ensure we indeed find a good one) is the core of machine learning. We can bring in any big name machine learning method (deep learning, support vector machines, random forests, decision trees, regression, nearest neighbors, conditional random fields, and so-on) but the legitimacy of the technique pretty much stands on some variation of the idea of exchangeability.

One effect antithetical to exchangeability is “concept drift.” Concept drift is when the meanings and distributions of variables or relations between variables changes over time. Concept drift is a killer: if the relations available to you during training are thought not to hold during later application then you should not expect to build a useful model. This one of the hard lessons that statistics tries so hard to quantify and teach.

We know that you should always prefer fixing your experimental design over trying a mechanical correction (which can go wrong). And there are no doubt “name brand” procedures for dealing with concept drift. However, data science and machine learning practitioners are at heart tinkerers. We ask: can we (to a limited extent) attempt to directly correct for concept drift? This article demonstrates a simple correction applied to a deliberately simple artificial example.

Image: Wikipedia: Elgin watchmaker

For this project we are getting into the realm of transductive inference. Traditionally we build a model based only on an initial fixed set of training data and then score each later application datum independently. In this write-up we will assume we have access to the later data we need to score during model construction (or at least the control variables or “x”s) and can use statistics about the data we are actually going to be asked to score to influence how we convert our training data (data for which both “x”s and “y” are known) into a model and predictions or scores.

Let’s describe our simple artificial problem. Suppose we have access to a number of instances of training data. These are ordered pairs of observations `(x_i,y_i) (i=1 ... a)`

where the `x_i`

are vectors in `R^n`

and the `y_i`

are real numbers. A typical regression task is to find a `g`

in `R^n`

such that `g.x_i`

is a good estimate of `y_i`

. Now further assume the following generative model. Unobserved vectors `z_i`

in `R^n`

are generated according to some (unknown distribution) and it is the case that `y_i = b.z_i + e_i`

(for some unobserved `b`

in `R^n`

, and noise-term `e_i`

) and our observed `x_i`

are generated as `L1 z_i + s_i`

(where `L1`

is an unobserved linear transform and `s_i`

is a vector noise term).

Graphically we can represent our problem as follows (we are using “`u~v`

” to informally denote “`u`

is distributed mean `v`

plus iid noise/error”).

And we can estimate `g`

without worrying over-much about details like `L1`

. However, the fact we are not directly observing an un-noised `z_i`

means we do not meet the standard conditions of simple least squares regression and are already in a more complicated errors-in-variables situation (which we will ignore). The additional difficulty we actually want to concentrate on is a form of concept drift. Suppose after the training period when time comes to apply the model we no longer observe `x_i ~ L1 z_i`

, but instead observe `q_i ~ L2 z_i`

(where `L2`

is a new unobserved linear operator, and `i = a+1 ... a+b`

). In this case our fit estimate `g`

may no longer supply best possible predictions. We may want to use an adjusted linear model. We would like to adjust by `L1 L2^{-1}`

, but we don’t directly observe `L1`

, `L2`

, or `L1 L2^{-1}`

. The situation during application time (when we are trying to predict new unobserved `y_i`

from `q_i`

) is illustrated below.

This situation may seem a bit contrived, but actually fairly familiar in the world of engineering (relevant topics being system identification and techniques like the Kalman filter).

There are some standard statistical practices that could help in this situation. One would be re-scale the observed `x_i`

during training (either through principal components methods, or by running individual variables through a CDF). We are not huge fans of “x-alone” scaling and feel more for partial least squares or inverse regression ideas. Since we are assuming during the application phase the `y_i`

s are not yet observable (say we have to make a block of predictions before we have a chance to observe any new `y_i`

s) we will have to try to find an x-alone scaling solution. We want to try and estimate `L1 L2^{-1}`

from the observed inertial-ellipsoids/covariance-matrices as illustrated below.

The issue is we are trying to find a change of basis without any so-called “registration marks.” We can try and estimate `E1 = L1 M`

(where `M M^{T}`

is the covariance matrix of the unobserved `z_i`

) and `E2 = L2 M`

from our data. So we could try to estimate `L1 L2^{-1}`

as `E1 E2^{-1}`

. But the problem is (in addition to having to use one of our estimates in a denominator, always a bad situation) without registration marks our frame of reference estimates `E1`

and `E2`

are only determined up to an orthonormal transformation. So we actually want to pick an estimate `L1 L2^{-1} ~ E1 W E2^{-1}`

where `W`

is an arbitrary orthogonal matrix (or orthonormal linear transformation). In our case we want to pick `W`

so that `E1 W E2^{-1}`

is near the identity. The principal being: don’t move anything without strong evidence a move is needed.

We don’t have simple code to pick orthogonal `W`

with `E1 W E2^{-1}`

nearest the identity. Though we could obviously give this to a general optimizer. We strongly agree with the principle that machine learning researchers should usually limit themselves to writing down the conditions of optimality and not cripple methods by over-specifying an (often inferior) optimizer. This point is made in “The Interplay of Optimization and Machine Learning Research” Kristin P Bennett and Emilio Parrado-Hernandez, Journal of Machine Learning Research, 2006 vol. 7 pp. 1265-1281 and in “The Elements of Statistical Learning”, Second Edition, Trevor Hastie, Robert Tibshirani, Jerome Friedman. But let’s ignore that and change the problem to one we happen to know the solution to.

There is a tempting and elegant solution that can pick orthogonal `W`

such that `W`

is near as possible to `E1^{-1} E2`

. So we are asking for a orthogonal matrix near `E1^{-1} E2`

instead of one that minimizes residual error. This form of problem is known as the orthogonal Procrustes problem and we show how to solve it using singular value decomposition in the following worked iPython example. The gist is: we form the singular value decomposition of `E1^{-1} E2 = U D V^{T}`

(`U`

, `V`

orthogonal matrices, `D`

a non-negative diagonal matrix) and it turns out `W = U V^{T}`

is the desired orthogonal estimate. So our estimate of `L1 L2^{-1}`

should then be `E1 U V^{T} E2^{-1}`

.

In our worked example the adjusted model has half the root mean square error of using the un-adjusted model.

The scatter plot of predicted versus actual using an un-adjusted (or `g`

model) is as follows.

And the scatter plot from the better adjusted estimate is given below.

This is not as good as re-fitting after the concept change, but it is better than nothing. I am not sure I would use this adjustment in practice, but the derivation of the estimate is fun.

Obviously these hierarchical models I diagramed are very much easier to interpret in a principled manner in a Bayesian setting (due to the need to integrate out the unobserved `z_i`

). But, frankly, I don’t have enough experience with Stan to know how to efficiently specify such a beast (with data) for standard inference.

Let us work the example.

Consider `n`

identically distributed independent normal random variables `x_1`

, … `x_n`

. A common naive estimate of the unknown common mean `U`

and variance `V`

of the generating distribution is given as follows:

```
``` u = sum_{i=1...n} x_i / n
v = sum_{i=1...n} (x_i - u)^2 / n

That is: we are calculating simple estimates `u,v`

that we hope will be close to the unknown true population values `U,V`

. Unfortunately if you show this estimate to a statistical audience you will likely be open to ridicule. The problem is that the preferred estimate is not what we just wrote, but in fact:

```
``` u = sum_{i=1...n} x_i / n
v' = sum_{i=1...n} (x_i - u)^2 / (n-1)

The proffered argument will be that the estimate `v`

is biased (indeed, an undesirable property) and the estimate `v'`

is unbiased. If one wants to be rude one can take pleasure in accusing the author (me) of not knowing the difference between sample variance and population variance.

In my opinion the actual reason for disagreement is: statistics, at least when taught out of major, is largely taught as a prescriptive practice; you follow the exact specified procedure or you are wholly wrong.

Let us take the time to reason about our naive estimate a bit more. We have indeed made a mistake in using it. The mistake is we didn’t state the intended goal of the estimator. That is sloppy thinking, we should always have some goal in mind (right or wrong) and not blindly execute procedures. If the goal is an unbiased estimator we have indeed picked the wrong estimator. But suppose we had been more careful and said we wanted a maximum likelihood estimator. `u,v`

is in fact maximum likelihood and `u,v'`

is not. Unbiasedness is not the only possible performance criteria, and often incompatible with other estimation goals (see here, and here for more examples).

The usual derivation that `u,v'`

is unbiased involves observing that if we define:

```
Q := (sum_{i=1...n} x_i^2) -
(1/n) (sum_{i=1...n} x_i)^2
```

A bit of algebra that is very familiar to statisticians shows that our earlier maximum likelihood estimate `v`

is in fact equal to `Q/n`

. We also note we can derive (using our knowledge of the non-central moments `U,V`

) that `E[Q|U,V] = (n-1) V`

(and *not* `n*V`

). And a small amount of algebra then gives you the unbiased estimate `u,v'`

.

This seems superior and fine, until you notice the following. A glob of messy algebra gives you `E[Q^2|U,V] = (n^2 - 1) V^2`

(claimed in Savage, the derivation will need the stated additional distributional assumption that the data are normal to ensure facts we need about the first four moments of the observed data hold). But this is enough to show that `Q/(n+1)`

is lower variance than the maximum likelihood estimate `v=Q/n`

and also lower variance than the unbiased estimate `v'=Q/(n-1)`

. So if we had stated our goal was a more statistically efficient estimate of the unknown variance (or lower variance in our estimate of variance) we might have preferred an estimate of the form:

```
u = sum_{i=1...n} x_i / n
v'' = sum_{i=1...n} (x_i - u)^2 / (n+1)
```

What is going on is the empirically observed variance is a different beast that the empirically observed mean, even for normal variates. For one the empirically observed variance is a non-negative random variable (so it itself is certainly not normal). And unlike the empirical mean we don’t get all of the maximum likelihood, zero bias, and minimal variance estimates all co-occurring.

The math isn’t too bad. From Savage:

```
E[ (a Q - V)^2 | U, V]
= (a^2 (n^2 - 1) - 2 a (n - 1) + 1) V^2
= ((a - 1/(n+1))^2 (n^2 - 1) + 2/(n+1)) V^2
>= 2 V^2 / (n+1)
```

And this bound is tight at `a = 1/(n+1)`

. Note that the algebra is only valid when `n>1`

, but `Q=0`

when `n=1`

or `V=0`

, which means `a*Q=0`

for all `a`

. So we will assume `n>1`

and `V>0`

. Thus: when `n>1`

and `V>0`

we have `v'' = Q/(n+1)`

is the unique least variance estimate of the form `a*Q`

where `a`

is a constant (not depending on the `n`

, `U`

, `V`

, or the `x_i`

).

Frankly we have never seen an estimate of the form `v'' = Q/(n+1)`

in use. It is unlikely the additional distributional assumptions are worth the promised reduction in estimation variance. But the point is: we have exhibited three different “optimal” estimates for the variance, so it is a bit harder to claim one is always obviously preferred (especially without context).

Or (following the math with an attempt at interpretation): estimating the variance of even a population of normal variates is a common example of where there are lower variance estimators than the standard unbiased choices (without getting into the complications of Stein’s example, James-Stein estimators, or Hodges–Le Cam estimators). In fact it is such a common example it is often ignored.

Or (without the math): as long as our estimators are what statisticians call *consistent* and `n`

is large (which is one of the great advantages of big data) we really can afford to be civil about the differences between these estimates.

Current schedule/location details after the click.

Hadoop Effortlessly: A Data Inventory is Key to Data Self-service 10/16/2014 1:45pm - 2:25pm EDT (40 minutes) Room: 1 E05 http://en.oreilly.com/stratany2014/public/schedule/detail/37956

Office Hour with John Mount (Win Vector LLC) 10/16/2014 2:35pm - 3:15pm EDT (40 minutes) Room: Table C http://en.oreilly.com/stratany2014/public/schedule/detail/37989

Also, look for us and “Practical Data Science with R” at Waterline Data Science’s Strata booth (booth 553).

Javits Center 655 W 34th Street New York, NY 10001

For more updates (events, book discounts), follow us on Twitter: @WinVectorLLC.

]]>There is one caveat: if you are evaluating a series of models to pick the best (and you usually are), then a single hold-out set is strictly speaking not enough. Hastie, et.al, say it best:

Ideally, the test set should be kept in a “vault,” and be brought out only at the end of the data analysis. Suppose instead that we use the test-set repeatedly, choosing the model with smallest test-set error. Then the test set error of the final chosen model will underestimate the true test error, sometimes substantially.

– Hastie, Tibshirani and Friedman, *The Elements of Statistical Learning*, 2nd edition.

The ideal way to select a model from a set of candidates (or set parameters for a model, for example the regularization constant) is to use a training set to train the model(s), a calibration set to select the model or choose parameters, and a test set to estimate the generalization error of the final model.

In many situations, breaking your data into three sets may not be practical: you may not have very much data, or the the phenomena you’re interested in are rare enough that you need a lot of data to detect them. In those cases, you will need more statistically efficient estimates for generalization error or goodness-of-fit. In this article, we look at the PRESS statistic, and how to use it to estimate generalization error and choose between models.

**The PRESS Statistic**

You can think of the PRESS statistic as an “adjusted sum of squared error (SSE).” It is calculated as

Where *n* is the number of data points in the training set, *y _{i}* is the outcome of the

For example, if you wanted to calculate the PRESS statistic for linear regression models in R, you could do it this way (though I wouldn’t recommend it):

# For explanation purposes only - # DO NOT implement PRESS this way brutePRESS.lm = function(fmla, dframe, outcome) { npts = dim(dframe)[1] ssdev = 0 for(i in 1:npts) { # a data frame with all but the ith row d = dframe[-i,] # build a model using all but pt i m = lm(fmla, data=d) # then predict outcome[i] pred = predict(m, newdata=dframe[i,]) # sum the squared deviations ssdev = ssdev + (pred - outcome[i])^2 } ssdev }

We have implemented a couple of helper functions to calculate the PRESS statistic (and related measures) for linear regression models more efficiently. You can find the code here. The function `hold1OutLMPreds(fmla, dframe)`

returns the vector `f`

, where f[i] is the prediction on the ith row of `dframe`

, when fitting the linear regression model described by `fmla`

on `dframe[-i,]`

. The function `hold1OutMeans(y)`

returns a vector `g`

where `g[i] = mean(y[-i])`

. With these function, you can efficiently calculate the PRESS statistic for a linear regression model:

hopreds = hold1OutLMPreds(fmla, dframe) devs = y-hopreds PRESS = sum(devs^2)

One disadvantage of the SSE (and the PRESS) is that they are dependent on the data size; you can’t compare a single model’s performance across data sets of different size. You can remove that dependency by going to the root mean squared error (RMSE): `rmse = sqrt(sse/n)`

, where `n`

is the size of the data set. You can also calculate an equivalent “root mean PRESS” statistic:

n = length(y) hopreds = hold1OutLMPreds(fmla, dframe) devs = y-hopreds rmPRESS = sqrt(mean(devs^2))

And you can also define a “PRESS R-squared”:

n = length(y) hopreds = hold1OutLMPreds(fmla, dframe) homeans = hold1OutMeans(y) devs = y-hopreds dely = y-homeans PRESS = sum(devs^2) PRESS.r2= 1 - (PRESS/sum(dely^2))

The “PRESS R-squared” is one minus the ratio of the model’s PRESS over the “PRESS of y’s mean value;” it adjusts the estimate of how much variation the model explains by using 1-fold cross validation rather than adjusting for the model’s degrees of freedom (as the more standard adjusted R-square does).

You might also consider defining a PRESS R-squared using the in-sample total error (`y-mean(y)`

) instead of the 1-hold-out mean; we decided on the latter in an “apples-to-apples” spirit. Note also that PRESS R-squared can be negative if the model is very poor.

**An Example**

Let’s imagine a situation where we want to predict a quantity *y*, and we have many many potential inputs to use in our prediction. Some of these inputs are truly correlated with *y*; some of them are not. Of course, we don’t know which are which. We have some training data with which to build models, and we will get (but don’t yet have) hold-out data to evaluate the final model. How might we proceed?

First, let’s create a process to simulate this situation:

# build a data frame with pure noise columns # and columns weakly correlated with y buildExample1 <- function(nRows) { nNoiseCols <- 300 nCorCols <- 20 copyDegree <- 0.1 noiseMagnitude <- 0.1 d <- data.frame(y=rnorm(nRows)) for(i in 1:nNoiseCols) { nm <- paste('noise',i,sep='_') d[,nm] <- noiseMagnitude*rnorm(nRows) + ifelse(runif(nRows)<=copyDegree, rnorm(nRows), 0) } for(i in 1:nCorCols) { nm <- paste('cor',i,sep='_') d[,nm] <- noiseMagnitude*rnorm(nRows) + ifelse(runif(nRows)<=copyDegree,d$y,0) } d }

This function will produce a dataset of `nRows`

rows with 20 columns that are weakly correlated (called `cor_1, cor_2...`

) with `y`

and 300 columns (`noise_1, noise_2...`

) that are independent of `y`

. The process is designed so that the noise columns and the correlated columns have similar magnitudes and variances. The outcome can be expressed as a linear combination of the correlated inputs, so a linear regression model should give reasonable predictions.

Let's suppose we have two candidate models: one which uses all the variables, and one which magically uses only the intentionally correlated variables.

set.seed(22525) train = buildExample1(1000) output = "y" inputs = setdiff(colnames(train), output) truein = inputs[grepl("^cor",inputs)] # all variables, including noise # (noisy model) fmla1 = paste(output, "~", paste(inputs, collapse="+")) mod1 = lm(fmla1, data=train) # only true inputs # (clean model) fmla2 = paste(output, "~", paste(truein, collapse="+")) mod2 = lm(fmla2, data=train)

We can extract all the model coefficients that `lm()`

deemed significant to p < 0.05 (that is, all the coefficients that are marked with at least one "*" in the model summary).

# 0.05 = "*" in the model summary sigCoeffs = function(model, pmax=0.05) { cmat = summary(model)$coefficients pvals = cmat[,4] plo = names(pvals)[pvals < pmax] plo } # significant coefficients in the noisy model sigCoeffs(mod1) ## [1] "noise_41" "noise_59" "noise_66" "noise_117" "noise_207" ## [6] "noise_256" "noise_279" "noise_280" "cor_1" "cor_2" ## [11] "cor_3" "cor_4" "cor_5" "cor_6" "cor_7" ## [16] "cor_8" "cor_9" "cor_10" "cor_11" "cor_12" ## [21] "cor_13" "cor_14" "cor_15" "cor_16" "cor_17" ## [26] "cor_18" "cor_19" "cor_20"

In other words, several of the noise inputs appear to be correlated with the output in the training data, just by chance. This means that the noisy model has overfit the data. Can we detect that? Let's look at the SSE and the PRESS:

## name sse PRESS ## 1 noisy model 203.3 448.6 ## 2 clean model 285.8 306.8

Looking at the in-sample SSE, the noisy model looks better than the clean model; the PRESS says otherwise. We can see the same thing if we look at the R-squared style measures:

## name R2 R2adj PRESSr2 ## 1 noisy model 0.7931 0.6956 0.5442 ## 2 clean model 0.7091 0.7031 0.6884

Again, R-squared makes the noisy model look better than the clean model. The adjusted R-squared correctly indicates that the additional variables in the noisy model do not improve the fit, and slightly prefers the clean model. The PRESS R-squared identifies the clean model as the better model, with a much larger margin of difference than the adjusted R-squared.

**The PRESS statistic versus Hold-out Data**

Of course, while the PRESS statistic is statistically efficient, it is not always computationally efficient, especially with modeling techniques other than linear regression. The calculation of the adjusted R-squared is not computationally demanding, and it also identified the better model in our experiment. One could ask, why not just use adjusted R-squared?

One reason is that the PRESS statistic is attempting to directly model future predictive performance. Our experiment suggests that it shows clearer distinctions between the models than the adjusted R-squared. But how well does the PRESS statistic estimate the "true" generalization error of a model?

To test this, we will hold the ground truth (that is, the data generation process) and the training set fixed. We will then repeat generating test sets, measuring the RMSE of the models' predictions against the test sets, and compare them to the training RMSE and root mean PRESS. This is akin to a situation where the training data and model fitting are accomplished facts, and we are hypothesizing possible future applications of the model.

Specifically, we used `buildExample1()`

to generate one hundred tests sets of size 100 (10% the size of the training set) and one hundred tests sets of size 1000 (the size of the training set). We then evaluated both the clean model and the noisy model against all the test sets and compared the distributions of the hold-out root mean squared error (RMSE) against the in-sample RMSE and PRESS statistics. The results are shown below.

For each plot, the solid black vertical line is the mean of the distribution of test RMSE; we can assume that the observed mean is a good approximation to the "true" expected RMSE of the model. Not surprisingly, a smaller test set size leads to more variance in the observed RMSE, but after 100 trials, both the n=100 and n=1000 hold out sets lead to similar estimates of the expected RMSE (just under 0.7 for the noisy model, just under 0.6 for the clean model.

The dashed red lines give the root mean PRESS of both models on the training data, and the dashed blue lines give each models' training set RMSE. For both the noisy and clean models, the root mean PRESS gives a better estimate of the models' expected RMSE than the training set RMSE -- dramatically so with the noisy, overfit model.

Note, however, that in this experiment, a single hold-out set reliably preferred the clean model to the noisy one (that is, the hold-out SSE was always greater for the noisy model than the clean one when both models were applied to the same test data). The moral of the story: use hold-out data (both calibration and test sets) when that is feasible. When data is at a premium, then try more statistically efficient metrics like the PRESS statistic to "stretch" the data that you have.

]]>`5`

are actually represented as length-1 vectors. We commonly think about working over vectors of “logical”, “integer”, “numeric”, “complex”, “character”, and “factor” types. However, a “factor” is not a R vector. In fact “factor” is For example, consider the following R code.

```
```levels <- c('a','b','c')
f <- factor(c('c','a','a',NA,'b','a'),levels=levels)
print(f)
## [1] c a a <NA> b a
## Levels: a b c
print(class(f))
## [1] "factor"

This example encoding a series of 6 observations into a known set of factor-levels (`'a'`

, `'b'`

, and `'c'`

). As is the case with real data some of the positions might be missing/invalid values such as `NA`

. One of the strengths of R is we have a uniform explicit representation of bad values, so with appropriate domain knowledge we can find and fix such problems. Suppose we knew (by policy or domain experience) that the level `'a'`

was a suitable default value to use when the actual data is missing/invalid. You would think the following code would be the reasonable way to build a new revised data column.

```
```fRevised <- ifelse(is.na(f),'a',f)
print(fRevised)
## [1] "3" "1" "1" "a" "2" "1"
print(class(fRevised))
## [1] "character"

Notice the new column `fRevised`

is an absolute mess (and not even of class/type factor). This sort of fix would have worked if `f`

had been a vector of characters or even a vector of integers, but for factors we get gibberish.

We are going to work through some more examples of this problem.

R is designed to support statistical computation. In R analyses and calculations are often centered on a type called a data-frame. A data frame is very much like a SQL table in that it is a sequence of rows (each row representing an instance of data) organized against a column schema. This is also very much like a spreadsheet where we have good column names and column types. (One caveat: in R vectors that are all `NA`

typically lose their type information and become type `"logical"`

.) An example of an R data frame is given below.

```
```d <- data.frame(x=c(1,-0.4),y=c('a','b'))
print(d)
## x y
## 1 1.0 a
## 2 -0.4 b

A R data frame is actually implemented as a list of columns, each column being treated as a vector. This encourages a very powerful programming style where we specify transformations as operations over columns. An example of working over column vectors is given below:

```
```d <- data.frame(x=c(1,-0.4),y=c('a','b'))
d$xSquared <- d$x^2
print(d)
## x y xSquared
## 1 1.0 a 1.00
## 2 -0.4 b 0.16

Notice that we did not need to specify any for-loop, iteration, or range over the rows. We work over column vectors to great advantage in clarity and speed. This is fairly clever as traditional databases tend to be row-oriented (define operations as traversing rows) and spreadsheets tend to be cell-oriented (define operations over ranges of cells). We can confirm R’s implementation of data frames is in fact a list of column vectors (not merely some other structure behaving as such) through the unclass-trick:

```
```print(class(unclass(d)))
## [1] "list"
print(unclass(d))
## $x
## [1] 1.0 -0.4
##
## $y
## [1] a b
## Levels: a b
##
## $xSquared
## [1] 1.00 0.16
##
## attr(,"row.names")
## [1] 1 2

The data frame `d`

is implemented as a class/type annotation over a list of columns (`x`

, `y`

, and `xSquared`

). Let’s take a closer look at the class or type of the column `y`

.

```
```print(class(d$y))
## [1] "factor"

The class of `y`

is `"factor"`

. We gave R a sequence of strings and it promoted or coerced them into a sequence of factor levels. For statistical work this makes a lot of sense; we are more likely to want to work over factors (which we will define soon) than over strings. And at first glance R seems to like factors more than strings. For example `summary()`

works better with factors than with strings:

```
```print(summary(d))
## x y xSquared
## Min. :-0.40 a:1 Min. :0.16
## 1st Qu.:-0.05 b:1 1st Qu.:0.37
## Median : 0.30 Median :0.58
## Mean : 0.30 Mean :0.58
## 3rd Qu.: 0.65 3rd Qu.:0.79
## Max. : 1.00 Max. :1.00
print(summary(data.frame(x=c(1,-0.4),y=c('a','b'),
stringsAsFactors=FALSE)))
## x y
## Min. :-0.40 Length:2
## 1st Qu.:-0.05 Class :character
## Median : 0.30 Mode :character
## Mean : 0.30
## 3rd Qu.: 0.65
## Max. : 1.00

Notice how if `y`

is a factor column we get nice counts of how often each factor-level occurred, but if `y`

is a character type (forced by setting `stringsAsFactors=FALSE`

to turn off conversion) we don’t get a usable summary. So as a a default behavior R promotes strings/characters to factors and has better summaries for strings/characters than for factors. This would make you think that factors might be a preferred/safe data type in R. This turns out to not completely be the case. A careful R programmer must really decide when and where they want to allow factors in their code.

What is a factor? In principle a factor is a value where the value is known to be taken from a known finite set of possible values called levels. This is similar to an enumerated type. Typically we think of factor levels or categories taking values from a fixed set of strings. Factors are very useful in encoding categorical responses or data. For example we can represent which continent a country is in with the factor levels `"Asia"`

, `"Africa"`

, `"North America"`

, `"South America"`

, `"Antarctica"`

, `"Europe"`

, and `"Australia"`

. When the data has been encoded as a factor (perhaps during ETL) you not only have the continents indicated, you also know the complete set of continents and have a guarantee of no ad-hoc alternate responses (such as “SA” for South America). Additional machine-readable knowledge and constraints make downstream code much more compact, powerful, and safe.

You can think of a factor vector as a sequence of strings with an additional annotation as to what universe of strings the strings are taken from. The R implementation of factor actually implements factor as a sequence of integers where each integer represents the index (starting from 1) of the string in the sequence of possible levels.

```
```print(class(unclass(d$y)))
## [1] "integer"
print(unclass(d$y))
## [1] 1 2
## attr(,"levels")
## [1] "a" "b"

This implementation difference *should* not matter, except R exposes implementation details (more on this later). Exposing implementation details is generally considered to be a bad thing as we don’t know if code that uses factors is using the declared properties and interfaces or is directly manipulating the implementation.

Down-stream users or programmers are supposed to mostly work over the supplied abstraction not over the implementation. Users should not routinely have direct access to the implementation details and certainly not be able to directly manipulate the underlying implementation. In many cases the user must be *aware* of some of the limitations of the implementation, but this is considered a necessary *undesirable* consequence of a leaky abstraction. An example of a necessarily leaky abstraction: abstracting base-2 floating point arithmetic as if it were arithmetic over the real numbers. For decent speed you need your numerics to be based on machine floating point (or some multi-precision extension of machine floating point), but you want to think of numerics abstractly as real numbers. With this leaky compromise the user doesn’t have to have the entire IEEE Standard for Floating-Point Arithmetic (IEEE 754) open on their desk at all times. But the user should known the exceptions like: `(3-2.9)<=0.1`

tends to evaluate to `FALSE`

(due to the implementation, and in violation of the claimed abstraction) and know the necessary defensive coding practices (such as being familiar with What Every Computer Scientist Should Know About Floating-Point Arithmetic).

Now: factors *can* be efficiently implemented perfectly, so they *should* be implemented perfectly. At first glance it appears that they have been implemented correctly in R and the user is protected from the irrelevant implementation details. For example if we try and manipulate the underlying integer array representing the factor levels we get caught.

```
```d$y[1] <- 2
## Warning message:
## In `[<-.factor`(`*tmp*`, 1, value = c(NA, 1L)) :
## invalid factor level, NA generated

This is good, when we tried to monkey with the implementation we got caught. This is how the R implementors try to ensure there is not a lot of user code directly monkeying with the current representation of factors (leaving open the possibility of future bug-fixes and implementation improvements). Likely this safety was gotten by overloading/patching the `[<-`

operator. However, as with most fix-to-finish designs, a few code paths are missed and there are places the user is exposed to the implementation of factors when they expected to be working over the abstraction. Here are a few examples:

```
```f <- factor(c('a','b','a')) # make a factor example
print(class(f))
## [1] "factor"
print(f)
## [1] a b a
## Levels: a b
# c() operator collapses to implementation
print(class(c(f,f)))
## [1] "integer"
## [1] 1 2 1 1 2 1
print(c(f,f))
## [1] 1 2 1 1 2 1
# ifelse(,,) operator collapses to implementation
print(ifelse(rep(TRUE,length(f)),f,f))
# [1] 1 2 1
# factors are not actually vectors
# this IS as claimed in help(vector)
print(is.vector(f))
## [1] FALSE
# factor implementations are not vectors either
# despite being "integer"
print(class(unclass(f)))
## [1] "integer"
print(is.vector(unclass(f)))
## [1] FALSE
# unlist of a factor is not a vector
# despite help(unlist):
# "Given a list structure x, unlist simplifies it to produce a vector:
print(is.vector(unlist(f)))
## [1] FALSE
print(unlist(f))
## [1] a b a
## Levels: a b
print(as.vector(f))
## [1] "a" "b" "a"

What we have done is found instances where a `factor`

column does not behave as we would expect a character vector to behave. These defects in behavior are why I claim factor are not first class in R. They don’t get the full-service expected behavior from a number of basic R operations (such is passing through `c()`

or `ifelse(,,)`

without losing their class label). It is hard to say a factor is treated as a first-class citizens that correctly “supports all the operations generally available to other entities” (quote taken from Wikipedia: First-class_citizen). R doesn’t seem to trust leaving factor data types in factor data types (which should give one pause about doing the same).

The reason these differences are not a mere curiosities is: in any code where we are expecting one behavior and we experience another, we have a bug. So these conversions or abstraction leaks cause system brittleness which can lead to verbose hard to test overly defensive code (see Postel’s law: not sure who to be angry with for some of the downsides of being required to code defensively).

September 9, 1947 Grace Murray Hopper “First actual case of bug being found.”

(image: Computer History Museum)

Why should we expect a factor to behave like a character vector? Why not expect it to behave like an integer vector? The reason is: we supplied a character vector and R’s default behavior in `data.frame()`

was to convert it to a factor. R’s behavior only makes sense under the assumption there is some commonality of behavior between factors and character vectors. Otherwise R has made a surprising substitution and violated the principle of least astonishment. To press the point further: from an object oriented view (which is a common way to talk about the separation of concerns of interface and implementation) a valid substitution should at the very least follow some form of the Liskov substitution principle of a factor being a valid sub-type of character vector. But this is *not* possible between mutable versions of factor and character vector, so the substitution should not have been offered.

What we are trying to point out is: design is not always just a matter of taste. With enough design principles in mind (such as least astonishment, Liskov substitution, and a few others) you can actually say some design decisions are wrong (and maybe even some day some other design decisions are right). There are very few general principals of software system design, so you really don’t want to ignore the few you have.

One possible criticism of my examples is: “You have done everything wrong, *everybody* knows to set `stringsAsFactors=FALSE`

.” I call this the “Alice’s Adventures in Wonderland” defense. In my opinion the user is a guest and it is fair for the guest to initially assume default settings are generally the correct or desirable settings. The relevant “Alice’s Adventures in Wonderland” quote being:

At this moment the King, who had been for some time busily writing in his note-book, cackled out ‘Silence!’ and read out from his book, ‘Rule Forty-two. All persons more than a mile high to leave the court.’

Everybody looked at Alice.

‘I’m not a mile high,’ said Alice.

‘You are,’ said the King.

‘Nearly two miles high,’ added the Queen.

‘Well, I shan’t go, at any rate,’ said Alice: ‘besides, that’s not a regular rule: you invented it just now.’

‘It’s the oldest rule in the book,’ said the King.

‘Then it ought to be Number One,’ said Alice.

(text: Project Gutenberg)

(image from Wikipedia)

Another obvious criticism is: “You have worked hard to write bugs.” That is not the case, I have worked hard to make consequences direct and obvious. Where I first noticed my bug was code deep in an actual project which is similar to the following example. First let’s build a synthetic data set where `y~f(x)`

where `x`

is a factor or categorical variable.

```
```# build a synthetic data set
set.seed(36236)
n <- 50
d <- data.frame(x=sample(c('a','b','c','d','e'),n,replace=TRUE))
d$train <- FALSE
d$train[sample(1:n,n/2)] <- TRUE
print(summary(d$x))
## a b c d e
## 4 7 12 14 13
# build noisy = f(x), with f('a')==f('b')
vals <- rnorm(length(levels(d$x)))
vals[2] <- vals[1]
names(vals) <- levels(d$x)
d$y <- rnorm(n) + vals[d$x]
print(vals)
## a b c d e
## 1.3394631 1.3394631 0.3536642 1.6990172 -0.5423986
# build a model
model1 <- lm(y~0+x,data=subset(d,train))
d$pred1 <- predict(model1,newdata=d)
print(summary(model1))
##
## Call:
## lm(formula = y ~ 0 + x, data = subset(d, train))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.53459 -0.43303 -0.07942 0.49278 2.20614
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## xa 2.9830 0.7470 3.993 0.000715 ***
## xb 2.0506 0.5282 3.882 0.000926 ***
## xc 1.2824 0.3993 3.212 0.004378 **
## xd 2.3644 0.3993 5.922 8.6e-06 ***
## xe -1.1541 0.4724 -2.443 0.023974 *
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
##
## Residual standard error: 1.056 on 20 degrees of freedom
## Multiple R-squared: 0.8046, Adjusted R-squared: 0.7558
## F-statistic: 16.47 on 5 and 20 DF, p-value: 1.714e-06

Our first model is good. But during the analysis phase we might come across some domain knowledge, such as `'a'`

and `'b'`

are actually equivalent codes. We could reduce fitting variance by incorporating this knowledge in our feature engineering. In this example it won’t be much of an improvement, we are not merging much and not eliminating many degrees of freedom. In a real production example this can be a very important step where you may have a domain supplied roll-up dictionary that merges a large number of levels. However, what happens is our new merged column gets quietly converted to a column of integers which is then treated as a numeric column in the following modeling step. So the merge is in fact disastrous, we lose the categorical structure of the variable. We can, of course, re-institute the structure by calling `as.factor()`

if we know about the problem (which we might not), but even then we have lost the string labels for new integer level labels (making debugging even harder). Let’s see the failure we are anticipating, notice how the training adjusted R-squared disastrously drops from 0.7558 to 0.1417 after we attempt our “improvement.”

```
```# try (and fail) to build an improved model
# using domain knowledge f('a')==f('b')
d$xMerged <- ifelse(d$x=='b',factor('a',levels=levels(d$x)),d$x)
print(summary(as.factor(d$xMerged)))
## 1 3 4 5
## 11 12 14 13
# disaster! xMerged is now class integer
# which is treated as numeric in lm, losing a lot of information
model2 <- lm(y~0+xMerged,data=subset(d,train))
print(summary(model2))
##
## Call:
## lm(formula = y ~ 0 + xMerged, data = subset(d, train))
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3193 -0.5818 0.8281 1.6237 3.5451
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## xMerged 0.2564 0.1132 2.264 0.0329 *
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
##
## Residual standard error: 1.98 on 24 degrees of freedom
## Multiple R-squared: 0.176, Adjusted R-squared: 0.1417
## F-statistic: 5.128 on 1 and 24 DF, p-value: 0.03286

There is an obvious method to merge the levels correctly: convert back to character (which we show below). The issue is: if you don’t know about the conversion to integer happening, you may not know to look for it and correct it.

```
```# correct f('a')==f('b') merge
d$xMerged <- ifelse(d$x=='b','a',as.character(d$x))
model3 <- lm(y~0+xMerged,data=subset(d,train))
d$pred3 <- predict(model3,newdata=d)
print(summary(model3))
##
## Call:
## lm(formula = y ~ 0 + xMerged, data = subset(d, train))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.53459 -0.51084 -0.05408 0.71385 2.20614
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## xMergeda 2.3614 0.4317 5.470 1.99e-05 ***
## xMergedc 1.2824 0.3996 3.209 0.00422 **
## xMergedd 2.3644 0.3996 5.916 7.15e-06 ***
## xMergede -1.1541 0.4729 -2.441 0.02361 *
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
##
## Residual standard error: 1.057 on 21 degrees of freedom
## Multiple R-squared: 0.7945, Adjusted R-squared: 0.7553
## F-statistic: 20.3 on 4 and 21 DF, p-value: 5.693e-07
dTest <- subset(d,!train)
nTest <- dim(dTest)[[1]]
# Root Mean Square Error of original model on test data
print(sqrt(sum((dTest$y-dTest$pred1)^2)/nTest))
## [1] 1.330894
# Root Mean Square Error of f('a')==f('b') model on test data
print(sqrt(sum((dTest$y-dTest$pred3)^2)/nTest))
## [1] 1.297682

Factors are definitely useful, and I am glad R has them. I just wish they had fewer odd behaviors. My rule of thumb is just to use them as late as possible, set `stringsAsFactors=FALSE`

and if you need factors in some place convert from character near that place.

Please see the following articles for more ideas on working with categorical variables and preparing data for analysis.

]]>The story is an inside joke referring to something really only funny to one of the founders. But a joke that amuses the teller is always enjoyed by at least one person. Win-Vector LLC’s John Mount had the honor of co-authoring a 1997 paper titled “The Polytope of Win Vectors.” The paper title is obviously mathematical terms in an odd combination. However the telegraphic grammar is coincidentally similar to deliberately ungrammatical gamer slang such as “full of win” and “so much win.”

If we treat “win” as a concrete noun (say something you can put in a sack) and “vector” in its *non-mathematical* sense (as an entity of infectious transmission) we have “Win-Vector LLC is an infectious delivery of victory.” I.e.: we deliver success to our clients. Of course, we have now attempt to explain a weak joke. It is not as grand as “winged victory,” but it does encode a positive company value: Win-Vector LLC delivers successful data science projects and training to clients.

Winged Victory: from Wikipedia

Let’s take this as an opportunity to describe what a win vector is.

We take the phrase “win vector” from a technical article titled “The Polytope of Win Vectors” by J.E. Bartles, J. Mount, and D.J.A. Welsh (Annals of Combinatorics I, 1997, pp. 1-15. The topic of this paper concerns the possible outcomes of game tournaments (or other things that can be expressed as tournaments). For example: we could have four teams (A,B,C, and D) scheduled to play each other a number of times, as indicated in the diagram below.

This graph is just saying in the tournament: A will play B 5 times, B will not play C, and so on. We assume each game can end in a win for one team (given them 1 point), or a loss or tie (giving them zero points). We can record a summary of the tournament outcomes as a vector (vector now back to its mathematical sense) that just records how often each team won. For example the vector [10,1,1,0] is a win vector compatible with the above diagram (it encodes A winning all matches and D losing all matches). The vector [0,0,0,5] is not a valid win vector for the digram as D did not play 5 games (so can not have 5 wins). (The Win-Vector LLC logo is itself a stylized single game tournament diagram, with the directed arrow representing both victory and reminiscent of vectors in the mathematical sense.)

The idea is that a win vector might be treated as a sufficient statistic for the tournament. Or more accurately the win vector may be all that is known about a previously run tournament. Such censored observations may be all that is possible in field biology where wins represent territory or offspring. The question is then: given knowledge of the tournament structure (the graph) and the summary of outcomes (the win vector) is there evidence one team is dominant, or are the effects random? So we have well-formed statistical questions about effect strength and significance.

The question of significance is: when we introduce a notion of effect strength how likely are we to see an effect of that size assuming identical players. For example if we make our notion of effect strength the maximum ratio of wins to plays seen in the win vector should we consider this evidence of a strong player, or is it to be expected by random fluctuation? We need to estimate how strong a conditioning effect our tournament constraints impose on unobserved outcomes (to determine if irregularities in distribution are from player strengths our tournament mis-design).

Relating distributions of unobserved details to observed totals (or margins) is one of the most fundamental problems in statistics. We have written on it many times (two examples: Google ad market reporting and checking scientific claims). In all cases you would be better off with direct detailed observations (i.e. without the censorship); but often you have to work with the data you have instead of the experiment you would design.

The math is a little easier to explain for a related problem: working out the number of ways to fill in a matrix with non-negative integers to meet given row and column totals. I’ll move on to discuss this contingency table problem a bit.

The statistical ideas largely come from “Testing for Independence in a Two-Way Table: New Interpretations of the Chi-Square Statistic”, Persi Diaconis and Bradley Efron, Ann. Statist., Vol. 13, No. 3, 1985 pp. 845-874. A contingency table is a matrix of non-negative integers, and the statistical problem is relating known row and column totals to possible fill-ins. In this paper the authors criticize some of the standard significance tests (chi-square, Fisher’s exact test) and propose a parameterized family of tests that at the extreme end considers a null-model of uniform fill-ins (each possible fill in equally likely). Obviously a uniform model is very different than the more standard distributions which tend to have cell counts more highly concentrated around their means. But the idea is: this proposed test takes more of the structure of the margin totals into account (or equivalently assumes away less of the margin mediated cell dependencies) and has its own merits.

However, we are actually describing the work of mathematicians and theoretical computer scientists. In that style you only speak with “applied types” (such as theoretical statisticians) to justify working on a snappy math problem. In this case: counting the number of ways to fill in a contingency table or the number of detailed results compatible with a given win vector (the link between counting, and generation having been strongly established in “Randomised Algorithms for Counting and Generating Combinatorial Structures”, A.J. Sinclair, Ph.D. thesisUniversity of Edinburgh (1988) and related works).

The contingency table problem is partially solved in:

- “Sampling contingency tables” Martin Dyer, Ravi Kannan, John Mount, Random Structures and Algorithms Vol. 10, no. 4, July 1997 pp. 487-506.
- “Fast Unimodular Counting” John Mount, Combinatorics Probability and Computing, Vol. 9, No. 3, May 2000, pp 277-285.

The second paper (strengthening some results from my Ph.D. thesis) lets you calculate that the number of ways to fill in the following four by four contingency table with non-negative integers to meet the shown row and column totals is exactly `350854066054593772938684218633979710637454260`

(about `3.508541e+44`

).

```
``` x(0,0) x(0,1) x(0,2) x(0,3) 154179
x(1,0) x(1,1) x(1,2) x(1,3) 255424
x(2,0) x(2,1) x(2,2) x(2,3) 277000
x(3,0) x(3,1) x(3,2) x(3,3) 160179
191780 288348 165221 201433

The point being: the table could arise as the summary from a data set with `846782`

(`=191780 + 288348 + 165221 + 201433`

) items; to characterize probabilities over such a tables you need good methods to sample over the astronomical family of potential alternate fill-ins (and this is where you apply the link between counting and sampling for self-reducible problem families). We have example code, notes, improved runtime proof, and results here.

“The Polytope of Win Vectors” introduced additional ideas from integral polymatroids to more strongly relate volume to number of integer vectors (and gets more complete theoretical results for its problem).

All the “big hammer” math is trying to extend some of the beauty of G.H. Hardy and J.E. Littlewood, “Some problems of Diophantine approximation: the lattice points of a right-angled triangle,” Hamburg. Math.Abh., 1 (1921) 212–249 to more general settings.

Or more succinctly: we just like the word “win.”

]]>**What is the Gauss-Markov theorem?**

From “The Cambridge Dictionary of Statistics” B. S. Everitt, 2nd Edition:

A theorem that proves that if the error terms in a

multiple regressionhave the same variance and are uncorrelated, then the estimators of the parameters in the model produced byleast squares estimationare better (in the sense of having lower dispersion about the mean) than any other unbiased linear estimator.

This is pretty much considered the “big boy” reason least squares fitting can be considered a good implementation of linear regression.

Suppose you are building a model of the form:

```
``` y(i) = B . x(i) + e(i)

where `B`

is a vector (to be inferred), `i`

is an index that runs over the available data (say `1`

through `n`

), `x(i)`

is a per-example vector of features, and `y(i)`

is the scalar quantity to be modeled. Only `x(i)`

and `y(i)`

are observed. The `e(i)`

term is the un-modeled component of `y(i)`

and you typically hope that the `e(i)`

can be thought of unknowable effects, individual variation, ignorable errors, residuals, or noise. How weak/strong assumptions you put on the `e(i)`

(and other quantities) depends on what you know, what you are trying to do, and which theorems you need to meet the pre-conditions of. The Gauss-Markov theorem assures a good estimate of `B`

under weak assumptions.

**How to interpret the theorem**

The point of the Gauss-Markov theorem is that we can find conditions ensuring a good fit without requiring detailed distributional assumptions about the `e(i)`

and without distributional assumptions about the `x(i)`

. However, if you are using Bayesian methods or generative models for predictions you *may want* to use additional stronger conditions (perhaps even normality of errors and *even* distributional assumptions on the `x`

s).

We are going to read through the Wikipedia statement of the Gauss-Markov theorem in detail.

**Wikipedia’s stated pre-conditions of the Gauss-Markov theorem**

To apply the Gauss-Markov theorem the Wikipedia says you must assume your data has the following properties:

```
```

```
E[e(i)] = 0 (lack of structural errors, needed to avoid bias)
V[e(i)] = c (equal variance, one form of homoscedasticity)
cov[e(i),e(j)] = 0 for i!=j (non-correlation of errors)
```

```
```

It is always important to know precisely what probability model the expectation (`E[]`

), variance (`V[]`

), and covariance (`cov[]`

) operators are working over in the Wikipedia conditions. This is usually left implicit, but it is critical to know exactly what is being asserted. When reading/listening about statistical or probabilistic work you should *always* insist on a concrete description of the probability model underlying all the notation (the `E[]`

s and `V[]`

s). A lot of confusion and subtle tricks get hidden by not sharing an explicit description of the probability model.

**Probability models**

Two plausible probability models are:

- Frequentist: unobserved parameters are held constant and all probabilities are over re-draws of the data. At first guess you would think this is the correct model for this problem, as the content of the Gauss-Markov theorem is about how estimates drawn from a larger population perform in expectation.
- x-Generative: This is not standard and not immediately implied by the notation (and represents a fairly strong set of assumptions). In this model all of the observed
`x`

s are held constant and unobserved`e`

s and`y`

s are regenerated with respect to the`x`

s. This is similar to a Bayesian generative model, except in the usual Bayesian formulation all observables (both`x`

s and`y`

s) are held fixed. We only introduce this model as it seems to be the simplest one which makes for a workable interpretation of the Wikipedia statements.

The issue is: the conditions as stated are not strong enough to ensure actual homoscedasticity (or even non-structure of errors/bias) needed to apply the Gauss-Markov theorem under a strict frequentist model. So we must go venue-shopping and find what model is likely intended. An easy way to do this is to design synthetic data that is considered well-behaved under one model and not under the other.

**A source of examples**

Let’s use a deliberately naive empirical view of data. Suppose the entire possible universe of data is `X(i),Y(i),Z(i) i=1...k`

for some `k`

(`k`

and `X(i),Y(i),Z(i)`

all finite real vectors). Our chosen explicit probability model for generating the observed data `x(i),y(i)`

and unobserved `e(i)`

is the following. We pick a length-`n`

sequence of integers `s(1),...,s(n)`

where each `s(i)`

is picked uniformly and independently from `1...k`

and add a bit of unique noise. Our sample data is then (only `x(i),y(i)`

are observed, `e(i)`

is an unobserved notional quantity):

```
```

```
(x(i),y(i),e(i)) = (X(s(i)),Y(s(i))+t(i),Z(s(i))+t(i)) for i=1...n,
where t(i) is an independent normal variable with mean 0 and variance 1
```

```
```

This is similar to a standard statistical model (empirical re-sampling from a fixed set, and designed to be similar to a sampling distribution). `Z(i)`

represents an idealized error term and `e(i)`

represents a per-sample unobserved realization of `Z(i)`

. It is a nice model because the `e(i)`

are independently identically distributed (and so are the `x(i)`

and `y(i)`

, though obviously there can be dependencies between the `x,y and e`

s). This model can be thought of as “too nice” as it isn’t powerful enough to capture the full power of the Gauss-Markov theorem (it can’t express non- independent identically distributed situations). However it can concretely embody situations that do meet the Gauss-Markov conditions and be used to work clarifying examples.

**Good examples under the frequentist probability model**

Let’s see what conditions on `X(i),Y(i),Z(i) i=1...k`

are needed to meet the Gauss-Markov pre-conditions assuming a frequentist probability model.

- The first one is easy:
`E[e(i)] = 0`

if and only if`sum_{j=1...k} Z(j) = 0`

. - When we have
`E[e(i)]=0`

the second condition (homoscedasticity as stated) simplifies to`V[e(i)] = E[(e(i) - E[e(j)])^2] = E[e(i)^2] = E[Z^2] + 1`

which is independent of`i`

. - When we have
`E[e(i)]=0`

the third condition simplifies to`E[e(i) e(j)] = 0`

for`i!=j`

. And then follows immediately from our overly strong condition of the index selections`s(i)`

being independent (giving us`E[e(i) e(j)] = E[e(i)] E[e(j)] = 0 for i!=j`

).

So all we need is: `sum_{j=1...k} Z(j) = 0`

and then the other conditions hold. This seems too easy, and is evidence that the frequentist probability model is not the model intended by Wikipedia. We will confirm this with a specific counter example later.

**Good examples under the x-generative probability model**

Under the x-generative probability model (and this is *not* standard terminology) the Wikipedia conditions are more properly written conditionally:

```
```

```
E[e(i)|x(i)] = 0
V[e(i)|x(i)] = c
cov[e(i),e(j)|x(i),x(j)] = 0 for i!=j
```

```
```

Or more precisely: if the conditions had been written in their conditional form we wouldn’t have to contrive a phrase like “x-generative model” to ensure the correct interpretation. These conditions are strict. Checking or ensuring these properties is a problem when `x`

is continuous and we have a joint description of how `x,y,e`

are generated (instead of a hierarchical one). These conditions as stated are strong enough to support the Gauss-Markov theorem, but probably in fact stronger than the minimum or canonical conditions. But let’s see how they work.

To meet these conditions our `Z(i)`

must pretty much be free of dependence on `x(i)`

(even one snuck through the index `i`

). This is somewhat unsatisfying as our overly simple modeling framework (producing `x,y,e`

from `X,Y,Z`

) combined with these strong conditions don’t really model much more than identical independence (so do not capture the full breadth of the Gauss-Markov theorem). The frequentist conditions are too lenient to work and the x-generative/conditioned conditions seem too strong (at least when combined with our simplistic source of examples).

**A good example**

The following R example (also available here) shows a data set generated under our framework where the Gauss-Markov theorem applies (under either probability model). In this case the true `y`

is produced as an actual linear function of `x`

plus iid (independent identically distributed) noise. This model meets the pre-conditions of the Gauss-Markov condition (under both the frequentist and x-generative models). We observe that the empirical samples average out to the correct theoretical coefficients taken from the original universal population. All of the calculations are designed to match the quantities discussed in the Wikipedia derivations.

```
library(ggplot2)
workProblem <- function(dAll,nreps,name,sampleSize=10) {
xAll <- matrix(data=c(dAll$x0,dAll$x1),ncol=2)
cAll <- solve(t(xAll) %*% xAll) %*% t(xAll)
beta <- as.numeric(cAll %*% dAll$y)
betaSamples <- matrix(data=0,nrow=2,ncol=nreps)
nrows <- dim(dAll)[[1]]
for(i in 1:nreps) {
dSample <- dAll[sample.int(nrows,sampleSize,replace=TRUE),]
individualError <- rnorm(sampleSize)
dSample$y <- dSample$y + individualError
dSample$e <- dSample$z + individualError
xSample <- matrix(data=c(dSample$x0,dSample$x1),ncol=2)
cSample <- solve(t(xSample) %*% xSample) %*% t(xSample)
betaS <- as.numeric(cSample %*% dSample$y)
betaSamples[,i] <- betaS
}
d <- c()
for(i in 1:(dim(betaSamples)[[1]])) {
coef <- paste('beta',(i-1),sep='')
mean <- mean(betaSamples[i,])
dev <- sqrt(var(betaSamples[i,])/nreps)
d <- rbind(d,data.frame(nsamples=nreps,model=name,coef=coef,
actual=beta[i],est=mean,estP=mean+2*dev,estM=mean-2*dev))
}
d
}
repCounts <- as.integer(floor(10^(0.25*(4:24))))
print('good example')
## [1] "good example"
set.seed(2623496)
dGood <- data.frame(x0=1,x1=0:10)
dGood$y <- 3*dGood$x0 + 2*dGood$x1
dGood$z <- dGood$y - predict(lm(y~0+x0+x1,data=dGood))
print(dGood)
## x0 x1 y z
## 1 1 0 3 -9.326e-15
## 2 1 1 5 -7.994e-15
## 3 1 2 7 -7.105e-15
## 4 1 3 9 -5.329e-15
## 5 1 4 11 -5.329e-15
## 6 1 5 13 -3.553e-15
## 7 1 6 15 -1.776e-15
## 8 1 7 17 -3.553e-15
## 9 1 8 19 0.000e+00
## 10 1 9 21 0.000e+00
## 11 1 10 23 0.000e+00
print(summary(lm(y~0+x0+x1,data=dGood)))
## Warning: essentially perfect fit: summary may be unreliable
##
## Call:
## lm(formula = y ~ 0 + x0 + x1, data = dGood)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.77e-15 -1.69e-15 -5.22e-16 4.48e-16 6.53e-15
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x0 3.00e+00 1.58e-15 1.9e+15 <2e-16 ***
## x1 2.00e+00 2.67e-16 7.5e+15 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.8e-15 on 9 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.47e+32 on 2 and 9 DF, p-value: <2e-16
print(workProblem(dGood,10,'good/works',10000))
## nsamples model coef actual est estP estM
## 1 10 good/works beta0 3 3.006 3.016 2.995
## 2 10 good/works beta1 2 1.999 2.001 1.997
pGood <- c()
set.seed(2623496)
for(reps in repCounts) {
pGood <- rbind(pGood,workProblem(dGood,reps,'goodData'))
}
ggplot(data=pGood,aes(x=nsamples)) +
geom_line(aes(y=actual)) +
geom_line(aes(y=est),linetype=2,color='blue') +
geom_ribbon(aes(ymax=estP,ymin=estM),alpha=0.2,fill='blue') +
facet_wrap(~coef,ncol=1,scales='free_y') + scale_x_log10() +
theme(axis.title.y=element_blank())
```

Notice the code is using the “return data frames” principle. The derived graph shows what we expect from an unbiased low-variance estimate: convergence to the correct values as we increase number of repetitions.

**A bad example**

The following R example meets all of the *Wikipedia stated* conditions of the Gauss-Markov theorem under a frequentist probability model, but doesn’t even exhibit unbiased estimates- let alone a minimal variance such on small samples. It does produce correct estimates on large samples (so one could work with it), but we are not seeing unbiasedness (let alone low variance) on small samples. For this example: the ideal distribution and large samples are unbiased (but have some ugly structure), yet small samples appear biased.

This bad example is essentially given as: `y = x^2`

and we haven’t made `x^2`

available to the model (only `x`

). So this data set doesn’t actually follow the assumed linear modeling structure. However, we can be sophists and claim the effect to model is `y = 10*x - 15 + e`

(which is linear in the features we are making available) and the error term is in fact `e=x^2 - 10*x + 15 + individualError`

(which does have an expected value of zero when `x`

is sampled uniformly from the integers `0...10`

).

This data set is designed to slip past the Gauss-Markov theorem pre-conditions under the frequentist interpretation. As we have shown all we need to do is check `sum_{k} Z(k)`

is zero and the rest of the properties follow. In our case we have `sum_{k} Z(k) = sum_{x=0...10} (x^2 - 10*x + 15) = 0`

. This data set does not slip past the Gauss-Markov theorem pre-conditions under the x-generative model as the obviously structured error term is what they are designed to prohibit/avoid. This sets us up for the following syllogism.

- This data set satisfies the Gauss-Markov theorem pre-conditions under the frequentist model.
- Our R simulation shows the data set doesn’t satisfy the conclusions of the Gauss-Markov theorem.
- We can then conclude the Gauss-Markov theorem pre-conditions can’t be based on the frequentist model.

We confirm this with the following R-simulation.

```
```

```
dBad <- data.frame(x0=1,x1=0:10)
dBad$y <- dBad$x1^2 # or y = -15 + 10*x1 with structured error
dBad$z <- dBad$y - predict(lm(y~0+x0+x1,data=dBad))
print('bad example')
## [1] "bad example"
print(dBad)
## x0 x1 y z
## 1 1 0 0 15
## 2 1 1 1 6
## 3 1 2 4 -1
## 4 1 3 9 -6
## 5 1 4 16 -9
## 6 1 5 25 -10
## 7 1 6 36 -9
## 8 1 7 49 -6
## 9 1 8 64 -1
## 10 1 9 81 6
## 11 1 10 100 15
print(summary(lm(y~0+x0+x1,data=dBad)))
##
## Call:
## lm(formula = y ~ 0 + x0 + x1, data = dBad)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.0 -7.5 -1.0 6.0 15.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x0 -15.000 5.508 -2.72 0.023 *
## x1 10.000 0.931 10.74 2e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.76 on 9 degrees of freedom
## Multiple R-squared: 0.966, Adjusted R-squared: 0.959
## F-statistic: 128 on 2 and 9 DF, p-value: 2.42e-07
print(workProblem(dBad,10,'bad/works',10000))
## nsamples model coef actual est estP estM
## 1 10 bad/works beta0 -15 -14.92 -14.81 -15.023
## 2 10 bad/works beta1 10 9.99 10.01 9.971
print(sum(dBad$z*dBad$x0))
## [1] -7.816e-14
print(sum(dBad$z*dBad$x1))
## [1] -1.013e-13
pBad <- c()
set.seed(2623496)
for(reps in repCounts) {
pBad <- rbind(pBad,workProblem(dBad,reps,'badData'))
}
ggplot(data=pBad,aes(x=nsamples)) +
geom_line(aes(y=actual)) +
geom_line(aes(y=est),linetype=2,color='blue') +
geom_ribbon(aes(ymax=estP,ymin=estM),alpha=0.2,fill='blue') +
facet_wrap(~coef,ncol=1,scales='free_y') + scale_x_log10() +
theme(axis.title.y=element_blank())
```

```
```

Notice even when we drive the number of repetitions high enough to collapse the error bars we still have one of the coefficient estimates routinely below its ideal value. This is what a biased estimation procedure looks like. Again, it isn’t strictly correct to say we the problem is due to heteroscedasticity, as we are seeing bias (not just systematic changes in magnitude of variation).

The reason the average of small samples retains bias on this example is: least squares fitting is a non-linear function of the `x`

s (it is only linear in the `y`

s). Without an additional argument (such as the Gauss-Markov theorem) to appeal to there is no a priori reason to believe an average of non-linear estimates will converge to the original population values. However, we feel it is much easier to teach a conclusion like this from stronger assumptions such as identically independent distributed errors than from homoscedasticity. The gain in generality in basing inference on homoscedasticity is not really so large and the loss in clarity is expensive. The main downside of basing inference on identically independent distributed errors appears to be: you get accused of not knowing of the Gauss-Markov theorem.

**What is homoscedasticity/heteroscedasticity**

Heteroscedasticity is a general *undesirable* modeling situation where the variability of some of your variables changes from sub-population to sub-population. That is what the Wikipedia requirement is trying to get at with `V[e(i)]=c`

. However as we move from informal text definitions to actual strict mathematics we have to precisely specify: what is varying with respect to what and which sub-populations do we consider identifiable?

Also be aware that while data with structured errors (the sign of errors being somewhat predictable from `x`

s or even from omitted variables) can not be homoscedastic, it is not traditional to call such situations heteroscedastic (but to instead point out the structural error and say in the presence of such problems the question between homoscedastic/heteroscedastic does not apply).

We would also point out that B.S. Everitt’s “The Cambridge Dictionary of Statistics” 2nd edition does not have primary entries for homoscedastic or heteroscedastic. Our opinion is not that Everitt forgot them or did not know of them. But, likely Everitt found the criticism he would get for leaving these entries out of his dictionary would be less than the loss of clarity/conciseness that would come from including them (and the verbiage needed to respect their detailed historic definitions and conventions).

For our part: we have come to regret ever having used the term “heteroscedacity” (which we have only attempted out of respect to our sources, which use the term). It is far simpler to introduce an ad-hoc term like *structural errors* and supply a precise definition and examples of what is meant in concise mathematical notation. What turns out to be complicated is: using standard statistical terminology which comes with a lot of conventions and historic linguistic baggage. Part of the problem is of course our own background is mathematics, not statistics. In mathematics term definitions tend to be revised to fit use and intended meaning, instead of being frozen to document priority (as is more common in sciences).

**Summary/conclusions**

Many probability/statistical write-ups fail to explicitly identify what probability model is actually underling operators such as `E[],V[]`

, and `cov[]`

. This is for brevity and pretty much the standard convention. Common probability models to consider include: frequentist (all parameters held constant and data regenerated), Bayesian (all observables held constant and probability statements are over distributions of unobserved quantities and parameters), and ad-hoc generative/conditional distributions (as we used here). The issue is: different probability models give different answers. Usually this is not a problem because by the same token: probability models encode so much about intent you can usually infer the right one from knowing intent.

Most common sampling questions use a frequentist model/interpretation (for example see Bayesian and Frequentist Approaches: Ask the Right Question). The issue is: under that rubric the statement there is a `c`

such that `V[e(i)] = c`

doesn’t carry a lot of content. What is probably meant/intended are strong conditional distribution statements like `E[e(i)|x(i)]=0`

and `V[e(i)|x(i)]=c`

. A quick proof analysis shows the derivations in the Wikipedia article are definitely pushing the `E[]`

operator through `X`

s as if the `X`

s are constants independent of the sample/experiment. This is not correct in general (as our bad example showed), but is a legitimate step if all operators are conditioned on `X`

(but again, that is a fairly strong condition).

Part of this is just a reminder that the Wikipedia is an encyclopedia, not a primary source. The other part is: don’t let statistical bullies force you away from clear thoughts and definitions.

For example: it is considered vulgar or ignorant to assume something as strong as independent identically distributed errors. The feeling is: the conclusion of the Gauss-Markov theorem gives facts about only the first two moments of a distribution, so the invoked pre-conditions should only use facts about the first two moments of any input distributions. But philosophically: assuming identical errors makes sense: errors we can’t tell apart in some sense *must* be treated as identical (as we can’t tell them apart). A data scientist if asked why they believe the residuals hidden in their data may be homoscedastic is more likely to appeal to some sort of assumed independent generative structure in their problem (which is itself not as weak or as general as homoscedasticity) than to point to an empirical test of homoscedasticity (which can itself be unreliable).

A lot tends to be going on in statistics papers (probabilities, interpretation, reasoning over counterfactuals, math, and more) so expect technical terminology (or even argot), implied conventions, and telegraphic writing. Correct comprehension often requires introducing and working your own examples.

]]>