Current schedule/location details after the click.

Hadoop Effortlessly: A Data Inventory is Key to Data Self-service 10/16/2014 1:45pm - 2:25pm EDT (40 minutes) Room: 1 E05 http://en.oreilly.com/stratany2014/public/schedule/detail/37956

Office Hour with John Mount (Win Vector LLC) 10/16/2014 2:35pm - 3:15pm EDT (40 minutes) Room: Table C http://en.oreilly.com/stratany2014/public/schedule/detail/37989

Also, look for us and “Practical Data Science with R” at Waterline Data Science’s Strata booth (booth 553).

Javits Center 655 W 34th Street New York, NY 10001

For more updates (events, book discounts), follow us on Twitter: @WinVectorLLC.

]]>There is one caveat: if you are evaluating a series of models to pick the best (and you usually are), then a single hold-out set is strictly speaking not enough. Hastie, et.al, say it best:

Ideally, the test set should be kept in a “vault,” and be brought out only at the end of the data analysis. Suppose instead that we use the test-set repeatedly, choosing the model with smallest test-set error. Then the test set error of the final chosen model will underestimate the true test error, sometimes substantially.

– Hastie, Tibshirani and Friedman, *The Elements of Statistical Learning*, 2nd edition.

The ideal way to select a model from a set of candidates (or set parameters for a model, for example the regularization constant) is to use a training set to train the model(s), a calibration set to select the model or choose parameters, and a test set to estimate the generalization error of the final model.

In many situations, breaking your data into three sets may not be practical: you may not have very much data, or the the phenomena you’re interested in are rare enough that you need a lot of data to detect them. In those cases, you will need more statistically efficient estimates for generalization error or goodness-of-fit. In this article, we look at the PRESS statistic, and how to use it to estimate generalization error and choose between models.

**The PRESS Statistic**

You can think of the PRESS statistic as an “adjusted sum of squared error (SSE).” It is calculated as

Where *n* is the number of data points in the training set, *y _{i}* is the outcome of the

For example, if you wanted to calculate the PRESS statistic for linear regression models in R, you could do it this way (though I wouldn’t recommend it):

# For explanation purposes only - # DO NOT implement PRESS this way brutePRESS.lm = function(fmla, dframe, outcome) { npts = dim(dframe)[1] ssdev = 0 for(i in 1:npts) { # a data frame with all but the ith row d = dframe[-i,] # build a model using all but pt i m = lm(fmla, data=d) # then predict outcome[i] pred = predict(m, newdata=dframe[i,]) # sum the squared deviations ssdev = ssdev + (pred - outcome[i])^2 } ssdev }

We have implemented a couple of helper functions to calculate the PRESS statistic (and related measures) for linear regression models more efficiently. You can find the code here. The function `hold1OutLMPreds(fmla, dframe)`

returns the vector `f`

, where f[i] is the prediction on the ith row of `dframe`

, when fitting the linear regression model described by `fmla`

on `dframe[-i,]`

. The function `hold1OutMeans(y)`

returns a vector `g`

where `g[i] = mean(y[-i])`

. With these function, you can efficiently calculate the PRESS statistic for a linear regression model:

hopreds = hold1OutLMPreds(fmla, dframe) devs = y-hopreds PRESS = sum(devs^2)

One disadvantage of the SSE (and the PRESS) is that they are dependent on the data size; you can’t compare a single model’s performance across data sets of different size. You can remove that dependency by going to the root mean squared error (RMSE): `rmse = sqrt(sse/n)`

, where `n`

is the size of the data set. You can also calculate an equivalent “root mean PRESS” statistic:

n = length(y) hopreds = hold1OutLMPreds(fmla, dframe) devs = y-hopreds rmPRESS = sqrt(mean(devs^2))

And you can also define a “PRESS R-squared”:

n = length(y) hopreds = hold1OutLMPreds(fmla, dframe) homeans = hold1OutMeans(y) devs = y-hopreds dely = y-homeans PRESS = sum(devs^2) PRESS.r2= 1 - (PRESS/sum(dely^2))

The “PRESS R-squared” is one minus the ratio of the model’s PRESS over the “PRESS of y’s mean value;” it adjusts the estimate of how much variation the model explains by using 1-fold cross validation rather than adjusting for the model’s degrees of freedom (as the more standard adjusted R-square does).

You might also consider defining a PRESS R-squared using the in-sample total error (`y-mean(y)`

) instead of the 1-hold-out mean; we decided on the latter in an “apples-to-apples” spirit. Note also that PRESS R-squared can be negative if the model is very poor.

**An Example**

Let’s imagine a situation where we want to predict a quantity *y*, and we have many many potential inputs to use in our prediction. Some of these inputs are truly correlated with *y*; some of them are not. Of course, we don’t know which are which. We have some training data with which to build models, and we will get (but don’t yet have) hold-out data to evaluate the final model. How might we proceed?

First, let’s create a process to simulate this situation:

# build a data frame with pure noise columns # and columns weakly correlated with y buildExample1 <- function(nRows) { nNoiseCols <- 300 nCorCols <- 20 copyDegree <- 0.1 noiseMagnitude <- 0.1 d <- data.frame(y=rnorm(nRows)) for(i in 1:nNoiseCols) { nm <- paste('noise',i,sep='_') d[,nm] <- noiseMagnitude*rnorm(nRows) + ifelse(runif(nRows)<=copyDegree, rnorm(nRows), 0) } for(i in 1:nCorCols) { nm <- paste('cor',i,sep='_') d[,nm] <- noiseMagnitude*rnorm(nRows) + ifelse(runif(nRows)<=copyDegree,d$y,0) } d }

This function will produce a dataset of `nRows`

rows with 20 columns that are weakly correlated (called `cor_1, cor_2...`

) with `y`

and 300 columns (`noise_1, noise_2...`

) that are independent of `y`

. The process is designed so that the noise columns and the correlated columns have similar magnitudes and variances. The outcome can be expressed as a linear combination of the correlated inputs, so a linear regression model should give reasonable predictions.

Let's suppose we have two candidate models: one which uses all the variables, and one which magically uses only the intentionally correlated variables.

set.seed(22525) train = buildExample1(1000) output = "y" inputs = setdiff(colnames(train), output) truein = inputs[grepl("^cor",inputs)] # all variables, including noise # (noisy model) fmla1 = paste(output, "~", paste(inputs, collapse="+")) mod1 = lm(fmla1, data=train) # only true inputs # (clean model) fmla2 = paste(output, "~", paste(truein, collapse="+")) mod2 = lm(fmla2, data=train)

We can extract all the model coefficients that `lm()`

deemed significant to p < 0.05 (that is, all the coefficients that are marked with at least one "*" in the model summary).

# 0.05 = "*" in the model summary sigCoeffs = function(model, pmax=0.05) { cmat = summary(model)$coefficients pvals = cmat[,4] plo = names(pvals)[pvals < pmax] plo } # significant coefficients in the noisy model sigCoeffs(mod1) ## [1] "noise_41" "noise_59" "noise_66" "noise_117" "noise_207" ## [6] "noise_256" "noise_279" "noise_280" "cor_1" "cor_2" ## [11] "cor_3" "cor_4" "cor_5" "cor_6" "cor_7" ## [16] "cor_8" "cor_9" "cor_10" "cor_11" "cor_12" ## [21] "cor_13" "cor_14" "cor_15" "cor_16" "cor_17" ## [26] "cor_18" "cor_19" "cor_20"

In other words, several of the noise inputs appear to be correlated with the output in the training data, just by chance. This means that the noisy model has overfit the data. Can we detect that? Let's look at the SSE and the PRESS:

## name sse PRESS ## 1 noisy model 203.3 448.6 ## 2 clean model 285.8 306.8

Looking at the in-sample SSE, the noisy model looks better than the clean model; the PRESS says otherwise. We can see the same thing if we look at the R-squared style measures:

## name R2 R2adj PRESSr2 ## 1 noisy model 0.7931 0.6956 0.5442 ## 2 clean model 0.7091 0.7031 0.6884

Again, R-squared makes the noisy model look better than the clean model. The adjusted R-squared correctly indicates that the additional variables in the noisy model do not improve the fit, and slightly prefers the clean model. The PRESS R-squared identifies the clean model as the better model, with a much larger margin of difference than the adjusted R-squared.

**The PRESS statistic versus Hold-out Data**

Of course, while the PRESS statistic is statistically efficient, it is not always computationally efficient, especially with modeling techniques other than linear regression. The calculation of the adjusted R-squared is not computationally demanding, and it also identified the better model in our experiment. One could ask, why not just use adjusted R-squared?

One reason is that the PRESS statistic is attempting to directly model future predictive performance. Our experiment suggests that it shows clearer distinctions between the models than the adjusted R-squared. But how well does the PRESS statistic estimate the "true" generalization error of a model?

To test this, we will hold the ground truth (that is, the data generation process) and the training set fixed. We will then repeat generating test sets, measuring the RMSE of the models' predictions against the test sets, and compare them to the training RMSE and root mean PRESS. This is akin to a situation where the training data and model fitting are accomplished facts, and we are hypothesizing possible future applications of the model.

Specifically, we used `buildExample1()`

to generate one hundred tests sets of size 100 (10% the size of the training set) and one hundred tests sets of size 1000 (the size of the training set). We then evaluated both the clean model and the noisy model against all the test sets and compared the distributions of the hold-out root mean squared error (RMSE) against the in-sample RMSE and PRESS statistics. The results are shown below.

For each plot, the solid black vertical line is the mean of the distribution of test RMSE; we can assume that the observed mean is a good approximation to the "true" expected RMSE of the model. Not surprisingly, a smaller test set size leads to more variance in the observed RMSE, but after 100 trials, both the n=100 and n=1000 hold out sets lead to similar estimates of the expected RMSE (just under 0.7 for the noisy model, just under 0.6 for the clean model.

The dashed red lines give the root mean PRESS of both models on the training data, and the dashed blue lines give each models' training set RMSE. For both the noisy and clean models, the root mean PRESS gives a better estimate of the models' expected RMSE than the training set RMSE -- dramatically so with the noisy, overfit model.

Note, however, that in this experiment, a single hold-out set reliably preferred the clean model to the noisy one (that is, the hold-out SSE was always greater for the noisy model than the clean one when both models were applied to the same test data). The moral of the story: use hold-out data (both calibration and test sets) when that is feasible. When data is at a premium, then try more statistically efficient metrics like the PRESS statistic to "stretch" the data that you have.

]]>`5`

are actually represented as length-1 vectors. We commonly think about working over vectors of “logical”, “integer”, “numeric”, “complex”, “character”, and “factor” types. However, a “factor” is not a R vector. In fact “factor” is For example, consider the following R code.

```
```levels <- c('a','b','c')
f <- factor(c('c','a','a',NA,'b','a'),levels=levels)
print(f)
## [1] c a a <NA> b a
## Levels: a b c
print(class(f))
## [1] "factor"

This example encoding a series of 6 observations into a known set of factor-levels (`'a'`

, `'b'`

, and `'c'`

). As is the case with real data some of the positions might be missing/invalid values such as `NA`

. One of the strengths of R is we have a uniform explicit representation of bad values, so with appropriate domain knowledge we can find and fix such problems. Suppose we knew (by policy or domain experience) that the level `'a'`

was a suitable default value to use when the actual data is missing/invalid. You would think the following code would be the reasonable way to build a new revised data column.

```
```fRevised <- ifelse(is.na(f),'a',f)
print(fRevised)
## [1] "3" "1" "1" "a" "2" "1"
print(class(fRevised))
## [1] "character"

Notice the new column `fRevised`

is an absolute mess (and not even of class/type factor). This sort of fix would have worked if `f`

had been a vector of characters or even a vector of integers, but for factors we get gibberish.

We are going to work through some more examples of this problem.

R is designed to support statistical computation. In R analyses and calculations are often centered on a type called a data-frame. A data frame is very much like a SQL table in that it is a sequence of rows (each row representing an instance of data) organized against a column schema. This is also very much like a spreadsheet where we have good column names and column types. (One caveat: in R vectors that are all `NA`

typically lose their type information and become type `"logical"`

.) An example of an R data frame is given below.

```
```d <- data.frame(x=c(1,-0.4),y=c('a','b'))
print(d)
## x y
## 1 1.0 a
## 2 -0.4 b

A R data frame is actually implemented as a list of columns, each column being treated as a vector. This encourages a very powerful programming style where we specify transformations as operations over columns. An example of working over column vectors is given below:

```
```d <- data.frame(x=c(1,-0.4),y=c('a','b'))
d$xSquared <- d$x^2
print(d)
## x y xSquared
## 1 1.0 a 1.00
## 2 -0.4 b 0.16

Notice that we did not need to specify any for-loop, iteration, or range over the rows. We work over column vectors to great advantage in clarity and speed. This is fairly clever as traditional databases tend to be row-oriented (define operations as traversing rows) and spreadsheets tend to be cell-oriented (define operations over ranges of cells). We can confirm R’s implementation of data frames is in fact a list of column vectors (not merely some other structure behaving as such) through the unclass-trick:

```
```print(class(unclass(d)))
## [1] "list"
print(unclass(d))
## $x
## [1] 1.0 -0.4
##
## $y
## [1] a b
## Levels: a b
##
## $xSquared
## [1] 1.00 0.16
##
## attr(,"row.names")
## [1] 1 2

The data frame `d`

is implemented as a class/type annotation over a list of columns (`x`

, `y`

, and `xSquared`

). Let’s take a closer look at the class or type of the column `y`

.

```
```print(class(d$y))
## [1] "factor"

The class of `y`

is `"factor"`

. We gave R a sequence of strings and it promoted or coerced them into a sequence of factor levels. For statistical work this makes a lot of sense; we are more likely to want to work over factors (which we will define soon) than over strings. And at first glance R seems to like factors more than strings. For example `summary()`

works better with factors than with strings:

```
```print(summary(d))
## x y xSquared
## Min. :-0.40 a:1 Min. :0.16
## 1st Qu.:-0.05 b:1 1st Qu.:0.37
## Median : 0.30 Median :0.58
## Mean : 0.30 Mean :0.58
## 3rd Qu.: 0.65 3rd Qu.:0.79
## Max. : 1.00 Max. :1.00
print(summary(data.frame(x=c(1,-0.4),y=c('a','b'),
stringsAsFactors=FALSE)))
## x y
## Min. :-0.40 Length:2
## 1st Qu.:-0.05 Class :character
## Median : 0.30 Mode :character
## Mean : 0.30
## 3rd Qu.: 0.65
## Max. : 1.00

Notice how if `y`

is a factor column we get nice counts of how often each factor-level occurred, but if `y`

is a character type (forced by setting `stringsAsFactors=FALSE`

to turn off conversion) we don’t get a usable summary. So as a a default behavior R promotes strings/characters to factors and has better summaries for strings/characters than for factors. This would make you think that factors might be a preferred/safe data type in R. This turns out to not completely be the case. A careful R programmer must really decide when and where they want to allow factors in their code.

What is a factor? In principle a factor is a value where the value is known to be taken from a known finite set of possible values called levels. This is similar to an enumerated type. Typically we think of factor levels or categories taking values from a fixed set of strings. Factors are very useful in encoding categorical responses or data. For example we can represent which continent a country is in with the factor levels `"Asia"`

, `"Africa"`

, `"North America"`

, `"South America"`

, `"Antarctica"`

, `"Europe"`

, and `"Australia"`

. When the data has been encoded as a factor (perhaps during ETL) you not only have the continents indicated, you also know the complete set of continents and have a guarantee of no ad-hoc alternate responses (such as “SA” for South America). Additional machine-readable knowledge and constraints make downstream code much more compact, powerful, and safe.

You can think of a factor vector as a sequence of strings with an additional annotation as to what universe of strings the strings are taken from. The R implementation of factor actually implements factor as a sequence of integers where each integer represents the index (starting from 1) of the string in the sequence of possible levels.

```
```print(class(unclass(d$y)))
## [1] "integer"
print(unclass(d$y))
## [1] 1 2
## attr(,"levels")
## [1] "a" "b"

This implementation difference *should* not matter, except R exposes implementation details (more on this later). Exposing implementation details is generally considered to be a bad thing as we don’t know if code that uses factors is using the declared properties and interfaces or is directly manipulating the implementation.

Down-stream users or programmers are supposed to mostly work over the supplied abstraction not over the implementation. Users should not routinely have direct access to the implementation details and certainly not be able to directly manipulate the underlying implementation. In many cases the user must be *aware* of some of the limitations of the implementation, but this is considered a necessary *undesirable* consequence of a leaky abstraction. An example of a necessarily leaky abstraction: abstracting base-2 floating point arithmetic as if it were arithmetic over the real numbers. For decent speed you need your numerics to be based on machine floating point (or some multi-precision extension of machine floating point), but you want to think of numerics abstractly as real numbers. With this leaky compromise the user doesn’t have to have the entire IEEE Standard for Floating-Point Arithmetic (IEEE 754) open on their desk at all times. But the user should known the exceptions like: `(3-2.9)<=0.1`

tends to evaluate to `FALSE`

(due to the implementation, and in violation of the claimed abstraction) and know the necessary defensive coding practices (such as being familiar with What Every Computer Scientist Should Know About Floating-Point Arithmetic).

Now: factors *can* be efficiently implemented perfectly, so they *should* be implemented perfectly. At first glance it appears that they have been implemented correctly in R and the user is protected from the irrelevant implementation details. For example if we try and manipulate the underlying integer array representing the factor levels we get caught.

```
```d$y[1] <- 2
## Warning message:
## In `[<-.factor`(`*tmp*`, 1, value = c(NA, 1L)) :
## invalid factor level, NA generated

This is good, when we tried to monkey with the implementation we got caught. This is how the R implementors try to ensure there is not a lot of user code directly monkeying with the current representation of factors (leaving open the possibility of future bug-fixes and implementation improvements). Likely this safety was gotten by overloading/patching the `[<-`

operator. However, as with most fix-to-finish designs, a few code paths are missed and there are places the user is exposed to the implementation of factors when they expected to be working over the abstraction. Here are a few examples:

```
```f <- factor(c('a','b','a')) # make a factor example
print(class(f))
## [1] "factor"
print(f)
## [1] a b a
## Levels: a b
# c() operator collapses to implementation
print(class(c(f,f)))
## [1] "integer"
## [1] 1 2 1 1 2 1
print(c(f,f))
## [1] 1 2 1 1 2 1
# ifelse(,,) operator collapses to implementation
print(ifelse(rep(TRUE,length(f)),f,f))
# [1] 1 2 1
# factors are not actually vectors
# this IS as claimed in help(vector)
print(is.vector(f))
## [1] FALSE
# factor implementations are not vectors either
# despite being "integer"
print(class(unclass(f)))
## [1] "integer"
print(is.vector(unclass(f)))
## [1] FALSE
# unlist of a factor is not a vector
# despite help(unlist):
# "Given a list structure x, unlist simplifies it to produce a vector:
print(is.vector(unlist(f)))
## [1] FALSE
print(unlist(f))
## [1] a b a
## Levels: a b
print(as.vector(f))
## [1] "a" "b" "a"

What we have done is found instances where a `factor`

column does not behave as we would expect a character vector to behave. These defects in behavior are why I claim factor are not first class in R. They don’t get the full-service expected behavior from a number of basic R operations (such is passing through `c()`

or `ifelse(,,)`

without losing their class label). It is hard to say a factor is treated as a first-class citizens that correctly “supports all the operations generally available to other entities” (quote taken from Wikipedia: First-class_citizen). R doesn’t seem to trust leaving factor data types in factor data types (which should give one pause about doing the same).

The reason these differences are not a mere curiosities is: in any code where we are expecting one behavior and we experience another, we have a bug. So these conversions or abstraction leaks cause system brittleness which can lead to verbose hard to test overly defensive code (see Postel’s law: not sure who to be angry with for some of the downsides of being required to code defensively).

September 9, 1947 Grace Murray Hopper “First actual case of bug being found.”

(image: Computer History Museum)

Why should we expect a factor to behave like a character vector? Why not expect it to behave like an integer vector? The reason is: we supplied a character vector and R’s default behavior in `data.frame()`

was to convert it to a factor. R’s behavior only makes sense under the assumption there is some commonality of behavior between factors and character vectors. Otherwise R has made a surprising substitution and violated the principle of least astonishment. To press the point further: from an object oriented view (which is a common way to talk about the separation of concerns of interface and implementation) a valid substitution should at the very least follow some form of the Liskov substitution principle of a factor being a valid sub-type of character vector. But this is *not* possible between mutable versions of factor and character vector, so the substitution should not have been offered.

What we are trying to point out is: design is not always just a matter of taste. With enough design principles in mind (such as least astonishment, Liskov substitution, and a few others) you can actually say some design decisions are wrong (and maybe even some day some other design decisions are right). There are very few general principals of software system design, so you really don’t want to ignore the few you have.

One possible criticism of my examples is: “You have done everything wrong, *everybody* knows to set `stringsAsFactors=FALSE`

.” I call this the “Alice’s Adventures in Wonderland” defense. In my opinion the user is a guest and it is fair for the guest to initially assume default settings are generally the correct or desirable settings. The relevant “Alice’s Adventures in Wonderland” quote being:

At this moment the King, who had been for some time busily writing in his note-book, cackled out ‘Silence!’ and read out from his book, ‘Rule Forty-two. All persons more than a mile high to leave the court.’

Everybody looked at Alice.

‘I’m not a mile high,’ said Alice.

‘You are,’ said the King.

‘Nearly two miles high,’ added the Queen.

‘Well, I shan’t go, at any rate,’ said Alice: ‘besides, that’s not a regular rule: you invented it just now.’

‘It’s the oldest rule in the book,’ said the King.

‘Then it ought to be Number One,’ said Alice.

(text: Project Gutenberg)

(image from Wikipedia)

Another obvious criticism is: “You have worked hard to write bugs.” That is not the case, I have worked hard to make consequences direct and obvious. Where I first noticed my bug was code deep in an actual project which is similar to the following example. First let’s build a synthetic data set where `y~f(x)`

where `x`

is a factor or categorical variable.

```
```# build a synthetic data set
set.seed(36236)
n <- 50
d <- data.frame(x=sample(c('a','b','c','d','e'),n,replace=TRUE))
d$train <- FALSE
d$train[sample(1:n,n/2)] <- TRUE
print(summary(d$x))
## a b c d e
## 4 7 12 14 13
# build noisy = f(x), with f('a')==f('b')
vals <- rnorm(length(levels(d$x)))
vals[2] <- vals[1]
names(vals) <- levels(d$x)
d$y <- rnorm(n) + vals[d$x]
print(vals)
## a b c d e
## 1.3394631 1.3394631 0.3536642 1.6990172 -0.5423986
# build a model
model1 <- lm(y~0+x,data=subset(d,train))
d$pred1 <- predict(model1,newdata=d)
print(summary(model1))
##
## Call:
## lm(formula = y ~ 0 + x, data = subset(d, train))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.53459 -0.43303 -0.07942 0.49278 2.20614
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## xa 2.9830 0.7470 3.993 0.000715 ***
## xb 2.0506 0.5282 3.882 0.000926 ***
## xc 1.2824 0.3993 3.212 0.004378 **
## xd 2.3644 0.3993 5.922 8.6e-06 ***
## xe -1.1541 0.4724 -2.443 0.023974 *
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
##
## Residual standard error: 1.056 on 20 degrees of freedom
## Multiple R-squared: 0.8046, Adjusted R-squared: 0.7558
## F-statistic: 16.47 on 5 and 20 DF, p-value: 1.714e-06

Our first model is good. But during the analysis phase we might come across some domain knowledge, such as `'a'`

and `'b'`

are actually equivalent codes. We could reduce fitting variance by incorporating this knowledge in our feature engineering. In this example it won’t be much of an improvement, we are not merging much and not eliminating many degrees of freedom. In a real production example this can be a very important step where you may have a domain supplied roll-up dictionary that merges a large number of levels. However, what happens is our new merged column gets quietly converted to a column of integers which is then treated as a numeric column in the following modeling step. So the merge is in fact disastrous, we lose the categorical structure of the variable. We can, of course, re-institute the structure by calling `as.factor()`

if we know about the problem (which we might not), but even then we have lost the string labels for new integer level labels (making debugging even harder). Let’s see the failure we are anticipating, notice how the training adjusted R-squared disastrously drops from 0.7558 to 0.1417 after we attempt our “improvement.”

```
```# try (and fail) to build an improved model
# using domain knowledge f('a')==f('b')
d$xMerged <- ifelse(d$x=='b',factor('a',levels=levels(d$x)),d$x)
print(summary(as.factor(d$xMerged)))
## 1 3 4 5
## 11 12 14 13
# disaster! xMerged is now class integer
# which is treated as numeric in lm, losing a lot of information
model2 <- lm(y~0+xMerged,data=subset(d,train))
print(summary(model2))
##
## Call:
## lm(formula = y ~ 0 + xMerged, data = subset(d, train))
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3193 -0.5818 0.8281 1.6237 3.5451
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## xMerged 0.2564 0.1132 2.264 0.0329 *
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
##
## Residual standard error: 1.98 on 24 degrees of freedom
## Multiple R-squared: 0.176, Adjusted R-squared: 0.1417
## F-statistic: 5.128 on 1 and 24 DF, p-value: 0.03286

There is an obvious method to merge the levels correctly: convert back to character (which we show below). The issue is: if you don’t know about the conversion to integer happening, you may not know to look for it and correct it.

```
```# correct f('a')==f('b') merge
d$xMerged <- ifelse(d$x=='b','a',as.character(d$x))
model3 <- lm(y~0+xMerged,data=subset(d,train))
d$pred3 <- predict(model3,newdata=d)
print(summary(model3))
##
## Call:
## lm(formula = y ~ 0 + xMerged, data = subset(d, train))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.53459 -0.51084 -0.05408 0.71385 2.20614
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## xMergeda 2.3614 0.4317 5.470 1.99e-05 ***
## xMergedc 1.2824 0.3996 3.209 0.00422 **
## xMergedd 2.3644 0.3996 5.916 7.15e-06 ***
## xMergede -1.1541 0.4729 -2.441 0.02361 *
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
##
## Residual standard error: 1.057 on 21 degrees of freedom
## Multiple R-squared: 0.7945, Adjusted R-squared: 0.7553
## F-statistic: 20.3 on 4 and 21 DF, p-value: 5.693e-07
dTest <- subset(d,!train)
nTest <- dim(dTest)[[1]]
# Root Mean Square Error of original model on test data
print(sqrt(sum((dTest$y-dTest$pred1)^2)/nTest))
## [1] 1.330894
# Root Mean Square Error of f('a')==f('b') model on test data
print(sqrt(sum((dTest$y-dTest$pred3)^2)/nTest))
## [1] 1.297682

Factors are definitely useful, and I am glad R has them. I just wish they had fewer odd behaviors. My rule of thumb is just to use them as late as possible, set `stringsAsFactors=FALSE`

and if you need factors in some place convert from character near that place.

Please see the following articles for more ideas on working with categorical variables and preparing data for analysis.

]]>The story is an inside joke referring to something really only funny to one of the founders. But a joke that amuses the teller is always enjoyed by at least one person. Win-Vector LLC’s John Mount had the honor of co-authoring a 1997 paper titled “The Polytope of Win Vectors.” The paper title is obviously mathematical terms in an odd combination. However the telegraphic grammar is coincidentally similar to deliberately ungrammatical gamer slang such as “full of win” and “so much win.”

If we treat “win” as a concrete noun (say something you can put in a sack) and “vector” in its *non-mathematical* sense (as an entity of infectious transmission) we have “Win-Vector LLC is an infectious delivery of victory.” I.e.: we deliver success to our clients. Of course, we have now attempt to explain a weak joke. It is not as grand as “winged victory,” but it does encode a positive company value: Win-Vector LLC delivers successful data science projects and training to clients.

Winged Victory: from Wikipedia

Let’s take this as an opportunity to describe what a win vector is.

We take the phrase “win vector” from a technical article titled “The Polytope of Win Vectors” by J.E. Bartles, J. Mount, and D.J.A. Welsh (Annals of Combinatorics I, 1997, pp. 1-15. The topic of this paper concerns the possible outcomes of game tournaments (or other things that can be expressed as tournaments). For example: we could have four teams (A,B,C, and D) scheduled to play each other a number of times, as indicated in the diagram below.

This graph is just saying in the tournament: A will play B 5 times, B will not play C, and so on. We assume each game can end in a win for one team (given them 1 point), or a loss or tie (giving them zero points). We can record a summary of the tournament outcomes as a vector (vector now back to its mathematical sense) that just records how often each team won. For example the vector [10,1,1,0] is a win vector compatible with the above diagram (it encodes A winning all matches and D losing all matches). The vector [0,0,0,5] is not a valid win vector for the digram as D did not play 5 games (so can not have 5 wins). (The Win-Vector LLC logo is itself a stylized single game tournament diagram, with the directed arrow representing both victory and reminiscent of vectors in the mathematical sense.)

The idea is that a win vector might be treated as a sufficient statistic for the tournament. Or more accurately the win vector may be all that is known about a previously run tournament. Such censored observations may be all that is possible in field biology where wins represent territory or offspring. The question is then: given knowledge of the tournament structure (the graph) and the summary of outcomes (the win vector) is there evidence one team is dominant, or are the effects random? So we have well-formed statistical questions about effect strength and significance.

The question of significance is: when we introduce a notion of effect strength how likely are we to see an effect of that size assuming identical players. For example if we make our notion of effect strength the maximum ratio of wins to plays seen in the win vector should we consider this evidence of a strong player, or is it to be expected by random fluctuation? We need to estimate how strong a conditioning effect our tournament constraints impose on unobserved outcomes (to determine if irregularities in distribution are from player strengths our tournament mis-design).

Relating distributions of unobserved details to observed totals (or margins) is one of the most fundamental problems in statistics. We have written on it many times (two examples: Google ad market reporting and checking scientific claims). In all cases you would be better off with direct detailed observations (i.e. without the censorship); but often you have to work with the data you have instead of the experiment you would design.

The math is a little easier to explain for a related problem: working out the number of ways to fill in a matrix with non-negative integers to meet given row and column totals. I’ll move on to discuss this contingency table problem a bit.

The statistical ideas largely come from “Testing for Independence in a Two-Way Table: New Interpretations of the Chi-Square Statistic”, Persi Diaconis and Bradley Efron, Ann. Statist., Vol. 13, No. 3, 1985 pp. 845-874. A contingency table is a matrix of non-negative integers, and the statistical problem is relating known row and column totals to possible fill-ins. In this paper the authors criticize some of the standard significance tests (chi-square, Fisher’s exact test) and propose a parameterized family of tests that at the extreme end considers a null-model of uniform fill-ins (each possible fill in equally likely). Obviously a uniform model is very different than the more standard distributions which tend to have cell counts more highly concentrated around their means. But the idea is: this proposed test takes more of the structure of the margin totals into account (or equivalently assumes away less of the margin mediated cell dependencies) and has its own merits.

However, we are actually describing the work of mathematicians and theoretical computer scientists. In that style you only speak with “applied types” (such as theoretical statisticians) to justify working on a snappy math problem. In this case: counting the number of ways to fill in a contingency table or the number of detailed results compatible with a given win vector (the link between counting, and generation having been strongly established in “Randomised Algorithms for Counting and Generating Combinatorial Structures”, A.J. Sinclair, Ph.D. thesisUniversity of Edinburgh (1988) and related works).

The contingency table problem is partially solved in:

- “Sampling contingency tables” Martin Dyer, Ravi Kannan, John Mount, Random Structures and Algorithms Vol. 10, no. 4, July 1997 pp. 487-506.
- “Fast Unimodular Counting” John Mount, Combinatorics Probability and Computing, Vol. 9, No. 3, May 2000, pp 277-285.

The second paper (strengthening some results from my Ph.D. thesis) lets you calculate that the number of ways to fill in the following four by four contingency table with non-negative integers to meet the shown row and column totals is exactly `350854066054593772938684218633979710637454260`

(about `3.508541e+44`

).

```
``` x(0,0) x(0,1) x(0,2) x(0,3) 154179
x(1,0) x(1,1) x(1,2) x(1,3) 255424
x(2,0) x(2,1) x(2,2) x(2,3) 277000
x(3,0) x(3,1) x(3,2) x(3,3) 160179
191780 288348 165221 201433

The point being: the table could arise as the summary from a data set with `846782`

(`=191780 + 288348 + 165221 + 201433`

) items; to characterize probabilities over such a tables you need good methods to sample over the astronomical family of potential alternate fill-ins (and this is where you apply the link between counting and sampling for self-reducible problem families). We have example code, notes, improved runtime proof, and results here.

“The Polytope of Win Vectors” introduced additional ideas from integral polymatroids to more strongly relate volume to number of integer vectors (and gets more complete theoretical results for its problem).

All the “big hammer” math is trying to extend some of the beauty of G.H. Hardy and J.E. Littlewood, “Some problems of Diophantine approximation: the lattice points of a right-angled triangle,” Hamburg. Math.Abh., 1 (1921) 212–249 to more general settings.

Or more succinctly: we just like the word “win.”

]]>**What is the Gauss-Markov theorem?**

From “The Cambridge Dictionary of Statistics” B. S. Everitt, 2nd Edition:

A theorem that proves that if the error terms in a

multiple regressionhave the same variance and are uncorrelated, then the estimators of the parameters in the model produced byleast squares estimationare better (in the sense of having lower dispersion about the mean) than any other unbiased linear estimator.

This is pretty much considered the “big boy” reason least squares fitting can be considered a good implementation of linear regression.

Suppose you are building a model of the form:

```
``` y(i) = B . x(i) + e(i)

where `B`

is a vector (to be inferred), `i`

is an index that runs over the available data (say `1`

through `n`

), `x(i)`

is a per-example vector of features, and `y(i)`

is the scalar quantity to be modeled. Only `x(i)`

and `y(i)`

are observed. The `e(i)`

term is the un-modeled component of `y(i)`

and you typically hope that the `e(i)`

can be thought of unknowable effects, individual variation, ignorable errors, residuals, or noise. How weak/strong assumptions you put on the `e(i)`

(and other quantities) depends on what you know, what you are trying to do, and which theorems you need to meet the pre-conditions of. The Gauss-Markov theorem assures a good estimate of `B`

under weak assumptions.

**How to interpret the theorem**

The point of the Gauss-Markov theorem is that we can find conditions ensuring a good fit without requiring detailed distributional assumptions about the `e(i)`

and without distributional assumptions about the `x(i)`

. However, if you are using Bayesian methods or generative models for predictions you *may want* to use additional stronger conditions (perhaps even normality of errors and *even* distributional assumptions on the `x`

s).

We are going to read through the Wikipedia statement of the Gauss-Markov theorem in detail.

**Wikipedia’s stated pre-conditions of the Gauss-Markov theorem**

To apply the Gauss-Markov theorem the Wikipedia says you must assume your data has the following properties:

```
```

```
E[e(i)] = 0 (lack of structural errors, needed to avoid bias)
V[e(i)] = c (equal variance, one form of homoscedasticity)
cov[e(i),e(j)] = 0 for i!=j (non-correlation of errors)
```

```
```

It is always important to know precisely what probability model the expectation (`E[]`

), variance (`V[]`

), and covariance (`cov[]`

) operators are working over in the Wikipedia conditions. This is usually left implicit, but it is critical to know exactly what is being asserted. When reading/listening about statistical or probabilistic work you should *always* insist on a concrete description of the probability model underlying all the notation (the `E[]`

s and `V[]`

s). A lot of confusion and subtle tricks get hidden by not sharing an explicit description of the probability model.

**Probability models**

Two plausible probability models are:

- Frequentist: unobserved parameters are held constant and all probabilities are over re-draws of the data. At first guess you would think this is the correct model for this problem, as the content of the Gauss-Markov theorem is about how estimates drawn from a larger population perform in expectation.
- x-Generative: This is not standard and not immediately implied by the notation (and represents a fairly strong set of assumptions). In this model all of the observed
`x`

s are held constant and unobserved`e`

s and`y`

s are regenerated with respect to the`x`

s. This is similar to a Bayesian generative model, except in the usual Bayesian formulation all observables (both`x`

s and`y`

s) are held fixed. We only introduce this model as it seems to be the simplest one which makes for a workable interpretation of the Wikipedia statements.

The issue is: the conditions as stated are not strong enough to ensure actual homoscedasticity (or even non-structure of errors/bias) needed to apply the Gauss-Markov theorem under a strict frequentist model. So we must go venue-shopping and find what model is likely intended. An easy way to do this is to design synthetic data that is considered well-behaved under one model and not under the other.

**A source of examples**

Let’s use a deliberately naive empirical view of data. Suppose the entire possible universe of data is `X(i),Y(i),Z(i) i=1...k`

for some `k`

(`k`

and `X(i),Y(i),Z(i)`

all finite real vectors). Our chosen explicit probability model for generating the observed data `x(i),y(i)`

and unobserved `e(i)`

is the following. We pick a length-`n`

sequence of integers `s(1),...,s(n)`

where each `s(i)`

is picked uniformly and independently from `1...k`

and add a bit of unique noise. Our sample data is then (only `x(i),y(i)`

are observed, `e(i)`

is an unobserved notional quantity):

```
```

```
(x(i),y(i),e(i)) = (X(s(i)),Y(s(i))+t(i),Z(s(i))+t(i)) for i=1...n,
where t(i) is an independent normal variable with mean 0 and variance 1
```

```
```

This is similar to a standard statistical model (empirical re-sampling from a fixed set, and designed to be similar to a sampling distribution). `Z(i)`

represents an idealized error term and `e(i)`

represents a per-sample unobserved realization of `Z(i)`

. It is a nice model because the `e(i)`

are independently identically distributed (and so are the `x(i)`

and `y(i)`

, though obviously there can be dependencies between the `x,y and e`

s). This model can be thought of as “too nice” as it isn’t powerful enough to capture the full power of the Gauss-Markov theorem (it can’t express non- independent identically distributed situations). However it can concretely embody situations that do meet the Gauss-Markov conditions and be used to work clarifying examples.

**Good examples under the frequentist probability model**

Let’s see what conditions on `X(i),Y(i),Z(i) i=1...k`

are needed to meet the Gauss-Markov pre-conditions assuming a frequentist probability model.

- The first one is easy:
`E[e(i)] = 0`

if and only if`sum_{j=1...k} Z(j) = 0`

. - When we have
`E[e(i)]=0`

the second condition (homoscedasticity as stated) simplifies to`V[e(i)] = E[(e(i) - E[e(j)])^2] = E[e(i)^2] = E[Z^2] + 1`

which is independent of`i`

. - When we have
`E[e(i)]=0`

the third condition simplifies to`E[e(i) e(j)] = 0`

for`i!=j`

. And then follows immediately from our overly strong condition of the index selections`s(i)`

being independent (giving us`E[e(i) e(j)] = E[e(i)] E[e(j)] = 0 for i!=j`

).

So all we need is: `sum_{j=1...k} Z(j) = 0`

and then the other conditions hold. This seems too easy, and is evidence that the frequentist probability model is not the model intended by Wikipedia. We will confirm this with a specific counter example later.

**Good examples under the x-generative probability model**

Under the x-generative probability model (and this is *not* standard terminology) the Wikipedia conditions are more properly written conditionally:

```
```

```
E[e(i)|x(i)] = 0
V[e(i)|x(i)] = c
cov[e(i),e(j)|x(i),x(j)] = 0 for i!=j
```

```
```

Or more precisely: if the conditions had been written in their conditional form we wouldn’t have to contrive a phrase like “x-generative model” to ensure the correct interpretation. These conditions are strict. Checking or ensuring these properties is a problem when `x`

is continuous and we have a joint description of how `x,y,e`

are generated (instead of a hierarchical one). These conditions as stated are strong enough to support the Gauss-Markov theorem, but probably in fact stronger than the minimum or canonical conditions. But let’s see how they work.

To meet these conditions our `Z(i)`

must pretty much be free of dependence on `x(i)`

(even one snuck through the index `i`

). This is somewhat unsatisfying as our overly simple modeling framework (producing `x,y,e`

from `X,Y,Z`

) combined with these strong conditions don’t really model much more than identical independence (so do not capture the full breadth of the Gauss-Markov theorem). The frequentist conditions are too lenient to work and the x-generative/conditioned conditions seem too strong (at least when combined with our simplistic source of examples).

**A good example**

The following R example (also available here) shows a data set generated under our framework where the Gauss-Markov theorem applies (under either probability model). In this case the true `y`

is produced as an actual linear function of `x`

plus iid (independent identically distributed) noise. This model meets the pre-conditions of the Gauss-Markov condition (under both the frequentist and x-generative models). We observe that the empirical samples average out to the correct theoretical coefficients taken from the original universal population. All of the calculations are designed to match the quantities discussed in the Wikipedia derivations.

```
library(ggplot2)
workProblem <- function(dAll,nreps,name,sampleSize=10) {
xAll <- matrix(data=c(dAll$x0,dAll$x1),ncol=2)
cAll <- solve(t(xAll) %*% xAll) %*% t(xAll)
beta <- as.numeric(cAll %*% dAll$y)
betaSamples <- matrix(data=0,nrow=2,ncol=nreps)
nrows <- dim(dAll)[[1]]
for(i in 1:nreps) {
dSample <- dAll[sample.int(nrows,sampleSize,replace=TRUE),]
individualError <- rnorm(sampleSize)
dSample$y <- dSample$y + individualError
dSample$e <- dSample$z + individualError
xSample <- matrix(data=c(dSample$x0,dSample$x1),ncol=2)
cSample <- solve(t(xSample) %*% xSample) %*% t(xSample)
betaS <- as.numeric(cSample %*% dSample$y)
betaSamples[,i] <- betaS
}
d <- c()
for(i in 1:(dim(betaSamples)[[1]])) {
coef <- paste('beta',(i-1),sep='')
mean <- mean(betaSamples[i,])
dev <- sqrt(var(betaSamples[i,])/nreps)
d <- rbind(d,data.frame(nsamples=nreps,model=name,coef=coef,
actual=beta[i],est=mean,estP=mean+2*dev,estM=mean-2*dev))
}
d
}
repCounts <- as.integer(floor(10^(0.25*(4:24))))
print('good example')
## [1] "good example"
set.seed(2623496)
dGood <- data.frame(x0=1,x1=0:10)
dGood$y <- 3*dGood$x0 + 2*dGood$x1
dGood$z <- dGood$y - predict(lm(y~0+x0+x1,data=dGood))
print(dGood)
## x0 x1 y z
## 1 1 0 3 -9.326e-15
## 2 1 1 5 -7.994e-15
## 3 1 2 7 -7.105e-15
## 4 1 3 9 -5.329e-15
## 5 1 4 11 -5.329e-15
## 6 1 5 13 -3.553e-15
## 7 1 6 15 -1.776e-15
## 8 1 7 17 -3.553e-15
## 9 1 8 19 0.000e+00
## 10 1 9 21 0.000e+00
## 11 1 10 23 0.000e+00
print(summary(lm(y~0+x0+x1,data=dGood)))
## Warning: essentially perfect fit: summary may be unreliable
##
## Call:
## lm(formula = y ~ 0 + x0 + x1, data = dGood)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.77e-15 -1.69e-15 -5.22e-16 4.48e-16 6.53e-15
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x0 3.00e+00 1.58e-15 1.9e+15 <2e-16 ***
## x1 2.00e+00 2.67e-16 7.5e+15 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.8e-15 on 9 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.47e+32 on 2 and 9 DF, p-value: <2e-16
print(workProblem(dGood,10,'good/works',10000))
## nsamples model coef actual est estP estM
## 1 10 good/works beta0 3 3.006 3.016 2.995
## 2 10 good/works beta1 2 1.999 2.001 1.997
pGood <- c()
set.seed(2623496)
for(reps in repCounts) {
pGood <- rbind(pGood,workProblem(dGood,reps,'goodData'))
}
ggplot(data=pGood,aes(x=nsamples)) +
geom_line(aes(y=actual)) +
geom_line(aes(y=est),linetype=2,color='blue') +
geom_ribbon(aes(ymax=estP,ymin=estM),alpha=0.2,fill='blue') +
facet_wrap(~coef,ncol=1,scales='free_y') + scale_x_log10() +
theme(axis.title.y=element_blank())
```

Notice the code is using the “return data frames” principle. The derived graph shows what we expect from an unbiased low-variance estimate: convergence to the correct values as we increase number of repetitions.

**A bad example**

The following R example meets all of the *Wikipedia stated* conditions of the Gauss-Markov theorem under a frequentist probability model, but doesn’t even exhibit unbiased estimates- let alone a minimal variance such on small samples. It does produce correct estimates on large samples (so one could work with it), but we are not seeing unbiasedness (let alone low variance) on small samples. For this example: the ideal distribution and large samples are unbiased (but have some ugly structure), yet small samples appear biased.

This bad example is essentially given as: `y = x^2`

and we haven’t made `x^2`

available to the model (only `x`

). So this data set doesn’t actually follow the assumed linear modeling structure. However, we can be sophists and claim the effect to model is `y = 10*x - 15 + e`

(which is linear in the features we are making available) and the error term is in fact `e=x^2 - 10*x + 15 + individualError`

(which does have an expected value of zero when `x`

is sampled uniformly from the integers `0...10`

).

This data set is designed to slip past the Gauss-Markov theorem pre-conditions under the frequentist interpretation. As we have shown all we need to do is check `sum_{k} Z(k)`

is zero and the rest of the properties follow. In our case we have `sum_{k} Z(k) = sum_{x=0...10} (x^2 - 10*x + 15) = 0`

. This data set does not slip past the Gauss-Markov theorem pre-conditions under the x-generative model as the obviously structured error term is what they are designed to prohibit/avoid. This sets us up for the following syllogism.

- This data set satisfies the Gauss-Markov theorem pre-conditions under the frequentist model.
- Our R simulation shows the data set doesn’t satisfy the conclusions of the Gauss-Markov theorem.
- We can then conclude the Gauss-Markov theorem pre-conditions can’t be based on the frequentist model.

We confirm this with the following R-simulation.

```
```

```
dBad <- data.frame(x0=1,x1=0:10)
dBad$y <- dBad$x1^2 # or y = -15 + 10*x1 with structured error
dBad$z <- dBad$y - predict(lm(y~0+x0+x1,data=dBad))
print('bad example')
## [1] "bad example"
print(dBad)
## x0 x1 y z
## 1 1 0 0 15
## 2 1 1 1 6
## 3 1 2 4 -1
## 4 1 3 9 -6
## 5 1 4 16 -9
## 6 1 5 25 -10
## 7 1 6 36 -9
## 8 1 7 49 -6
## 9 1 8 64 -1
## 10 1 9 81 6
## 11 1 10 100 15
print(summary(lm(y~0+x0+x1,data=dBad)))
##
## Call:
## lm(formula = y ~ 0 + x0 + x1, data = dBad)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.0 -7.5 -1.0 6.0 15.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x0 -15.000 5.508 -2.72 0.023 *
## x1 10.000 0.931 10.74 2e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.76 on 9 degrees of freedom
## Multiple R-squared: 0.966, Adjusted R-squared: 0.959
## F-statistic: 128 on 2 and 9 DF, p-value: 2.42e-07
print(workProblem(dBad,10,'bad/works',10000))
## nsamples model coef actual est estP estM
## 1 10 bad/works beta0 -15 -14.92 -14.81 -15.023
## 2 10 bad/works beta1 10 9.99 10.01 9.971
print(sum(dBad$z*dBad$x0))
## [1] -7.816e-14
print(sum(dBad$z*dBad$x1))
## [1] -1.013e-13
pBad <- c()
set.seed(2623496)
for(reps in repCounts) {
pBad <- rbind(pBad,workProblem(dBad,reps,'badData'))
}
ggplot(data=pBad,aes(x=nsamples)) +
geom_line(aes(y=actual)) +
geom_line(aes(y=est),linetype=2,color='blue') +
geom_ribbon(aes(ymax=estP,ymin=estM),alpha=0.2,fill='blue') +
facet_wrap(~coef,ncol=1,scales='free_y') + scale_x_log10() +
theme(axis.title.y=element_blank())
```

```
```

Notice even when we drive the number of repetitions high enough to collapse the error bars we still have one of the coefficient estimates routinely below its ideal value. This is what a biased estimation procedure looks like. Again, it isn’t strictly correct to say we the problem is due to heteroscedasticity, as we are seeing bias (not just systematic changes in magnitude of variation).

The reason the average of small samples retains bias on this example is: least squares fitting is a non-linear function of the `x`

s (it is only linear in the `y`

s). Without an additional argument (such as the Gauss-Markov theorem) to appeal to there is no a priori reason to believe an average of non-linear estimates will converge to the original population values. However, we feel it is much easier to teach a conclusion like this from stronger assumptions such as identically independent distributed errors than from homoscedasticity. The gain in generality in basing inference on homoscedasticity is not really so large and the loss in clarity is expensive. The main downside of basing inference on identically independent distributed errors appears to be: you get accused of not knowing of the Gauss-Markov theorem.

**What is homoscedasticity/heteroscedasticity**

Heteroscedasticity is a general *undesirable* modeling situation where the variability of some of your variables changes from sub-population to sub-population. That is what the Wikipedia requirement is trying to get at with `V[e(i)]=c`

. However as we move from informal text definitions to actual strict mathematics we have to precisely specify: what is varying with respect to what and which sub-populations do we consider identifiable?

Also be aware that while data with structured errors (the sign of errors being somewhat predictable from `x`

s or even from omitted variables) can not be homoscedastic, it is not traditional to call such situations heteroscedastic (but to instead point out the structural error and say in the presence of such problems the question between homoscedastic/heteroscedastic does not apply).

We would also point out that B.S. Everitt’s “The Cambridge Dictionary of Statistics” 2nd edition does not have primary entries for homoscedastic or heteroscedastic. Our opinion is not that Everitt forgot them or did not know of them. But, likely Everitt found the criticism he would get for leaving these entries out of his dictionary would be less than the loss of clarity/conciseness that would come from including them (and the verbiage needed to respect their detailed historic definitions and conventions).

For our part: we have come to regret ever having used the term “heteroscedacity” (which we have only attempted out of respect to our sources, which use the term). It is far simpler to introduce an ad-hoc term like *structural errors* and supply a precise definition and examples of what is meant in concise mathematical notation. What turns out to be complicated is: using standard statistical terminology which comes with a lot of conventions and historic linguistic baggage. Part of the problem is of course our own background is mathematics, not statistics. In mathematics term definitions tend to be revised to fit use and intended meaning, instead of being frozen to document priority (as is more common in sciences).

**Summary/conclusions**

Many probability/statistical write-ups fail to explicitly identify what probability model is actually underling operators such as `E[],V[]`

, and `cov[]`

. This is for brevity and pretty much the standard convention. Common probability models to consider include: frequentist (all parameters held constant and data regenerated), Bayesian (all observables held constant and probability statements are over distributions of unobserved quantities and parameters), and ad-hoc generative/conditional distributions (as we used here). The issue is: different probability models give different answers. Usually this is not a problem because by the same token: probability models encode so much about intent you can usually infer the right one from knowing intent.

Most common sampling questions use a frequentist model/interpretation (for example see Bayesian and Frequentist Approaches: Ask the Right Question). The issue is: under that rubric the statement there is a `c`

such that `V[e(i)] = c`

doesn’t carry a lot of content. What is probably meant/intended are strong conditional distribution statements like `E[e(i)|x(i)]=0`

and `V[e(i)|x(i)]=c`

. A quick proof analysis shows the derivations in the Wikipedia article are definitely pushing the `E[]`

operator through `X`

s as if the `X`

s are constants independent of the sample/experiment. This is not correct in general (as our bad example showed), but is a legitimate step if all operators are conditioned on `X`

(but again, that is a fairly strong condition).

Part of this is just a reminder that the Wikipedia is an encyclopedia, not a primary source. The other part is: don’t let statistical bullies force you away from clear thoughts and definitions.

For example: it is considered vulgar or ignorant to assume something as strong as independent identically distributed errors. The feeling is: the conclusion of the Gauss-Markov theorem gives facts about only the first two moments of a distribution, so the invoked pre-conditions should only use facts about the first two moments of any input distributions. But philosophically: assuming identical errors makes sense: errors we can’t tell apart in some sense *must* be treated as identical (as we can’t tell them apart). A data scientist if asked why they believe the residuals hidden in their data may be homoscedastic is more likely to appeal to some sort of assumed independent generative structure in their problem (which is itself not as weak or as general as homoscedasticity) than to point to an empirical test of homoscedasticity (which can itself be unreliable).

A lot tends to be going on in statistics papers (probabilities, interpretation, reasoning over counterfactuals, math, and more) so expect technical terminology (or even argot), implied conventions, and telegraphic writing. Correct comprehension often requires introducing and working your own examples.

]]>- Missing values (
`NA`

or blanks) - Problematic numerical values (
`Inf`

,`NaN`

, sentinel values like 999999999 or -1) - Valid categorical levels that don’t appear in the training data (especially when there are rare levels, or a large number of levels)
- Invalid values

Of course, you should examine the data to understand the nature of the data issues: are the missing values missing at random, or are they systematic? What are the valid ranges for the numerical data? Are there sentinel values, what are they, and what do they mean? What are the valid values for text fields? Do we know all the valid values for a categorical variable, and are there any missing? Is there any principled way to roll up category levels? In the end though, the steps you take to deal with these issues will often be the same from data set to data set, so having a package of ready-to-go functions for data treatment is useful. In this article, we will discuss some of our usual data treatment procedures, and describe a prototype R package that implements them.

**Missing Values; Missing Category Levels**

First, we’ll look at what to do when there are missing values or NAs in the data, and how to guard against category levels that don’t appear in the training data. Let’s make a small example data set that manifests these issues.

set.seed(9394092) levels = c('a', 'b', 'c', 'd') levelfreq = c(0.3, 0.3, 0.3, 0.1) means = c(1, 6, 2, 7) names(means) = levels NArate = 1/30 X = sample(levels, 200, replace=TRUE, prob=levelfreq) Y = rnorm(200) + means[X] train = data.frame(x=X[1:150], y=Y[1:150], stringsAsFactors=FALSE) test = data.frame(x=X[151:200], y=Y[151:200], stringsAsFactors=FALSE) # remove a level from training train = subset(train, x !='d') # sprinkle in some NAs ntrain = dim(train)[1] ; ntest = dim(test)[1] train$x = ifelse(runif(ntrain) < NArate, NA, train$x) test$x = ifelse(runif(ntest) < NArate, NA, test$x) table(train$x) ## a b c ## 40 44 42 sum(is.na(train$x)) ## [1] 4 sum(is.na(test$x)) ## [1] 2

This simulates a situation where a rare level failed to be collected in the training data. In addition, we’ve simulated a missing value mechanism. In this example, it’s a “faulty sensor” mechanism (missing values show up at random, as if a sensor were intermittently and randomly failing) – though it may also in general be a systematic mechanism, where the `NA`

means something specific, like the measurement doesn't apply (say "most recent pregnancy date" for a male subject).

We can build a linear regression model for predicting `y`

from `x`

:

# build a model model1 = lm("y~x", data=train) train$pred = predict(model1, newdata=train) # this works predict(model1, newdata=test) # this fails ## Error in model.frame.default(Terms, newdata, na.action = na.action, ## xlev = object$xlevels) : factor x has new levels d

The model fails on the holdout data because the new data has a value of `x`

which was not observed in the training data. You can always refuse to predict in such cases, of course, but in some situations even a not-so-good prediction may be better than no prediction at all. Note also that `lm`

quietly omitted the rows where x was missing while training, and the resulting model will return `NA`

as the predicted outcome in such cases. This is again perfectly reasonable, but not always what you want, especially in cases where a large fraction of the data has missing values.

Are there alternative ways to handle these issues? If `NA`

s show up in the data, the conservative assumption is that they are missing systematically; in this situation (when `x`

is a categorical value), we can then treat them as just another category value, for example by pretreating the variable to convert `NA`

to "Unknown." When novel values show up in the test data (or when `NA`

s appear in the holdout data, but not in the training data), the best assumption we can make is that the novel value is in fact one of the values that we have already observed; the probability of being any given value being proportional to the training set frequencies.

We've implemented these data treatments, and others, in an R package called `vtreat`

. The package is very much at the alpha stage, and is not yet available on CRAN; we'll explain how you can get the package later on in the post. For now, let's see how it works.

The first step is to use the training data to create a set of variable treatments, one for each variable of interest.

library(vtreat) # our library, not public; we'll show how to install later treatments = designTreatmentsN(train, c("x"), "y")

The function `designTreatmentsN()`

takes as input the data frame of training data, the list of input columns, and the (numerical) outcome column. There is a similar function `designTreatmentsC()`

for binary classification problems. The output of the function is a list of variable treatment objects (of class `treatmentplan`

), one per input variable.

treatments ## $treatments ## $treatments[[1]] ## [1] "vtreat 'Categoric Indicators'('x'->'x_lev_NA','x_lev_x.a','x_lev_x.b','x_lev_x.c')" ## ## ## $vars ## [1] "x_lev_NA" "x_lev_x.a" "x_lev_x.b" "x_lev_x.c" ## ## $varScores ## x_lev_NA x_lev_x.a x_lev_x.b x_lev_x.c ## 1.0310 0.6948 0.2439 0.8959 ## ## $varMoves ## x_lev_NA x_lev_x.a x_lev_x.b x_lev_x.c ## TRUE TRUE TRUE TRUE ## ## $outcomename ## [1] "y" ## ## $meanY ## [1] 3.246 ## ## $ndat ## [1] 130 ## ## attr(,"class") ## [1] "treatmentplan"

The `vars`

field of a `treatmentplan`

object gives the names of the new variables that were formed from the original variable: a categorical variable like `x`

is converted to several indicator variables, one for each known level of `x`

-- including `NA`

, if it is observed in the training data. `varMoves`

is TRUE if the new variable in question varies (that is, if it has more than one value in the training data). `meanY`

is the base mean of the outcome variable (unconditioned on the inputs). `ndat`

is the number of data points.

The field `varScores`

is a rough indicator of variable importance, based on the Press statistic. The Press statistic of a model is the sum of the variance of all the hold-one-out models: that is, the sum of `(y_i - f_i)^2`

, where `y_i`

is the outcome corresponding to the ith data point, and `f_i`

is the prediction of the model built by using all the training data *except* the ith data point. We calculate the `varScore`

of the jth input variable `x_j`

to be the Press statistic of the one-dimensional linear regression model that uses only `x_j`

, divided by the Press statistic of the unconditioned mean of `y`

. A varScore of 0 means the model predicts perfectly. A varScore close to one means that the variable predicts only about as well as the global mean; a varScore above 1 means that the model predicts outcome worse than the global mean. So the lower the varScore, the better. You can use `varScores`

to prune uninformative variables, as we will show later.

Once you have created the treatment plans using `designTreatmentsN()`

, you can treat the training and test data frames using the function `prepare()`

. This creates new data frames that express the outcome in terms of the new transformed variables. `prepare()`

takes as input a list of treatment plans and a data set to be treated. The optional argument `pruneLevel`

lets you specify a threshold for `varScores`

; variables with a varScore higher than `pruneLevel`

will be eliminated. By default, `prepare()`

will prune away any variables with a varScore greater than 0.99; we will use `pruneLevel=NULL`

to force `prepare()`

to create all possible variables.

# pruneLevel=NULL turns pruning OFF train.treat = prepare(treatments, train, pruneLevel=NULL) test.treat = prepare(treatments, test, pruneLevel=NULL) train.treat[1:4,] ## x_lev_NA x_lev_x.a x_lev_x.b x_lev_x.c y ## 1 0 0 1 0 7.037 ## 2 0 0 0 1 1.209 ## 3 0 0 0 1 2.819 ## 4 0 0 0 1 2.099 subset(train.treat, is.na(train$x)) # similarly for test ## x_lev_NA x_lev_x.a x_lev_x.b x_lev_x.c y ## 12 1 0 0 0 -0.4593 ## 48 1 0 0 0 6.4741 ## 49 1 0 0 0 5.3387 ## 81 1 0 0 0 2.2319

The listing above shows that instead of the training data frame `(x, y)`

, we now have a training data frame with four `x`

indicator variables, one for the each known `x`

-values "a", "b", and "c" -- plus `NA`

. According to the listing, the first four values for `x`

in the training data were `c("b", "c", "c", "c")`

. `NA`

s are encoded as the variable `x_lev_NA`

.

We can see how `prepare()`

handles novel values in the test data:

# # when we encounter a new variable value, we assign it all levels, # proportional to training set frequencies # subset(test.treat, test$x=='d') ## x_lev_NA x_lev_x.a x_lev_x.b x_lev_x.c y ## 8 0.03077 0.3077 0.3385 0.3231 4.622

Looking back at the process by which we generated `y`

, we can see in this case that the "d" level isn't actually a proportional combination of the other levels; still this is the best assumption in the absence of any other information. Furthermore, in the more common situation of multiple input variables, this assumption allows us to take advantage of information that is available through those other variables.

Now we can fit a model using the transformed variables:

# get the names of the x variables vars = setdiff(colnames(train.treat), "y") fmla = paste("y ~ ", paste(vars, collapse=" + ")) model2 = lm(fmla, data=train.treat) summary(model2) ## ## Call: ## lm(formula = fmla, data = train.treat) ## ## Residuals: ## Min 1Q Median 3Q Max ## -3.856 -0.756 -0.026 0.782 3.078 ## ## Coefficients: (1 not defined because of singularities) ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 2.103 0.168 12.54 < 2e-16 *** ## x_lev_NA 1.293 0.569 2.27 0.02461 * ## x_lev_x.a -0.830 0.240 -3.46 0.00075 *** ## x_lev_x.b 4.014 0.234 17.13 < 2e-16 *** ## x_lev_x.c NA NA NA NA ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.09 on 126 degrees of freedom ## Multiple R-squared: 0.794, Adjusted R-squared: 0.789 ## F-statistic: 162 on 3 and 126 DF, p-value: <2e-16

The significance levels of the variables are consistent with the variable importance scores we observed in the treatment plan. The fact that one of the levels is NAd out is to be expected; four levels implies 3 degrees of freedom (plus the intercept). The standard practice is to omit one level of a categorical as redundant. We don't do this in our treatment plan, as regularized models can actually benefit from having the extra level left in. You will get warnings about possibly misleading fits when applying the model; in this case, we know how the variables were constructed, and that there are no hidden degeneracies in the variables (at least none that we created), so we can disregard the warning.

# you get the warnings about rank-deficient fits train.treat$pred = predict(model2, newdata=train.treat) ## Warning: prediction from a rank-deficient fit may be misleading test.treat$pred = predict(model2, newdata=test.treat) # works! ## Warning: prediction from a rank-deficient fit may be misleading # no NAs summary(train.treat$pred) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 1.27 1.27 2.10 3.25 6.12 6.12 summary(test.treat$pred) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 1.27 1.27 2.10 3.37 6.12 6.12 # note that this model gives the same answers on training data # as the default model sum(abs(train$pred - train.treat$pred), na.rm=TRUE) ## [1] 9.566e-13

The last command of the above listing confirms that on the training data, the model learned from the treated data is equivalent to the model learned on the original data. Now we can look at model accuracy. .

rmse = function(y, pred) { se = (y-pred)^2 sqrt(mean(se)) } # model does well where it really has x values with(subset(train, !is.na(x)), rmse(y, pred)) ## [1] 0.973 # not too bad on NAs with(train.treat, rmse(y,pred)) ## [1] 1.07 # model generalizes well on levels it's observed with(subset(test.treat, test$x != "d"), rmse(y,pred)) ## [1] 1.08 # less well on novel values with(test.treat, rmse(y,pred)) ## [1] 1.272 subset(test.treat, test$x=='d')[,c("y", "pred")] ## y pred ## 8 4.622 3.246

As expected, the model does not perform as well on novel data values (`x`

= "d"), but at least it returns a prediction without crashing. Furthermore, if the novel levels are rare (as we would expect), then predicting them poorly will not affect the overall performance of the model too much.

Let's try preparing the data with the default pruning parameters (`pruneLevel=0.99`

):

train.treat = prepare(treatments, train) test.treat = prepare(treatments, test) # The x_lev_NA variable has been pruned away train.treat[1:4,] ## x_lev_x.a x_lev_x.b x_lev_x.c y ## 1 0 1 0 7.037 ## 2 0 0 1 1.209 ## 3 0 0 1 2.819 ## 4 0 0 1 2.099 # NAs are now encoded as (0,0,0) subset(train.treat, is.na(train$x)) ## x_lev_x.a x_lev_x.b x_lev_x.c y ## 12 0 0 0 -0.4593 ## 48 0 0 0 6.4741 ## 49 0 0 0 5.3387 ## 81 0 0 0 2.2319 # d is now encoded as the relative frequencies of a, b, and c. subset(test.treat, test$x=='d') ## x_lev_x.a x_lev_x.b x_lev_x.c y ## 8 0.3077 0.3385 0.3231 4.622

We no longer keep `NA`

as a level, because it's not any more informative than the global mean; novel levels are still encoded as "all the known levels," proportionally weighted. If we use this data representation to model, we don't have a rank-deficient fit.

vars = setdiff(colnames(train.treat), "y") fmla = paste("y ~ ", paste(vars, collapse=" + ")) model2 = lm(fmla, data=train.treat) summary(model2) ## ## Call: ## lm(formula = fmla, data = train.treat) ## ## Residuals: ## Min 1Q Median 3Q Max ## -3.856 -0.756 -0.026 0.782 3.078 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.396 0.543 6.25 5.8e-09 *** ## x_lev_x.a -2.123 0.570 -3.73 0.00029 *** ## x_lev_x.b 2.721 0.567 4.80 4.5e-06 *** ## x_lev_x.c -1.293 0.569 -2.27 0.02461 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.09 on 126 degrees of freedom ## Multiple R-squared: 0.794, Adjusted R-squared: 0.789 ## F-statistic: 162 on 3 and 126 DF, p-value: <2e-16

The model performance is similar to that of the model that included `x_lev_NA`

.

train.treat$pred = predict(model2, newdata=train.treat) test.treat$pred = predict(model2, newdata=test.treat) sum(abs(train$pred - train.treat$pred), na.rm=TRUE) ## [1] 6.297e-13 with(train.treat, rmse(y,pred)) ## [1] 1.07 with(test.treat, rmse(y,pred)) ## [1] 1.272

**Numerical variables and Categorical variables with many levels**

The above examples looked at data treatment for a simple categorical variable with a moderate number of levels, some possibly missing. There are two other cases to consider. First, we would like basic data treatment for numerical variables, to protect against bad values like `NA`

, `NaN`

or `Inf`

.

Second, we'd like to gracefully manage categorical variables with a large number of possible levels, such as ZIP code, telephone area code, or even city or other geographical region. Such categorical variables can be problematic because they introduce computational or data size issues for some modeling algorithms. For example, the size of the design matrix when computing linear or logistic regression models grows as the square of the number of variables -- and a categorical variable with `N`

levels is represented as `N-1`

indicator variables. The `randomForest`

implementation in R cannot handle categorical variables with more than 32 levels. Categoricals with a large number of levels are also a problem because it is more likely that some of the rarer levels will not appear in the training set, triggering the "novel level" problem on new data: if only a few of your customers come from Alaska or Rhode Island, then those states may not show up in your training set -- but they may show up when you deploy the model to your website.

There are often domain specific ways to handle categories with many levels. For example, a common trick with zip codes is to map them to a new variable whose value is related to zip code and relevant to the problem, such as average household income within that zip code. Obviously, this mapping won't be appropriate in all situations, so it's good to have an automatic procedure to fall back on.

Previously, we've discussed a technique that we call "impact coding" to manage this issue. We discuss this technique here and here; see also Chapter 6 of *Practical Data Science with R*. Impact coding converts a categorical variable `x_cat`

into a numerical variable that corresponds to a one-variable bayesian model for the outcome as a function of `x_cat`

. The `vtreat`

library implements impact coding as discussed in those posts, with a few improvements.

Let's build another simple example, to demonstrate impact coding and the treatment of numerical variables.

N = 100 # a variable with 100 levels levels = paste('gp', 1:N, sep='') fhi = c(0.15, 0.1, 0.1) # the first three levels account for 35% of of the data fx = sum(fhi)/(N-length(fhi)) levelfreq = c(fhi, numeric(N-length(fhi))+fx) means = sample.int(10, size=N, replace=TRUE) names(means) = levels X = sample(levels, 200, replace=TRUE, prob=levelfreq) U = rnorm(200, mean=0.5) # numeric variable Y = rnorm(200) + means[X] + U length(unique(X)) # the data set is missing levels ## [1] 68 train = data.frame(x=X[1:150], u = U[1:150], y=Y[1:150], stringsAsFactors=FALSE) test = data.frame(x=X[151:200], u= U[151:200], y=Y[151:200], stringsAsFactors=FALSE) # sprinkle a few NAs into u (for demonstration purposes) train$u = ifelse(runif(150) < 0.01, NA, train$u) length(setdiff(unique(test$x), unique(train$x))) # and test has some levels train doesn't ## [1] 11

The `designTreatmentsN`

function has two parameters that control when a categorical variable is impact coded. The parameter `minFraction`

(default value: 0.02) controls what fraction of the time an indicator variable has to be "on" (that is, not zero) to be used (this is separate from the `pruneLevel`

parameter in `prepare`

). The purpose is to eliminate rare variables or rare levels. By default, we eliminate variables that are on less than 2% of the time.

When a categorical variable has a large number of levels, it's likely that many of them will be on less than 2% of the time. In that case, the corresponding indicator variables are eliminated, and all of those rare levels will encode to `c(0, 0, ...)`

, in the way the `NA`

level did in our second example above. Let's call the fraction of the data that gets encoded to zero due to rare levels the fraction of the data that we "lose". The parameter `maxMissing`

(default value: 0.04) specifies what fraction of the data we are allowed to "lose" before automatically switching to an impact coded variable. By default, if the eliminated levels correspond to more than 4% of the data, then the treatment plan will switch to impact coding.

In the example above, three levels of the variable `x`

account for 35% of the data, so all the other levels will account for roughly `(1-0.35)/97 = 0.0067`

or the data each, or less than 1% of the mass each. So, all of those 97 levels would be eliminated, and we will "lose" 65% of the data if we keep the categorical representation! Therefore, the data treatment automatically converts `x`

to an impact-coded variable.

# # create the treatment plan. # treatments = designTreatmentsN(train, c("x", "u"), "y") treatments ## $treatments ## $treatments[[1]] ## [1] "vtreat 'Scalable Impact Code'('x'->'x_catN')" ## ## $treatments[[2]] ## [1] "vtreat 'Scalable pass through'('u'->'u_clean')" ## ## $treatments[[3]] ## [1] "vtreat 'is.bad'('u'->'u_isBAD')" ## ## ## $vars ## [1] "x_catN" "u_clean" "u_isBAD" ## ## $varScores ## x_catN u_clean u_isBAD ## 0.1717 0.9183 1.0116 ## ## $varMoves ## x_catN u_clean u_isBAD ## TRUE TRUE TRUE ## ## $outcomename ## [1] "y" ## ## $meanY ## [1] 5.493 ## ## $ndat ## [1] 150 ## ## attr(,"class") ## [1] "treatmentplan"

The variable `x_catN`

is the impact-coded variable corresponding to `x`

. If we refer to the mean of `y`

conditioned on `x`

as `y|x`

, and `meanY`

as grand (unconditioned) mean of `y`

then `x_catN = y|x - meanY`

. Note that `x_catN`

has a low `varScore`

, indicating that it is a good, informative variable.

The variable `u_clean`

is the numerical variable `u`

, with all "bad" values (`NA`

, `NaN`

, `Inf`

) converted to the mean of the "non-bad" `u`

(we'll call this the "clean mean" of `u`

). The variable `u_isBAD`

is an indicator variable that is one whenever `u`

is bad. If the bad values are due to a "faulty sensor" (that is, they occur at random), then converting to the clean mean value of `u`

is the right thing to do. If the bad values are systematic, then `u_isBAD`

can be used by the modeling algorithm to adjust for the systematic effect (assuming it survives the pruning, which in this case, it won't).

We can see how this works concretely by preparing the test and training sets.

train.treat = prepare(treatments, train) test.treat = prepare(treatments, test) train.treat[1:5,] # isBAD column didn't survive ## x_catN u_clean y ## 1 0.04809 1.0749 5.328 ## 2 1.37053 -0.4429 5.413 ## 3 -2.32535 1.4380 2.372 ## 4 -5.02863 0.6611 0.464 ## 5 -1.04404 0.8327 4.449 # ------------------------ # "bad" u values map to the "clean mean" of u # ------------------------ train.treat[is.na(train$u),] ## x_catN u_clean y ## 74 1.371 0.5014 5.269 ## 133 3.053 0.5014 8.546 # compare to u_clean, above mean(train$u, na.rm=TRUE) ## [1] 0.5014 #----------------------- # confirm (x_catN | x = xlevel) is mean(y | x=xlevel) - mean(y) # ------------------------ subset(train.treat, train$x==levels[1])[1:2,] ## x_catN u_clean y ## 3 -2.325 1.4380 2.372 ## 15 -2.325 0.0661 2.381 # compare to x_catN, above mean(subset(train, x==levels[1])$y) - mean(train$y) ## [1] -2.325 #----------------------- # missing levels map to 0, which is equivalent to # mapping them to all known levels proportional to frequency #----------------------- missingInTest = setdiff(unique(test$x), unique(train$x)) subset(test.treat, test$x %in% missingInTest)[1:2,] ## x_catN u_clean y ## 1 4.737e-16 1.3802 1.754 ## 13 4.737e-16 0.9062 6.862

Finally, we use the treated data to model.

vars = setdiff(colnames(train.treat), "y") fmla = paste("y ~ ", paste(vars, collapse=" + ")) model = lm(fmla, data=train.treat) summary(model) ## ## Call: ## lm(formula = fmla, data = train.treat) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2.2077 -0.6131 -0.0113 0.5237 2.5923 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 5.1347 0.0800 64.2 <2e-16 *** ## x_catN 0.9846 0.0296 33.3 <2e-16 *** ## u_clean 0.7139 0.0760 9.4 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.862 on 147 degrees of freedom ## Multiple R-squared: 0.894, Adjusted R-squared: 0.892 ## F-statistic: 617 on 2 and 147 DF, p-value: <2e-16 train.treat$pred = predict(model, newdata=train.treat) test.treat$pred = predict(model, newdata=test.treat) with(train.treat, rmse(y,pred)) ## [1] 0.8529 with(test.treat, rmse(y,pred)) ## [1] 1.964 # evaluate only on the known levels with(subset(test.treat, test$x %in% unique(train$x)), rmse(y, pred)) ## [1] 1.138

As you can see, the model performs better on categories that it saw during training, but it still handles novel levels gracefully -- and remember, some modeling algorithms can't handle a large number of categories at all.

That describes the most basic data treatment procedures that our package implements. For binary classification and logistic regression problems, the package has another function, `designTreatmentsC()`

, which creates treatment plans when the outcome is a binary class variable.

**Loading the vtreat package**

We have made `vtreat`

available on github; remember, this is an alpha package, so it will be rough around the edges. To install the package, download the `vtreat`

tar file (at this writing, `vtreat_0.2.tar.gz`

), as shown in the figure below:

Once you've downloaded it, you can install it from the R command line, as you would any other package. If your R working directory is the same directory where you've downloaded the tar file, then the command looks like this:

install.packages('vtreat_0.2.tar.gz',repos=NULL,type='source')

Once it's installed, `library(vtreat)`

will load the package. Type `help(vtreat)`

to get a short description of how to use the package, along with some example code snippets.

`vtreat`

has a few more features that we will cover in future posts, but this post has given you enough to get you started. Remember, automatic data treatment procedures are not a substitute for inspecting and exploring your data before modeling. However, once you've gotten a feel for the data, you will find that the procedures we have implemented are applicable to a wide variety of situations.

If you try the package, please do send along feedback, including any errors or bugs that you might discover.

For more on data treatment, see Chapter 4 of *Practical Data Science with R*.

Nina Zumel also examines aspects of the supernatural in literature and in folk culture at her blog, multoghost.wordpress.com. She writes about folklore, ghost stories, weird fiction, or anything else that strikes her fancy. Follow her on Twitter @multoghost.

]]>

An easy way to avoid fairly evaluating an analysis technique is to assert that the technique in question is unsound because it violates some important foundational axiom of sound analysis. This rapidly moves a discussion from a potentially difficult analysis to an easy debate. However, this (unfortunately common) behavior is mere gamesmanship (see Potter “The Theory and Practice of Gamesmanship (or the Art of Winning Games without Actually Cheating)”). But it is what you can encounter when presenting a technique from school “B” to members of school “A.” For example: Bayesian parameter estimates can be considered inadmissible by frequentists because the estimates may be biased (see Frequentist inference only seems easy for an interesting example of the principle, and of a valid low-variance estimate that is in necessarily biased). BDA3 page 94 provides an interesting situation with a deliberate omitted variable bias (a feature of the data). BDA3 goes on to demonstrates how silly it would be to apportion the blame for prediction bias to the inference technique used (ordinary linear regression), or to try and mechanically adjust for prediction bias it without fixing the underlying omitted variable issue (by recruiting more variables/features). The example is important because, as we demonstrated in our earlier article: so-called unbiased techniques work by rejecting many (possibly good) biased estimates, and therefore can implicitly incorporate potentially domain-inappropriate bias corrections or adjustments. This example is relevant, because it is easier to respond to such criticism when it applied to a standard technique used on a simple artificial problem (versus defending a specialized technique on messy real data).

Axiomatic approaches to statistical inference tend to be very brittle in that it takes only a few simple rules to build an paradoxical or unsatisfiable system. For example: we described how even insisting on the single reasonable axiom of unbiasedness completely determines a family of statistical estimates (leaving absolutely no room to attempt to satisfy any additional independent conditions or axioms).

This sort of axiomatic brittleness is not unique to statistical inference. It is a common experience that small families of seemingly reasonable (and important) desiderata lead to inconsistent and unsatisfiable systems when converted to axioms. Examples include Arrow’s impossibility theorem (showing a certain reasonable combination of goals in voting systems is unachievable), Brewer’s CAP theorem (showing a certain reasonable combination of distributed computing goals are mutually incompatible). So the reason a given analysis may not satisfy an obvious set of desirable axioms is often that no analysis satisfies the given set of axioms.

Let’s get back to the BDA3 example and work out how to criticize ordinary linear regression for having an undesirable bias. If linear regression can’t stand up to this sort of criticism, how can any other method to be expected to face the same? If we are merely looking at the words it is “obvious” that regression can’t be biased as this would contradict the Gauss-Markov theorem (that linear regression is the “best linear unbiased estimator” or BLUE). However, the word “bias” can have different meanings in different contexts: in particular *what* is biased with respect to *what*? Let’s refine the idea of bias and try to make ordinary linear regression look bad.

Consider the following simple problem. Suppose our data set is observations of pairs of mother’s and adult daughter’s heights. Suppose idealizations of these two random variables are generated by the following process:

`c`

(unobserved) is independently sampled from a normal distribution with a mean of 80 centimeters and a standard deviation of 5 centimeters (the shared or common component of height).`u`

(unobserved) is independently sampled from a normal distribution with a mean of 80 centimeters and a standard deviation of 5 centimeters (the unique to mother portion of height).`v`

(unobserved) is independently sampled from a normal distribution with a mean of 80 centimeters and a standard deviation of 5 centimeters (the unique to daughter portion of height).- We then observe the two derived random variables: mother’s height
`m=c+u`

, and adult daughter’s height`d=c+v`

.

The random variables `m`

and `d`

are normally distributed with equal means of 160 centimeters, equal variances, and a correlation of 0.5. As we said: we can think of the two random variables `m`

and `d`

as representing the heights of pairs of mothers and adult daughters. The correlation means tall mothers tend to have taller daughters (but the correlation being less that 1.0 means the mother’s height does not completely determine the daughter’s height). Obviously real heights are not normally distributed (as people do not have negative heights, and non-degenerate normal distributions have non-zero mass on negative values); but overall the normal distribution is a very good approximation of plausible heights.

This generative model represents a specialization of the example from BDA3 page 94 to specific distributions that clearly obey the properties claimed in BDA3. We are completely specifying the distributions to attempt to negate any (wrong) claim that there may not be distributions simultaneously having all of the claimed properties mentioned in the original BDA3 example. The interpretation (again from BDA3) of the two observed random variables `m`

and `d`

as pairs of mother/daughter heights is to give the data an obvious interpretation and help make obvious when our procedures become silly. At this point we have distributions exactly matching the claimed properties in BDA3 and very closely (but not exactly) matching the claimed interpretation as heights of pairs of mothers and their adult daughters.

Let’s move on to the analysis. The claim in BDA3 is that the posterior mean of `d`

given `m`

is:

`E[d|m] = 160 + 0.5 (m-160)`

. We could derive this through Bayes law and some calculus/algebra. But we get the exact same answer using ordinary linear regression (which tends to have a frequentist justification). In R:

```
```n <- 10000
set.seed(4369306)
d <- data.frame(c=rnorm(n,mean=80,sd=5),
u=rnorm(n,mean=80,sd=5),
v=rnorm(n,mean=80,sd=5))
d$m <- d$c+d$u
d$d <- d$c+d$v
print(cor(d$m,d$d))
## [1] 0.4958206
print(lm(d~m,data=d))
##
## Call:
## lm(formula = d ~ m, data = d)
##
## Coefficients:
## (Intercept) m
## 81.6638 0.4899

The recovered linear model is very close to the claimed theoretical conditioned expectation `E[d|m] = 160 + 0.5 (m-160) = 80 + 0.5 m`

. So we can assume a good estimate of `d`

can be learned from the data. To keep things neat let’s say our point-estimate for `d`

is called `δ`

, and `δ = 160 + 0.5 (m-160)`

. As we see below `δ`

is a plausible looking estimate:

```
```library(ggplot2)
d$delta <- 160 + 0.5*(d$m-160)
ggplot(data=d,aes(x=delta,y=d)) +
geom_density2d() + geom_point(alpha=0.1) +
geom_smooth() +
geom_segment(x=150,xend=170,y=150,yend=170,linetype=2) +
coord_equal(ratio=1) + xlim(140,180) + ylim(140,180)

Actuals (

`d`

) plotted against estimate `δ`

Notice the dashed line `y=x`

mostly coincides with the blue smoothing curve in the above graph; this is a visual confirmation that `E[d|δ] = δ`

. This follows because we chose `δ`

so that `δ = E[d|m]`

(i.e. matching the regression estimate) and if we know `δ`

then we (by simple linear substitution) also know `m`

. So `E[d|δ] = E[d|m] = δ`

. `E[d|δ] = δ`

seems like a very nice property to have for the estimate `δ`

to have. We can (partially) re-confirm it by fitting a linear model of `d`

as a linear function of `δ`

:

```
```print(summary(lm(d~delta,data=d)))
## Call:
## lm(formula = d ~ delta, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.1532 -4.1458 0.0286 4.1461 23.5144
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.28598 2.74655 1.196 0.232
## delta 0.97972 0.01716 57.089 <2e-16 ***
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
##
## Residual standard error: 6.126 on 9998 degrees of freedom
## Multiple R-squared: 0.2458, Adjusted R-squared: 0.2458
## F-statistic: 3259 on 1 and 9998 DF, p-value: < 2.2e-16

We see the expected slope near one and an intercept/dc-term statistically indistinguishable from zero. And we don’t really have much bad to say about this fit (beyond the R-squared of 0.2458, which is expected when correlation is known to be 0.5). For instance the residuals don’t formally appear structured (despite the obvious visible tilt of principal axes in the previous graph):

```
```plot(lm(d~delta,data=d))

And now for the (intentionally overreaching) frequentist criticism. From BDA3 page 94 (variable names changed): “The posterior mean is *not*, however, an unbiased estimate of `d`

in the sense of repeated sampling of `m`

for a fixed `d`

.” That is: the chosen estimate `δ`

is not an unbiased estimate of a general fixed unknown value of `d`

under repeated experiments where the observed variable `m`

varies according to repeated draws from the joint distribution. This may sound complicated, but it is the standard frequentist definition of an unbiased estimator: for any given fixed unknown value of the item to be estimated under repeated experiments (with new, possibly different observed data) the value of the estimate should match the fixed unknown value in expectation. In other words: it isn’t considered enough for a single given estimate `δ`

to capture the expected value of the unknown item `d`

(to have `E[d|δ] = δ`

, which we have confirmed), we must also have the whole *estimation procedure* be unbiased for arbitrary unknown `d`

(that would be `E[δ|d] = d`

, which we will show does not hold in general). To be clear BDA3 is not advocating this criticism, they are just citing it as a standard frequentist criterion often wrongly over-applied to methods designed with different objectives in mind. The punch-line is: the predictions from the method of ordinary linear regression fail this criticism, yet the method continues to stand.

Let’s confirm `E[δ|d] ≠ d`

in general. To do this we need one more lemma: for a fixed (unknown) value of `d`

we know the conditional expectation of the observable value of `m`

is `E[m|d] = 160 + 0.5 (d-160)`

. We can again get this by a Bayesian argument, or just by running the linear regression `lm(m~d,data=d)`

and remembering that linear regression is a linear estimate of the conditional expectation. We are now ready to look at the expected value of our estimate `δ`

conditioned on the unknown true value `d`

: `E[δ|d]`

. Plugging in what we know we get:

```
```E[δ|d] = E[160 + 0.5 (m-160) | d]
= 160 + 0.5 (E[m|d]-160)
= 160 + 0.5 ((160 + 0.5 (d-160))-160)
= 160 + 0.25 (d - 160)

And that is a problem. To satisfy frequentist unbiasedness we would need `E[δ|d] = d`

for all `d`

. And `160 + 0.25 (d - 160) = d`

only if `d=160`

. So for all but one possible value of the daughter’s height `d`

the ordinary linear regression’s prediction procedure is considered biased in the frequentist sense. In fact we didn’t actually use the regression coefficients, we used the exact coefficients implied by the generative model that is actually building the examples. So we could even say: using the actual generative model to produce predictions is not unbiased in the frequentist sense.

This would seem to contradict our earlier regression check, but that is not the case. Consider the following regression and graph:

```
```

```
print(summary(lm(delta~d,data=d)))
## Call:
## lm(formula = delta ~ d, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.4663 -2.0743 -0.0216 2.0890 12.7414
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.198e+02 7.041e-01 170.20 <2e-16 ***
## d 2.509e-01 4.395e-03 57.09 <2e-16 ***
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
##
## Residual standard error: 3.1 on 9998 degrees of freedom
## Multiple R-squared: 0.2458, Adjusted R-squared: 0.2458
## F-statistic: 3259 on 1 and 9998 DF, p-value: < 2.2e-16
ggplot(data=d,aes(x=d,y=delta)) +
geom_density2d() + geom_point(alpha=0.1) +
geom_smooth() +
geom_segment(x=140,xend=180,y=140,yend=180,linetype=2) +
coord_equal(ratio=1) + xlim(140,180) + ylim(140,180)
```

Estimate

`δ`

plotted against actuals (`d`

)Notice that the slope (both from the regression and the graph) is now 0.25 and the dashed "y=x" line no longer agrees with the empirical smoothing curve. This closely agrees with our derived form for `E[δ|d]`

. And this may expose one source of the confusion. The slope of the regression `d ~ δ`

is 1.0, while the slope of the regression `δ ~ d`

is 0.25. This violates a possible naive expectation/intuition that these two slopes should be reciprocal (which they need not be, as each has a different error model).

Part of what is going on is an expected reversion to the mean effect. If we have a given `m`

in hand then `δ = 160 + 0.5 (m-160)`

is in fact a good estimate for `d`

(given that we known only `m`

and don't have a direct better estimate for `c`

). What we don't have is a the ability to guess what part of the heights to be estimated is from the shared process (`c`

, which we can consider an omitted variable in this simple analysis) and what part is from the unique processes (`u`

and `v`

, and therefore not useful for prediction).

One concern: we have been taking all conditional expectations `E[|]`

over the same data set (a nice single consistent probability model). This doesn't quite reproduce the frequentist set-up of `d`

being fixed. However, if there was in fact no reversion to the mean on any `d`

-slice then we would not have seen reversion to the mean in the aggregate. We can check the fixed-`d`

case directly with a little math to produce new fixed-`d`

data set, or approximate it by censoring a larger data set down to a narrow interval of `d`

. Here is a such an example (showing the same effects we saw before):

```
```n2 <- 10000000
set.seed(4369306)
d2 <- data.frame(c=rnorm(n2,mean=80,sd=5),
u=rnorm(n2,mean=80,sd=5),
v=rnorm(n2,mean=80,sd=5))
d2$m <- d2$c+d2$u
d2$d <- d2$c+d2$v
d2 <- subset(d2,d>=170.1 & d<170.2)
d2$delta <- 160 + 0.5*(d2$m-160)
print(dim(d2)[[1]])
## [1] 20042
print(mean(d2$d))
## [1] 170.1498
print(mean(d2$delta))
## [1] 162.5718
print(mean(160 + 0.25*(d2$d-160)))
## [1] 162.5374

And we pretty much see the exact reversion to the mean expected from our derivation.

Back to the Gauss-Markov theorem: in what sense can ordinary linear regression be considered unbiased? It turns out if you read carefully the content of the Gauss-Markov theorem is that the estimates of unknown *parameters* (or betas) are unbiased. So in particular the estimate `lm(d~m,data=d)`

should recover coefficients that are unbiased estimates of the coefficients in the expression `E[d|m] = 160 + 0.5 (m-160)`

. And that appears plausible, as we empirically estimated `d ~ 81.6638 + 0.4899*m`

which is very close to the true values. The Gauss-Markov theorem says ordinary linear regression, given appropriate assumptions, gives us unbiased *estimates of models*. It does not say that evaluations of such model are themselves unbiased (in the frequentist sense) *predictions of instances*. In fact, as we have seen, even evaluations of the exact true model does not always give unbiased (in the frequentist sense) predictions for individual instances. This is one reason that frequentist analysis has to take some care in treating unobservable parameters and unobserved future instances very differently (supporting the distinction between prediction and estimation, less of a concern in Bayesian analysis). This also is a good reminder of the fact that traditional statistics is much more interested in parameter estimation, than in prediction of individual instances.

BDA3 goes on to exhibit (for the purpose of criticism) the mechanical derivation of a frequentist-sense unbiased linear estimator for `d`

: `γ = 160 + 2 (m-160)`

. It is true that `γ`

satisfies the unbiased condition `E[γ|d] = d`

for all `d`

. But `γ`

is clearly an unusable and ridiculous estimator that claims for every centimeter in height increase in the `m`

mother we should expect two centimeters of expected height increase in the daughter. This is not an effect seen in the data (so not something a good estimator should claim) and is a much higher variance estimator than the common reasonable estimator `δ`

. A point BDA3 is making is: applying "bias corrections" willy-nilly or restricting to only unbiased predictors is an ill-advised attempt at a mechanical fix to modeling bias. When the underlying issue is omitted variable bias (as it is in this example) the correct fix is to try and get better estimates of the hidden variables (in this case `c`

) by introducing more explanatory variables (in this case perhaps obtaining some genetic and diet measurements for each mother/daughter pair).

So: a deliberately far too broad application of a too stringent frequentist bias condition eliminated reasonable predictors leaving us with a bad one. The fact is: unbiasedness is a stronger condition than is commonly thought, and can limit your set of possible solutions very dramatically (as was also shown in our earlier example). Bias *is* bad (it can prevent you from improving results through aggregation), but you can't rely on mere mechanical procedures to eliminate it. *Correctly* controlling for bias may involve making additional measurments and introducing new variables. You also really need to be allowed to examine the domain specific utility of your procedures, and not be told a large number of them are a-priori inadmissible.

We took the term inadmissible from discussions about the James-Stein estimator. One of those results shows: "The ordinary decision rule for estimating the mean of a multivariate Gaussian distribution is inadmissible under mean squared error risk" (from Wikipedia: Stein's example). Though really what this shows is that insisting only on "admissible estimators" (a loaded term if there ever was one) collapses under its own weight (linear regression in many cases actually being a good method for prediction). So such criticism of criticisms are already well known, but evidently not always sufficiently ready to hand.

Afterthought: the plots we made looked a lot like the cover of Freedman, Pisani, Purves "Statistics" 4th edition (which itself is likely a reminder that the line from regressing `y~x`

is not the same as the principal axes). So in this example we see the regression line `y~x`

is not necessarily the transpose or reciprocal of the regression line `x~y`

, and neither of these is necessarily one of the principal axes of the scatterplot.

To follow up on this we produced some plots showing regression lines, smoothed curves, y=x, and principal axes all at once. The graphs are bit too busy/confusing for the main part of of the article itself, but nice to know how to produce (for use debugging, and during data exploration). We have also changed the smoothing curve to green, to give it a chance to stand out from the other annotations.

```
```# build some ggplot2() line segments representing principal axes
pcompAxes <- function(x,y) {
axes <- list()
means <- c(mean(x),mean(y))
dp <- data.frame(x=x-means[1],y=y-means[2])
p <- prcomp(~x+y,data=dp)
for(j in 1:2) {
s <- p$rotation[,j]
step <- 3*p$sdev[j]
a = means - step*s
b = means + step*s
axes[[length(axes)+1]] <-
geom_segment(x=a[1],xend=b[1],y=a[2],yend=b[2],
color='blue',linetype=3)
}
axes
}
ggplot(data=d,aes(x=delta,y=d)) +
geom_density2d() + geom_point(alpha=0.1) +
geom_smooth(color='green') +
geom_segment(x=140,xend=180,y=140,yend=180,linetype=2) +
coord_equal(ratio=1) + xlim(140,180) + ylim(140,180) +
pcompAxes(d$delta,d$d)

```
```ggplot(data=d,aes(x=d,y=delta)) +
geom_density2d() + geom_point(alpha=0.1) +
geom_smooth(color='green') +
geom_segment(x=140,xend=180,y=140,yend=180,linetype=2) +
coord_equal(ratio=1) + xlim(140,180) + ylim(140,180) +
pcompAxes(d$d,d$delta)

As a first example, consider the problem of trying to estimate the speed of light from a series of experiments.

In this situation the frequentist method quietly does some heavy philosophical lifting before you even start work. Under the frequentist interpretation since the speed of light is thought to have a single value it does not make sense to model it as having a prior distribution of possible values over any non-trivial range. To get the ability to infer, frequentist philosophy considers the act of measurement repeatable and introduces very subtle concepts such as *confidence intervals*. The frequentist statement that a series of experiments places the speed of light in vacuum at 300,000,000 meters a second plus or minus 1,000,000 meters a second with 95% confidence does not mean there is a 95% chance that the actual speed of light is in the interval 299,000,000 to 301,000,000 (the common incorrect recollection of what a confidence interval is). It means if the procedure that generated the interval were repeated on new data, then 95% of the time the speed of light would be in the interval produced: which may not be the interval we are looking at right now. Frequentist procedures are typically easy on the practitioner (all of the heavy philosophic work has already been done) and result in simple procedures and calculations (through years of optimization of practice).

Bayesian procedures on the other hand are philosophically much simpler, but require much more from the user (production and acceptance of priors). The Bayesian philosophy is: *given* a generative model, a complete prior distribution (detailed probabilities of the unknown value posited before looking at the current experimental data) of the quantity to be estimated, and observations: then inference is just a matter of calculating the complete posterior distribution of the quantity to be estimated (by correct application of Bayes’ Law). Supply a bad model or bad prior beliefs on possible values of the speed of light and you get bad results (and it is your fault, not the methodology’s fault). The Bayesian method seems to ask more, but you have to remember it is trying to supply more (complete posterior distribution, versus subjunctive confidence intervals).

In this article we are going to work a simple (but important) problem where (for once) the Bayesian calculations are in fact easier than the frequentist ones.

Consider estimating from observation the odds that a coin-flip comes out heads (as shown below).

Heads outcome.

The coin can also show a tails (as shown below).

Tails outcome.

This might be a fair coin, that when tossed properly can be argued to have heads/tails probabilities very close to 50/50. Or the heads/tails outcome could in fact be implemented by some other process with some other probability `p`

of coming up heads. Suppose we flip the coin 100 times and record heads 54 times.

In this case the frequentist procedure is to generate a point-estimate of the unknown `p`

as `pest = 54/100 = 0.54`

. That is we estimate `p`

to be the relative frequency we actually empirically observed. Stop and consider: how do we know this is the right frequentist estimate? Beyond being told to use it, what principles lead us to this estimate? It may seem obvious in this case, but in probability mere obviousness often leads to contradictions and paradox. What criteria can be used to derive this estimate in a principled manner?

Gelman, Carlin, Stern, Dunson, Vehtari, Rubin “Bayesian Data Analysis” 3rd Edition p. 92 states that frequentist estimates are designed to be *consistent* (as the sample size increases they converge to the unknown value), *efficient* (they tend to minimize loss or expected square-error), or even have *asymptotic unbiasedness* (the difference in the estimate from the true value converges to zero as the experiment size increases, even when re-scaled by the shrinking standard error of the estimate). Because some of the estimators we will work with are biased we are going to use expected square-error as our measure of error. This is the expected value of the square of the distance of our estimate from the *unknown true value*, and not the variance (which is the expected value of the square of the distance of the estimator from its own mean).

Frequentists also commonly insist on fully *unbiased* procedures (which is what we will discuss) here. In this case an unbiased

procedure is a function `f(nHeads,nFlips)`

that given the sufficient statistics of the experiment (the number of heads and the total number of flips) returns an estimate for the unknown probability. The frequentist philosophy assumes the unknown probability `p`

is fixed and the observed number of heads might vary as we repeat the coin-flip experiment again and again. To confirm a frequentist procedure to estimate `p`

from 100 flips is unbiased, we must check that the entire family of possible estimates `f(0,100), f(1,100), ... f(100,100)`

together represent an panel of estimates that are simultaneously unbiased no matter what the unknown true value of `p`

is.

That is: the following bias check equation must hold for any `p`

in the range `[0,1]`

:

Equation family 1: Bias checks (one

`f(h,n)`

variable for every possible outcome `h`

, one equation for every possible `p`

).Some combinatorics or probability theory tells us `P(h|n,p) = (n choose h) p^h (1-p)^(n-h)`

. We can choose to treat the sequence `f(0,nFlips),f(1,nFlips), ... f(nFlips,nFlips)`

either as a set of pre-made estimates (to be checked) or as a set of variables (to be solved for). It turns out there is a solution that satisfies all of the equations simultaneously: `f(h,n) = h/n`

. This fact is just a matter of checking that the expected value of the number of heads is `p`

times the number of flips. And this is the only unbiased solution. The set of check equations we can generate for various `p`

has rank `nFlips+1`

(when we include check equations from at least `nFlips+1`

different values of `p`

, this follows as the check equations behave a lot like the moment curve). We will work a concrete example of the family 1 bias checks a bit later (which should make seeing the content of the chekcs a bit easier).

The pre-packaged frequentist estimation procedure is easy: write down the empirically observed frequency as your estimate. But the derivation should now seem a bit scary (submit a panel of `nFlips+1`

simultaneous estimates and confirm they simultaneously obey an uncountable family of bias check equations). And this is one of the merits of the frequentist methods- the hard derivational steps don’t have to be reapplied each time you encounter new data, so the end user may not need to know about them.

Let’s look at the same data using Bayesian methods. First we are required to supply prior beliefs on the possible values for `p`

. Most typically we would operationally assume unknown `p`

is beta distributed with shape parameters `(1/2,1/2)`

(the Jeffreys prior) or shape parameters `(1,1)`

(implementing classic Laplace smoothing). I’ll choose to use the Jeffreys prior, and in that case the posterior distribution (what we want to calculate) turns out to be a beta distribution with shape parameters `(54.5,46.5)`

. Our complete posterior estimate of probable values of `p`

is given by the R plot below:

```
library(ggplot2)
d <- data.frame(p=seq(0,1,0.01))
d$density <- dbeta(d$p,shape1=54.5,shape2=46.5)
ggplot(data=d) + geom_line(aes(x=p,y=density))
sum(d$p*d$density)/sum(d$density)
## [1] 0.539604
```

The posterior distribution of

`p`

.And the common Bayesian method if obtaining an estimate of a summary statistics is to just compute the appropriate summary statistic from the estimated posterior distribution. So if we only want a point-estimate for `p`

we can use the expected value `54.5/(54.5+46.5) = 0.539604`

or the mode (maximum likelihood value) `(54.5-1)/(54.5+46.5-2) = 0.540404`

of the posterior beta distribution. But having a complete graph of an estimate of the complete posterior distribution also allows a lot more. For example: from such a graph we can work out a Bayesian credible interval (which has a given chance of containing the unknown true value `p`

assuming our generative modeling assumptions and priors were both correct). And this is one of the reasons Bayesians emphasize working with distributions (instead of point-estimates): even though they can require more work to derive and use, they retain more information.

Notice the complications of having to completely specify a prior distribution have not been hidden from us. The actual application of Bayes’ law (an explicit convolution or integral relating the prior distribution to the posterior through a data likelihood function) has (thankfully) been hidden by appealing to the theory of conjugate distributions. So the Bayes theory is hiding some pain from us, but significant pain is still leaking through.

And this is common: what is commonly called a frequentist analysis is often so quick you almost can’t describe the motivation, and the Bayesian analysis seems like more work. What we want to say is this is not always the case. If there is any significant hidden state, or constraints on the possible values, then the Bayesian calculation becomes in fact easier than a fully derived frequentist calculation. And that is what we will show in our next example. But first let’s cut down confusion by fixing detailed names for a few common inference methods:

- Empirical frequency estimate. This is just the procedure of using the empirically observed frequencies as your estimate. This is commonly thought of as “the frequentist estimate.” However, we are going to reserve the term “proper frequentist estimate” for an estimate that most addresses the common frequentist criticisms: bias and loss/square-error. We will also call the empirical frequency estimate the “prescriptive frequentist estimate” as it is a simple “do what you are told” style procedure.
- Proper frequentist estimate. As we said, we are going to use this term for the estimate that most addresses the common frequentist criticisms: bias and loss/square-error. We use the traditional frequentist framework: the unknown parameters to be estimated are assumed to be fixed, and probabilities are over variations in possible observations if our measurement procedures were to be repeated. We define this estimate as an unbiased estimate that minimizes expected loss/square-error for arbitrary
*possible*values of the unknown parameters to be estimated. Often the bias check conditions are so restrictive that they completely determine the proper frequentist estimate*and*cause the proper frequentist estimate to agree with the empirical frequency estimate. - Full generative Bayesian estimate. This is a complete estimate of the entire posterior distribution of values for the unknown parameters to be estimated. This is under the traditional Bayesian framework that the observations are fixed and the unknown parameters to be estimated take on values from a non-trivial prior distribution (that is a distribution that takes on more than one possible value). Under the (very strong) assumptions that we have the correct generative model and the correct prior distribution the estimated posterior is identical to how the unknown parameters are distributed conditioned on the known observations. Thus reasonable summaries built from the full generative Bayesian estimate should be good (without explicitly satisfying conditions such as unbiasedness or minimal loss/square-error). We are avoiding the usual distinction of objective versus subjective interpretation (Bayesian usually being considered subjective if we consider the required priors subjective beliefs).
- Bayes point-estimate. This is a less common procedure. A full generative Bayesian estimate is wrapped in a procedure that hides details of the generative model, priors and Bayes inference step. What is returned is single summary of the detailed posterior distribution such as a mean (useful for producing low square-error estimates) or mode (useful for producing maximum likelihood estimates). For our examples the Bayes point-estimate will be a procedure that returns an estimate mean (or probability/rate) using the correct generative model and uniform priors (when there is a preferred problem parameterization, otherwise we suggest looking into invariant ideas like the Jeffreys prior).

Our points are going to be: the empirical frequency estimate is very easy, but is not always the proper frequentist estimate. The proper frequentist estimate can be itself cumbersome to derive, and therefore hard to think of as “always being easier than the Bayesian estimate.” And finally one should consider something like the Bayes point-estimate when one does not want to make a complete Bayesian analysis the central emphasis of a given project. We will illustrate these points with a simple (and natural) example.

Returning to our coin-flip problem. Suppose we introduce a five sided control die that is set once (and held fixed) before we start our experiments. Then suppose each experiment is a roll of a fair six-sided die and we observe “heads” if the number of pips on the six-sided die is greater than the number (1 through 5) shown on the control die (and otherwise “tails”). The process is strongly stationary in that the probability `p`

is a single fixed value of the entire series of experiments. Our imagined apparatus is depicted below.

Our apparatus (the 5-sided die is simulated with a 10-sided die labeled 1 through 5 twice).

We assume that while we understand the generative mechanics of the generation process, but that we don’t see the details of the actual die rolls. We observe only the reported heads/tails outcomes (as shown below).

What is observed.

This may seem like a silly estimation game, but it succinctly models a number of important inference situations such as: estimating advertisement conversion rates, estimating health treatment success rates, and so on. We pick a simple formulation so that when we run into difficulties or complications it will be clear that they are essential difficulties (and not avoidable domain issues). Or: if your estimation procedures are not correct on this example, how can you expect them to be correct in more complicated real-world situations? Another good example of this kind of analysis is: Sean R. Eddy “What is Bayesian statistics” Nature Biotechnology, Vol. 22, No. 9, Sept. 2004, pp. 1177-1178. Eddy presented a clever inference problem comparing where pool balls hit a rail relative to a uniform random chalk mark on the rail. Eddy’s problem illustrates the issues of inference when there are important unobserved (or omitted) state variables. Our example is designed to allow further investigation of the both Bayesian and Frequentist inference in the presence of constraints (not quite the same as complete priors).

We will consider two important ways the control die could be set: by a single roll before we start observations (essentially embodying the Bayesian generative assumptions), or by a manual selection by an assumed hostile agent (justifying the usual distribution-free frequentist minimax treatment of loss/square-error).

An adversary holding the control die at a chosen value.

Let’s start with the case where the control die is set before we start measurements by a fair (uniform) roll of the five sided die. Because the control die only has 5 possible states the unknown probability `p`

has exactly 5 possible values. In this case we can write down all of the bias check equations for every possible outcome of a one coin-flip simulation. For only one flip observed there are only two possible outcomes: either we see one heads or one tails. So we have two possible outcomes (giving us two variables, as we get one estimate variable per sufficient outcome) and 5 check equations (one for each possible value of `p`

). The complete bias check equations are represented by the matrix `a`

and vector `b`

shown below:

```
```> print(freqSystem(6,1))
$a
prob. 0 heads prob. 1 heads
check for p=0.166666666666667 0.8333333 0.1666667
check for p=0.333333333333333 0.6666667 0.3333333
check for p=0.5 0.5000000 0.5000000
check for p=0.666666666666667 0.3333333 0.6666667
check for p=0.833333333333333 0.1666667 0.8333333
$b
p
check for p=0.166666666666667 0.1666667
check for p=0.333333333333333 0.3333333
check for p=0.5 0.5000000
check for p=0.666666666666667 0.6666667
check for p=0.833333333333333 0.8333333

The above is just the family 1 bias check equations for our particular problem. A vector of estimates `f`

is unbiased if and only if `a f - b = 0`

(i.e. it obeys the equation family 1 checks). When `a`

is full rank (in this case when the number of variables is no more than the number of checks) the bias check equations completely determine the unique unbiased solution (more on this later). So even in this “discrete `p`

” situation: for any number of flips less than 5, the bias conditions alone completely determine the unique unbiased estimate.

What we are trying to show is that when we move away from the procedure “copy the observed frequency as your estimate” to the more foundational “pick an unbiased family of estimates with minimal expected square-error”, then frequentist reasoning appears a bit more complicated. Let’s continue with a frequentist analysis of this problem (this time in python instead of R, see here for the complete code).

The common “everything wrapped in a bow” prescriptive empirical frequency procedure is by far the easiest estimate:

```
```# Build the traditional frequentist empirical estimates of
# the expected value of the unknown quantity pWin
# for each possible observed outcome of number of wins
# seen in kFlips trials
def empiricalMeansEstimates(nSides,kFlips):
return numpy.array([ j/float(kFlips) for j in range(kFlips+1) ])

And if we load this code (and all of its pre-conditions) we get the following estimates of `p`

if we observe one coin experiment:

```
```>>> printEsts(empiricalMeansEstimates(6,1))
pest for 0 heads 0.0
pest for 1 heads 1.0

Using our bias check equations we can confirm this solution is indeed unbiased:

```
```>>> sNK = freqSystem(6,1)
>>> printBiasChecks(matMulFlatten(sNK['a'], \
empiricalMeansEstimates(6,1)) - flatten(sNK['b']))
bias for p=0.166666666667 0.0
bias for p=0.333333333333 0.0
bias for p=0.5 0.0
bias for p=0.666666666667 0.0
bias for p=0.833333333333 0.0

And has moderate loss/square-errors:

```
```>>> printLosses(losses(6,empiricalMeansEstimates(6,1)))
exp. sq error for p= 0.166666666667 0.138888888889
exp. sq error for p= 0.333333333333 0.222222222222
exp. sq error for p= 0.5 0.25
exp. sq error for p= 0.666666666667 0.222222222222
exp. sq error for p= 0.833333333333 0.138888888889

But the solution is kind of icky. Remember, this result was completely determined by the unbiased check conditions. It says if we observe one coin experiment and see tails then the estimate for `p`

is zero, if we see heads the estimate for `p`

is one. Both of these estimates are well outside the range of possible values for `p`

! Recall our heads/tails coin events are assigned “heads” if the number of pips on the 6-sided die exceeds the mark on the control die (which are the numbers 1 through 5). Thus `p`

only takes on values in the range `1/6`

(when the control is `5`

) through `5/6`

(when the control is `1`

). In fact `p`

is always going to be one of the values: `1/6`

, `2/6`

, `3/6`

, `4/6`

, or `5/6`

. The frequentist analysis is failing to respect these known constraints (which are weaker than assuming actual priors).

We can try fixing this with a simple procedure such as Winsorising or knocking everything back into range. For example the estimate `[1/6,5/6]`

is biased but has improved loss/square-error:

```
```>>> w = [1/6.0,5/6.0]
>>> printBiasChecks(matMulFlatten(sNK['a'], w) - flatten(sNK['b']))
bias for p=0.166666666667 0.111111111111
bias for p=0.333333333333 0.0555555555556
bias for p=0.5 0.0
bias for p=0.666666666667 -0.0555555555556
bias for p=0.833333333333 -0.111111111111
>>> printLosses(losses(6,w))
exp. sq error for p=0.166666666667 0.0740740740741
exp. sq error for p=0.333333333333 0.101851851852
exp. sq error for p=0.5 0.111111111111
exp. sq error for p=0.666666666667 0.101851851852
exp. sq error for p=0.833333333333 0.0740740740741

There are other ideas for fixing estimates (such as shrinkage to reduce expected square-error, or quantization to improve likelihood). But the point is these are not baked into the traditional simple empirical frequency estimate. Once you start adding all of these features you may have a frequentist estimator that is as complicated as a Bayesian estimator is thought to be, and a frequentist estimator that is no longer considered pure with respect to traditional frequentist criticisms.

Let’s switch to the Bayes analysis for the game where the 5-sided control dice is set uniformly at random. A good Bayes point-estimate is easy to derive, as the appropriate priors for `p`

are obvious (uniform on `1/6`

, `2/6`

, `3/6`

, `4/6`

, `5/6`

). Our Bayes point-estimates for the expected value of `p`

turn out to be:

```
```>>> printEsts(bayesMeansEstimates(6,1))
pest for 0 heads 0.388888888889
pest for 1 heads 0.611111111111

Which means: for 1 tails we estimate `p=.38888889`

and for 1 heads we estimate `p=0.61111111`

. Notice these estimates are strictly inside the range `[1/6,5/6]`

(pulled in by `2/9`

in both cases). Also notice because we have wrapped the Bayes estimate in code it appears no more complicated to the user than the empirical estimate (sure the code is larger than the empirical estimate, but that is exactly what an end user does not need to see). We have *intentionally* hidden from the user some important design choices (priors, the Bayes step convolution, use of a mean estimate instead of a mode). The estimator (see here or here) has wrapped up proposing a prior distribution, deriving the posterior distribution from the data likelihood equations (applying Bayes law), and then returning the expected value of the posterior as a single point-estimate. In addition to hiding the implementation details, we have refrained (or at least delayed) educating the user out of their desire for a simple point-estimate. We have not insisted the user/consumer of the result learn to use the (superior) complete posterior distribution in favor of mere point-estimates. For a Bayes estimate to be replacement compatible for a frequentist one we need (at least initially) put it into the same format as the frequentist estimate it is competing with. This squanders a number of the advantages the Bayes posterior, but as we will see the Bayes estimate is still lesser expected square-error (more efficient) than the frequentist one. So initially offering a Bayes estimate as a ready to go replacement for the frequentist estimate is of some value, and we don’t want to lose that value by initially requiring additional user training.

Unfortunately this Bayes point-estimate solution is biased, as we confirm here:

```
```>>> printBiasChecks(matMulFlatten(sNK['a'], \
bayesMeansEstimates(6,1)) - flatten(sNK['b']))
bias for p=0.166666666667 0.259259259259
bias for p=0.333333333333 0.12962962963
bias for p=0.5 0.0
bias for p=0.666666666667 -0.12962962963
bias for p=0.833333333333 -0.259259259259

But, as we mentioned, our Bayes point-estimate has some advantages. Let’s also look at the expected loss each estimate would give for every possible value of the unknown probability `p`

:

```
```>>> printLosses(losses(6,bayesMeansEstimates(6,1)))
exp. sq error for p= 0.166666666667 0.0740740740741
exp. sq error for p= 0.333333333333 0.0277777777778
exp. sq error for p= 0.5 0.0123456790123
exp. sq error for p= 0.666666666667 0.0277777777778
exp. sq error for p= 0.833333333333 0.0740740740741

Notice that the Bayes estimate has smaller expected square-error (or in statistical parlance is a more efficient estimator) no matter what value `p`

takes. The unbiased check conditions forced the frequentist estimate to a high expected square-error estimator. This means demanding the estimator be strictly unbiased may not be a good trade-off (and the frequentist habit of deriding other estimators for “not being unbiased” may not always be justified). To be fair bias can be a critical flaw if you intend to aggregate it with other estimators later (as enough independent unbiased estimates can be averaged to reduce noise, which is not always true for biased estimators).

Let’s give the frequentist estimate another chance. For our discrete set of possible values `p`

(`1/6`

, `2/6`

, `3/6`

, `4/6`

, `5/6`

) once the number of coin-flips is large enough the equation family 1 bias checks no longer completely determine the estimate. So it is no longer immediately obvious that the observed empirical frequency is minimal loss. In fact it is not, so we can no longer consider the canned empirical solution to be the unique optimal estimate. Note this differs from the case where `p`

takes on many different values from a continuous interval, which is enough to ensure the bias check conditions completely determine a unique solution. Continuing with an example: if we observed 7 flips an improved frequentist estimate (under the idea it is an unbiased point-estimate with minimal expected square-error) is as follows:

```
```>>> printEsts(newSoln)
pest for 0 heads 0.0319031034157
pest for 1 heads 0.111845090806
pest for 2 heads 0.296666330987
pest for 3 heads 0.439170280769
pest for 4 heads 0.560830250198
pest for 5 heads 0.703332297349
pest for 6 heads 0.888156558984
pest for 7 heads 0.968095569167

To say we decrease loss we have to decide on a scalar definition of loss: be it maximum loss, total loss or some other criteria. This solution was chosen to decrease maximum loss (an idea compatible with frequentist philosophy) and was found through constrained optimization. Notice this solution is not the direct empirical relative frequency estimate. For example: in this estimate if you see seven tails in a row you estimate `p=0.0319031`

not `p=0`

(though we still have `0.0319031 < 1/6`

which is an out of bounds estimate). This estimate is a pain to work out (the technique I used involved optimizing a move in directions orthogonal to the under-rank bias check conditions; perhaps some clever math would allow us to consider this solution obvious, but that is not the point). It is not important if this new solution is actually optimal, what is important is it is unbiased and has a smaller maximum loss (meaning the empirical estimate itself can not be considered optimal in that sense). The fact that the unknown probability `p`

can only be one of the values `1/6`

, `2/6`

, `3/6`

, `4/6`

, `5/6`

has changed which unbiased estimate is in fact the minimal loss one (added a new lower loss solution that would not considered unbiased if `p`

could choose from more possible values).

Depending on your application it can be the case that either of the frequentist or Bayesian estimate has better utility. But is is unusual for the frequentist estimate to be the harder one to calculate (as is the case here).

The Bayes solution in this case is:

```
```>>> printEsts(bayesSoln)
pest for 0 heads 0.203065668302
pest for 1 heads 0.251405546037
pest for 2 heads 0.33603150662
pest for 3 heads 0.443861984801
pest for 4 heads 0.556138015199
pest for 5 heads 0.66396849338
pest for 6 heads 0.748594453963
pest for 7 heads 0.796934331698

This is still biased, but all values are in range and the losses are smaller than the frequentist losses for all possible values of `p`

(again limited to: `1/6`

, `2/6`

, `3/6`

, `4/6`

, `5/6`

).

To be fair the differences in loss/square-error are small (and shrinking rapidly as the number of observed flips goes up, so it is a small data problem). The point we want to make isn’t which estimate is better (that depends on how you are going to use the estimate, your domain, and your application), but the idea that: Bayesian methods are not necessarily more painful that frequentist procedures. The Bayesian estimation procedure requires more from the user (the priors) and has an expensive and complicated convolution step to use the data to relate the priors to the posteriors (unless you are lucky enough to have something like the theory of conjugate distributions to hide this step). The frequentist estimation procedure seems to be as simple as “copy over your empirical observation as your estimate.” That is unless you have significant hidden state, constraints or discreteness (not the same as having priors). When you actually have to justify the frequentist inference steps (versus just benefiting from them) you find you have to at least imaging submitting every possible inference you could make as a set of variables and picking a minimax solution optimizing expected square-error over the unknown quantities while staying in the linear flat of unbiased solutions (itself a complicated check).

Note that each style analysis is correct on its own terms and is not always compatible with the assumptions of the other. This doesn’t give one camp a free-card to criticize the other.

My advice is: Bayesians need to do a better job of wrapping standard simple analyses (you shouldn’t have to learn and fire up Stan for this sort of thing), and we all need to be aware that *proper* frequentist inference is not always just the common simple procedure of copying over the empirical observations.

For full implementations/experiments (and results) click here for R and here for python.

]]>`data.matrix`

when you mean `model.matrix`

. If you do so you may lose (without noticing) a lot of your model’s explanatory power (due to poor encoding).

For some modeling tasks you end up having to prepare a special expanded data matrix before calling a given machine learning algorithm. For example the `randomForest`

package advises:

For large data sets, especially those with large number of variables, calling randomForest via the formula interface is not advised: There may be too much overhead in handling the formula.

Which means you may want to prepare a matrix of exactly the values you want to use in your prediction (versus using the more common and convenient formula interface). As R supplies good tools for this, this is not a big problem. Unless you (accidentally) use `data.matrix`

as shown below.

```
```d <- data.frame(x=c('a','b','c'),y=c(1,2,3))
print(data.matrix(d))
## x y
## [1,] 1 1
## [2,] 2 2
## [3,] 3 3

Notice in the above `x`

is converted from its original factor levels `'a'`

, `'b'`

and `'c'`

to the numeric quantities `1`

, `2`

and `'3`

. The problem is this introduces an unwanted order relation in the `x`

-values. For example any linear model is going to be forced to treat the effect of `x='b'`

as being between the modeled effects of `x='a'`

and `x='c'`

even if this is not an actual feature of the data. Now there are cases when you want ordinal constraints and there are ways (like GAMs) to learn non monotone relations on numerics. But really what has happened is you have not used a rich enough encoding of this factor.

Usually what you want is `model.matrix`

which we demonstrate below:

```
```print(model.matrix(~0+x+y,data=d))
## xa xb xc y
## 1 1 0 0 1
## 2 0 1 0 2
## 3 0 0 1 3
## attr(,"assign")
## [1] 1 1 1 2
## attr(,"contrasts")
## attr(,"contrasts")$x
## [1] "contr.treatment"

The `0+`

notation is telling R to not add an “intercept” column (a column whose value is always `1`

, a useful addition when doing linear regression). In this case it also had the side-effect of allowing us to see all three derived indicator variables (usually all but one are shown, more on this later).

What `model.matrix`

has done is used the idea of indicator variables (implemented through `contrasts`

) to re-encode the single string-valued variable `x`

as a set of indicators. The three possible values (or levels) of `x`

(`'a'`

, `'b'`

and `'c'`

) are encoded as three new variables: `xa`

, `xb`

and `xc`

. These new variables are related to the original `x`

as follows:

`x` |
`xa` |
`xb` |
`xc` |

`'a'` |
`1` |
`0` |
`0` |

`'b'` |
`0` |
`1` |
`0` |

`'c'` |
`0` |
`0` |
`1` |

It is traditional to suppress one of the derived variables (in this case `xa`

) yielding the following factor as a set of indicators representation:

`x` |
`xb` |
`xc` |

`'a'` |
`0` |
`0` |

`'b'` |
`1` |
`0` |

`'c'` |
`0` |
`1` |

And this is what we see if we don’t add the `0+`

notation to our `model.matrix`

formula:

```
```print(model.matrix(~x+y,data=d))
## (Intercept) xb xc y
## 1 1 0 0 1
## 2 1 1 0 2
## 3 1 0 1 3
## attr(,"assign")
## [1] 0 1 1 2
## attr(,"contrasts")
## attr(,"contrasts")$x
## [1] "contr.treatment"

When you use the formula interface R performs these sort of conversions for you under the hood. This is why seemingly pure numeric models (like `lm()`

) can use string-valued variables. R performing these conversions spares the analyst a lot of messy column bookkeeping. Once you get used to this automatic conversion you really come to miss it (such as in `scikit-learn`

‘s random forest implementation).

The traditional reason to suppress the `xa`

variable is there is a redundancy when we use all the indicator variables. Since the indicators always sum to one we can always infer one missing indicator by subtracting the sum of all the others from one. In fact this redundancy is an linear dependence among the indicators- and this can actually cause trouble for some naive implementations of linear regression (though we feel this is much better handled using L2 regularization ideas).

At some point you are forced to “production harden” your code and deal directly with level encodings yourself. The built-in R encoding scheme is not optimal for factor with large numbers of distinct levels, rare levels (whose fit coefficients won’t achieve statistical significance during training), missing values, or the appearance of new levels after training. It is a simple matter of feature engineering to deal with each of these situations.

Some recipes for working with problem factors include:

- Modeling Trick: Impact Coding of Categorical Variables with Many Levels (deals with factors with large numbers of levels).
- Section 4.1.1 of Practical Data Science with R (deals with missing variables by introducing additional “masking variables”).

Novel values (factor levels not seen during training, but that show up during test or application) can be treated as missing values, or you can work out additional feature engineering ideas to improve model performance.

]]>