- Missing values (
`NA`

or blanks) - Problematic numerical values (
`Inf`

,`NaN`

, sentinel values like 999999999 or -1) - Valid categorical levels that don’t appear in the training data (especially when there are rare levels, or a large number of levels)
- Invalid values

Of course, you should examine the data to understand the nature of the data issues: are the missing values missing at random, or are they systematic? What are the valid ranges for the numerical data? Are there sentinel values, what are they, and what do they mean? What are the valid values for text fields? Do we know all the valid values for a categorical variable, and are there any missing? Is there any principled way to roll up category levels? In the end though, the steps you take to deal with these issues will often be the same from data set to data set, so having a package of ready-to-go functions for data treatment is useful. In this article, we will discuss some of our usual data treatment procedures, and describe a prototype R package that implements them.

**Missing Values; Missing Category Levels**

First, we’ll look at what to do when there are missing values or NAs in the data, and how to guard against category levels that don’t appear in the training data. Let’s make a small example data set that manifests these issues.

set.seed(9394092) levels = c('a', 'b', 'c', 'd') levelfreq = c(0.3, 0.3, 0.3, 0.1) means = c(1, 6, 2, 7) names(means) = levels NArate = 1/30 X = sample(levels, 200, replace=TRUE, prob=levelfreq) Y = rnorm(200) + means[X] train = data.frame(x=X[1:150], y=Y[1:150], stringsAsFactors=FALSE) test = data.frame(x=X[151:200], y=Y[151:200], stringsAsFactors=FALSE) # remove a level from training train = subset(train, x !='d') # sprinkle in some NAs ntrain = dim(train)[1] ; ntest = dim(test)[1] train$x = ifelse(runif(ntrain) < NArate, NA, train$x) test$x = ifelse(runif(ntest) < NArate, NA, test$x) table(train$x) ## a b c ## 40 44 42 sum(is.na(train$x)) ## [1] 4 sum(is.na(test$x)) ## [1] 2

This simulates a situation where a rare level failed to be collected in the training data. In addition, we’ve simulated a missing value mechanism. In this example, it’s a “faulty sensor” mechanism (missing values show up at random, as if a sensor were intermittently and randomly failing) – though it may also in general be a systematic mechanism, where the `NA`

means something specific, like the measurement doesn't apply (say "most recent pregnancy date" for a male subject).

We can build a linear regression model for predicting `y`

from `x`

:

# build a model model1 = lm("y~x", data=train) train$pred = predict(model1, newdata=train) # this works predict(model1, newdata=test) # this fails ## Error in model.frame.default(Terms, newdata, na.action = na.action, ## xlev = object$xlevels) : factor x has new levels d

The model fails on the holdout data because the new data has a value of `x`

which was not observed in the training data. You can always refuse to predict in such cases, of course, but in some situations even a not-so-good prediction may be better than no prediction at all. Note also that `lm`

quietly omitted the rows where x was missing while training, and the resulting model will return `NA`

as the predicted outcome in such cases. This is again perfectly reasonable, but not always what you want, especially in cases where a large fraction of the data has missing values.

Are there alternative ways to handle these issues? If `NA`

s show up in the data, the conservative assumption is that they are missing systematically; in this situation (when `x`

is a categorical value), we can then treat them as just another category value, for example by pretreating the variable to convert `NA`

to "Unknown." When novel values show up in the test data (or when `NA`

s appear in the holdout data, but not in the training data), the best assumption we can make is that the novel value is in fact one of the values that we have already observed; the probability of being any given value being proportional to the training set frequencies.

We've implemented these data treatments, and others, in an R package called `vtreat`

. The package is very much at the alpha stage, and is not yet available on CRAN; we'll explain how you can get the package later on in the post. For now, let's see how it works.

The first step is to use the training data to create a set of variable treatments, one for each variable of interest.

library(vtreat) # our library, not public; we'll show how to install later treatments = designTreatmentsN(train, c("x"), "y")

The function `designTreatmentsN()`

takes as input the data frame of training data, the list of input columns, and the (numerical) outcome column. There is a similar function `designTreatmentsC()`

for binary classification problems. The output of the function is a list of variable treatment objects (of class `treatmentplan`

), one per input variable.

treatments ## $treatments ## $treatments[[1]] ## [1] "vtreat 'Categoric Indicators'('x'->'x_lev_NA','x_lev_x.a','x_lev_x.b','x_lev_x.c')" ## ## ## $vars ## [1] "x_lev_NA" "x_lev_x.a" "x_lev_x.b" "x_lev_x.c" ## ## $varScores ## x_lev_NA x_lev_x.a x_lev_x.b x_lev_x.c ## 1.0310 0.6948 0.2439 0.8959 ## ## $varMoves ## x_lev_NA x_lev_x.a x_lev_x.b x_lev_x.c ## TRUE TRUE TRUE TRUE ## ## $outcomename ## [1] "y" ## ## $meanY ## [1] 3.246 ## ## $ndat ## [1] 130 ## ## attr(,"class") ## [1] "treatmentplan"

The `vars`

field of a `treatmentplan`

object gives the names of the new variables that were formed from the original variable: a categorical variable like `x`

is converted to several indicator variables, one for each known level of `x`

-- including `NA`

, if it is observed in the training data. `varMoves`

is TRUE if the new variable in question varies (that is, if it has more than one value in the training data). `meanY`

is the base mean of the outcome variable (unconditioned on the inputs). `ndat`

is the number of data points.

The field `varScores`

is a rough indicator of variable importance, based on the Press statistic. The Press statistic of a model is the sum of the variance of all the hold-one-out models: that is, the sum of `(y_i - f_i)^2`

, where `y_i`

is the outcome corresponding to the ith data point, and `f_i`

is the prediction of the model built by using all the training data *except* the ith data point. We calculate the `varScore`

of the jth input variable `x_j`

to be the Press statistic of the one-dimensional linear regression model that uses only `x_j`

, divided by the Press statistic of the unconditioned mean of `y`

. A varScore of 0 means the model predicts perfectly. A varScore close to one means that the variable predicts only about as well as the global mean; a varScore above 1 means that the model predicts outcome worse than the global mean. So the lower the varScore, the better. You can use `varScores`

to prune uninformative variables, as we will show later.

Once you have created the treatment plans using `designTreatmentsN()`

, you can treat the training and test data frames using the function `prepare()`

. This creates new data frames that express the outcome in terms of the new transformed variables. `prepare()`

takes as input a list of treatment plans and a data set to be treated. The optional argument `pruneLevel`

lets you specify a threshold for `varScores`

; variables with a varScore higher than `pruneLevel`

will be eliminated. By default, `prepare()`

will prune away any variables with a varScore greater than 0.99; we will use `pruneLevel=NULL`

to force `prepare()`

to create all possible variables.

# pruneLevel=NULL turns pruning OFF train.treat = prepare(treatments, train, pruneLevel=NULL) test.treat = prepare(treatments, test, pruneLevel=NULL) train.treat[1:4,] ## x_lev_NA x_lev_x.a x_lev_x.b x_lev_x.c y ## 1 0 0 1 0 7.037 ## 2 0 0 0 1 1.209 ## 3 0 0 0 1 2.819 ## 4 0 0 0 1 2.099 subset(train.treat, is.na(train$x)) # similarly for test ## x_lev_NA x_lev_x.a x_lev_x.b x_lev_x.c y ## 12 1 0 0 0 -0.4593 ## 48 1 0 0 0 6.4741 ## 49 1 0 0 0 5.3387 ## 81 1 0 0 0 2.2319

The listing above shows that instead of the training data frame `(x, y)`

, we now have a training data frame with four `x`

indicator variables, one for the each known `x`

-values "a", "b", and "c" -- plus `NA`

. According to the listing, the first four values for `x`

in the training data were `c("b", "c", "c", "c")`

. `NA`

s are encoded as the variable `x_lev_NA`

.

We can see how `prepare()`

handles novel values in the test data:

# # when we encounter a new variable value, we assign it all levels, # proportional to training set frequencies # subset(test.treat, test$x=='d') ## x_lev_NA x_lev_x.a x_lev_x.b x_lev_x.c y ## 8 0.03077 0.3077 0.3385 0.3231 4.622

Looking back at the process by which we generated `y`

, we can see in this case that the "d" level isn't actually a proportional combination of the other levels; still this is the best assumption in the absence of any other information. Furthermore, in the more common situation of multiple input variables, this assumption allows us to take advantage of information that is available through those other variables.

Now we can fit a model using the transformed variables:

# get the names of the x variables vars = setdiff(colnames(train.treat), "y") fmla = paste("y ~ ", paste(vars, collapse=" + ")) model2 = lm(fmla, data=train.treat) summary(model2) ## ## Call: ## lm(formula = fmla, data = train.treat) ## ## Residuals: ## Min 1Q Median 3Q Max ## -3.856 -0.756 -0.026 0.782 3.078 ## ## Coefficients: (1 not defined because of singularities) ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 2.103 0.168 12.54 < 2e-16 *** ## x_lev_NA 1.293 0.569 2.27 0.02461 * ## x_lev_x.a -0.830 0.240 -3.46 0.00075 *** ## x_lev_x.b 4.014 0.234 17.13 < 2e-16 *** ## x_lev_x.c NA NA NA NA ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.09 on 126 degrees of freedom ## Multiple R-squared: 0.794, Adjusted R-squared: 0.789 ## F-statistic: 162 on 3 and 126 DF, p-value: <2e-16

The significance levels of the variables are consistent with the variable importance scores we observed in the treatment plan. The fact that one of the levels is NAd out is to be expected; four levels implies 3 degrees of freedom (plus the intercept). The standard practice is to omit one level of a categorical as redundant. We don't do this in our treatment plan, as regularized models can actually benefit from having the extra level left in. You will get warnings about possibly misleading fits when applying the model; in this case, we know how the variables were constructed, and that there are no hidden degeneracies in the variables (at least none that we created), so we can disregard the warning.

# you get the warnings about rank-deficient fits train.treat$pred = predict(model2, newdata=train.treat) ## Warning: prediction from a rank-deficient fit may be misleading test.treat$pred = predict(model2, newdata=test.treat) # works! ## Warning: prediction from a rank-deficient fit may be misleading # no NAs summary(train.treat$pred) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 1.27 1.27 2.10 3.25 6.12 6.12 summary(test.treat$pred) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 1.27 1.27 2.10 3.37 6.12 6.12 # note that this model gives the same answers on training data # as the default model sum(abs(train$pred - train.treat$pred), na.rm=TRUE) ## [1] 9.566e-13

The last command of the above listing confirms that on the training data, the model learned from the treated data is equivalent to the model learned on the original data. Now we can look at model accuracy. .

rmse = function(y, pred) { se = (y-pred)^2 sqrt(mean(se)) } # model does well where it really has x values with(subset(train, !is.na(x)), rmse(y, pred)) ## [1] 0.973 # not too bad on NAs with(train.treat, rmse(y,pred)) ## [1] 1.07 # model generalizes well on levels it's observed with(subset(test.treat, test$x != "d"), rmse(y,pred)) ## [1] 1.08 # less well on novel values with(test.treat, rmse(y,pred)) ## [1] 1.272 subset(test.treat, test$x=='d')[,c("y", "pred")] ## y pred ## 8 4.622 3.246

As expected, the model does not perform as well on novel data values (`x`

= "d"), but at least it returns a prediction without crashing. Furthermore, if the novel levels are rare (as we would expect), then predicting them poorly will not affect the overall performance of the model too much.

Let's try preparing the data with the default pruning parameters (`pruneLevel=0.99`

):

train.treat = prepare(treatments, train) test.treat = prepare(treatments, test) # The x_lev_NA variable has been pruned away train.treat[1:4,] ## x_lev_x.a x_lev_x.b x_lev_x.c y ## 1 0 1 0 7.037 ## 2 0 0 1 1.209 ## 3 0 0 1 2.819 ## 4 0 0 1 2.099 # NAs are now encoded as (0,0,0) subset(train.treat, is.na(train$x)) ## x_lev_x.a x_lev_x.b x_lev_x.c y ## 12 0 0 0 -0.4593 ## 48 0 0 0 6.4741 ## 49 0 0 0 5.3387 ## 81 0 0 0 2.2319 # d is now encoded as the relative frequencies of a, b, and c. subset(test.treat, test$x=='d') ## x_lev_x.a x_lev_x.b x_lev_x.c y ## 8 0.3077 0.3385 0.3231 4.622

We no longer keep `NA`

as a level, because it's not any more informative than the global mean; novel levels are still encoded as "all the known levels," proportionally weighted. If we use this data representation to model, we don't have a rank-deficient fit.

vars = setdiff(colnames(train.treat), "y") fmla = paste("y ~ ", paste(vars, collapse=" + ")) model2 = lm(fmla, data=train.treat) summary(model2) ## ## Call: ## lm(formula = fmla, data = train.treat) ## ## Residuals: ## Min 1Q Median 3Q Max ## -3.856 -0.756 -0.026 0.782 3.078 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.396 0.543 6.25 5.8e-09 *** ## x_lev_x.a -2.123 0.570 -3.73 0.00029 *** ## x_lev_x.b 2.721 0.567 4.80 4.5e-06 *** ## x_lev_x.c -1.293 0.569 -2.27 0.02461 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.09 on 126 degrees of freedom ## Multiple R-squared: 0.794, Adjusted R-squared: 0.789 ## F-statistic: 162 on 3 and 126 DF, p-value: <2e-16

The model performance is similar to that of the model that included `x_lev_NA`

.

train.treat$pred = predict(model2, newdata=train.treat) test.treat$pred = predict(model2, newdata=test.treat) sum(abs(train$pred - train.treat$pred), na.rm=TRUE) ## [1] 6.297e-13 with(train.treat, rmse(y,pred)) ## [1] 1.07 with(test.treat, rmse(y,pred)) ## [1] 1.272

**Numerical variables and Categorical variables with many levels**

The above examples looked at data treatment for a simple categorical variable with a moderate number of levels, some possibly missing. There are two other cases to consider. First, we would like basic data treatment for numerical variables, to protect against bad values like `NA`

, `NaN`

or `Inf`

.

Second, we'd like to gracefully manage categorical variables with a large number of possible levels, such as ZIP code, telephone area code, or even city or other geographical region. Such categorical variables can be problematic because they introduce computational or data size issues for some modeling algorithms. For example, the size of the design matrix when computing linear or logistic regression models grows as the square of the number of variables -- and a categorical variable with `N`

levels is represented as `N-1`

indicator variables. The `randomForest`

implementation in R cannot handle categorical variables with more than 32 levels. Categoricals with a large number of levels are also a problem because it is more likely that some of the rarer levels will not appear in the training set, triggering the "novel level" problem on new data: if only a few of your customers come from Alaska or Rhode Island, then those states may not show up in your training set -- but they may show up when you deploy the model to your website.

There are often domain specific ways to handle categories with many levels. For example, a common trick with zip codes is to map them to a new variable whose value is related to zip code and relevant to the problem, such as average household income within that zip code. Obviously, this mapping won't be appropriate in all situations, so it's good to have an automatic procedure to fall back on.

Previously, we've discussed a technique that we call "impact coding" to manage this issue. We discuss this technique here and here; see also Chapter 6 of *Practical Data Science with R*. Impact coding converts a categorical variable `x_cat`

into a numerical variable that corresponds to a one-variable bayesian model for the outcome as a function of `x_cat`

. The `vtreat`

library implements impact coding as discussed in those posts, with a few improvements.

Let's build another simple example, to demonstrate impact coding and the treatment of numerical variables.

N = 100 # a variable with 100 levels levels = paste('gp', 1:N, sep='') fhi = c(0.15, 0.1, 0.1) # the first three levels account for 35% of of the data fx = sum(fhi)/(N-length(fhi)) levelfreq = c(fhi, numeric(N-length(fhi))+fx) means = sample.int(10, size=N, replace=TRUE) names(means) = levels X = sample(levels, 200, replace=TRUE, prob=levelfreq) U = rnorm(200, mean=0.5) # numeric variable Y = rnorm(200) + means[X] + U length(unique(X)) # the data set is missing levels ## [1] 68 train = data.frame(x=X[1:150], u = U[1:150], y=Y[1:150], stringsAsFactors=FALSE) test = data.frame(x=X[151:200], u= U[151:200], y=Y[151:200], stringsAsFactors=FALSE) # sprinkle a few NAs into u (for demonstration purposes) train$u = ifelse(runif(150) < 0.01, NA, train$u) length(setdiff(unique(test$x), unique(train$x))) # and test has some levels train doesn't ## [1] 11

The `designTreatmentsN`

function has two parameters that control when a categorical variable is impact coded. The parameter `minFraction`

(default value: 0.02) controls what fraction of the time an indicator variable has to be "on" (that is, not zero) to be used (this is separate from the `pruneLevel`

parameter in `prepare`

). The purpose is to eliminate rare variables or rare levels. By default, we eliminate variables that are on less than 2% of the time.

When a categorical variable has a large number of levels, it's likely that many of them will be on less than 2% of the time. In that case, the corresponding indicator variables are eliminated, and all of those rare levels will encode to `c(0, 0, ...)`

, in the way the `NA`

level did in our second example above. Let's call the fraction of the data that gets encoded to zero due to rare levels the fraction of the data that we "lose". The parameter `maxMissing`

(default value: 0.04) specifies what fraction of the data we are allowed to "lose" before automatically switching to an impact coded variable. By default, if the eliminated levels correspond to more than 4% of the data, then the treatment plan will switch to impact coding.

In the example above, three levels of the variable `x`

account for 35% of the data, so all the other levels will account for roughly `(1-0.35)/97 = 0.0067`

or the data each, or less than 1% of the mass each. So, all of those 97 levels would be eliminated, and we will "lose" 65% of the data if we keep the categorical representation! Therefore, the data treatment automatically converts `x`

to an impact-coded variable.

# # create the treatment plan. # treatments = designTreatmentsN(train, c("x", "u"), "y") treatments ## $treatments ## $treatments[[1]] ## [1] "vtreat 'Scalable Impact Code'('x'->'x_catN')" ## ## $treatments[[2]] ## [1] "vtreat 'Scalable pass through'('u'->'u_clean')" ## ## $treatments[[3]] ## [1] "vtreat 'is.bad'('u'->'u_isBAD')" ## ## ## $vars ## [1] "x_catN" "u_clean" "u_isBAD" ## ## $varScores ## x_catN u_clean u_isBAD ## 0.1717 0.9183 1.0116 ## ## $varMoves ## x_catN u_clean u_isBAD ## TRUE TRUE TRUE ## ## $outcomename ## [1] "y" ## ## $meanY ## [1] 5.493 ## ## $ndat ## [1] 150 ## ## attr(,"class") ## [1] "treatmentplan"

The variable `x_catN`

is the impact-coded variable corresponding to `x`

. If we refer to the mean of `y`

conditioned on `x`

as `y|x`

, and `meanY`

as grand (unconditioned) mean of `y`

then `x_catN = y|x - meanY`

. Note that `x_catN`

has a low `varScore`

, indicating that it is a good, informative variable.

The variable `u_clean`

is the numerical variable `u`

, with all "bad" values (`NA`

, `NaN`

, `Inf`

) converted to the mean of the "non-bad" `u`

(we'll call this the "clean mean" of `u`

). The variable `u_isBAD`

is an indicator variable that is one whenever `u`

is bad. If the bad values are due to a "faulty sensor" (that is, they occur at random), then converting to the clean mean value of `u`

is the right thing to do. If the bad values are systematic, then `u_isBAD`

can be used by the modeling algorithm to adjust for the systematic effect (assuming it survives the pruning, which in this case, it won't).

We can see how this works concretely by preparing the test and training sets.

train.treat = prepare(treatments, train) test.treat = prepare(treatments, test) train.treat[1:5,] # isBAD column didn't survive ## x_catN u_clean y ## 1 0.04809 1.0749 5.328 ## 2 1.37053 -0.4429 5.413 ## 3 -2.32535 1.4380 2.372 ## 4 -5.02863 0.6611 0.464 ## 5 -1.04404 0.8327 4.449 # ------------------------ # "bad" u values map to the "clean mean" of u # ------------------------ train.treat[is.na(train$u),] ## x_catN u_clean y ## 74 1.371 0.5014 5.269 ## 133 3.053 0.5014 8.546 # compare to u_clean, above mean(train$u, na.rm=TRUE) ## [1] 0.5014 #----------------------- # confirm (x_catN | x = xlevel) is mean(y | x=xlevel) - mean(y) # ------------------------ subset(train.treat, train$x==levels[1])[1:2,] ## x_catN u_clean y ## 3 -2.325 1.4380 2.372 ## 15 -2.325 0.0661 2.381 # compare to x_catN, above mean(subset(train, x==levels[1])$y) - mean(train$y) ## [1] -2.325 #----------------------- # missing levels map to 0, which is equivalent to # mapping them to all known levels proportional to frequency #----------------------- missingInTest = setdiff(unique(test$x), unique(train$x)) subset(test.treat, test$x %in% missingInTest)[1:2,] ## x_catN u_clean y ## 1 4.737e-16 1.3802 1.754 ## 13 4.737e-16 0.9062 6.862

Finally, we use the treated data to model.

vars = setdiff(colnames(train.treat), "y") fmla = paste("y ~ ", paste(vars, collapse=" + ")) model = lm(fmla, data=train.treat) summary(model) ## ## Call: ## lm(formula = fmla, data = train.treat) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2.2077 -0.6131 -0.0113 0.5237 2.5923 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 5.1347 0.0800 64.2 <2e-16 *** ## x_catN 0.9846 0.0296 33.3 <2e-16 *** ## u_clean 0.7139 0.0760 9.4 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.862 on 147 degrees of freedom ## Multiple R-squared: 0.894, Adjusted R-squared: 0.892 ## F-statistic: 617 on 2 and 147 DF, p-value: <2e-16 train.treat$pred = predict(model, newdata=train.treat) test.treat$pred = predict(model, newdata=test.treat) with(train.treat, rmse(y,pred)) ## [1] 0.8529 with(test.treat, rmse(y,pred)) ## [1] 1.964 # evaluate only on the known levels with(subset(test.treat, test$x %in% unique(train$x)), rmse(y, pred)) ## [1] 1.138

As you can see, the model performs better on categories that it saw during training, but it still handles novel levels gracefully -- and remember, some modeling algorithms can't handle a large number of categories at all.

That describes the most basic data treatment procedures that our package implements. For binary classification and logistic regression problems, the package has another function, `designTreatmentsC()`

, which creates treatment plans when the outcome is a binary class variable.

**Loading the vtreat package**

We have made `vtreat`

available on github; remember, this is an alpha package, so it will be rough around the edges. To install the package, download the `vtreat`

tar file (at this writing, `vtreat_0.2.tar.gz`

), as shown in the figure below:

Once you've downloaded it, you can install it from the R command line, as you would any other package. If your R working directory is the same directory where you've downloaded the tar file, then the command looks like this:

install.packages('vtreat_0.2.tar.gz',repos=NULL,type='source')

Once it's installed, `library(vtreat)`

will load the package. Type `help(vtreat)`

to get a short description of how to use the package, along with some example code snippets.

`vtreat`

has a few more features that we will cover in future posts, but this post has given you enough to get you started. Remember, automatic data treatment procedures are not a substitute for inspecting and exploring your data before modeling. However, once you've gotten a feel for the data, you will find that the procedures we have implemented are applicable to a wide variety of situations.

If you try the package, please do send along feedback, including any errors or bugs that you might discover.

For more on data treatment, see Chapter 4 of *Practical Data Science with R*.

Nina Zumel also examines aspects of the supernatural in literature and in folk culture at her blog, multoghost.wordpress.com. She writes about folklore, ghost stories, weird fiction, or anything else that strikes her fancy. Follow her on Twitter @multoghost.

]]>

An easy way to avoid fairly evaluating an analysis technique is to assert that the technique in question is unsound because it violates some important foundational axiom of sound analysis. This rapidly moves a discussion from a potentially difficult analysis to an easy debate. However, this (unfortunately common) behavior is mere gamesmanship (see Potter “The Theory and Practice of Gamesmanship (or the Art of Winning Games without Actually Cheating)”). But it is what you can encounter when presenting a technique from school “B” to members of school “A.” For example: Bayesian parameter estimates can be considered inadmissible by frequentists because the estimates may be biased (see Frequentist inference only seems easy for an interesting example of the principle, and of a valid low-variance estimate that is in necessarily biased). BDA3 page 94 provides an interesting situation with a deliberate omitted variable bias (a feature of the data). BDA3 goes on to demonstrates how silly it would be to apportion the blame for prediction bias to the inference technique used (ordinary linear regression), or to try and mechanically adjust for prediction bias it without fixing the underlying omitted variable issue (by recruiting more variables/features). The example is important because, as we demonstrated in our earlier article: so-called unbiased techniques work by rejecting many (possibly good) biased estimates, and therefore can implicitly incorporate potentially domain-inappropriate bias corrections or adjustments. This example is relevant, because it is easier to respond to such criticism when it applied to a standard technique used on a simple artificial problem (versus defending a specialized technique on messy real data).

Axiomatic approaches to statistical inference tend to be very brittle in that it takes only a few simple rules to build an paradoxical or unsatisfiable system. For example: we described how even insisting on the single reasonable axiom of unbiasedness completely determines a family of statistical estimates (leaving absolutely no room to attempt to satisfy any additional independent conditions or axioms).

This sort of axiomatic brittleness is not unique to statistical inference. It is a common experience that small families of seemingly reasonable (and important) desiderata lead to inconsistent and unsatisfiable systems when converted to axioms. Examples include Arrow’s impossibility theorem (showing a certain reasonable combination of goals in voting systems is unachievable), Brewer’s CAP theorem (showing a certain reasonable combination of distributed computing goals are mutually incompatible). So the reason a given analysis may not satisfy an obvious set of desirable axioms is often that no analysis satisfies the given set of axioms.

Let’s get back to the BDA3 example and work out how to criticize ordinary linear regression for having an undesirable bias. If linear regression can’t stand up to this sort of criticism, how can any other method to be expected to face the same? If we are merely looking at the words it is “obvious” that regression can’t be biased as this would contradict the Gauss-Markov theorem (that linear regression is the “best linear unbiased estimator” or BLUE). However, the word “bias” can have different meanings in different contexts: in particular *what* is biased with respect to *what*? Let’s refine the idea of bias and try to make ordinary linear regression look bad.

Consider the following simple problem. Suppose our data set is observations of pairs of mother’s and adult daughter’s heights. Suppose idealizations of these two random variables are generated by the following process:

`c`

(unobserved) is independently sampled from a normal distribution with a mean of 80 centimeters and a standard deviation of 5 centimeters (the shared or common component of height).`u`

(unobserved) is independently sampled from a normal distribution with a mean of 80 centimeters and a standard deviation of 5 centimeters (the unique to mother portion of height).`v`

(unobserved) is independently sampled from a normal distribution with a mean of 80 centimeters and a standard deviation of 5 centimeters (the unique to daughter portion of height).- We then observe the two derived random variables: mother’s height
`m=c+u`

, and adult daughter’s height`d=c+v`

.

The random variables `m`

and `d`

are normally distributed with equal means of 160 centimeters, equal variances, and a correlation of 0.5. As we said: we can think of the two random variables `m`

and `d`

as representing the heights of pairs of mothers and adult daughters. The correlation means tall mothers tend to have taller daughters (but the correlation being less that 1.0 means the mother’s height does not completely determine the daughter’s height). Obviously real heights are not normally distributed (as people do not have negative heights, and non-degenerate normal distributions have non-zero mass on negative values); but overall the normal distribution is a very good approximation of plausible heights.

This generative model represents a specialization of the example from BDA3 page 94 to specific distributions that clearly obey the properties claimed in BDA3. We are completely specifying the distributions to attempt to negate any (wrong) claim that there may not be distributions simultaneously having all of the claimed properties mentioned in the original BDA3 example. The interpretation (again from BDA3) of the two observed random variables `m`

and `d`

as pairs of mother/daughter heights is to give the data an obvious interpretation and help make obvious when our procedures become silly. At this point we have distributions exactly matching the claimed properties in BDA3 and very closely (but not exactly) matching the claimed interpretation as heights of pairs of mothers and their adult daughters.

Let’s move on to the analysis. The claim in BDA3 is that the posterior mean of `d`

given `m`

is:

`E[d|m] = 160 + 0.5 (m-160)`

. We could derive this through Bayes law and some calculus/algebra. But we get the exact same answer using ordinary linear regression (which tends to have a frequentist justification). In R:

```
```n <- 10000
set.seed(4369306)
d <- data.frame(c=rnorm(n,mean=80,sd=5),
u=rnorm(n,mean=80,sd=5),
v=rnorm(n,mean=80,sd=5))
d$m <- d$c+d$u
d$d <- d$c+d$v
print(cor(d$m,d$d))
## [1] 0.4958206
print(lm(d~m,data=d))
##
## Call:
## lm(formula = d ~ m, data = d)
##
## Coefficients:
## (Intercept) m
## 81.6638 0.4899

The recovered linear model is very close to the claimed theoretical conditioned expectation `E[d|m] = 160 + 0.5 (m-160) = 80 + 0.5 m`

. So we can assume a good estimate of `d`

can be learned from the data. To keep things neat let’s say our point-estimate for `d`

is called `δ`

, and `δ = 160 + 0.5 (m-160)`

. As we see below `δ`

is a plausible looking estimate:

```
```library(ggplot2)
d$delta <- 160 + 0.5*(d$m-160)
ggplot(data=d,aes(x=delta,y=d)) +
geom_density2d() + geom_point(alpha=0.1) +
geom_smooth() +
geom_segment(x=150,xend=170,y=150,yend=170,linetype=2) +
coord_equal(ratio=1) + xlim(140,180) + ylim(140,180)

Actuals (

`d`

) plotted against estimate `δ`

Notice the dashed line `y=x`

mostly coincides with the blue smoothing curve in the above graph; this is a visual confirmation that `E[d|δ] = δ`

. This follows because we chose `δ`

so that `δ = E[d|m]`

(i.e. matching the regression estimate) and if we know `δ`

then we (by simple linear substitution) also know `m`

. So `E[d|δ] = E[d|m] = δ`

. `E[d|δ] = δ`

seems like a very nice property to have for the estimate `δ`

to have. We can (partially) re-confirm it by fitting a linear model of `d`

as a linear function of `δ`

:

```
```print(summary(lm(d~delta,data=d)))
## Call:
## lm(formula = d ~ delta, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.1532 -4.1458 0.0286 4.1461 23.5144
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.28598 2.74655 1.196 0.232
## delta 0.97972 0.01716 57.089 <2e-16 ***
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
##
## Residual standard error: 6.126 on 9998 degrees of freedom
## Multiple R-squared: 0.2458, Adjusted R-squared: 0.2458
## F-statistic: 3259 on 1 and 9998 DF, p-value: < 2.2e-16

We see the expected slope near one and an intercept/dc-term statistically indistinguishable from zero. And we don’t really have much bad to say about this fit (beyond the R-squared of 0.2458, which is expected when correlation is known to be 0.5). For instance the residuals don’t formally appear structured (despite the obvious visible tilt of principal axes in the previous graph):

```
```plot(lm(d~delta,data=d))

And now for the (intentionally overreaching) frequentist criticism. From BDA3 page 94 (variable names changed): “The posterior mean is *not*, however, an unbiased estimate of `d`

in the sense of repeated sampling of `m`

for a fixed `d`

.” That is: the chosen estimate `δ`

is not an unbiased estimate of a general fixed unknown value of `d`

under repeated experiments where the observed variable `m`

varies according to repeated draws from the joint distribution. This may sound complicated, but it is the standard frequentist definition of an unbiased estimator: for any given fixed unknown value of the item to be estimated under repeated experiments (with new, possibly different observed data) the value of the estimate should match the fixed unknown value in expectation. In other words: it isn’t considered enough for a single given estimate `δ`

to capture the expected value of the unknown item `d`

(to have `E[d|δ] = δ`

, which we have confirmed), we must also have the whole *estimation procedure* be unbiased for arbitrary unknown `d`

(that would be `E[δ|d] = d`

, which we will show does not hold in general). To be clear BDA3 is not advocating this criticism, they are just citing it as a standard frequentist criterion often wrongly over-applied to methods designed with different objectives in mind. The punch-line is: the predictions from the method of ordinary linear regression fail this criticism, yet the method continues to stand.

Let’s confirm `E[δ|d] ≠ d`

in general. To do this we need one more lemma: for a fixed (unknown) value of `d`

we know the conditional expectation of the observable value of `m`

is `E[m|d] = 160 + 0.5 (d-160)`

. We can again get this by a Bayesian argument, or just by running the linear regression `lm(m~d,data=d)`

and remembering that linear regression is a linear estimate of the conditional expectation. We are now ready to look at the expected value of our estimate `δ`

conditioned on the unknown true value `d`

: `E[δ|d]`

. Plugging in what we know we get:

```
```E[δ|d] = E[160 + 0.5 (m-160) | d]
= 160 + 0.5 (E[m|d]-160)
= 160 + 0.5 ((160 + 0.5 (d-160))-160)
= 160 + 0.25 (d - 160)

And that is a problem. To satisfy frequentist unbiasedness we would need `E[δ|d] = d`

for all `d`

. And `160 + 0.25 (d - 160) = d`

only if `d=160`

. So for all but one possible value of the daughter’s height `d`

the ordinary linear regression’s prediction procedure is considered biased in the frequentist sense. In fact we didn’t actually use the regression coefficients, we used the exact coefficients implied by the generative model that is actually building the examples. So we could even say: using the actual generative model to produce predictions is not unbiased in the frequentist sense.

This would seem to contradict our earlier regression check, but that is not the case. Consider the following regression and graph:

```
```

```
print(summary(lm(delta~d,data=d)))
## Call:
## lm(formula = delta ~ d, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.4663 -2.0743 -0.0216 2.0890 12.7414
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.198e+02 7.041e-01 170.20 <2e-16 ***
## d 2.509e-01 4.395e-03 57.09 <2e-16 ***
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
##
## Residual standard error: 3.1 on 9998 degrees of freedom
## Multiple R-squared: 0.2458, Adjusted R-squared: 0.2458
## F-statistic: 3259 on 1 and 9998 DF, p-value: < 2.2e-16
ggplot(data=d,aes(x=d,y=delta)) +
geom_density2d() + geom_point(alpha=0.1) +
geom_smooth() +
geom_segment(x=140,xend=180,y=140,yend=180,linetype=2) +
coord_equal(ratio=1) + xlim(140,180) + ylim(140,180)
```

Estimate

`δ`

plotted against actuals (`d`

)Notice that the slope (both from the regression and the graph) is now 0.25 and the dashed "y=x" line no longer agrees with the empirical smoothing curve. This closely agrees with our derived form for `E[δ|d]`

. And this may expose one source of the confusion. The slope of the regression `d ~ δ`

is 1.0, while the slope of the regression `δ ~ d`

is 0.25. This violates a possible naive expectation/intuition that these two slopes should be reciprocal (which they need not be, as each has a different error model).

Part of what is going on is an expected reversion to the mean effect. If we have a given `m`

in hand then `δ = 160 + 0.5 (m-160)`

is in fact a good estimate for `d`

(given that we known only `m`

and don't have a direct better estimate for `c`

). What we don't have is a the ability to guess what part of the heights to be estimated is from the shared process (`c`

, which we can consider an omitted variable in this simple analysis) and what part is from the unique processes (`u`

and `v`

, and therefore not useful for prediction).

One concern: we have been taking all conditional expectations `E[|]`

over the same data set (a nice single consistent probability model). This doesn't quite reproduce the frequentist set-up of `d`

being fixed. However, if there was in fact no reversion to the mean on any `d`

-slice then we would not have seen reversion to the mean in the aggregate. We can check the fixed-`d`

case directly with a little math to produce new fixed-`d`

data set, or approximate it by censoring a larger data set down to a narrow interval of `d`

. Here is a such an example (showing the same effects we saw before):

```
```n2 <- 10000000
set.seed(4369306)
d2 <- data.frame(c=rnorm(n2,mean=80,sd=5),
u=rnorm(n2,mean=80,sd=5),
v=rnorm(n2,mean=80,sd=5))
d2$m <- d2$c+d2$u
d2$d <- d2$c+d2$v
d2 <- subset(d2,d>=170.1 & d<170.2)
d2$delta <- 160 + 0.5*(d2$m-160)
print(dim(d2)[[1]])
## [1] 20042
print(mean(d2$d))
## [1] 170.1498
print(mean(d2$delta))
## [1] 162.5718
print(mean(160 + 0.25*(d2$d-160)))
## [1] 162.5374

And we pretty much see the exact reversion to the mean expected from our derivation.

Back to the Gauss-Markov theorem: in what sense can ordinary linear regression be considered unbiased? It turns out if you read carefully the content of the Gauss-Markov theorem is that the estimates of unknown *parameters* (or betas) are unbiased. So in particular the estimate `lm(d~m,data=d)`

should recover coefficients that are unbiased estimates of the coefficients in the expression `E[d|m] = 160 + 0.5 (m-160)`

. And that appears plausible, as we empirically estimated `d ~ 81.6638 + 0.4899*m`

which is very close to the true values. The Gauss-Markov theorem says ordinary linear regression, given appropriate assumptions, gives us unbiased *estimates of models*. It does not say that evaluations of such model are themselves unbiased (in the frequentist sense) *predictions of instances*. In fact, as we have seen, even evaluations of the exact true model does not always give unbiased (in the frequentist sense) predictions for individual instances. This is one reason that frequentist analysis has to take some care in treating unobservable parameters and unobserved future instances very differently (supporting the distinction between prediction and estimation, less of a concern in Bayesian analysis). This also is a good reminder of the fact that traditional statistics is much more interested in parameter estimation, than in prediction of individual instances.

BDA3 goes on to exhibit (for the purpose of criticism) the mechanical derivation of a frequentist-sense unbiased linear estimator for `d`

: `γ = 160 + 2 (m-160)`

. It is true that `γ`

satisfies the unbiased condition `E[γ|d] = d`

for all `d`

. But `γ`

is clearly an unusable and ridiculous estimator that claims for every centimeter in height increase in the `m`

mother we should expect two centimeters of expected height increase in the daughter. This is not an effect seen in the data (so not something a good estimator should claim) and is a much higher variance estimator than the common reasonable estimator `δ`

. A point BDA3 is making is: applying "bias corrections" willy-nilly or restricting to only unbiased predictors is an ill-advised attempt at a mechanical fix to modeling bias. When the underlying issue is omitted variable bias (as it is in this example) the correct fix is to try and get better estimates of the hidden variables (in this case `c`

) by introducing more explanatory variables (in this case perhaps obtaining some genetic and diet measurements for each mother/daughter pair).

So: a deliberately far too broad application of a too stringent frequentist bias condition eliminated reasonable predictors leaving us with a bad one. The fact is: unbiasedness is a stronger condition than is commonly thought, and can limit your set of possible solutions very dramatically (as was also shown in our earlier example). Bias *is* bad (it can prevent you from improving results through aggregation), but you can't rely on mere mechanical procedures to eliminate it. *Correctly* controlling for bias may involve making additional measurments and introducing new variables. You also really need to be allowed to examine the domain specific utility of your procedures, and not be told a large number of them are a-priori inadmissible.

We took the term inadmissible from discussions about the James-Stein estimator. One of those results shows: "The ordinary decision rule for estimating the mean of a multivariate Gaussian distribution is inadmissible under mean squared error risk" (from Wikipedia: Stein's example). Though really what this shows is that insisting only on "admissible estimators" (a loaded term if there ever was one) collapses under its own weight (linear regression in many cases actually being a good method for prediction). So such criticism of criticisms are already well known, but evidently not always sufficiently ready to hand.

Afterthought: the plots we made looked a lot like the cover of Freedman, Pisani, Purves "Statistics" 4th edition (which itself is likely a reminder that the line from regressing `y~x`

is not the same as the principal axes). So in this example we see the regression line `y~x`

is not necessarily the transpose or reciprocal of the regression line `x~y`

, and neither of these is necessarily one of the principal axes of the scatterplot.

To follow up on this we produced some plots showing regression lines, smoothed curves, y=x, and principal axes all at once. The graphs are bit too busy/confusing for the main part of of the article itself, but nice to know how to produce (for use debugging, and during data exploration). We have also changed the smoothing curve to green, to give it a chance to stand out from the other annotations.

```
```# build some ggplot2() line segments representing principal axes
pcompAxes <- function(x,y) {
axes <- list()
means <- c(mean(x),mean(y))
dp <- data.frame(x=x-means[1],y=y-means[2])
p <- prcomp(~x+y,data=dp)
for(j in 1:2) {
s <- p$rotation[,j]
step <- 3*p$sdev[j]
a = means - step*s
b = means + step*s
axes[[length(axes)+1]] <-
geom_segment(x=a[1],xend=b[1],y=a[2],yend=b[2],
color='blue',linetype=3)
}
axes
}
ggplot(data=d,aes(x=delta,y=d)) +
geom_density2d() + geom_point(alpha=0.1) +
geom_smooth(color='green') +
geom_segment(x=140,xend=180,y=140,yend=180,linetype=2) +
coord_equal(ratio=1) + xlim(140,180) + ylim(140,180) +
pcompAxes(d$delta,d$d)

```
```ggplot(data=d,aes(x=d,y=delta)) +
geom_density2d() + geom_point(alpha=0.1) +
geom_smooth(color='green') +
geom_segment(x=140,xend=180,y=140,yend=180,linetype=2) +
coord_equal(ratio=1) + xlim(140,180) + ylim(140,180) +
pcompAxes(d$d,d$delta)

As a first example, consider the problem of trying to estimate the speed of light from a series of experiments.

In this situation the frequentist method quietly does some heavy philosophical lifting before you even start work. Under the frequentist interpretation since the speed of light is thought to have a single value it does not make sense to model it as having a prior distribution of possible values over any non-trivial range. To get the ability to infer, frequentist philosophy considers the act of measurement repeatable and introduces very subtle concepts such as *confidence intervals*. The frequentist statement that a series of experiments places the speed of light in vacuum at 300,000,000 meters a second plus or minus 1,000,000 meters a second with 95% confidence does not mean there is a 95% chance that the actual speed of light is in the interval 299,000,000 to 301,000,000 (the common incorrect recollection of what a confidence interval is). It means if the procedure that generated the interval were repeated on new data, then 95% of the time the speed of light would be in the interval produced: which may not be the interval we are looking at right now. Frequentist procedures are typically easy on the practitioner (all of the heavy philosophic work has already been done) and result in simple procedures and calculations (through years of optimization of practice).

Bayesian procedures on the other hand are philosophically much simpler, but require much more from the user (production and acceptance of priors). The Bayesian philosophy is: *given* a generative model, a complete prior distribution (detailed probabilities of the unknown value posited before looking at the current experimental data) of the quantity to be estimated, and observations: then inference is just a matter of calculating the complete posterior distribution of the quantity to be estimated (by correct application of Bayes’ Law). Supply a bad model or bad prior beliefs on possible values of the speed of light and you get bad results (and it is your fault, not the methodology’s fault). The Bayesian method seems to ask more, but you have to remember it is trying to supply more (complete posterior distribution, versus subjunctive confidence intervals).

In this article we are going to work a simple (but important) problem where (for once) the Bayesian calculations are in fact easier than the frequentist ones.

Consider estimating from observation the odds that a coin-flip comes out heads (as shown below).

Heads outcome.

The coin can also show a tails (as shown below).

Tails outcome.

This might be a fair coin, that when tossed properly can be argued to have heads/tails probabilities very close to 50/50. Or the heads/tails outcome could in fact be implemented by some other process with some other probability `p`

of coming up heads. Suppose we flip the coin 100 times and record heads 54 times.

In this case the frequentist procedure is to generate a point-estimate of the unknown `p`

as `pest = 54/100 = 0.54`

. That is we estimate `p`

to be the relative frequency we actually empirically observed. Stop and consider: how do we know this is the right frequentist estimate? Beyond being told to use it, what principles lead us to this estimate? It may seem obvious in this case, but in probability mere obviousness often leads to contradictions and paradox. What criteria can be used to derive this estimate in a principled manner?

Gelman, Carlin, Stern, Dunson, Vehtari, Rubin “Bayesian Data Analysis” 3rd Edition p. 92 states that frequentist estimates are designed to be *consistent* (as the sample size increases they converge to the unknown value), *efficient* (they tend to minimize loss or expected square-error), or even have *asymptotic unbiasedness* (the difference in the estimate from the true value converges to zero as the experiment size increases, even when re-scaled by the shrinking standard error of the estimate). Because some of the estimators we will work with are biased we are going to use expected square-error as our measure of error. This is the expected value of the square of the distance of our estimate from the *unknown true value*, and not the variance (which is the expected value of the square of the distance of the estimator from its own mean).

Frequentists also commonly insist on fully *unbiased* procedures (which is what we will discuss) here. In this case an unbiased

procedure is a function `f(nHeads,nFlips)`

that given the sufficient statistics of the experiment (the number of heads and the total number of flips) returns an estimate for the unknown probability. The frequentist philosophy assumes the unknown probability `p`

is fixed and the observed number of heads might vary as we repeat the coin-flip experiment again and again. To confirm a frequentist procedure to estimate `p`

from 100 flips is unbiased, we must check that the entire family of possible estimates `f(0,100), f(1,100), ... f(100,100)`

together represent an panel of estimates that are simultaneously unbiased no matter what the unknown true value of `p`

is.

That is: the following bias check equation must hold for any `p`

in the range `[0,1]`

:

Equation family 1: Bias checks (one

`f(h,n)`

variable for every possible outcome `h`

, one equation for every possible `p`

).Some combinatorics or probability theory tells us `P(h|n,p) = (n choose h) p^h (1-p)^(n-h)`

. We can choose to treat the sequence `f(0,nFlips),f(1,nFlips), ... f(nFlips,nFlips)`

either as a set of pre-made estimates (to be checked) or as a set of variables (to be solved for). It turns out there is a solution that satisfies all of the equations simultaneously: `f(h,n) = h/n`

. This fact is just a matter of checking that the expected value of the number of heads is `p`

times the number of flips. And this is the only unbiased solution. The set of check equations we can generate for various `p`

has rank `nFlips+1`

(when we include check equations from at least `nFlips+1`

different values of `p`

, this follows as the check equations behave a lot like the moment curve). We will work a concrete example of the family 1 bias checks a bit later (which should make seeing the content of the chekcs a bit easier).

The pre-packaged frequentist estimation procedure is easy: write down the empirically observed frequency as your estimate. But the derivation should now seem a bit scary (submit a panel of `nFlips+1`

simultaneous estimates and confirm they simultaneously obey an uncountable family of bias check equations). And this is one of the merits of the frequentist methods- the hard derivational steps don’t have to be reapplied each time you encounter new data, so the end user may not need to know about them.

Let’s look at the same data using Bayesian methods. First we are required to supply prior beliefs on the possible values for `p`

. Most typically we would operationally assume unknown `p`

is beta distributed with shape parameters `(1/2,1/2)`

(the Jeffreys prior) or shape parameters `(1,1)`

(implementing classic Laplace smoothing). I’ll choose to use the Jeffreys prior, and in that case the posterior distribution (what we want to calculate) turns out to be a beta distribution with shape parameters `(54.5,46.5)`

. Our complete posterior estimate of probable values of `p`

is given by the R plot below:

```
library(ggplot2)
d <- data.frame(p=seq(0,1,0.01))
d$density <- dbeta(d$p,shape1=54.5,shape2=46.5)
ggplot(data=d) + geom_line(aes(x=p,y=density))
sum(d$p*d$density)/sum(d$density)
## [1] 0.539604
```

The posterior distribution of

`p`

.And the common Bayesian method if obtaining an estimate of a summary statistics is to just compute the appropriate summary statistic from the estimated posterior distribution. So if we only want a point-estimate for `p`

we can use the expected value `54.5/(54.5+46.5) = 0.539604`

or the mode (maximum likelihood value) `(54.5-1)/(54.5+46.5-2) = 0.540404`

of the posterior beta distribution. But having a complete graph of an estimate of the complete posterior distribution also allows a lot more. For example: from such a graph we can work out a Bayesian credible interval (which has a given chance of containing the unknown true value `p`

assuming our generative modeling assumptions and priors were both correct). And this is one of the reasons Bayesians emphasize working with distributions (instead of point-estimates): even though they can require more work to derive and use, they retain more information.

Notice the complications of having to completely specify a prior distribution have not been hidden from us. The actual application of Bayes’ law (an explicit convolution or integral relating the prior distribution to the posterior through a data likelihood function) has (thankfully) been hidden by appealing to the theory of conjugate distributions. So the Bayes theory is hiding some pain from us, but significant pain is still leaking through.

And this is common: what is commonly called a frequentist analysis is often so quick you almost can’t describe the motivation, and the Bayesian analysis seems like more work. What we want to say is this is not always the case. If there is any significant hidden state, or constraints on the possible values, then the Bayesian calculation becomes in fact easier than a fully derived frequentist calculation. And that is what we will show in our next example. But first let’s cut down confusion by fixing detailed names for a few common inference methods:

- Empirical frequency estimate. This is just the procedure of using the empirically observed frequencies as your estimate. This is commonly thought of as “the frequentist estimate.” However, we are going to reserve the term “proper frequentist estimate” for an estimate that most addresses the common frequentist criticisms: bias and loss/square-error. We will also call the empirical frequency estimate the “prescriptive frequentist estimate” as it is a simple “do what you are told” style procedure.
- Proper frequentist estimate. As we said, we are going to use this term for the estimate that most addresses the common frequentist criticisms: bias and loss/square-error. We use the traditional frequentist framework: the unknown parameters to be estimated are assumed to be fixed, and probabilities are over variations in possible observations if our measurement procedures were to be repeated. We define this estimate as an unbiased estimate that minimizes expected loss/square-error for arbitrary
*possible*values of the unknown parameters to be estimated. Often the bias check conditions are so restrictive that they completely determine the proper frequentist estimate*and*cause the proper frequentist estimate to agree with the empirical frequency estimate. - Full generative Bayesian estimate. This is a complete estimate of the entire posterior distribution of values for the unknown parameters to be estimated. This is under the traditional Bayesian framework that the observations are fixed and the unknown parameters to be estimated take on values from a non-trivial prior distribution (that is a distribution that takes on more than one possible value). Under the (very strong) assumptions that we have the correct generative model and the correct prior distribution the estimated posterior is identical to how the unknown parameters are distributed conditioned on the known observations. Thus reasonable summaries built from the full generative Bayesian estimate should be good (without explicitly satisfying conditions such as unbiasedness or minimal loss/square-error). We are avoiding the usual distinction of objective versus subjective interpretation (Bayesian usually being considered subjective if we consider the required priors subjective beliefs).
- Bayes point-estimate. This is a less common procedure. A full generative Bayesian estimate is wrapped in a procedure that hides details of the generative model, priors and Bayes inference step. What is returned is single summary of the detailed posterior distribution such as a mean (useful for producing low square-error estimates) or mode (useful for producing maximum likelihood estimates). For our examples the Bayes point-estimate will be a procedure that returns an estimate mean (or probability/rate) using the correct generative model and uniform priors (when there is a preferred problem parameterization, otherwise we suggest looking into invariant ideas like the Jeffreys prior).

Our points are going to be: the empirical frequency estimate is very easy, but is not always the proper frequentist estimate. The proper frequentist estimate can be itself cumbersome to derive, and therefore hard to think of as “always being easier than the Bayesian estimate.” And finally one should consider something like the Bayes point-estimate when one does not want to make a complete Bayesian analysis the central emphasis of a given project. We will illustrate these points with a simple (and natural) example.

Returning to our coin-flip problem. Suppose we introduce a five sided control die that is set once (and held fixed) before we start our experiments. Then suppose each experiment is a roll of a fair six-sided die and we observe “heads” if the number of pips on the six-sided die is greater than the number (1 through 5) shown on the control die (and otherwise “tails”). The process is strongly stationary in that the probability `p`

is a single fixed value of the entire series of experiments. Our imagined apparatus is depicted below.

Our apparatus (the 5-sided die is simulated with a 10-sided die labeled 1 through 5 twice).

We assume that while we understand the generative mechanics of the generation process, but that we don’t see the details of the actual die rolls. We observe only the reported heads/tails outcomes (as shown below).

What is observed.

This may seem like a silly estimation game, but it succinctly models a number of important inference situations such as: estimating advertisement conversion rates, estimating health treatment success rates, and so on. We pick a simple formulation so that when we run into difficulties or complications it will be clear that they are essential difficulties (and not avoidable domain issues). Or: if your estimation procedures are not correct on this example, how can you expect them to be correct in more complicated real-world situations? Another good example of this kind of analysis is: Sean R. Eddy “What is Bayesian statistics” Nature Biotechnology, Vol. 22, No. 9, Sept. 2004, pp. 1177-1178. Eddy presented a clever inference problem comparing where pool balls hit a rail relative to a uniform random chalk mark on the rail. Eddy’s problem illustrates the issues of inference when there are important unobserved (or omitted) state variables. Our example is designed to allow further investigation of the both Bayesian and Frequentist inference in the presence of constraints (not quite the same as complete priors).

We will consider two important ways the control die could be set: by a single roll before we start observations (essentially embodying the Bayesian generative assumptions), or by a manual selection by an assumed hostile agent (justifying the usual distribution-free frequentist minimax treatment of loss/square-error).

An adversary holding the control die at a chosen value.

Let’s start with the case where the control die is set before we start measurements by a fair (uniform) roll of the five sided die. Because the control die only has 5 possible states the unknown probability `p`

has exactly 5 possible values. In this case we can write down all of the bias check equations for every possible outcome of a one coin-flip simulation. For only one flip observed there are only two possible outcomes: either we see one heads or one tails. So we have two possible outcomes (giving us two variables, as we get one estimate variable per sufficient outcome) and 5 check equations (one for each possible value of `p`

). The complete bias check equations are represented by the matrix `a`

and vector `b`

shown below:

```
```> print(freqSystem(6,1))
$a
prob. 0 heads prob. 1 heads
check for p=0.166666666666667 0.8333333 0.1666667
check for p=0.333333333333333 0.6666667 0.3333333
check for p=0.5 0.5000000 0.5000000
check for p=0.666666666666667 0.3333333 0.6666667
check for p=0.833333333333333 0.1666667 0.8333333
$b
p
check for p=0.166666666666667 0.1666667
check for p=0.333333333333333 0.3333333
check for p=0.5 0.5000000
check for p=0.666666666666667 0.6666667
check for p=0.833333333333333 0.8333333

The above is just the family 1 bias check equations for our particular problem. A vector of estimates `f`

is unbiased if and only if `a f - b = 0`

(i.e. it obeys the equation family 1 checks). When `a`

is full rank (in this case when the number of variables is no more than the number of checks) the bias check equations completely determine the unique unbiased solution (more on this later). So even in this “discrete `p`

” situation: for any number of flips less than 5, the bias conditions alone completely determine the unique unbiased estimate.

What we are trying to show is that when we move away from the procedure “copy the observed frequency as your estimate” to the more foundational “pick an unbiased family of estimates with minimal expected square-error”, then frequentist reasoning appears a bit more complicated. Let’s continue with a frequentist analysis of this problem (this time in python instead of R, see here for the complete code).

The common “everything wrapped in a bow” prescriptive empirical frequency procedure is by far the easiest estimate:

```
```# Build the traditional frequentist empirical estimates of
# the expected value of the unknown quantity pWin
# for each possible observed outcome of number of wins
# seen in kFlips trials
def empiricalMeansEstimates(nSides,kFlips):
return numpy.array([ j/float(kFlips) for j in range(kFlips+1) ])

And if we load this code (and all of its pre-conditions) we get the following estimates of `p`

if we observe one coin experiment:

```
```>>> printEsts(empiricalMeansEstimates(6,1))
pest for 0 heads 0.0
pest for 1 heads 1.0

Using our bias check equations we can confirm this solution is indeed unbiased:

```
```>>> sNK = freqSystem(6,1)
>>> printBiasChecks(matMulFlatten(sNK['a'], \
empiricalMeansEstimates(6,1)) - flatten(sNK['b']))
bias for p=0.166666666667 0.0
bias for p=0.333333333333 0.0
bias for p=0.5 0.0
bias for p=0.666666666667 0.0
bias for p=0.833333333333 0.0

And has moderate loss/square-errors:

```
```>>> printLosses(losses(6,empiricalMeansEstimates(6,1)))
exp. sq error for p= 0.166666666667 0.138888888889
exp. sq error for p= 0.333333333333 0.222222222222
exp. sq error for p= 0.5 0.25
exp. sq error for p= 0.666666666667 0.222222222222
exp. sq error for p= 0.833333333333 0.138888888889

But the solution is kind of icky. Remember, this result was completely determined by the unbiased check conditions. It says if we observe one coin experiment and see tails then the estimate for `p`

is zero, if we see heads the estimate for `p`

is one. Both of these estimates are well outside the range of possible values for `p`

! Recall our heads/tails coin events are assigned “heads” if the number of pips on the 6-sided die exceeds the mark on the control die (which are the numbers 1 through 5). Thus `p`

only takes on values in the range `1/6`

(when the control is `5`

) through `5/6`

(when the control is `1`

). In fact `p`

is always going to be one of the values: `1/6`

, `2/6`

, `3/6`

, `4/6`

, or `5/6`

. The frequentist analysis is failing to respect these known constraints (which are weaker than assuming actual priors).

We can try fixing this with a simple procedure such as Winsorising or knocking everything back into range. For example the estimate `[1/6,5/6]`

is biased but has improved loss/square-error:

```
```>>> w = [1/6.0,5/6.0]
>>> printBiasChecks(matMulFlatten(sNK['a'], w) - flatten(sNK['b']))
bias for p=0.166666666667 0.111111111111
bias for p=0.333333333333 0.0555555555556
bias for p=0.5 0.0
bias for p=0.666666666667 -0.0555555555556
bias for p=0.833333333333 -0.111111111111
>>> printLosses(losses(6,w))
exp. sq error for p=0.166666666667 0.0740740740741
exp. sq error for p=0.333333333333 0.101851851852
exp. sq error for p=0.5 0.111111111111
exp. sq error for p=0.666666666667 0.101851851852
exp. sq error for p=0.833333333333 0.0740740740741

There are other ideas for fixing estimates (such as shrinkage to reduce expected square-error, or quantization to improve likelihood). But the point is these are not baked into the traditional simple empirical frequency estimate. Once you start adding all of these features you may have a frequentist estimator that is as complicated as a Bayesian estimator is thought to be, and a frequentist estimator that is no longer considered pure with respect to traditional frequentist criticisms.

Let’s switch to the Bayes analysis for the game where the 5-sided control dice is set uniformly at random. A good Bayes point-estimate is easy to derive, as the appropriate priors for `p`

are obvious (uniform on `1/6`

, `2/6`

, `3/6`

, `4/6`

, `5/6`

). Our Bayes point-estimates for the expected value of `p`

turn out to be:

```
```>>> printEsts(bayesMeansEstimates(6,1))
pest for 0 heads 0.388888888889
pest for 1 heads 0.611111111111

Which means: for 1 tails we estimate `p=.38888889`

and for 1 heads we estimate `p=0.61111111`

. Notice these estimates are strictly inside the range `[1/6,5/6]`

(pulled in by `2/9`

in both cases). Also notice because we have wrapped the Bayes estimate in code it appears no more complicated to the user than the empirical estimate (sure the code is larger than the empirical estimate, but that is exactly what an end user does not need to see). We have *intentionally* hidden from the user some important design choices (priors, the Bayes step convolution, use of a mean estimate instead of a mode). The estimator (see here or here) has wrapped up proposing a prior distribution, deriving the posterior distribution from the data likelihood equations (applying Bayes law), and then returning the expected value of the posterior as a single point-estimate. In addition to hiding the implementation details, we have refrained (or at least delayed) educating the user out of their desire for a simple point-estimate. We have not insisted the user/consumer of the result learn to use the (superior) complete posterior distribution in favor of mere point-estimates. For a Bayes estimate to be replacement compatible for a frequentist one we need (at least initially) put it into the same format as the frequentist estimate it is competing with. This squanders a number of the advantages the Bayes posterior, but as we will see the Bayes estimate is still lesser expected square-error (more efficient) than the frequentist one. So initially offering a Bayes estimate as a ready to go replacement for the frequentist estimate is of some value, and we don’t want to lose that value by initially requiring additional user training.

Unfortunately this Bayes point-estimate solution is biased, as we confirm here:

```
```>>> printBiasChecks(matMulFlatten(sNK['a'], \
bayesMeansEstimates(6,1)) - flatten(sNK['b']))
bias for p=0.166666666667 0.259259259259
bias for p=0.333333333333 0.12962962963
bias for p=0.5 0.0
bias for p=0.666666666667 -0.12962962963
bias for p=0.833333333333 -0.259259259259

But, as we mentioned, our Bayes point-estimate has some advantages. Let’s also look at the expected loss each estimate would give for every possible value of the unknown probability `p`

:

```
```>>> printLosses(losses(6,bayesMeansEstimates(6,1)))
exp. sq error for p= 0.166666666667 0.0740740740741
exp. sq error for p= 0.333333333333 0.0277777777778
exp. sq error for p= 0.5 0.0123456790123
exp. sq error for p= 0.666666666667 0.0277777777778
exp. sq error for p= 0.833333333333 0.0740740740741

Notice that the Bayes estimate has smaller expected square-error (or in statistical parlance is a more efficient estimator) no matter what value `p`

takes. The unbiased check conditions forced the frequentist estimate to a high expected square-error estimator. This means demanding the estimator be strictly unbiased may not be a good trade-off (and the frequentist habit of deriding other estimators for “not being unbiased” may not always be justified). To be fair bias can be a critical flaw if you intend to aggregate it with other estimators later (as enough independent unbiased estimates can be averaged to reduce noise, which is not always true for biased estimators).

Let’s give the frequentist estimate another chance. For our discrete set of possible values `p`

(`1/6`

, `2/6`

, `3/6`

, `4/6`

, `5/6`

) once the number of coin-flips is large enough the equation family 1 bias checks no longer completely determine the estimate. So it is no longer immediately obvious that the observed empirical frequency is minimal loss. In fact it is not, so we can no longer consider the canned empirical solution to be the unique optimal estimate. Note this differs from the case where `p`

takes on many different values from a continuous interval, which is enough to ensure the bias check conditions completely determine a unique solution. Continuing with an example: if we observed 7 flips an improved frequentist estimate (under the idea it is an unbiased point-estimate with minimal expected square-error) is as follows:

```
```>>> printEsts(newSoln)
pest for 0 heads 0.0319031034157
pest for 1 heads 0.111845090806
pest for 2 heads 0.296666330987
pest for 3 heads 0.439170280769
pest for 4 heads 0.560830250198
pest for 5 heads 0.703332297349
pest for 6 heads 0.888156558984
pest for 7 heads 0.968095569167

To say we decrease loss we have to decide on a scalar definition of loss: be it maximum loss, total loss or some other criteria. This solution was chosen to decrease maximum loss (an idea compatible with frequentist philosophy) and was found through constrained optimization. Notice this solution is not the direct empirical relative frequency estimate. For example: in this estimate if you see seven tails in a row you estimate `p=0.0319031`

not `p=0`

(though we still have `0.0319031 < 1/6`

which is an out of bounds estimate). This estimate is a pain to work out (the technique I used involved optimizing a move in directions orthogonal to the under-rank bias check conditions; perhaps some clever math would allow us to consider this solution obvious, but that is not the point). It is not important if this new solution is actually optimal, what is important is it is unbiased and has a smaller maximum loss (meaning the empirical estimate itself can not be considered optimal in that sense). The fact that the unknown probability `p`

can only be one of the values `1/6`

, `2/6`

, `3/6`

, `4/6`

, `5/6`

has changed which unbiased estimate is in fact the minimal loss one (added a new lower loss solution that would not considered unbiased if `p`

could choose from more possible values).

Depending on your application it can be the case that either of the frequentist or Bayesian estimate has better utility. But is is unusual for the frequentist estimate to be the harder one to calculate (as is the case here).

The Bayes solution in this case is:

```
```>>> printEsts(bayesSoln)
pest for 0 heads 0.203065668302
pest for 1 heads 0.251405546037
pest for 2 heads 0.33603150662
pest for 3 heads 0.443861984801
pest for 4 heads 0.556138015199
pest for 5 heads 0.66396849338
pest for 6 heads 0.748594453963
pest for 7 heads 0.796934331698

This is still biased, but all values are in range and the losses are smaller than the frequentist losses for all possible values of `p`

(again limited to: `1/6`

, `2/6`

, `3/6`

, `4/6`

, `5/6`

).

To be fair the differences in loss/square-error are small (and shrinking rapidly as the number of observed flips goes up, so it is a small data problem). The point we want to make isn’t which estimate is better (that depends on how you are going to use the estimate, your domain, and your application), but the idea that: Bayesian methods are not necessarily more painful that frequentist procedures. The Bayesian estimation procedure requires more from the user (the priors) and has an expensive and complicated convolution step to use the data to relate the priors to the posteriors (unless you are lucky enough to have something like the theory of conjugate distributions to hide this step). The frequentist estimation procedure seems to be as simple as “copy over your empirical observation as your estimate.” That is unless you have significant hidden state, constraints or discreteness (not the same as having priors). When you actually have to justify the frequentist inference steps (versus just benefiting from them) you find you have to at least imaging submitting every possible inference you could make as a set of variables and picking a minimax solution optimizing expected square-error over the unknown quantities while staying in the linear flat of unbiased solutions (itself a complicated check).

Note that each style analysis is correct on its own terms and is not always compatible with the assumptions of the other. This doesn’t give one camp a free-card to criticize the other.

My advice is: Bayesians need to do a better job of wrapping standard simple analyses (you shouldn’t have to learn and fire up Stan for this sort of thing), and we all need to be aware that *proper* frequentist inference is not always just the common simple procedure of copying over the empirical observations.

For full implementations/experiments (and results) click here for R and here for python.

]]>`data.matrix`

when you mean `model.matrix`

. If you do so you may lose (without noticing) a lot of your model’s explanatory power (due to poor encoding).

For some modeling tasks you end up having to prepare a special expanded data matrix before calling a given machine learning algorithm. For example the `randomForest`

package advises:

For large data sets, especially those with large number of variables, calling randomForest via the formula interface is not advised: There may be too much overhead in handling the formula.

Which means you may want to prepare a matrix of exactly the values you want to use in your prediction (versus using the more common and convenient formula interface). As R supplies good tools for this, this is not a big problem. Unless you (accidentally) use `data.matrix`

as shown below.

```
```d <- data.frame(x=c('a','b','c'),y=c(1,2,3))
print(data.matrix(d))
## x y
## [1,] 1 1
## [2,] 2 2
## [3,] 3 3

Notice in the above `x`

is converted from its original factor levels `'a'`

, `'b'`

and `'c'`

to the numeric quantities `1`

, `2`

and `'3`

. The problem is this introduces an unwanted order relation in the `x`

-values. For example any linear model is going to be forced to treat the effect of `x='b'`

as being between the modeled effects of `x='a'`

and `x='c'`

even if this is not an actual feature of the data. Now there are cases when you want ordinal constraints and there are ways (like GAMs) to learn non monotone relations on numerics. But really what has happened is you have not used a rich enough encoding of this factor.

Usually what you want is `model.matrix`

which we demonstrate below:

```
```print(model.matrix(~0+x+y,data=d))
## xa xb xc y
## 1 1 0 0 1
## 2 0 1 0 2
## 3 0 0 1 3
## attr(,"assign")
## [1] 1 1 1 2
## attr(,"contrasts")
## attr(,"contrasts")$x
## [1] "contr.treatment"

The `0+`

notation is telling R to not add an “intercept” column (a column whose value is always `1`

, a useful addition when doing linear regression). In this case it also had the side-effect of allowing us to see all three derived indicator variables (usually all but one are shown, more on this later).

What `model.matrix`

has done is used the idea of indicator variables (implemented through `contrasts`

) to re-encode the single string-valued variable `x`

as a set of indicators. The three possible values (or levels) of `x`

(`'a'`

, `'b'`

and `'c'`

) are encoded as three new variables: `xa`

, `xb`

and `xc`

. These new variables are related to the original `x`

as follows:

`x` |
`xa` |
`xb` |
`xc` |

`'a'` |
`1` |
`0` |
`0` |

`'b'` |
`0` |
`1` |
`0` |

`'c'` |
`0` |
`0` |
`1` |

It is traditional to suppress one of the derived variables (in this case `xa`

) yielding the following factor as a set of indicators representation:

`x` |
`xb` |
`xc` |

`'a'` |
`0` |
`0` |

`'b'` |
`1` |
`0` |

`'c'` |
`0` |
`1` |

And this is what we see if we don’t add the `0+`

notation to our `model.matrix`

formula:

```
```print(model.matrix(~x+y,data=d))
## (Intercept) xb xc y
## 1 1 0 0 1
## 2 1 1 0 2
## 3 1 0 1 3
## attr(,"assign")
## [1] 0 1 1 2
## attr(,"contrasts")
## attr(,"contrasts")$x
## [1] "contr.treatment"

When you use the formula interface R performs these sort of conversions for you under the hood. This is why seemingly pure numeric models (like `lm()`

) can use string-valued variables. R performing these conversions spares the analyst a lot of messy column bookkeeping. Once you get used to this automatic conversion you really come to miss it (such as in `scikit-learn`

‘s random forest implementation).

The traditional reason to suppress the `xa`

variable is there is a redundancy when we use all the indicator variables. Since the indicators always sum to one we can always infer one missing indicator by subtracting the sum of all the others from one. In fact this redundancy is an linear dependence among the indicators- and this can actually cause trouble for some naive implementations of linear regression (though we feel this is much better handled using L2 regularization ideas).

At some point you are forced to “production harden” your code and deal directly with level encodings yourself. The built-in R encoding scheme is not optimal for factor with large numbers of distinct levels, rare levels (whose fit coefficients won’t achieve statistical significance during training), missing values, or the appearance of new levels after training. It is a simple matter of feature engineering to deal with each of these situations.

Some recipes for working with problem factors include:

- Modeling Trick: Impact Coding of Categorical Variables with Many Levels (deals with factors with large numbers of levels).
- Section 4.1.1 of Practical Data Science with R (deals with missing variables by introducing additional “masking variables”).

Novel values (factor levels not seen during training, but that show up during test or application) can be treated as missing values, or you can work out additional feature engineering ideas to improve model performance.

]]>`data.frame`

s. That may seem needlessly heavy-weight, but it has a lot of down-stream advantages.

The usual mental model of R’s basic types start with the scalar/atomic types like doubles precision numbers. R doesn’t actually expose routinely such a type to users as what we think of as numbers in R are actually length one arrays or vectors. So you can easily write functions like the following:

```
```typical <- function(x) { mean(x) }
print(typical(c(1,2,3,4)))
## [1] 2.5

You eventually evolve to wanting functions that return more than one result and the standard R solution to this is to use a named list:

```
```typical <- function(x) { list(mean=mean(x),median=median(x)) }
print(typical(c(1,2,3,4)))
## $mean
## [1] 2.5
##
## $median
## [1] 2.5

Consider, however, returning a `data.frame`

instead of a list:

```
```typical <- function(x) { data.frame(mean=mean(x),median=median(x)) }
print(typical(c(1,2,3,4)))
## mean median
## 1 2.5 2.5

What this allows is convenient for-loop free batch code using `plyr`

‘s `adply()`

function:

```
```library(plyr)
d <- list(x=c(1,2,3,4),y=c(5,6,700))
print(adply(d,1,typical))
## X1 mean median
## 1 x 2.5 2.5
## 2 y 237.0 6.0

You get convenient for-loop free code that collects all of your results into a single result `data.frame`

. You also get real flexibility in that your underlying function can (in addition to returning multiple columns) can safely return multiple (or even varying numbers of) rows. We don’t use this extra power in this small example.

We did need to handle multiple rows when generating run-timings of the `step()`

function applied to a `lm()`

model. The `microbenchmark`

suite runs an expression many times to get a distribution of run times (run times are notoriously unstable, so you should always report a distribution or summary of distribution of them). We ended up building a function called `timeStep()`

which timed a step-wise regression of a given size. The `data.frame`

wrapping allowed us to easily collect and organize the many repetitions applied at many different problem sizes in a single call to `adply`

:

```
```timeStep <- function(n) {
dTraini <- adply(1:(n/dim(dTrainB)[[1]]),1,function(x) dTrainB)
modeli <- lm(y~xN+xC,data=dTraini)
data.frame(n=n,stepTime=microbenchmark(step(modeli,trace=0))$time)
}
plotFrameStep <- adply(seq(1000,10000,1000),1,timeStep)

(See here for the actual code this extract came from, and here for the result.)

This is much more succinct than the original for-loop solution (requires a lot of needless packing and then unpacking) or the per-column sapply solution (which depends on the underlying timing returning only one row and one column; which should be thought of not as natural, but as a very limited special case). With the richer `data.frame`

data structure you are not forced to organize you computation as an explicit sequence over rows or an explicit sequence over columns. You can treat things as abstract batches where intermediate functions don’t need complete details on row or column structures (making them more more reusable).

In many cases data-frame returning functions allow more powerful code as they allow multiple return values (the columns) and multiple/varying return instances (the rows). Adding such funcitons to your design toolbox allows for better code with better designed separation of concerns between code components. Also it sets things up in very `plyr`

friendly format.

Note: Nina Zumel pointed out that some complex structures (like complete models) can not always be safely returned in `data.frames`

, so you would need to use lists in that case.

An interesting example of this is `POSIXlt`

. Compare `print(class(as.POSIXlt(Sys.time())))`

`print(class(data.frame(t=as.POSIXlt(Sys.time()))$t))`

, and `d <- data.frame(t=0); d$t <- as.POSIXlt(Sys.time()); print(class(d$t))`

.

Some other stuff reads differently after this though.

For example I finally got around to skimming Li, K-C. (1991) “Sliced Inverse Regression for Dimension Reduction”, Journal of the American Statistical Association, 86, 316–327. The problem formulation in this paper is very clever: suppose `y`

isn’t a just a linear function of the `x`

, but a linear function of an unknown low rank linear image of them. In this case how do you efficiently infer? This is clear statement of an idea that can be used to move a lot of current “wide data” (lots of variables) heuristics onto solid ground. Very very roughly the analysis method involves working with something the authors call “the inverse regression curve” which is defined as `E[x|y] - E[x]`

(`y`

being an outcome variable and `x`

being an instrumental variable).

Now I have a bit of math background, so I am familiar with the idea that “inverse”, “reverse”, or “co” is a way to sex things up. Too many people of written about homologies? Then write about co-homologies! Another example: Avis, Fukuda, “Reverse Search for Enumeration”, Discrete and Applied Mathematics, 1993, volume 65, pp. 21-46 (which actually is a quite good result and paper). There are technical distinctions, but perhaps you should check your arrows if you are sprinkling your titles with “inverse”, “reverse”, or “co” (weak attempt at a category theory joke).

If we treat these inverse regression curves very loosely (as we do in other writings about regression), we can try to find a pre-existing common idea or procedure that it may at least be similar to.

Suppose we are in the special case where `x`

and `y`

both indicator variables that each take the value `1`

when their respective conditions are met and are `0`

otherwise. So instead of working with `E[x]`

and `E[x|y]`

we work with the related quantities `P[x=True]`

and `P[x=TRUE|y=TRUE]`

(actually `E[x|y]`

is encoding information about both `P[x=TRUE|y=TRUE]`

and `P[x=TRUE|y=FALSE]`

, but let us allow this further specialization for convenience of notation).

In Chapter 6 of Practical Data Science with R we suggest re-encoding variables as their log change in likelihood: `log(P[y=TRUE|x=TRUE]/P[y=TRUE])`

. Now PDSwR was written long after inverse regression was invented, so we are in no way claiming priority. But we certainly are not the first people to use a log likelihood ratio. Let’s work with this quantity a bit:

By Bayes’ law `P[y=TRUE|x=TRUE] = P[x=TRUE|y=TRUE] P[y=TRUE] / P[x=TRUE]`

. So `log(P[y=TRUE|x=TRUE]/P[y=TRUE]) = log(P[x=TRUE|y=TRUE]/P[x=TRUE])`

which is in turn equal to `log(P[x=TRUE|y=TRUE]) - log(P[x=TRUE])`

.

So we have `log(P[y=TRUE|x=TRUE]/P[y=TRUE]) = log(P[x=TRUE|y=TRUE]) - log(P[x=TRUE])`

. The term on the left is the change in log likelihood of `y=TRUE`

given `x=TRUE`

(which we argue is a very useful and natural quantity to work with). The term on the right is in the same form as the inverse regression curve except we are writing `log(P[])`

instead of `E[]`

. So I would argue by analogy that the quantities `E[x|y]-E[x]`

and `E[y|x]-E[y]`

(modulo some centering and scaling monkey business) are likely of similar utility in a regression (by analogy to Bayes’ law). you can likely do whatever dimension reduction you want on either (though I prefer the `E[y|x]-E[y]`

forward form as it seems more natural and is scale-invariant with respect to `x`

).

Maybe the original paper needs the original quantity for some of the later steps, but that requires a much more thorough reading.

]]>We admit, it isn’t the only book we own. Some relevant books from the Win-Vector LLC company library include:

And a few more from our digital bookshelf:

“Practical Data Science with R” stands out in that it:

- Concentrates on the process of data science (working with teams and tools to deploy predictive models into production).
- Spends more time on how to acquire and load non-trivial data sets (including working with SQL, CSV files, and Excel).
- Spends more time on data treatment (which allows standard modeling methods to be used in new and powerful ways).
- Deals with real world issues such as setting expectations and producing presentations (not strictly a part of machine learning, but very much a part of data science).
- Includes free code and data to reproduce almost every analysis and graph in the book (and there are a lot of them).
- Many data scientists say “you spend 90% of your time preparing your data for analysis.” Our book actually spends time explaining these steps.
- Prepares you to use many of these other books.

Work through “Practical Data Science with R” and you will learn a lot about the practice of data science.

Why do we need a book on data science? Some ask “is data science just a fad?” and “we have had statistics for hundreds of years, so why do we need data science?” Obviously the term “data science” is on the high portion of its hype cycle. But data science is a real and important discipline. One way it differs from statistics (which itself is an important tool needed by data scientists) is: data science involves a lot more programming, a lot more work on data architecture, a lot more tools, and a lot more domain/client empathy. Statisticians already do a lot of programming, but data scientists can end up doing even more. I would say one of the assumptions of data science is: *there is a client* (either real or imagined) that the data scientist is working for (similar to the customer role in agile development). Data scientists also tend to use a large number of tools (you can start with R, but depending on your client needs you may need to eventually work with many more tools). We feel that there is a significant gap in the teaching of the gestalt of data science that “Practical Data Science with R” fills.

The methods “Practical Data Science with R” teaches are entirely based on free and open source software (R, RStudio, SQuirreL SQL, H2 DB, and others) and are cross platform (running on OSX, Linux, and Windows). So once you buy the book, you are ready to start work on significant projects.

If you feel “Practical Data Science with R” doesn’t go deep enough on foundational topics (such as R itself, statistics or SQL) we suggest consulting one or more of the following in parallel:

- Kabacoff “R in Action” 2nd edition (our current favorite book about R and statistics).
- Freedman, Pisani, Purves “Statistics” 4th edition (good writing on statistics).
- Celko “SQL for Smarties” 4th edition (clear writing about advanced query techniques, learn SQL before you try big data tools such as Hive).

“Practical Data Science with R” emphasizes the business questions (such as determining what type of score is actually useful for your client) and assumes machine learning is something you can delegate to ready-made algorithms (which is the main reason to use R). If you want to move on to machine learning algorithm design and analysis try:

- Hastie, Tibshirani, Friedman “The Elements of Statistical Learning” 2nd edition
*the*book on analyzing machine learning algorithms. - James, Witten, Hastie, Tibshirani “An Introduction to Statistical Learning: with Applications in R” an R example oriented introduction to statistical machine learning.
- Kuhn, Johnson “Applied Predictive Modeling” Theory (and worked examples in R) of building and tuning predictive analytic models.

If you want interesting descriptions of data science (something to share with your boss or colleagues) we suggest checking out:

- Provost, Fawcett “Data Science for Business” (a description of data science for “people who will be working with data scientists”).
- O’Neil, Schutt “Doing Data Science” (guest presentations from a data science class bound together as a set of essays).

Good books, in the mind of a good reader, amplify each other (not detract from each other). The fact that Celko is an excellent book on SQL doesn’t lesson Hastie/Tibshirani/Friedman’s authoritativeness on statistical machine learning. Yet these are all topics that are relevant to data science.

All of that being said: we think “Practical Data Science with R” is one of the best introductions to data science. “Practical Data Science with R” attempts to convey the actual process of data science through worked examples (that may include programming, SQL, machine learning, and presenting to clients). The data scientist may not equally enjoy all of the sub-steps and sub-specialties, but is expected (by discerning clients) to do (or delegate) them all.

If you want to try your hand at a data science project we strongly recommend “Practical Data Science with R.” Available from our publisher, Amazon.com, and other booksellers.

Feel free to visit here to freely inspect “Practical Data Science with R”‘s:

- Table of Contents
- Foreword
- Preface
- About this book
- Chapter 3
- Chapter 8
- Index
- help forum
- example data
- code

And some excerpts from Amazon reviews:

“This is the book that I wish was available when I was first learning Data Science.”

J. Fister

Paulo Nuin Suano

]]>“Thankfully, this book is a welcome bridge.”

David M. Steier

`glm`

models are not so concise; we noticed this to our dismay when we tried to automate fitting a moderate number of models (about 500 models, with on the order of 50 coefficients) to data sets of moderate size (several tens of thousands of rows). A workspace save of the models alone was in the tens of gigabytes! How is this possible? We decided to find out.
As many R users know (but often forget), a `glm`

model object carries a copy of its training data by default. You can use the settings `y=FALSE`

and `model=FALSE`

to turn this off.

set.seed(2325235) # Set up a synthetic classification problem of a given size # and two variables: one numeric, one categorical # (two levels). synthFrame = function(nrows) { d = data.frame(xN=rnorm(nrows), xC=sample(c('a','b'),size=nrows,replace=TRUE)) d$y = (d$xN + ifelse(d$xC=='a',0.2,-0.2) + rnorm(nrows))>0.5 d } # first show that model=F and y=F help reduce model size dTrain = synthFrame(1000) model1 = glm(y~xN+xC,data=dTrain,family=binomial(link='logit')) model2 = glm(y~xN+xC,data=dTrain,family=binomial(link='logit'), y=FALSE) model3 = glm(y~xN+xC,data=dTrain,family=binomial(link='logit'), y=FALSE, model=FALSE) # # Estimate the object's size as the size of its serialization # length(serialize(model1, NULL)) # [1] 225251 length(serialize(model2, NULL)) # [1] 206341 length(serialize(model3, NULL)) # [1] 189562 dTest = synthFrame(100) p1 = predict(model1, newdata=dTest, type='response') p2 = predict(model2, newdata=dTest, type='response') p3 = predict(model3, newdata=dTest, type='response') sum(abs(p1-p2)) # [1] 0 sum(abs(p1-p3)) # [1] 0

So we see (as expected) that removing the training data from the model decreases the size of the model (as estimated by the size of its serialization), without affecting the model’s predictions. What happens when you increase the training data size? The size of the model (with `y=FALSE`

and `model=FALSE`

) *should not* grow.

ndata = seq(from=0, to=100000, by=5000)[-1] # # A function to estimate the size of a model for # our synthetic problem, with a training set of size n # getModelSize = function(n) { data = synthFrame(n) model = glm(y~xN+xC,data=data,family=binomial(link='logit'), y=FALSE, model=FALSE) length(serialize(model, NULL)) } size1 = sapply(ndata, FUN=getModelSize) library(ggplot2) ggplot(data.frame(n=ndata, modelsize=size1), aes(x=n, y=modelsize)) + geom_point() + geom_line()

Lo and behold, we see that the model size *still* grows linearly in the size of the training data! The model objects are still holding something that is proportional to the size of the training data. Where?

We can use our serialization trick to find the size of the individual components of a model:

breakItDown = function(mod) { sapply(mod, FUN=function(x){length(serialize(x, NULL))}, simplify=T) }

Now let’s compare two models trained with datasets of different sizes (one ten times the size of the other).

mod1 = glm(y~xN+xC,data=synthFrame(1000), family=binomial(link='logit'), y=FALSE, model=FALSE) c1 = breakItDown(mod1) mod2 = glm(y~xN+xC,data=synthFrame(10000), family=binomial(link='logit'), y=FALSE, model=FALSE) c2 = breakItDown(mod2) # For pretty-printing a vector to a vertical blog-friendly format: # return a string of vector formatted as a column with names # use cat to echo the value vfmtN = function(v) { width = max(sapply(names(v),nchar)) paste( sapply(1:length(v),function(i) { paste(format(names(v)[i], width=width), format(v[[i]])) }), collapse='\n') } cat(vfmtN(c1)) # coefficients 119 # residuals 18948 # fitted.values 18948 # effects 16071 # R 261 # rank 26 # qr 35261 # family 25160 # linear.predictors 18948 # deviance 30 # aic 30 # null.deviance 30 # iter 26 # weights 18948 # prior.weights 18948 # df.residual 26 # df.null 26 # converged 26 # boundary 26 # call 373 # formula 193 # terms 836 # data 16278 # offset 18 # control 140 # method 37 # contrasts 96 # xlevels 91 cat(vfmtN(c2)) # coefficients 119 # residuals 198949 # fitted.values 198949 # effects 160071 # R 261 # rank 26 # qr 359262 # family 25160 # linear.predictors 198949 # deviance 30 # aic 30 # null.deviance 30 # iter 26 # weights 198949 # prior.weights 198949 # df.residual 26 # df.null 26 # converged 26 # boundary 26 # call 373 # formula 193 # terms 836 # data 160278 # offset 18 # control 140 # method 37 # contrasts 96 # xlevels 91

Look carefully, and you will see that certain objects in the `glm`

model are large, and growing with data size.

r = c2/c1 cat(vfmtN(r)) # coefficients 1 # residuals 10.49974 # fitted.values 10.49974 # effects 9.960239 # R 1 # rank 1 # qr 10.18865 # family 1 # linear.predictors 10.49974 # deviance 1 # aic 1 # null.deviance 1 # iter 1 # weights 10.49974 # prior.weights 10.49974 # df.residual 1 # df.null 1 # converged 1 # boundary 1 # call 1 # formula 1 # terms 1 # data 9.846296 # offset 1 # control 1 # method 1 # contrasts 1 # xlevels 1 cat(vfmtN(r[r>1])) # residuals 10.49974 # fitted.values 10.49974 # effects 9.960239 # qr 10.18865 # linear.predictors 10.49974 # weights 10.49974 # prior.weights 10.49974 # data 9.846296

Now strictly speaking, all you need to know to apply a glm model are the coefficients of the model, and the appropriate link function. All the other things the `glm`

model object carries around are for the purpose of characterizing the model. An example would be calculating coefficient significances (and really, for most purposes, one could just calculate the quantities one wants to know, save those, and throw the data away — but we’re here to discuss R as it is, not as it should be). Once you’ve examined a model and decided that it’s satisfactory, all you probably want to do is predict. So let’s try trimming all those large objects away.

cleanModel1 = function(cm) { # just in case we forgot to set # y=FALSE and model=FALSE cm$y = c() cm$model = c() cm$residuals = c() cm$fitted.values = c() cm$effects = c() cm$qr = c() cm$linear.predictors = c() cm$weights = c() cm$prior.weights = c() cm$data = c() cm } cm1 = cleanModel1(mod1) cm2 = cleanModel1(mod2) dTest = synthFrame(100) p1=predict(cm1, newdata=dTest, type='response') # FAILS # Error in qr.lm(object) : lm object does not have a proper 'qr' component. # Rank zero or should not have used lm(.., qr=FALSE).

Ooops. We can’t null out the `qr`

member of the model object if we want to predict. Incidentally, this is related to the observation that if you try to call `lm(...., y=FALSE, model=FALSE, qr=FALSE)`

, the result is a model object that fails to either predict or summarize. Don’t ask me why `qr=FALSE`

is even an option. But back to the glm. What’s in the model’s `qr`

field?

breakItDown(mod1$qr) # qr rank qraux pivot tol # 35042 26 46 34 30 breakItDown(mod2$qr) # qr rank qraux pivot tol # 359043 26 46 34 30

It turns out that we don’t actually need model’s `qr$qr`

to predict, so let’s trim just that away:

cleanModel2 = function(cm) { cm$y = c() cm$model = c() cm$residuals = c() cm$fitted.values = c() cm$effects = c() cm$qr$qr = c() cm$linear.predictors = c() cm$weights = c() cm$prior.weights = c() cm$data = c() cm } # More reduction in model size length(serialize(mod2, NULL)) # [1] 1701600 cm2 = cleanModel2(mod2) length(serialize(cm2, NULL)) # [1] 27584 # And prediction works, too resp.full = predict(mod2, newdata=dTest, type="response") resp.cm = predict(cm2, newdata=dTest, type="response") sum(abs(resp.full-resp.cm)) # [1] 0

Are we done?

getModelSize = function(n) { data = synthFrame(n) model = cleanModel2(glm(y~xN+xC,data=data, family=binomial(link='logit'), y=FALSE, model=FALSE)) length(serialize(model, NULL)) } size2 = sapply(ndata, FUN=getModelSize) ggplot(data.frame(n=ndata, modelsize=size2), aes(x=n, y=modelsize)) + geom_point() + geom_line()

The models are substantially smaller than when we started, but they *still* grow with training data size.

A rough explanation for this is that `glm`

hides pointers to the environment and things from the environment deep in many places. We didn’t notice this when we built models in the global environment because all those pointers pointed to the same things, so even though the models are much bigger than they need to be, they are all “too big” by the same amount, and hence don’t appear to grow as the training data grows. But when you build the models in a function (as we did in `getModelSize()`

, you get more transient environments that are proportional to the size of the training data — and so model size grows with training data size. This isn’t going to seem clear, because it depends on a lot of complicated implementation details (for a taste of how complicated it can get, see here).

After much trial and error, this is the set of fields and attributes of the model that we found were growing with data size, and that we could eliminate without breaking `predict()`

.

stripGlmLR = function(cm) { cm$y = c() cm$model = c() cm$residuals = c() cm$fitted.values = c() cm$effects = c() cm$qr$qr = c() cm$linear.predictors = c() cm$weights = c() cm$prior.weights = c() cm$data = c() cm$family$variance = c() cm$family$dev.resids = c() cm$family$aic = c() cm$family$validmu = c() cm$family$simulate = c() attr(cm$terms,".Environment") = c() attr(cm$formula,".Environment") = c() cm } getModelSize = function(n) { data = synthFrame(n) model = stripGlmLR(glm(y~xN+xC,data=data, family=binomial(link='logit'), y=FALSE, model=FALSE)) length(serialize(model, NULL)) } size3 = sapply(ndata, FUN=getModelSize) ggplot(data.frame(n=ndata, modelsize=size3), aes(x=n, y=modelsize)) + geom_point() + geom_line()

Yahoo! It worked! The models are constant size with respect to training data size. And prediction works.

cm2 = stripGlmLR(mod2) resp.full = predict(mod2, newdata=dTest, type="response") resp.cm = predict(cm2, newdata=dTest, type="response") sum(abs(resp.full-resp.cm)) # [1] 0

Comparing the size of the final stripped-down models (in variable `size3`

in the demonstration code) to the originals (`size1`

), we find that the final model is 3/10th of a percent the size of the original model for small (n=5000) training data sets, and 0.015% the size of the original model for “large” (n=100,000) data sets. That’s a heckuva savings. And we probably haven’t gotten rid of all the unnecessary baggage, but at least we seem to have stopped the growth. This was enough trimming to accomplish our task for the client (producing working models that stored and loaded quickly), so we stopped.

One point and one caveat. You can null out `model$family`

entirely; the `predict`

function will still return its default value, the link value (that is, `predict(model, newdata=data))`

will work). However, `predict(model, newdata=data, type='response')`

will fail. You can still recover the response by passing the link value through the inverse link function: in the case of logistic regression, this is the sigmoid function, `sigmoid(x) = 1/(1 + exp(-x))`

. I didn’t test `type=terms`

.

The caveat: many of the other things besides predict that you might like to do with a glm model will fail on the stripped-down version: in particular `summary()`

, `anova()`

and `step()`

. So any characterization that you want to do on a candidate model should be done before trimming down the fat. Once you have decided on a satisfactory model, you can strip it down and save it for use in future predictions.

- You can trim
`lm`

and`gam`

models in a similar way, too. The exact fields to trim are a bit different. We will leave this as an exercise for the reader. - We are aware of the
`bigglm`

package, for fitting generalized linear models to big data. We didn’t test it, but I would imagine that it doesn’t have this problem. Note, though that the problem here isn’t the size of the training data*per se*(which is only of moderate size); it’s the inordinate size of the resulting model.

Books eligible for this great discount:

- Practical Probabilistic Programming
- Practical Data Science with R
- Machine Learning in Action
- Real-World Machine Learning
- R in Action, Second Edition
- Mahout in Action
- Hadoop in Practice, Second Edition
- Linked Data
- Giraph in Action
- Scala in Action
- Functional Programming in Scala
- SBT in Action
- Akka in Action
- Taming Text
- Neo4j in Action
- Making Sense of NoSQL
- Big Data

Edit: we are going to try and keep the current best deals on the book at the bottom of the Practical Data Science with R page. So look there for updates (also the book is always available at Amazon.com so you may want to look what the discount there is). ]]>