We’ll continue the example from the previous posts in the series: predicting income from demographic variables (sex, age, employment, education). The data is from the 2016 US Census American Community Survay (ACS) Public Use Microdata Sample (PUMS) for our example. More information about the data can be found here. First, we’ll get the training and test data, and show how the expected income varies along different groupings (by sex, by employment, and by education):
library(zeallot)
library(wrapr)
incomedata <- readRDS("incomedata.rds")
c(test, train) %<-% split(incomedata, incomedata$gp)
# get the rollups (mean) by grouping variable
show_conditional_means <- function(d, outcome = "income") {
cols <- qc(sex, employment, education)
lapply(
cols := cols,
function(colname) {
aggregate(d[, outcome, drop = FALSE],
d[, colname, drop = FALSE],
FUN = mean)
})
}
display_tables <- function(tlist) {
for(vi in tlist) {
print(knitr::kable(vi))
}
}
display_tables(
show_conditional_means(train))
sex | income |
---|---|
Male | 55755.51 |
Female | 47718.52 |
employment | income |
---|---|
Employee of a private for profit | 51620.39 |
Federal government employee | 64250.09 |
Local government employee | 54740.93 |
Private not-for-profit employee | 53106.41 |
Self employed incorporated | 66100.07 |
Self employed not incorporated | 41346.47 |
State government employee | 53977.20 |
education | income |
---|---|
no high school diploma | 31883.18 |
Regular high school diploma | 38052.13 |
GED or alternative credential | 37273.30 |
some college credit, no degree | 42991.09 |
Associate’s degree | 47759.61 |
Bachelor’s degree | 65668.51 |
Master’s degree | 79225.87 |
Professional degree | 97772.60 |
Doctorate degree | 91214.55 |
For this post, we’ll train a random forest model to predict income.
library(randomForest)
model_rf_1stage <- randomForest(income ~ age+sex+employment+education,
data=train)
train$pred_rf_raw <- predict(model_rf_1stage, newdata=train, type="response")
# doesn't roll up
display_tables(
show_conditional_means(train,
qc(income, pred_rf_raw)))
sex | income | pred_rf_raw |
---|---|---|
Male | 55755.51 | 55292.47 |
Female | 47718.52 | 48373.40 |
employment | income | pred_rf_raw |
---|---|---|
Employee of a private for profit | 51620.39 | 51291.36 |
Federal government employee | 64250.09 | 61167.31 |
Local government employee | 54740.93 | 55425.30 |
Private not-for-profit employee | 53106.41 | 54174.31 |
Self employed incorporated | 66100.07 | 63714.20 |
Self employed not incorporated | 41346.47 | 46415.34 |
State government employee | 53977.20 | 55599.89 |
education | income | pred_rf_raw |
---|---|---|
no high school diploma | 31883.18 | 41673.91 |
Regular high school diploma | 38052.13 | 42491.11 |
GED or alternative credential | 37273.30 | 43037.49 |
some college credit, no degree | 42991.09 | 44547.89 |
Associate’s degree | 47759.61 | 46815.79 |
Bachelor’s degree | 65668.51 | 63474.48 |
Master’s degree | 79225.87 | 69953.53 |
Professional degree | 97772.60 | 76861.44 |
Doctorate degree | 91214.55 | 75940.24 |
As we observed before, the random forest model predictions do not match the true rollups, even on the training data.
Suppose that we wish to make individual predictions of subjects’ incomes, and estimate mean income as a function of employment type. An ad-hoc way to do this is to adjust the predictions from the random forest, depending on subjects’ employment type, so that the resulting polished model is calibrated with respect to employment. Since linear models are calibrated, we might try fitting a linear model to the random forest model’s predictions, along with employment.
(Of course, we could use a Poisson model as well, but for this example we’ll just use a simple linear model for the polishing step).
One caution: we shouldn’t use the same data to fit both the random forest model and the polishing model. This leads to nested-model bias, a potential source of overfit. Either we must split the training data into two sets: one to train the random forest model and another to train the polishing model; or we have to use cross-validation to simulate having two sets. This second procedure is the same procedure used when stacking multiple models; you can think of this polishing procedure as being a form of stacking, where some of the sub-learners are simply single variables.
Let’s use 5-fold cross-validation to "stack" the random forest model and the employment variable. We’ll use vtreat
to create the cross-validation plan.
set.seed(2426355)
# build a schedule for 5-way crossval
crossplan <- vtreat::kWayCrossValidation(nrow(train), 5)
The crossplan
is a list of five splits of the data (described by row indices); each split is itself a list of two disjoint index vectors: split$train
and split$app
. For each fold, we want to train a model using train[split$train, , drop=FALSE]
and then apply the model to train[split$app, , drop=FALSE]
.
train$pred_uncal <- 0
# use cross validation to get uncalibrated predictions
for(split in crossplan) {
model_rf_2stage <- randomForest(income ~ age+sex+employment+education,
data=train[split$train, , drop=FALSE])
predi <- predict(model_rf_2stage,
newdata=train[split$app, , drop=FALSE],
type="response")
train$pred_uncal[split$app] <- predi
}
The vector train$pred_uncal
is now a vector of random forest predictions on the training data; every prediction is made using a model that was not trained on the datum in question.
Now we can use these random forest predictions to train the linear polishing model.
# learn a polish/calibration for employment
rf_polish <- lm(income - pred_uncal ~ employment,
data=train)
# get rid of pred_uncal, as it's no longer needed
train$pred_uncal <- NULL
Now, take the predictions from the original random forest model (the one trained on all the data, earlier), and polish them with the polishing model.
# get the predictions from the original random forest model
train$pred_rf_raw <- predict(model_rf_1stage, newdata=train, type="response")
# polish the predictions so that employment rolls up correctly
train$pred_cal <-
train$pred_rf_raw +
predict(rf_polish, newdata=train, type="response")
# see how close the rollups get to ground truth
rollups <- show_conditional_means(train,
qc(income, pred_cal, pred_rf_raw))
display_tables(rollups)
sex | income | pred_cal | pred_rf_raw |
---|---|---|---|
Male | 55755.51 | 55343.35 | 55292.47 |
Female | 47718.52 | 48296.93 | 48373.40 |
employment | income | pred_cal | pred_rf_raw |
---|---|---|---|
Employee of a private for profit | 51620.39 | 51640.44 | 51291.36 |
Federal government employee | 64250.09 | 64036.19 | 61167.31 |
Local government employee | 54740.93 | 54739.80 | 55425.30 |
Private not-for-profit employee | 53106.41 | 53075.76 | 54174.31 |
Self employed incorporated | 66100.07 | 66078.76 | 63714.20 |
Self employed not incorporated | 41346.47 | 41341.37 | 46415.34 |
State government employee | 53977.20 | 53946.07 | 55599.89 |
education | income | pred_cal | pred_rf_raw |
---|---|---|---|
no high school diploma | 31883.18 | 41526.88 | 41673.91 |
Regular high school diploma | 38052.13 | 42572.57 | 42491.11 |
GED or alternative credential | 37273.30 | 43104.09 | 43037.49 |
some college credit, no degree | 42991.09 | 44624.38 | 44547.89 |
Associate’s degree | 47759.61 | 46848.84 | 46815.79 |
Bachelor’s degree | 65668.51 | 63468.93 | 63474.48 |
Master’s degree | 79225.87 | 69757.13 | 69953.53 |
Professional degree | 97772.60 | 76636.17 | 76861.44 |
Doctorate degree | 91214.55 | 75697.59 | 75940.24 |
Note that the rolled up predictions from the polished model almost match the true rollups for employment, but not for the other grouping variables (sex and education). To see this better, let’s look at the total absolute error of the estimated rollups.
err_mag <- function(x, y) {
sum(abs(y-x))
}
preds = qc(pred_rf_raw, pred_cal)
errframes <- lapply(rollups,
function(df) {
lapply(df[, preds],
function(p)
err_mag(p, df$income)) %.>%
as.data.frame(.)
})
errframes <- lapply(rollups,
function(df) {
gp = names(df)[[1]]
errs <- lapply(df[, preds],
function(p)
err_mag(p, df$income))
as.data.frame(c(grouping=gp, errs))
})
display_tables(errframes)
grouping | pred_rf_raw | pred_cal |
---|---|---|
sex | 1117.927 | 990.5685 |
grouping | pred_rf_raw | pred_cal |
---|---|---|
employment | 14241.51 | 323.2577 |
grouping | pred_rf_raw | pred_cal |
---|---|---|
education | 70146.37 | 70860.7 |
We can reduce the rollup errors substantially for the variables that the polishing model was exposed to. For variables that the polishing model is not exposed to, there is no improvement; it’s likely that those estimated rollups will in many cases be worse.
Let’s see the performance of the polished model on test data.
# get the predictions from the original random forest model
test$pred_rf_raw <- predict(model_rf_1stage, newdata=test, type="response")
# polish the predictions so that employment rolls up correctly
test$pred_cal <-
test$pred_rf_raw +
predict(rf_polish, newdata=test, type="response")
# compare the rollups on employment
preds <- qc(pred_rf_raw, pred_cal)
employment_rollup <-
show_conditional_means(test,
c("income", preds))$employment
knitr::kable(employment_rollup)
employment | income | pred_rf_raw | pred_cal |
---|---|---|---|
Employee of a private for profit | 50717.96 | 51064.25 | 51413.32 |
Federal government employee | 66268.05 | 61401.94 | 64270.82 |
Local government employee | 52565.89 | 54878.96 | 54193.47 |
Private not-for-profit employee | 52887.52 | 54011.64 | 52913.09 |
Self employed incorporated | 67744.61 | 63664.51 | 66029.07 |
Self employed not incorporated | 41417.25 | 46215.42 | 41141.44 |
State government employee | 51314.92 | 55395.96 | 53742.14 |
# see how close the rollups get to ground truth for employment
lapply(employment_rollup[, preds],
function(p) err_mag(p, employment_rollup$income)) %.>%
as.data.frame(.) %.>%
knitr::kable(.)
pred_rf_raw | pred_cal |
---|---|
21608.9 | 8764.302 |
The polished model estimates rollups with respect to employment better than the uncalibrated random forest model. Its performance on individual predictions (as measured by root mean squared error) is about the same.
# predictions on individuals
rmse <- function(x, y) {
sqrt(mean((y-x)^2))
}
lapply(test[, preds],
function(p) rmse(p, test$income)) %.>%
as.data.frame(.) %.>%
knitr::kable(.)
pred_rf_raw | pred_cal |
---|---|
31780.39 | 31745.12 |
We’ve demonstrated a procedure that mitigates bias issues with ensemble models, or any other uncalibrated model. This potentially allows the data scientist to balance the requirement for highly accurate predictions on individuals with the need to correctly estimate specific aggregate quantities.
This method is ad-hoc, and may be somewhat brittle. In addition, it requires that the data scientist knows ahead of time which rollups will be desired in the future. However, if you find yourself in a situation where you must balance accurate individual prediction with accurate aggregate estimates, this may be a useful trick to have in your data science toolbox.
Jelmer Ypma has pointed out to us that for the special case of loglinear models (that is, a linear model forlog(y)
), there are other techniques for mitigating bias in predictions on y
. More information on these methods can be found in chapter 6.4 of Introductory Econometrics: A Modern Approach by Jeffrey Woolrich (2014).
These methods explicitly assume that y
is lognormally distributed (an assumption that is often valid for monetary amounts), and try to estimate the true standard deviation of log(y)
in order to adjust the estimates of y
. They do not completely eliminate the bias, because this true standard deviation is unknown, but they do reduce it, while making predictions on individuals with RMSE performance competitive with the performance of linear or (quasi)Poisson models fit directly to y
. However, they do not give the improvements on relative error that the naive adjustment we showed in the first article of this series will give.
In doing that I ran into one more avoidable but strange issue in using xgboost: when run for a small number of rounds it at first appears that xgboost doesn’t get the unconditional average or grand average right (let alone the conditional averages Nina was working with)!
Let’s take a look at that by running a trivial example in R.
Let’s take a trivial data set with only one explanatory variable “x
” that is attempting to model a single response variable “y
.”
library(xgboost)
library(wrapr)
d <- data.frame(
x = c(1 , 1 , 1 , 1 , 1 ),
y = c(10, 10, 10, 10, 10))
knitr::kable(d)
x | y |
---|---|
1 | 10 |
1 | 10 |
1 | 10 |
1 | 10 |
1 | 10 |
Now let’s do a bad job modeling y
as a function of x
.
model_xgb_bad <- xgboost(
data = as.matrix(d[, "x", drop = FALSE]),
label = d$y,
nrounds = 2, # part of the problem
verbose = 0,
params = list(
objective = "reg:linear"
#, base_score = mean(d$y) # would fix issue
#, eta = 1 # would nearly fix issue
))
d$pred_xgboost_bad <- predict(
model_xgb_bad,
newdata = as.matrix(d[, "x", drop = FALSE]))
aggregate(cbind(y, pred_xgboost_bad) ~ 1,
data = d,
FUN = mean) %.>%
knitr::kable(.)
y | pred_xgboost_bad |
---|---|
10 | 4.65625 |
The above is odd: the training value of y
has a mean of 10, yet the prediction averages to the very different value 4.66.
The issue is hidden in the usual value of the “learning rate” eta
. In gradient boosting we fit sub-models (in this case regression trees), and then use a linear combination of the sub-models predictions as our overall model. However: eta
defaults to 0.3
(ref), which roughly means each sub-model is used to move about 30% of the way from the current estimate to the suggested next estimate. Thus with a small number of trees the model deliberately can’t model the unconditional average as it hasn’t been allowed to fully use the sum-model estimates.
The low learning rate is thought to help fix over-fit driven by depending too much on any one sub-learner. The issue goes away as we build larger models with more rounds, as a systematic issue (such as getting the grand-mean wrong) is quickly corrected as each sub-learner suggests related adjustments. This is part of the idea of boosting: some of the generalization performance comes from smoothing over behaviors unique to one sub-learner and concentrating on behaviors that aggregate across sub-learners (which may be important features of the problem). This idea can’t fight systematic model bias (errors that re-occur again and again) but does help with some model variance issues.
We can fix this by running xgboost
closer to how we would see it run in production (which was in fact how Nina ran it in the first place!). Run for a larger number of rounds, and determine the number of rounds by cross-validation.
cvobj <- xgb.cv(params = list(objective="reg:linear"),
as.matrix(d[, "x", drop = FALSE]),
label = d$y,
verbose = 0,
nfold = 5,
nrounds = 50)
evallog <- cvobj$evaluation_log
( ntrees <- which.min(evallog$test_rmse_mean) )
## [1] 50
model_xgb_good <- xgboost(
data = as.matrix(d[, "x", drop = FALSE]),
label = d$y,
nrounds = ntrees,
verbose = 0,
params = list(
objective = "reg:linear"
))
d$pred_xgboost_good <- predict(
model_xgb_good,
newdata = as.matrix(d[, "x", drop = FALSE]))
aggregate(cbind(y, pred_xgboost_good) ~ 1,
data = d,
FUN = mean) %.>%
knitr::kable(.)
y | pred_xgboost_good |
---|---|
10 | 9.999994 |
Or we can fix this by returning to the documentation, and noticing the somewhat odd parameter “base_score
”.
model_xgb_base <- xgboost(
data = as.matrix(d[, "x", drop = FALSE]),
label = d$y,
nrounds = 1,
verbose = 0,
params = list(
objective = "reg:linear",
base_score = mean(d$y)
))
d$pred_xgboost_base <- predict(
model_xgb_base,
newdata = as.matrix(d[, "x", drop = FALSE]))
aggregate(cbind(y, pred_xgboost_base) ~ 1,
data = d,
FUN = mean) %.>%
knitr::kable(.)
y | pred_xgboost_base |
---|---|
10 | 10 |
base_score
is documented as:
base_score [default=0.5]
- The initial prediction score of all instances, global bias
- For sufficient number of iterations, changing this value will not have too much effect.
Frankly this parameter (and its default value) violate the principle of least astonishment. Most users coming to xgboost-regression from other forms of regression would expect the grand average to be quickly modeled, and not something the user has to specify (especially if there is in explicit constant column in the list of explanatory variables). It is a somewhat minor “footgun”, but a needless footgun all the same.
We (at Win-Vector LLC) think this is an issue as we always teach: try a method on simple problems you know the answer to before trying it on large or complex problems you don’t have a solution for. We think one should build up intuition and confidence about a method by seeing how it works on small simple problems (even if its forte is large complex problems). The mathematics principle is: concepts that are correct or correct in the extremes. It is better to not have a problem in the first place than to have a problem with remedy at hand.
Why does this issue live on? Because, as the documentation says, it rarely matters in practice. However it may be a good practice to try setting base_score = mean(d$y)
(especially if your model is having problems and you are seeing a small number of trees in your xgboost model).
base_score
? That isn’t clear. The initial wrong setting of base_score
also biases the number of trees fit in cross-validation up, which may be a feature that other xgboost parameters may be tuned with respect to or counting on. In careful work (such as our book) we do set the base_score
. In practical terms it often does not make a difference (as we saw above), so over-emphasizing this parameter can also give the student a strange impression of how boosting works.However, when making predictions on individuals, a biased model may be preferable; biased models may be more accurate, or make predictions with lower relative error than an unbiased model. For example, tree-based ensemble models tend to be highly accurate, and are often the modeling approach of choice for many machine learning applications. In this note, we will show that tree-based models are biased, or uncalibrated. This means they may not always represent the best bias/variance trade-off.
We’ll continue the example from the previous post: predicting income from demographic variables (sex, age, employment, education). The data is from the 2016 US Census American Community Survay (ACS) Public Use Microdata Sample (PUMS) for our example. More information about the data can be found here. First, we’ll get the training and test data, and show how the expected income varies along different groupings (by sex, by employment, and by education):
library(zeallot)
library(wrapr)
location <- "https://github.com/WinVector/PDSwR2/raw/master/PUMS/incomedata.rds"
incomedata <- readRDS(url(location))
c(test, train) %<-% split(incomedata, incomedata$gp)
# A convenience function to calculate and display
# the conditional expected incomes
show_conditional_means <- function(d, outcome = "income") {
cols <- qc(sex, employment, education)
lapply(
cols := cols,
function(colname) {
aggregate(d[, outcome, drop = FALSE],
d[, colname, drop = FALSE],
FUN = mean)
})
}
display_tables <- function(tlist) {
for(vi in tlist) {
print(knitr::kable(vi))
}
}
display_tables(
show_conditional_means(train))
sex | income |
---|---|
Male | 55755.51 |
Female | 47718.52 |
employment | income |
---|---|
Employee of a private for profit | 51620.39 |
Federal government employee | 64250.09 |
Local government employee | 54740.93 |
Private not-for-profit employee | 53106.41 |
Self employed incorporated | 66100.07 |
Self employed not incorporated | 41346.47 |
State government employee | 53977.20 |
education | income |
---|---|
no high school diploma | 31883.18 |
Regular high school diploma | 38052.13 |
GED or alternative credential | 37273.30 |
some college credit, no degree | 42991.09 |
Associate’s degree | 47759.61 |
Bachelor’s degree | 65668.51 |
Master’s degree | 79225.87 |
Professional degree | 97772.60 |
Doctorate degree | 91214.55 |
We’ll fit three models to the data: two tree ensemble models (random forest and gradient boosting), and one (quasi)Poisson model–a calibrated model– for comparison.
library(randomForest)
library(xgboost)
# Quasipoisson model
model_pincome <- glm(income ~ age+sex+employment+education,
data=train,
family=quasipoisson)
# random forest model
model_rf_1stage <- randomForest(income ~ age+sex+employment+education,
data=train)
# gradient boosting model
# build the model.matrix for the training data
train_mm <- model.matrix(income ~ age+sex+employment+education,
train)
cvobj <- xgb.cv(params = list(objective="reg:linear"),
train_mm,
label= train$income,
verbose=0,
nfold=5,
nrounds=50)
evallog <- cvobj$evaluation_log
ntrees <- which.min(evallog$test_rmse_mean)
model_xgb <- xgboost(train_mm,
label= train$income,
verbose=0, nrounds=ntrees)
#
# make the predictions on training data
#
train <- transform(train,
pred_pois = predict(model_pincome,
train, type="response"),
pred_rf_raw = predict(model_rf_1stage,
newdata=train, type="response"),
pred_xgb = predict(model_xgb, train_mm))
First, we’ll compare the rollups of the predictions to the actual rollups.
outcomecols <- qc(income, pred_pois, pred_rf_raw, pred_xgb)
rollups <-show_conditional_means(train, outcomecols)
display_tables(rollups)
sex | income | pred_pois | pred_rf_raw | pred_xgb |
---|---|---|---|---|
Male | 55755.51 | 55755.51 | 55261.04 | 54203.70 |
Female | 47718.52 | 47718.52 | 48405.71 | 47326.59 |
employment | income | pred_pois | pred_rf_raw | pred_xgb |
---|---|---|---|---|
Employee of a private for profit | 51620.39 | 51620.39 | 51276.95 | 50294.95 |
Federal government employee | 64250.09 | 64250.09 | 61623.87 | 60596.12 |
Local government employee | 54740.93 | 54740.93 | 55464.36 | 54121.91 |
Private not-for-profit employee | 53106.41 | 53106.41 | 54135.75 | 53417.86 |
Self employed incorporated | 66100.07 | 66100.07 | 63840.91 | 63391.52 |
Self employed not incorporated | 41346.47 | 41346.47 | 46257.98 | 42578.69 |
State government employee | 53977.20 | 53977.20 | 55530.86 | 54752.98 |
education | income | pred_pois | pred_rf_raw | pred_xgb |
---|---|---|---|---|
no high school diploma | 31883.18 | 31883.18 | 40599.88 | 40287.54 |
Regular high school diploma | 38052.13 | 38052.13 | 41864.18 | 36245.78 |
GED or alternative credential | 37273.30 | 37273.30 | 42316.76 | 37654.63 |
some college credit, no degree | 42991.09 | 42991.09 | 44303.14 | 41259.59 |
Associate’s degree | 47759.61 | 47759.61 | 46831.13 | 44995.83 |
Bachelor’s degree | 65668.51 | 65668.51 | 64131.61 | 64043.09 |
Master’s degree | 79225.87 | 79225.87 | 70762.24 | 77177.23 |
Professional degree | 97772.60 | 97772.60 | 77940.16 | 93507.90 |
Doctorate degree | 91214.55 | 91214.55 | 76972.02 | 86496.11 |
Note that the rollups of the predictions from the two ensemble models don’t match the true rollups, even on the training data; unlike the Poisson model, the random forest and gradient boosting models are uncalibrated.
Let’s see the performance of the models on test data.
# build the model.matrix for the test data
test_mm <- model.matrix(income ~ age+sex+employment+education,
test)
test <- transform(test,
pred_pois = predict(model_pincome,
test, type="response"),
pred_rf_raw = predict(model_rf_1stage,
newdata=test, type="response"),
pred_xgb = predict(model_xgb, test_mm))
outcomecols <- qc(income, pred_pois, pred_rf_raw, pred_xgb)
rollups <-show_conditional_means(test, outcomecols)
display_tables(rollups)
sex | income | pred_pois | pred_rf_raw | pred_xgb |
---|---|---|---|---|
Male | 55408.95 | 55899.83 | 55210.82 | 54236.94 |
Female | 46261.99 | 47111.01 | 47950.01 | 46705.38 |
employment | income | pred_pois | pred_rf_raw | pred_xgb |
---|---|---|---|---|
Employee of a private for profit | 50717.96 | 51362.44 | 51040.56 | 49995.11 |
Federal government employee | 66268.05 | 64881.32 | 61974.36 | 61574.06 |
Local government employee | 52565.89 | 54119.83 | 54901.92 | 53703.97 |
Private not-for-profit employee | 52887.52 | 53259.07 | 53987.90 | 53441.93 |
Self employed incorporated | 67744.61 | 66096.20 | 63790.77 | 63100.00 |
Self employed not incorporated | 41417.25 | 41507.17 | 46086.63 | 42296.44 |
State government employee | 51314.92 | 53973.39 | 55262.11 | 54374.54 |
education | income | pred_pois | pred_rf_raw | pred_xgb |
---|---|---|---|---|
no high school diploma | 29903.70 | 31783.60 | 40469.94 | 40169.21 |
Regular high school diploma | 36979.33 | 37746.81 | 41648.80 | 35989.04 |
GED or alternative credential | 39636.86 | 37177.50 | 42620.37 | 38180.68 |
some college credit, no degree | 43490.42 | 43270.86 | 44449.98 | 41538.77 |
Associate’s degree | 48384.19 | 47234.56 | 46309.68 | 44383.58 |
Bachelor’s degree | 65268.96 | 66141.27 | 64387.44 | 64320.68 |
Master’s degree | 77180.40 | 79594.17 | 70804.81 | 77491.04 |
Professional degree | 94976.75 | 99009.56 | 78713.55 | 94974.29 |
Doctorate degree | 87535.83 | 91742.54 | 76517.41 | 86141.53 |
# see how close the rollups get to ground truth for employment
err_mag <- function(x, y) {
delta = y-x
sqrt(sum(delta^2))
}
employment <- rollups$employment
lapply(employment[, qc(pred_pois, pred_rf_raw, pred_xgb)],
function(p) err_mag(p, employment$income)) %.>%
as.data.frame(.) %.>%
knitr::kable(.)
pred_pois | pred_rf_raw | pred_xgb |
---|---|---|
3831.967 | 8844.436 | 7474.311 |
The calibrated Poisson model gives better estimates of the income rollups with respect to employment than either of the ensemble models, despite the fact that all the models have similar root mean squared error when making individual predictions.
# predictions on individuals
rmse = function(x, y) {
sqrt(mean((y-x)^2))
}
lapply(test[, qc(pred_pois, pred_rf_raw, pred_xgb)],
function(p) rmse(p, test$income)) %.>%
as.data.frame(.) %.>%
knitr::kable(.)
pred_pois | pred_rf_raw | pred_xgb |
---|---|---|
31341.14 | 31688.77 | 31299.37 |
In this example, the input variables were simply not informative enough, so the tree ensemble models performed equivalently to the Poisson model for predicting income. With more informative (and nonlinear) input variables, one can expect that ensemble models will outperform linear or generalized linear models, in terms of predictions on individuals. However, even these more accurate ensemble models can be biased, so they are not guaranteed to estimate important aggregates (grouped sums or conditional means) correctly.
In the next note, we’ll propose a polishing step on uncalibrated models that mitigates this bias, potentially enabling models that are both highly accurate on individuals, while estimating certain aggregates correctly.
]]>log10(income)
rather than directly against income.
One obvious reason for not regressing directly against income is that (in our example) income is restricted to be non-negative, a restraint that linear regression can’t enforce. Other reasons include the wide distribution of values and the relative or multiplicative structure of errors on outcomes. A common practice in this situation is to use Poisson regression, or generalized linear regression with a log-link function. Like all generalized linear regressions, Poisson regression is unbiased and calibrated: it preserves the conditional expectations and rollups of the training data. A calibrated model is important in many applications, particularly when financial data is involved.
Regressing against the log of the outcome will not be calibrated; however it has the advantage that the resulting model will have lower relative error than a Poisson regression against income. Minimizing relative error is appropriate in situations when differences are naturally expressed in percentages rather than in absolute amounts. Again, this is common when financial data is involved: raises in salary tend to be in terms of percentage of income, not in absolute dollar increments.
Unfortunately, a full discussion of the differences between Poisson regression and regressing against log amounts was outside of the scope of our book, so we will discuss it in this note.
As we did in the book, we’ll use data from the 2016 US Census American Community Survay (ACS) Public Use Microdata Sample (PUMS) for our example. More information about the data can be found here. First, we’ll get the training and test data, and show how the expected income varies along different groupings (by sex, by employment, and by education):
library(zeallot)
library(wrapr)
location <- "https://github.com/WinVector/PDSwR2/raw/master/PUMS/incomedata.rds"
incomedata <- readRDS(url(location))
c(test, train) %<-% split(incomedata, incomedata$gp)
# A convenience function to calculate and display
# the conditional expected incomes
show_conditional_means <- function(d, outcome = "income") {
cols <- qc(sex, employment, education)
lapply(
cols := cols,
function(colname) {
aggregate(d[, outcome, drop = FALSE],
d[, colname, drop = FALSE],
FUN = mean)
})
}
display_tables <- function(tlist) {
for(vi in tlist) {
print(knitr::kable(vi))
}
}
display_tables(show_conditional_means(train))
sex | income |
---|---|
Male | 55755.51 |
Female | 47718.52 |
employment | income |
---|---|
Employee of a private for profit | 51620.39 |
Federal government employee | 64250.09 |
Local government employee | 54740.93 |
Private not-for-profit employee | 53106.41 |
Self employed incorporated | 66100.07 |
Self employed not incorporated | 41346.47 |
State government employee | 53977.20 |
education | income |
---|---|
no high school diploma | 31883.18 |
Regular high school diploma | 38052.13 |
GED or alternative credential | 37273.30 |
some college credit, no degree | 42991.09 |
Associate’s degree | 47759.61 |
Bachelor’s degree | 65668.51 |
Master’s degree | 79225.87 |
Professional degree | 97772.60 |
Doctorate degree | 91214.55 |
Now we’ll model income as a function of age, sex, employment, and education three different ways:
# linear model for income
model_income <- lm(income ~ age+sex+employment+education,
data=train)
# linear model for log10(income)
model_logincome <- lm(log10(income) ~ age+sex+employment+education,
data=train)
# Quasipoisson model for income
model_pincome <- glm(income ~ age+sex+employment+education,
data=train,
family=quasipoisson)
Note that we are fitting a quasipoisson model for income; strictly speaking, a Poisson model assumes that the mean and variance of the data are the same, which is not true in general. A quasipoisson model relaxes the restriction on the variance of the data. We’ll still refer to this as a Poisson model for brevity.
Now we can use all three models to predict income for the training data.
train <- transform(train,
pred_lm = predict(model_income, train),
pred_lmlog = 10^predict(model_logincome, train),
pred_pois = predict(model_pincome,
train, type="response"))
knitr::kable(
summary(train[, qc(income, pred_lm, pred_pois, pred_lmlog)]))
income | pred_lm | pred_pois | pred_lmlog | |
---|---|---|---|---|
Min. : 1200 | Min. : -4682 | Min. : 15704 | Min. : 11977 | |
1st Qu.: 26700 | 1st Qu.: 36877 | 1st Qu.: 36480 | 1st Qu.: 30546 | |
Median : 41200 | Median : 50180 | Median : 47450 | Median : 40281 | |
Mean : 52373 | Mean : 52373 | Mean : 52373 | Mean : 44478 | |
3rd Qu.: 66000 | 3rd Qu.: 65962 | 3rd Qu.: 63669 | 3rd Qu.: 54397 | |
Max. :250000 | Max. :125969 | Max. :159583 | Max. :129216 |
Note that even though all actual incomes were positive, the linear model (model_income
) sometimes predicted negative income.
Now let’s compare how the predicted incomes roll up.
display_tables(
show_conditional_means(train,
qc(income, pred_lm, pred_pois, pred_lmlog))
)
sex | income | pred_lm | pred_pois | pred_lmlog |
---|---|---|---|---|
Male | 55755.51 | 55755.51 | 55755.51 | 47081.99 |
Female | 47718.52 | 47718.52 | 47718.52 | 40895.21 |
employment | income | pred_lm | pred_pois | pred_lmlog |
---|---|---|---|---|
Employee of a private for profit | 51620.39 | 51620.39 | 51620.39 | 43169.85 |
Federal government employee | 64250.09 | 64250.09 | 64250.09 | 58542.64 |
Local government employee | 54740.93 | 54740.93 | 54740.93 | 49988.61 |
Private not-for-profit employee | 53106.41 | 53106.41 | 53106.41 | 47475.45 |
Self employed incorporated | 66100.07 | 66100.07 | 66100.07 | 53189.40 |
Self employed not incorporated | 41346.47 | 41346.47 | 41346.47 | 31151.47 |
State government employee | 53977.20 | 53977.20 | 53977.20 | 50023.27 |
education | income | pred_lm | pred_pois | pred_lmlog |
---|---|---|---|---|
no high school diploma | 31883.18 | 31883.18 | 31883.18 | 26978.21 |
Regular high school diploma | 38052.13 | 38052.13 | 38052.13 | 32437.46 |
GED or alternative credential | 37273.30 | 37273.30 | 37273.30 | 30816.91 |
some college credit, no degree | 42991.09 | 42991.09 | 42991.09 | 36184.14 |
Associate’s degree | 47759.61 | 47759.61 | 47759.61 | 40585.89 |
Bachelor’s degree | 65668.51 | 65668.51 | 65668.51 | 55130.77 |
Master’s degree | 79225.87 | 79225.87 | 79225.87 | 69437.91 |
Professional degree | 97772.60 | 97772.60 | 97772.60 | 81612.18 |
Doctorate degree | 91214.55 | 91214.55 | 91214.55 | 80679.19 |
The rollups of the predictions for the linear and Poisson models (model_income
and model_pincome
) match the rollups of the training data. The predictions from model_logincome
roll up too low. In fact, one can prove that by Jensen’s inequality, a linear model fit to log-income will always have a systematic bias (underprediction) when estimating expected income. This means that if one of the intended uses of the model is to estimate aggregates (grouped sums, conditional means), then a calibrated model like a linear or Poisson model is more appropriate.
If the primary purpose of the model is predictions on individuals, then biased models may still be acceptable, or even preferable. When predicting income, it’s often the case that you want to express uncertainty in relative terms: that is, predict income to within 5%, rather than predict income to within $50. So let’s see how each of the models performs in terms of relative error (on the training data):
rel_err <- function(x, y) {
mean(abs(y-x)/y)
}
lapply(train[, qc(pred_lm, pred_lmlog, pred_pois)],
function(p) rel_err(p, train$income)) %.>%
as.data.frame(.) %.>%
knitr::kable(.)
pred_lm | pred_lmlog | pred_pois |
---|---|---|
0.74858 | 0.615897 | 0.7437119 |
model_logincome
has a lower average relative error on estimated income than either of the models fit directly to income — not a great relative error, but that’s because our set of input variables isn’t informative enough. We can also compare the models’ performances in terms of root mean squared error (an absolute difference):
rmse <- function(x, y) {
sqrt(mean((y-x)^2))
}
lapply(train[, qc(pred_lm, pred_lmlog, pred_pois)],
function(p) rmse(p, train$income)) %.>%
as.data.frame(.) %.>%
knitr::kable(.)
pred_lm | pred_lmlog | pred_pois |
---|---|---|
31625.35 | 32616.57 | 31395.38 |
The models that are fit directly to income have lower RMSE than model_logincome
, but not dramatically so. In other words, model_logincome
seems to improve relative error, at the cost of a slightly larger RMSE.
The real test of the three models for income is how they perform on data not used to train the models. First, we’ll compare the rollups.
test <- transform(test,
pred_lm = predict(model_income, test),
pred_lmlog = 10^predict(model_logincome, test),
pred_pois = predict(model_pincome,
test, type="response"))
rollups <- show_conditional_means(test,
qc(income, pred_lm,
pred_pois,pred_lmlog))
display_tables(rollups)
sex | income | pred_lm | pred_pois | pred_lmlog |
---|---|---|---|---|
Male | 55408.95 | 55903.57 | 55899.83 | 47173.10 |
Female | 46261.99 | 46876.96 | 47111.01 | 40361.71 |
employment | income | pred_lm | pred_pois | pred_lmlog |
---|---|---|---|---|
Employee of a private for profit | 50717.96 | 51314.69 | 51362.44 | 42947.99 |
Federal government employee | 66268.05 | 64635.60 | 64881.32 | 58993.59 |
Local government employee | 52565.89 | 53730.43 | 54119.83 | 49450.23 |
Private not-for-profit employee | 52887.52 | 52830.80 | 53259.07 | 47642.49 |
Self employed incorporated | 67744.61 | 65538.68 | 66096.20 | 53189.42 |
Self employed not incorporated | 41417.25 | 41671.41 | 41507.17 | 31265.77 |
State government employee | 51314.92 | 54106.89 | 53973.39 | 50029.83 |
education | income | pred_lm | pred_pois | pred_lmlog |
---|---|---|---|---|
no high school diploma | 29903.70 | 31738.07 | 31783.60 | 26923.95 |
Regular high school diploma | 36979.33 | 37538.76 | 37746.81 | 32162.33 |
GED or alternative credential | 39636.86 | 37336.08 | 37177.50 | 30666.80 |
some college credit, no degree | 43490.42 | 43199.50 | 43270.86 | 36421.74 |
Associate’s degree | 48384.19 | 47167.06 | 47234.56 | 40140.43 |
Bachelor’s degree | 65268.96 | 66077.47 | 66141.27 | 55535.11 |
Master’s degree | 77180.40 | 79521.83 | 79594.17 | 69750.68 |
Professional degree | 94976.75 | 98649.58 | 99009.56 | 82575.73 |
Doctorate degree | 87535.83 | 91403.52 | 91742.54 | 81524.25 |
# see how close the rollups get to ground truth for employment
err_mag <- function(x, y) {
delta = y-x
sqrt(sum(delta^2))
}
employment <- rollups$employment
lapply(employment[, qc(pred_lm, pred_pois, pred_lmlog)],
function(p) err_mag(p, employment$income)) %.>%
as.data.frame(.) %.>%
knitr::kable(.)
pred_lm | pred_pois | pred_lmlog |
---|---|---|
4135.96 | 3831.967 | 21611.7 |
None of the models reproduced the true rollups perfectly. Just looking at the employment rollup, you can see that the rollups from model_income
and model_pincome
are usually fairly close to the actual rollups, while the rollups from model_logincome
are off — and consistently under. The pattern holds for the rollups by sex and education as well.
Let’s compare the models on individual predictions.
# relative error
lapply(test[, qc(pred_lm, pred_lmlog, pred_pois)],
function(p) rel_err(p, test$income)) %.>%
as.data.frame(.) %.>%
knitr::kable(.)
pred_lm | pred_lmlog | pred_pois |
---|---|---|
0.7508259 | 0.6222302 | 0.7543232 |
# root mean square error
lapply(test[, qc(pred_lm, pred_lmlog, pred_pois)],
function(p) rmse(p, test$income)) %.>%
as.data.frame(.) %.>%
knitr::kable(.)
pred_lm | pred_lmlog | pred_pois |
---|---|---|
31589.5 | 32389.97 | 31341.14 |
Again, model_logincome
returns predictions with the lowest relative error, but a slightly higher RMSE than the other two models.
In this note we have shown the consequences of different modeling decisions, in particular the trade-off between bias and relative error. Notice that transforming the outcome and using a link function have different advantages. Which procedure you use depends on what is most important to your application: correctly estimating summary statistics, minimizing relative error, or minimizing squared error.
In our next article, we will show that common tree models are also non-calibrated, which means that despite their high accuracy on individual predictions, they do not correctly estimate summary statistics in an unbiased way. Later, we will address how to mitigate this issue.
Postscript
Our thanks to Jelmer Ypma for pointing us to references to corrections for loglinear models; these corrections reduce the bias and RMSE of estimates of y
that are based on predictions from a linear model for log(y)
. More information can be found in chapter 6.4 of Introductory Econometrics: A Modern Approach by Jeffrey Woolrich (2014).
R
.
We want to fit a linear model where the names of the data columns carrying the outcome to predict (y
), the explanatory variables (x1
, x2
), and per-example row weights (wt
) are given to us as string values in variables.
Lets start with our example data and parameters. The point is: we are assuming the data and parameters come to us as arguments and are not known at the time of writing the script or program.
# our inputs
d <- data.frame(
x1 = c(1, 2, 3, 4),
x2 = c(0, 0, 1, 1),
y = c(3, 3, 4, -5),
wt = c(1, 2, 1, 1))
knitr::kable(d)
x1 | x2 | y | wt |
---|---|---|---|
1 | 0 | 3 | 1 |
2 | 0 | 3 | 2 |
3 | 1 | 4 | 1 |
4 | 1 | -5 | 1 |
outcome_name <- "y"
explanatory_vars <- c("x1", "x2")
name_for_weight_column <- "wt"
For everything except the weights this is easy, as the linear regression function lm()
is willing to take strings in its first argument “formula
” (and also, there are tools for building up formula objects).
# start our generic solution
formula_str <- paste(
outcome_name,
"~",
paste(explanatory_vars, collapse = " + "))
print(formula_str)
## [1] "y ~ x1 + x2"
model <- lm(formula_str,
data = d)
print(model)
##
## Call:
## lm(formula = formula_str, data = d)
##
## Coefficients:
## (Intercept) x1 x2
## 9.75 -4.50 5.50
format(model$terms)
## [1] "y ~ x1 + x2"
However, once we try to add weights we have problems.
lm(formula_str,
data = d,
weights = name_for_weight_column)
## Error in model.frame.default(formula = formula_str, data = d, weights = name_for_weight_column, : variable lengths differ (found for '(weights)')
This is a bit disappointing, as much of the point of working in R
is being able to write parameterized scripts and programs over the R
functions. So we really want to be able to take names of columns from an external source.
The reason is the following (taken from help(lm)
):
All of weights, subset and offset are evaluated in the same way as variables in formula, that is first in data and then in the environment of formula.
This means the weights argument is not treated as a value, but instead the name typed in is captured through “non standard evaluation” (NSE). The data frame environment, and formula environment are searched for a column or value named “name_for_weight_column
”, and not for one named “wt
”.
This is a big hindrance to using lm()
programmatically. The NSE notation style can be convenient when working interactively, but is often a burden when programming.
However, R
has some ways out of the problem.
The first solution is bquote()
which is part of R
itself (in the base
package).
eval(bquote(
lm(
formula_str,
data = d,
weights = .(as.name(name_for_weight_column)))
))
##
## Call:
## lm(formula = formula_str, data = d, weights = wt)
##
## Coefficients:
## (Intercept) x1 x2
## 9.429 -3.857 3.571
In the above, the .()
notation indicates to replace the .()
-expression with its evaluated value before evaluating the rest of the expression. This is a substitution principle based on escaping notation (also called quasiquoting).
Another solution is the let()
function from the wrapr
package (a user extension package, not part of R
itself).
wrapr::let(
c(NAME_FOR_WEIGHT_COLUMN = name_for_weight_column),
lm(
formula_str,
data = d,
weights = NAME_FOR_WEIGHT_COLUMN)
)
##
## Call:
## lm(formula = formula_str, data = d, weights = wt)
##
## Coefficients:
## (Intercept) x1 x2
## 9.429 -3.857 3.571
In the above, the left-hand sides of the named vector are symbols to be replaced and the right had sides refer to values to replace them with. The specification c(NAME_FOR_WEIGHT_COLUMN = name_for_weight_column)
means to replace NAME_FOR_WEIGHT_COLUMN
with wt
(the value referred to by name_for_weight_column
). This is a substitution principle based on named substitution targets.
Another solution is the !!
notation from the rlang
package (a user extension package, not part of R
itself).
rlang::eval_tidy(rlang::quo(
lm(
formula_str,
data = d,
weights = !!as.name(name_for_weight_column))
))
##
## Call:
## lm(formula = formula_str, data = d, weights = wt)
##
## Coefficients:
## (Intercept) x1 x2
## 9.429 -3.857 3.571
In the above, the !!
notation indicates to replace the !!
-expression with its evaluated value before evaluating the rest of the expression. This is a substitution principle based on escaping notation (also called quasiquoting).
Note the argument types and/or the internals of lm()
do not currently appear to allow the use of the newer rlang
double curly brace notation (a notation that can replace !!rlang::enquo()
, and possibly other forms).
rlang::eval_tidy(rlang::quo(
lm(
formula_str,
data = d,
weights = {{name_for_weight_column}})
))
## Error in model.frame.default(formula = formula_str, data = d, weights = ~"wt", : invalid type (language) for variable '(weights)'
We only mention the “{{}}
” notation as rlang
familiar readers are likely to wonder about using it.
All of the above solutions are using meta-programming tools. These are tools that let programs treat programs as data. As such they are very powerful. In fact they are big hammers for such a simple problem as specifying a column name that is already stored as data. However, the above tools should not be blamed for the awkwardness of having a need for them.
The need for these meta-programming tools comes from the over-reliance on NSE interfaces. Or more precisely: the lack of a value oriented interface (possibly in addition to the NSE interface). NSE designs emphasize specifying things in program text instead of in data. This violation of “The Rule of Representation: Fold knowledge into data so program logic can be stupid and robust” is what causes the problems.
]]>
Our task is to take a linear model in one system and stand up a replicant of it in another system. We take the phrase “replicating a model” with an eye to the finance term “replicating portfolio.” This different than the sense of the word in “reproducible research” (pertaining more the ability to re-run work reliably).
This technique can be useful in moving an existing model into a large data system such as Apache Spark or Google BigQuery.
Let’s work an example in R. Suppose we are working with a linear regression model and from our donor system we have extracted the following representation of the model as “intercept” and “betas”.
intercept <- 3
betas <- c(weight = 2, height = 4)
Our goal is to build a linear regression model that has the above coefficients. The way we are going to do this is by building our own synthetic data set such that the regression fit through this data set yields these coefficients.
To do this we will build one data row for the intercept and one for each of the coefficients. Our data set will represent a dependent variable “y” based on the input variables “weight” and “height”.
n_vars <- length(betas)
d <- as.data.frame(matrix(0,
nrow = n_vars + 1,
ncol = n_vars + 1))
colnames(d) <- c(names(betas), "y")
rownames(d) <- c("intercept", names(betas))
d["intercept", "y"] = intercept
This gives us the following partially filled out data.frame.
knitr::kable(d)
weight | height | y | |
---|---|---|---|
intercept | 0 | 0 | 3 |
weight | 0 | 0 | 0 |
height | 0 | 0 | 0 |
We then fill in the rows corresponding to betas with a 1 in the corresponding ni-th beta-column and “y” equal to intercept + beta[[ni]]
. This is saying y = intercept + beta[[ni]]
.
for(ni in names(betas)) {
d[ni, ni] <- 1
d[ni, "y"] <- intercept + betas[[ni]]
}
This gives us the following finished synthetic data set.
knitr::kable(d)
weight | height | y | |
---|---|---|---|
intercept | 0 | 0 | 3 |
weight | 1 | 0 | 5 |
height | 0 | 1 | 7 |
The idea is: a linear model with our original intercept and coefficients equal to our betas is the unique linear model that matches the y-values we have specified. Let’s confirm this.
model <- lm(y ~ weight + height,
data = d)
print(coefficients(model))
## (Intercept) weight height
## 3 2 4
Notice this model has exactly the coefficient we started with. We have solved our linear model replication problem. For a follow-up problem: suppose we wanted a logistic regression model with the given coefficients. How would we go about that?
Our idea for that is as follows. First we need a data frame that has twice as many rows as our original did.
d2_0 <- d
rownames(d2_0) <- paste0(rownames(d2_0), "_0")
d2_0$y <- 0
d2_1 <- d
rownames(d2_1) <- paste0(rownames(d2_1), "_1")
d2_1$y <- 1
d2 <- rbind(d2_0, d2_1)
This gives us the following data set.
knitr::kable(d2)
weight | height | y | |
---|---|---|---|
intercept_0 | 0 | 0 | 0 |
weight_0 | 1 | 0 | 0 |
height_0 | 0 | 1 | 0 |
intercept_1 | 0 | 0 | 1 |
weight_1 | 1 | 0 | 1 |
height_1 | 0 | 1 | 1 |
Now we are going use the weights to encode a relation in the data that forces a logistic regression to replicate the weights we want. The idea each row should be in our training set with a weight or mass proportional to to the logistic predicted probability for rows with y==1
, and one minus the logistic predicted probability for rows with y==0
.
What we are saying is: to replicate the logistic model we need evaluations of that model for these rows. Lets calculate those for our desired coefficients as follows.
d2$py <- intercept
for(ni in names(betas)) {
d2$py <- d2$py + betas[[ni]]*d2[[ni]]
}
d2$py <- 1/(1+exp(-d2$py))
The last step is the traditional sigmoid transform that translates from link-space to probabilities for the logistic model. We said we wanted 1-py
in the rows where y
is zero, so we make that adjustment.
d2$py[d2$y==0] <- 1 - d2$py[d2$y==0]
This gives us the following data set.
knitr::kable(d2)
weight | height | y | py | |
---|---|---|---|---|
intercept_0 | 0 | 0 | 0 | 0.0474259 |
weight_0 | 1 | 0 | 0 | 0.0066929 |
height_0 | 0 | 1 | 0 | 0.0009111 |
intercept_1 | 0 | 0 | 1 | 0.9525741 |
weight_1 | 1 | 0 | 1 | 0.9933071 |
height_1 | 0 | 1 | 1 | 0.9990889 |
If we fit a logistic regression model on this data with each row weighted by the py
column we recover the desired coefficients as we see below. The idea is: the logistic model is describing the distribution of the values outcomes instead of the values themselves.
model2 <- glm(y ~ weight + height,
data = d2,
weights = d2$py,
family = binomial)
## Warning in eval(family$initialize): non-integer #successes in a binomial
## glm!
print(coefficients(model2))
## (Intercept) weight height
## 3 2 4
print(class(model2))
## [1] "glm" "lm"
model2
is now a logistic generalized linear model with the desired coefficients.
And that is how you replicate a model when you have access to the original model coefficients, but don’t have training data.
]]>by
” grouping notation when combined with the :=
notation.
Let’s take a look at this powerful notation.
First, let’s build an example data.frame
.
d <- wrapr::build_frame( "group" , "value" | "a" , 1L | "a" , 2L | "b" , 3L | "b" , 4L ) knitr::kable(d)
group | value |
---|---|
a | 1 |
a | 2 |
b | 3 |
b | 4 |
The data is some sort of value with a grouping column telling us which rows are related.
With the data.table
“:=
,by
” notation we can add the per-group totals into each row of the data as follows (the extra []
at the end is just the command to also print the results in addition to adding the column in-place).
library("data.table") dt <- data.table::as.data.table(d) dt[, group_sum := sum(value), by = "group"][] # group value group_sum # 1: a 1 3 # 2: a 2 3 # 3: b 3 7 # 4: b 4 7
The “by
” signals we are doing a per-group calculation, and the “:=
” signals to land the results in the original data.table
. This sort of window function is incredibly useful in computing things such as what fraction of a group’s mass is in each row. For example.
# build a fresh copy as last command altered dt in place dt <- data.table::as.data.table(d) dt[, fraction := value/sum(value), by = "group"][] # group value fraction # 1: a 1 0.3333333 # 2: a 2 0.6666667 # 3: b 3 0.4285714 # 4: b 4 0.5714286
In base R (or in a more purely relational data system) the obvious way to solve this requires two steps: computing the per-group summaries and then joining them back into the original table rows. This can be done as follows.
sums <- tapply(d$value, d$group, sum) d$fraction <- d$value/sums[d$group] print(d) # group value fraction # 1 a 1 0.3333333 # 2 a 2 0.6666667 # 3 b 3 0.4285714 # 4 b 4 0.5714286
We called the transform a “window function”, as that is the name that SQL uses for the concept. The SQL code to perform this calculation would look like the following.
SELECT group, value, value/sum(value) OVER ( PARTITION BY group ) AS fraction FROM d
And the popular package dplyr
uses the following notation for the same problem.
d %>% group_by(group) %>% mutate(fraction = value/sum(value)) %>% ungroup()
And, as always, let’s end with some timings. For a 1000000 row table with 10 additional irrelevant columns, and group ids picked uniformly from 100000 symbols: we see the various solutions take the following times to complete the task.
## solution milliseconds ## datatable_soln 384 ## base_R_lookup_soln 1476 ## dplyr_soln 3988
All packages are the current CRAN releases as of 2019-06-29. Timings are sensitive to number of row, columns, and groups. Note the data.table
time includes the time to convert to the data.table
class.Details on the timings can be found here.
data.table
commands, to take advantage of data.table
‘s superior performance. Obviously if one wants to use data.table
it is best to learn data.table
. But if we want code that can run multiple places a translation layer may be in order.
In this note we look at how this translation is commonly done.
The dtplyr
developers recently announced they are making changes to dtplyr
to support two operation modes:
Note that there are two ways to use
dtplyr
:
- Eagerly [WIP]. When you use a dplyr verb directly on a data.table object, it
eagerly converts the dplyr code to data.table code, runs it, and returns a
new data.table. This is not very efficient because it can’t take advantage
of many of data.table’s best features.- Lazily. In this form, trigged by using
lazy_dt()
, no computation is
performed until you explicitly request it withas.data.table()
,
as.data.frame()
oras_tibble()
. This allows dtplyr to inspect the
full sequence of operations to figure out the best translation.(reference, and recently completely deleted)
This is a bit confusing, but we can unroll it a bit.
dplyr
(and later dtplyr
) has always converted dplyr
pipelines into data.table
realizations.dplyr
‘s strategy since the first released version of dplyr
(verson 0.1.1 2014-01-29).
data.table
. Our own rqdatatable
package has been calling data.table
this way for over a year (ref). It is very odd that dplyr
didn’t use this good strategy for the data.table
adaptor, as it is the strategy dplyr
uses in its SQL
adaptor.
Let’s take a look at the current published version of dtplyr
(0.0.3) and how its eager evaluation works. Consider the following 4 trivial functions: that each add one to a data.frame
column multiple times.
base_r_fn <- function(df) {
dt <- df
for(i in seq_len(nstep)) {
dt$x1 <- dt$x1 + 1
}
dt
}
dplyr_fn <- function(df) {
dt <- df
for(i in seq_len(nstep)) {
dt <- mutate(dt, x1 = x1 + 1)
}
dt
}
dtplyr_fn <- function(df) {
dt <- as.data.table(df)
for(i in seq_len(nstep)) {
dt <- mutate(dt, x1 = x1 + 1)
}
dt
}
data.table_fn <- function(df) {
dt <- as.data.table(df)
for(i in seq_len(nstep)) {
dt[, x1 := x1 + 1]
}
dt[]
}
base_r_fn()
is idiomatic R
code, dplyr_fn()
is idiomatic dplyr
code, dtplyr_fn()
is a idiomatic dplyr
code operating over a data.table
object (hence using dtplyr
), and data.table_fn()
is idiomatic data.table
code.
When we time running all of these functions operating on a 100000 row by 100 column data frame for 1000 steps we see each of them takes the following time to complete the task on average:
method mean_seconds 1: base_r 0.8367011 2: data.table 1.5592681 3: dplyr 2.6420171 4: dtplyr 151.0217646
The “eager” dtplyr
system is about 100 times slower than data.table
. This trivial task is one of the few times that data.table
isn’t by far the fastest implementation (in tasks involving grouped summaries, joins, and other non-trivial operations data.table
typically has a large performance advantage, ref).
Here is the same data presented graphically.
This is why we don’t consider “eager” the proper way to call data.table
, it artificially makes data.table
appear slow. This is the negative impression of data.table
that the dplyr
/dtplyr
adaptors have been falsely giving dplyr
users for the last five years. dplyr
users either felt they were getting the performance of data.table
through dplyr
(if they didn’t check timings), or got a (false) negative impression of data.table
(if they did check timings).
Details of the timings can be found here.
As we have said: the “don’t force so many extra copies” methodology has been in rqdatable
for quite some time, and in fact works well. Some timings on a similar problem are shared here.
Notice the two rqdatatable
timings have some translation overhead. This is why using base R
or data.table
directly is, in general, going to be a superior methodology.
Let’s imagine you are testing a new weight loss program and comparing it so some existing weight loss regimen. You want to run an experiment to determine if the new program is more effective than the old one. You’ll put a control group on the old plan, and a treatment group on the new plan, and after three months, you’ll measure how much weight the subjects lost, and see which plan does better on average.
The question is: how many subjects do you need to run a good experiment?
In order to answer that question, you generally need to decide several things:
Let’s say that you are willing to tolerate a probability of either false posititve or false negative of 1%. This means that you want a p-value threshold of 0.01, and a power of 0.99.
p_value <- 0.01
power <- 0.99
You also decide that you will call the new program “better” if the average weight loss at the end of the trial is two pounds or more greater on the new program than on the old.
Of course, not everyone on a program is going to lose exactly the same amount of weight; in fact, weight loss generally varies widely. Let’s say you know from experience that on average, people on the old program lose about 10 pounds after three months, and the standard deviation on the weight loss is about five pounds. This relatively wide standard deviation will make it harder to detect an average difference of two pounds than if the standard deviation were (miraculously) about one pound:
Cohen’s d is a measure of effect size for the difference of two means that takes the variance of the population into account. It’s defined as
d = | μ_{1} – μ_{2} | / σ_{pooled}
where σ_{pooled} is the pooled standard deviation over both cohorts.
σ_{pooled} = √( ( σ_{1}^{2} + σ_{2}^{2})/2 )
Note that this formula assumes both cohorts are the same size.
The use of Cohen’s d for experimental design also assumes that the true standard deviation of the two treatment groups is about the same; only the mean differs.
Cohen suggested as a rule of thumb that d = 0.2 is a small effect, d = 0.5 is medium, and d = 0.8 is large (Wikipedia has a more detailed rule-of-thumb table). Here’s what those values of d (plus one more) look like:
But of course, we’d like to map d to meaningful physical units. If we assume that subjects on the new program will have about the same variance in weight loss as people on the old program, we can estimate the minimum d that we’d like to detect.
# in pounds
sigma <- 5
delta_mu <- 2
(d_minimum <- delta_mu/sigma)
## [1] 0.4
Now we can estimate how many subjects we need per group, using the pwr.t.test()
function from the pwr
package in R. This function has four primary arguments:
Given any three (and the fourth as NULL
), pwr.t.test()
estimates the value of the fourth. We’ll do a two-sided, two-sample test (compare two populations and check if the means are different).
library(pwr)
n_per_cohort <- NULL # what we want to know
effect_size <- d_minimum
p_value <- 0.01
power <- 0.99
type <- "two.sample" # default value
alternative <- "two.sided"# default value
## given any three you can estimate the fourth
(estimate <- pwr.t.test(n = n_per_cohort,
d = effect_size,
sig.level = p_value,
power = power,
type = type,
alternative = alternative))
##
## Two-sample t test power calculation
##
## n = 302.0564
## d = 0.4
## sig.level = 0.01
## power = 0.99
## alternative = two.sided
##
## NOTE: n is number in *each* group
We estimate that you will need about 303 subjects in each cohort (or 606 subjects total) to reliably detect a difference of two pounds average weight loss between the two groups of subjects (where “reliably” means 1% chance of a false positive and 1% chance of a false negative).
If this seems like too large an experiment, then you may need to relax your error bounds, or test for a larger effect size; whichever is more appropriate to your situation. If you are willing to reduce the power of your test (but leave the significance level where it is), plotting the output of pwr.t.test
will give you a helpful power analysis graph.
# you could just plot it, but I want the data
eplot <- plot(estimate)
powanal <- eplot$data
# a function that returns the approx sample size for a given power
panal_fun <- approxfun(powanal$power, powanal$sample_sizes)
powers <- c(0.9, 0.95)
nframe <- data.frame(power=powers,
sample_sizes=ceiling(panal_fun(powers)))
nframe <- transform(nframe,
label = paste("approx n =", sample_sizes))
library(ggplot2)
ggplot(powanal, aes(x=sample_sizes, y=power)) +
geom_point() + geom_line(size = 0.1, color="red") +
geom_hline(yintercept = nframe$power, linetype=3, color="darkblue") +
geom_vline(xintercept = nframe$sample_sizes, linetype=3, color="darkblue") +
geom_text(data=nframe, aes(x = 300, y = power, label=label)) +
ggtitle("Two-sample t test power calculation",
subtitle = "n is sample size per group")
If you want to detect an effect size of at least two pounds to a significance of 0.01 and a power of 0.95, you need 226 subjects per group, or 452 subjects total. If you can settle for a test with a power of 0.9, then you only need 188 subjects per group, or 376 subjects total.
What often happens in real life is that the experimenter only has access to as many subjects as they can gather; they then hope that the set is large enough to detect any usedful difference. You can use pwr.t.test()
to estimate the efficacy of your experiment in this situation, as well.
Suppose you are only able to gather 200 subjects (100 subjects for each diet plan). How big a difference can your experiment detect?
n <- 100
(estimate <- pwr.t.test(n = n,
d = NULL,
sig.level = p_value,
power = power,
type = type,
alternative = alternative))
##
## Two-sample t test power calculation
##
## n = 100
## d = 0.6991277
## sig.level = 0.01
## power = 0.99
## alternative = two.sided
##
## NOTE: n is number in *each* group
# how large an effect in pounds?
(delta_mu <- sigma * estimate$d)
## [1] 3.495639
An experiment of this size, with a desired significance of 0.01 and a desired power of 0.99, can only reliably detect an effect size of about 3.5 pounds or larger. Lowering the test power to 0.8 will lower the minimum detectable effect size to about 2.5 pounds; this means that even if the new diet truly improves average weight loss by about 2.5 pounds over three months, your experiment will have a 20% chance of failing to detect it.
Assuming that you want to keep your error bounds as they are, and you can gather a suitable number of subjects at will, what is the minimum detectable effect size for a given sample size? You can estimate that using pwr.t.test()
, too.
sample_sizes <- seq(from = 50, to = 500, by = 50)
# a function to get the effect size as d
get_effect_size <- function(n, sig_level = 0.01, power = 0.99) {
test_results <- pwr.t.test(n, NULL, sig_level, power)
test_results$d
}
# estimate d for a range of sample sizes
dvec = vapply(sample_sizes,
function(n) {get_effect_size(n)},
numeric(1))
# convert dvec into approximate difference in pounds
# assuming both populations have the same sigma
sigma <- 5
diff_frame <- data.frame(n = sample_sizes,
delta_lbs = dvec * sigma)
# what's the approximate sample size (per cohort) to detect a difference of 3 pounds?
(n_3 <- ceiling(approx(diff_frame$delta_lbs, diff_frame$n, 3)$y))
## [1] 139
nframe <- data.frame(n = n_3,
delta_lbs = 3,
label = paste("approx n =", n_3))
ggplot(diff_frame, aes(x=n, y=delta_lbs)) +
geom_point() + geom_line(size = 0.1, color="red") +
geom_hline(yintercept = 3, color="darkblue", linetype=3) +
geom_vline(xintercept = n_3, color="darkblue", linetype=3) +
geom_text(data = nframe, aes(y = 1.5, label=label)) +
ylab("Minimum detectable effect size (pounds)") +
ggtitle("Estimated minimum detectable effect size (pounds)")
To detect an average weight loss difference of three pounds to a significance of 0.01 and a power of 0.99, you will need about 139 subjects per group, or 278 total.
A general rule of thumb is that for a given set of error bounds, if you you want to halve the effect size you want to measure (go from detecting a difference of three pounds to 1.5 pounds), you need four times the data (from 139 subjects per group to 556 subjects per group).
This analysis assumes that both populations have about the same standard deviation of 5 pounds. If you suspect that the new weight loss program might have higher variance than the old one, then you should assume a higher pooled standard deviation, which means a smaller d, which means you will need more subjects.
For testing the difference in rates of two processes, use Cohens’ h. The function ES.h
calculates the appropriate h for a desired difference in rates, and the functions pwr.p.test()
, pwr.2p.test()
, and pwr.2p2n.test()
help you estimate appropriate experiment sizes in different situations.
Our overall point is that proper experiments need to have stated goals and documented plans before being executed. Power calculators help you properly design the experiment size.
For more information, Chapter 10 of Kabacoff’s R in Action, 2nd Edition (Manning, 2015) is a useful reference.
]]>In this note we show that finding a minimal set of columns that form a primary key in a database is also NP-hard.
Problem: Minimum Cardinality Primary Key
Instance: Vectors x_{1} through x_{m} elements of {0,1}^{n} and positive integer k.
Question: Is there a “primary key” of size no more then k? That is: is there a subset P of {1, …, n} with |P| ≤ k such that for any integers a, b with 1 ≤ a < b ≤ n we can find an j in P such that x_{a}(j) doesn’t equal x_{b}(j) (i.e. x_{a} and x_{b} differ at some index named in P, and hence can be distinguished or “told apart”).
Now the standard reference on NP-hardness (Garey and Johnson, Computers and Intractability, Freeman, 1979) does have some NP-hard database examples (such as SR26 Minimum Cardinality Key). However the stated formulations are a bit hard to decode, so we will relate the above problem directly to a more accessible problem: SP8 Hitting Set.
Problem: SP8 Hitting Set
Instance: Collection C of subsets of a finite set S, positive integer K ≤ |S|.
Question: Is there a subset S^{‘} contained in S with |S^{‘}| ≤ K such that S^{‘} contains at least one element from each subset in C?
The idea is: SP8 is thought to be difficult to solve, so if we show how Minimum Cardinality Primary Key could be used to easily solve SP8 this is then evidence Minimum Cardinality Primary Key is also difficult to solve.
So suppose we have an arbitrary instance of SP8 in front of us. Without loss of generality assume S = {1, …, n}, C = {C_{1}, …, C_{m}}, and all of the C_{i} are non-empty and distinct.
We build an instance of the Minimum Cardinality Primary Key problem by defining a table with columns named s_{1} through s_{n} plus d_{1} through d_{m}.
Now we define the rows of our table:
Now let’s look at what sets of columns form primary keys for the collection of rows r_{0}, z_{i}, x_{i}.
We must have all of d_{i} in P, as each d_{i} is the unique index of the only difference between z_{i} and r_{0}. Also, for any i we must have a j such that z_{i}(d_{j})=1 and j in C_{i}, as if there were none we could not tell z_{i} from x_{i} (as they differ only in indices named by C_{i}).
This lets us confirm a good primary key set P is such that S^{‘} = {j | s_{j} in P} is itself a good hitting set for the SP8 problem. And for any hitting set S^{‘} we have P = {s_{j} | j in S^{‘}} union {d_{i}, … d_{m}} is a good solution for the Minimum Cardinality Primary Key problem (the d_{i} allow us to distinguish r_{0} from z_{i}, the z_{i} from themselves, r_{0} from x_{i}, and the x_{i} from them selves; the set hitting property lets us distinguish z_{i} from the corresponding x_{i}, completing the unique keying of rows by the chosen column set). And the solution sizes are always such that |P| = |S’| + m.
So: if we had a method to solve arbitrary instances of the Minimum Cardinality Primary Key problem, we could then use it to solve arbitrary instances of the SP8 Hitting Set Problem. We would just re-encode the SP8 problem as described above, solve the Minimum Cardinality Primary Key problem, and use the strong correspondence between solutions to these two problems to map the solution back to the SP8 problem. Thus the Minimum Cardinality Primary Key problem is itself NP-hard.
What made the problem hard was, as is quite common, is: the solution size constraint. Without that constraint the problem is trivial. The set of all columns either forms a primary key or does not, and it is simple calculation to check that. As with the variable pruning problem we can even try step-wise deleting columns to explore subsets of columns that are also primary table keys, moving us to a non-redundant key set (but possibly not of minimal size).
]]>