In the previous article in this series, we showed that common ensemble models like random forest and gradient boosting are uncalibrated: they are not guaranteed to estimate aggregates or rollups of the data in an unbiased way. However, they can be preferable to calibrated models such as linear or generalized linear regression, when they make more accurate predictions on individuals. In this article, we’ll demonstrate one ad-hoc method for calibrating an uncalibrated model with respect to specific grouping variables. This "polishing step" potentially returns a model that estimates certain rollups in an unbiased way, while retaining good performance on individual predictions.
In our previous article , we showed that generalized linear models are unbiased, or calibrated: they preserve the conditional expectations and rollups of the training data. A calibrated model is important in many applications, particularly when financial data is involved.
However, when making predictions on individuals, a biased model may be preferable; biased models may be more accurate, or make predictions with lower relative error than an unbiased model. For example, tree-based ensemble models tend to be highly accurate, and are often the modeling approach of choice for many machine learning applications. In this note, we will show that tree-based models are biased, or uncalibrated. This means they may not always represent the best bias/variance trade-off.
In the linear regression section of our book Practical Data Science in R, we use the example of predicting income from a number of demographic variables (age, sex, education and employment type). In the text, we choose to regress against
log10(income) rather than directly against income.
One obvious reason for not regressing directly against income is that (in our example) income is restricted to be non-negative, a restraint that linear regression can’t enforce. Other reasons include the wide distribution of values and the relative or multiplicative structure of errors on outcomes. A common practice in this situation is to use Poisson regression, or generalized linear regression with a log-link function. Like all generalized linear regressions, Poisson regression is unbiased and calibrated: it preserves the conditional expectations and rollups of the training data. A calibrated model is important in many applications, particularly when financial data is involved.
Regressing against the log of the outcome will not be calibrated; however it has the advantage that the resulting model will have lower relative error than a Poisson regression against income. Minimizing relative error is appropriate in situations when differences are naturally expressed in percentages rather than in absolute amounts. Again, this is common when financial data is involved: raises in salary tend to be in terms of percentage of income, not in absolute dollar increments.
Unfortunately, a full discussion of the differences between Poisson regression and regressing against log amounts was outside of the scope of our book, so we will discuss it in this note.
In this note, we discuss the use of Cohen’s D for planning difference-of-mean experiments.
Estimating sample size
Let’s imagine you are testing a new weight loss program and comparing it so some existing weight loss regimen. You want to run an experiment to determine if the new program is more effective than the old one. You’ll put a control group on the old plan, and a treatment group on the new plan, and after three months, you’ll measure how much weight the subjects lost, and see which plan does better on average.
The question is: how many subjects do you need to run a good experiment? Continue reading Cohen’s D for Experimental Planning
We have two new chapters of Practical Data Science with R, Second Edition online and available for review!
The newly available chapters cover:
Data Engineering And Data Shaping – Explores how to use R to organize or wrangle data into a shape useful for analysis. The chapter covers applying data transforms, data manipulation packages, and more.
Choosing and Evaluating Models – The chapter starts with exploring machine learning approaches and then moves to studying key model evaluation topics like mapping business problems to machine learning tasks, evaluating model quality, and how to explain model predictions.
If you haven’t signed up for our book’s MEAP (Manning Early Access Program), we encourage you to do so. The MEAP includes a free copy of Practical Data Science with R, First Edition, as well as early access to chapter drafts of the second edition as we complete them.
For those of you who have already subscribed — thank you! We hope you enjoy the new chapters, and we look forward to your feedback.
If you’ve read our previous R Tip on using sigr with linear models, you might have noticed that the
lm() summary object does in fact carry the R-squared and F statistics, both in the printed form:
model_lm <- lm(formula = Petal.Length ~ Sepal.Length, data = iris) (smod_lm <- summary(model_lm)) ## ## Call: ## lm(formula = Petal.Length ~ Sepal.Length, data = iris) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2.47747 -0.59072 -0.00668 0.60484 2.49512 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -7.10144 0.50666 -14.02 <2e-16 *** ## Sepal.Length 1.85843 0.08586 21.65 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.8678 on 148 degrees of freedom ## Multiple R-squared: 0.76, Adjusted R-squared: 0.7583 ## F-statistic: 468.6 on 1 and 148 DF, p-value: < 2.2e-16
and also in the
c(R2 = smod_lm$r.squared, F = smod_lm$fstatistic) ## R2 F.value ## 0.7599546 468.5501535
Note, though, that while the summary reports the model’s significance, it does not carry it as a specific
summary() object item.
sigr::wrapFTest() is a convenient way to extract the model’s R-squared and F statistic and simultaneously calculate the model significance, as is required by many scientific publications.
sigr is even more helpful for logistic regression, via
glm(), which reports neither the model’s chi-squared statistic nor its significance.
iris$isVersicolor <- iris$Species == "versicolor" model_glm <- glm( isVersicolor ~ Sepal.Length + Sepal.Width, data = iris, family = binomial) (smod_glm <- summary(model_glm)) ## ## Call: ## glm(formula = isVersicolor ~ Sepal.Length + Sepal.Width, family = binomial, ## data = iris) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.9769 -0.8176 -0.4298 0.8855 2.0855 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 8.0928 2.3893 3.387 0.000707 *** ## Sepal.Length 0.1294 0.2470 0.524 0.600247 ## Sepal.Width -3.2128 0.6385 -5.032 4.85e-07 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 190.95 on 149 degrees of freedom ## Residual deviance: 151.65 on 147 degrees of freedom ## AIC: 157.65 ## ## Number of Fisher Scoring iterations: 5
To get the significance of a logistic regression model, call
library(sigr) (chi2Test <- wrapChiSqTest(model_glm)) ##  “Chi-Square Test summary: pseudo-R2=0.21 (X2(2,N=150)=39, p<1e-05).”
Notice that the fit summary also reports a pseudo-R-squared. You can extract the values directly off the
sigr object, as well:
str(chi2Test) ## List of 10 ## $ test : chr "Chi-Square test" ## $ df.null : int 149 ## $ df.residual : int 147 ## $ null.deviance : num 191 ## $ deviance : num 152 ## $ pseudoR2 : num 0.206 ## $ pValue : num 2.92e-09 ## $ sig : num 2.92e-09 ## $ delta_deviance: num 39.3 ## $ delta_df : int 2 ## - attr(*, "class")= chr [1:2] "sigr_chisqtest" "sigr_statistic"
And of course you can render the
sigr object into one of several formats (Latex, html, markdown, and ascii) for direct inclusion in a report or publication.
render(chi2Test, format = "html")
Chi-Square Test summary: pseudo-R2=0.21 (χ2(2,N=150)=39, p<1e-05).
By the way, if you are interested, we give the explicit formula for calculating the significance of a logistic regression model in Practical Data Science with R.
In my previous post, I showed how to use
cdata package along with
ggplot2‘s faceting facility to compactly plot two related graphs from the same data. This got me thinking: can I use
cdata to produce a
ggplot2 version of a scatterplot matrix, or pairs plot?
A pairs plot compactly plots every (numeric) variable in a dataset against every other one. In base plot, you would use the
pairs() function. Here is the base version of the pairs plot of the
pairs(iris[1:4], main = "Anderson's Iris Data -- 3 species", pch = 21, bg = c("#1b9e77", "#d95f02", "#7570b3")[unclass(iris$Species)])
There are other ways to do this, too:
# not run library(ggplot2) library(GGally) ggpairs(iris, columns=1:4, aes(color=Species)) + ggtitle("Anderson's Iris Data -- 3 species") library(lattice) splom(iris[1:4], groups=iris$Species, main="Anderson's Iris Data -- 3 species")
But I wanted to see if
cdata was up to the task. So here we go….
In between client work, John and I have been busy working on our book, Practical Data Science with R, 2nd Edition. To demonstrate a toy example for the section I’m working on, I needed scatter plots of the petal and sepal dimensions of the
iris data, like so:
I wanted a plot for petal dimensions and sepal dimensions, but I also felt that two plots took up too much space. So, I thought, why not make a faceted graph that shows both:
Except — which columns do I plot and what do I facet on?
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3.0 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5.0 3.6 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa
We are pleased and excited to announce that we are working on a second edition of Practical Data Science with R!
In a previous article, we showed the use of partial pooling, or hierarchical/multilevel models, for level coding high-cardinality categorical variables in
vtreat. In this article, we will discuss a little more about the how and why of partial pooling in
We will use the
lme4 package to fit the hierarchical models. The acronym “lme” stands for “linear mixed-effects” models: models that combine so-called “fixed effects” and “random effects” in a single (generalized) linear model. The
lme4 documentation uses the random/fixed effects terminology, but we are going to follow Gelman and Hill, and avoid the use of the terms “fixed” and “random” effects.
The varying coefficients [corresponding to the levels of a categorical variable] in a multilevel model are sometimes called random effects, a term that refers to the randomness in the probability model for the group-level coefficients….
The term fixed effects is used in contrast to random effects – but not in a consistent way! … Because of the conflicting definitions and advice, we will avoid the terms “fixed” and “random” entirely, and focus on the description of the model itself…
– Gelman and Hill 2007, Chapter 11.4
We will also restrict ourselves to the case that
vtreat considers: partially pooled estimates of conditional group expectations, with no other predictors considered.