In the linear regression section of our book Practical Data Science in R, we use the example of predicting income from a number of demographic variables (age, sex, education and employment type). In the text, we choose to regress against
log10(income) rather than directly against income.
One obvious reason for not regressing directly against income is that (in our example) income is restricted to be non-negative, a restraint that linear regression can’t enforce. Other reasons include the wide distribution of values and the relative or multiplicative structure of errors on outcomes. A common practice in this situation is to use Poisson regression, or generalized linear regression with a log-link function. Like all generalized linear regressions, Poisson regression is unbiased and calibrated: it preserves the conditional expectations and rollups of the training data. A calibrated model is important in many applications, particularly when financial data is involved.
Regressing against the log of the outcome will not be calibrated; however it has the advantage that the resulting model will have lower relative error than a Poisson regression against income. Minimizing relative error is appropriate in situations when differences are naturally expressed in percentages rather than in absolute amounts. Again, this is common when financial data is involved: raises in salary tend to be in terms of percentage of income, not in absolute dollar increments.
Unfortunately, a full discussion of the differences between Poisson regression and regressing against log amounts was outside of the scope of our book, so we will discuss it in this note.
Continue reading Link Functions versus Data Transforms
Here is an absolutely horrible way to confuse yourself and get an inflated reported
R-squared on a simple linear regression model in
We have written about this before, but we found a new twist on the problem (interactions with categorical variable encoding) which we would like to call out here. Continue reading An easy way to accidentally inflate reported R-squared in linear regression models
Suppose we have the task of predicting an outcome
y given a number of variables
v1,..,vk. We often want to “prune variables” or build models with fewer than all the variables. This can be to speed up modeling, decrease the cost of producing future data, improve robustness, improve explain-ability, even reduce over-fit, and improve the quality of the resulting model.
For some informative discussion on such issues please see the following:
In this article we are going to deliberately (and artificially) find and test one of the limits of the technique. We recommend simple variable pruning, but also think it is important to be aware of its limits.
Continue reading Variables can synergize, even in a linear model
In most of our data science teaching (including our book Practical Data Science with R) we emphasize the deliberately easy problem of “exchangeable prediction.” We define exchangeable prediction as: given a series of observations with two distinguished classes of variables/observations denoted “x”s (denoting control variables, independent variables, experimental variables, or predictor variables) and “y” (denoting an outcome variable, or dependent variable) then:
- Estimate an approximate functional relation
y ~ f(x).
- Apply that relation to new instances where
x is known and
y is not yet known.
An example of this would be to use measured characteristics of online shoppers to predict if they will purchase in the next month. Data more than a month old gives us a training set where both
y are known. Newer shoppers give us examples where only
x is currently known and it would presumably be of some value to estimate
y or estimate the probability of different
y values. The problem is philosophically “easy” in the sense we are not attempting inference (estimating unknown parameters that are not later exposed to us) and we are not extrapolating (making predictions about situations that are out of the range of our training data). All we are doing is essentially generalizing memorization: if somebody who shares characteristics of recent buyers shows up, predict they are likely to buy. We repeat: we are not forecasting or “predicting the future” as we are not modeling how many high-value prospects will show up, just assigning scores to the prospects that do show up.
The reliability of such a scheme rests on the concept of exchangeability. If the future individuals we are asked to score are exchangeable with those we had access to during model construction then we expect to be able to make useful predictions. How we construct the model (and how to ensure we indeed find a good one) is the core of machine learning. We can bring in any big name machine learning method (deep learning, support vector machines, random forests, decision trees, regression, nearest neighbors, conditional random fields, and so-on) but the legitimacy of the technique pretty much stands on some variation of the idea of exchangeability.
One effect antithetical to exchangeability is “concept drift.” Concept drift is when the meanings and distributions of variables or relations between variables changes over time. Concept drift is a killer: if the relations available to you during training are thought not to hold during later application then you should not expect to build a useful model. This one of the hard lessons that statistics tries so hard to quantify and teach.
We know that you should always prefer fixing your experimental design over trying a mechanical correction (which can go wrong). And there are no doubt “name brand” procedures for dealing with concept drift. However, data science and machine learning practitioners are at heart tinkerers. We ask: can we (to a limited extent) attempt to directly correct for concept drift? This article demonstrates a simple correction applied to a deliberately simple artificial example.
Image: Wikipedia: Elgin watchmaker
Continue reading Can we try to make an adjustment?
What is the Gauss-Markov theorem?
From “The Cambridge Dictionary of Statistics” B. S. Everitt, 2nd Edition:
A theorem that proves that if the error terms in a multiple regression have the same variance and are uncorrelated, then the estimators of the parameters in the model produced by least squares estimation are better (in the sense of having lower dispersion about the mean) than any other unbiased linear estimator.
This is pretty much considered the “big boy” reason least squares fitting can be considered a good implementation of linear regression.
Suppose you are building a model of the form:
y(i) = B . x(i) + e(i)
B is a vector (to be inferred),
i is an index that runs over the available data (say
x(i) is a per-example vector of features, and
y(i) is the scalar quantity to be modeled. Only
y(i) are observed. The
e(i) term is the un-modeled component of
y(i) and you typically hope that the
e(i) can be thought of unknowable effects, individual variation, ignorable errors, residuals, or noise. How weak/strong assumptions you put on the
e(i) (and other quantities) depends on what you know, what you are trying to do, and which theorems you need to meet the pre-conditions of. The Gauss-Markov theorem assures a good estimate of
B under weak assumptions.
How to interpret the theorem
The point of the Gauss-Markov theorem is that we can find conditions ensuring a good fit without requiring detailed distributional assumptions about the
e(i) and without distributional assumptions about the
x(i). However, if you are using Bayesian methods or generative models for predictions you may want to use additional stronger conditions (perhaps even normality of errors and even distributional assumptions on the
We are going to read through the Wikipedia statement of the Gauss-Markov theorem in detail.
Continue reading Reading the Gauss-Markov theorem
What is meant by regression modeling?
Linear Regression is one of the most common statistical modeling techniques. It is very powerful, important, and (at first glance) easy to teach. However, because it is such a broad topic it can be a minefield for teaching and discussion. It is common for angry experts to accuse writers of carelessness, ignorance, malice and stupidity. If the type of regression the expert reader is expecting doesn’t match the one the writer is discussing then the writer is assumed to be ill-informed. The writer is especially vulnerable to experts when writing for non-experts. In such writing the expert finds nothing new (as they already know the topic) and is free to criticize any accommodation or adaption made for the intended non-expert audience. We argue that many of the corrections are not so much evidence of wrong ideas but more due a lack of empathy for the necessary informality necessary in concise writing. You can only define so much in a given space, and once you write too much you confuse and intimidate a beginning audience. Continue reading What is meant by regression modeling?
When finishing Worry about correctness and repeatability, not p-values I got to thinking a bit more about what can you actually check when reading a paper, especially when you don’t have access to the raw data. Some of the fellow scientists I admire most have a knack for back of the envelope calculations and dimensional analysis style calculations. They could always read a few facts off a presentation that the presenter may not have meant to share. There is a joy in calculation and figuring, so I decided it would be a fun challenge to see if you could check any of the claims of “Association between muscular strength and mortality in men: prospective cohort study,” Ruiz et. al. BMJ 2008;337:a439 from just the summary tables supplied in the paper itself. Continue reading Checking claims in published statistics papers
The important criterion for a graph is not simply how fast we can see a result; rather it is whether through the use of the graph we can see something that would have been harder to see otherwise or that could not have been seen at all.
— William Cleveland, The Elements of Graphing Data, Chapter 2
In this article, I will discuss some graphs that I find extremely useful in my day-to-day work as a data scientist. While all of them are helpful (to me) for statistical visualization during the analysis process, not all of them will necessarily be useful for presentation of final results, especially to non-technical audiences.
I tend to follow Cleveland’s philosophy, quoted above; these graphs show me — and hopefully you — aspects of data and models that I might not otherwise see. Some of them, however, are non-standard, and tend to require explanation. My purpose here is to share with our readers some ideas for graphical analysis that are either useful to you directly, or will give you some ideas of your own.
Continue reading My Favorite Graphs
What is R2? In the context of predictive models (usually linear regression), where y is the true outcome, and f is the model’s prediction, the definition that I see most often is:
In words, R2 is a measure of how much of the variance in y is explained by the model, f.
Under “general conditions”, as Wikipedia says, R2 is also the square of the correlation (correlation written as a “p” or “rho”) between the actual and predicted outcomes:
I prefer the “squared correlation” definition, as it gets more directly at what is usually my primary concern: prediction. If R2 is close to one, then the model’s predictions mirror true outcome, tightly. If R2 is low, then either the model does not mirror true outcome, or it only mirrors it loosely: a “cloud” that — hopefully — is oriented in the right direction. Of course, looking at the graph always helps:
The question we will address here is : how do you get from R2 to correlation?
Continue reading Correlation and R-Squared