An all too common approach to modeling in data science is to throw all possible variables at a modeling procedure and “let the algorithm sort it out.” This is tempting when you are not sure what are the true causes or predictors of the phenomenon you are interested in, but it presents dangers, too. Very wide data sets are computationally difficult for some modeling procedures; and more importantly, they can lead to overfit models that generalize poorly on new data. In extreme cases, wide data can fool modeling procedures into finding models that look good on training data, even when that data has no signal. We showed some examples of this previously in our “Bad Bayes” blog post.
In this latest “Statistics as it should be” article, we will look at a heuristic to help determine which of your input variables have signal.
Thought experiment: What does a pure noise variable look like?
To recognize if a variable is predictive of the outcome you are interested in, it helps to know what it looks like when a variable is completely independent of the outcome. Suppose you have a variable x
that you believe predicts an outcome y
— in our examples, y
will be a binary outcome, TRUE or FALSE, but the idea works for general x
and y
.
x1 y1 x2 y2 x3 y3 …
We can make y
independent of x
by scrambling (or permuting) the order of x and y relative to each other:
x1 y4 x2 y7 x3 y1 …
Now fit a one variable model (say a logistic regression model) to this new no-signal data set, and compare the model’s training performance (say its deviance) to the training performance of a logistic regression model fit to the true data (x
, y
). If x
truly has signal (as you hope), then the model on the unpermuted data should perform much better (have lower deviance) than the model fit to the no-signal (permuted) data. If x
actually has no signal, then both models should look about the same.
Now do the permutation step over and over again. You will have built up a distribution of models built on no-signal data sets that “look like” the original data — that is, the data sets are the same size as the original data, and they have the same distributions of x
values and of y
values as your original data, just not in the same pairings. You also have the distribution of how they perform. If x
truly has signal, then the model built on the real (x
, y
) should perform much better than the other models — its deviance should be to the left (lower) than the “hump” of deviances of the other models. If x
does not predict y
, then its deviance will probably sit somewhere within the range of deviances of the other models.
Let’s see what this looks like in R.
library(ggplot2) # return a frame of the deviance scores on the permuted data permutation_test = function(dataf, ycol, nperm) { nrows = dim(dataf)[1] y = dataf[[ycol]] X = dataf[, setdiff(colnames(dataf), ycol), drop=FALSE] varnames = colnames(X) fmla = paste("y", paste(varnames, collapse=" + "), sep=" ~ ") deviances <- numeric(nperm) for(i in seq_len(nperm)) { # random order of rows ord = sample.int(nrows, size=nrows, replace=FALSE) model = glm(fmla, data=cbind(y=y[ord], X), family=binomial(link="logit")) #print(summary(model)) deviances[[i]] =model$deviance } deviances } score_variable = function(dframe, ycol, var, nperm, title='') { df=data.frame(y=dframe[[ycol]], x=dframe[[var]]) mod = glm("y~x", data=df, family=binomial(link="logit")) vdev = mod$deviance vperm = permutation_test(df, "y", nperm) # count how many times vdev >= deviances from perm test num = sum(vperm <= vdev) vscore = num/nperm print(ggplot(data.frame(nullperm=vperm), aes(x=nullperm)) + geom_density() + geom_vline(xintercept=vdev, color='red') + ggtitle(paste(title, "left tail area ~", vscore)))
Now we build a small data set with one predictive variable and one noise variable, and compare the performance of the two variables.
set.seed(3266) N = 1000 s1 = rnorm(N) n1 = rnorm(N) y = 2*s1 + rnorm(N) dframe = data.frame(y=y>0, s1=s1, n1=n1) nperm=500 # First, the model on the signaling variable score_variable(dframe, "y", "s1", nperm, title='Signal variable deviance,')
The one-variable model built from the variable with signal has a deviance far smaller than its companion no-signal data sets.
score_variable(dframe, "y", "n1", nperm, title='Noise variable deviance,')
The one-variable model built from the no-signal variable has a deviance that sits right in the middle of the distribution of no-signal data sets: about 36% of the permuted data sets produced models with training deviances lower than the model on the real data. So we have plausible evidence that this variable does not provide any signal about the outcome, and therefore isn't useful.
￼
This is what "no signal" looks like: almost usable.
This permutation test technique can be used not just for deviance, but for any metric, like accuracy, precision or recall (for classifiers), or squared error (for regression on continuous-valued outcomes). You can even, in principle, use permutation tests to evaluate whether an entire model -- not just a single variable -- is extracting useful signal from the data. The idea is the same: if your model's accuracy, or variance, or whatever, falls within the distribution of the performance of models built on permuted (no-signal) data, then the original model is not extracting meaningful, generalizable concepts from the data. The permutation test, in this situation, is directly measuring if your model is statistically significantly different from a family of uninformative models. For shorthand, we will call this "the significance" of the model. Note that we want the significance value (the area under the left tail in the figures above) to be small.
There is a caveat here: this technique won't work on modeling algorithms that memorize their training data, like random forest (Kohavi noticed the same problem with cross-validation and bootstrap). Random forests regularly fit to perfect accuracy on training data, no matter what, so there isn't a meaningful distribution to compare against (although one can evaluate a random forest model with a permutation test of deviance). In any case, one doesn't usually have to worry about calculating the significance of a full model in a data science (data-rich) situation: if you want to determine if a full model is overfitting its training data, it's better to do that with hold-out data. So we will stick to discussing variable evaluation.
From thought experiment to practical suggestion: chi-squared and F tests
When you have very many variables, permutation tests to check which of the variables have signal can get computationally-intensive. Fortunately, there are "closed-form" statistics you can use to estimate the significance of your variables (or to be precise, the significance of the one-variable models built from your variables). Let's stick to the example of predicting a binary outcome using logistic regression. You can determine the significance of a logistic regression model by looking at the difference between the model's deviance on the training data, and the null deviance (the deviance of the best constant model: the mean of y
). In R parlance, if your glm model is called model
, then you want to look at delta_deviance = model$null.deviance - model$deviance
. If there is no signal in the data, then this quantity is distributed as a chi-squared distribution with degrees of freedom equal to the difference of the degrees of freedom of each model (delta_deviance
will have one degree of freedom for every numerical input to the model, and k-1
degrees of freedom for every k-level categorical variable). The area under the right tail of this chi-squared distribution is the probability that no-signal data would produce delta_deviance
as large as what you observed. This is the significance value of model
.
# get the significance of glm model get_significance = function(model) { delta_deviance = model$null.deviance - model$deviance df = model$df.null - model$df.residual sig = pchisq(delta_deviance, df, lower.tail=FALSE) }
For the example that we used above, the signal variable had a chi-squared significance of 1.9e-163, compared to 0 estimated from the permutation test; the no-signal variable had a chi-squared significance of 0.386, compared to 0.356 estimated from the permutation test.
For linear regression, there is a similar statistic called the F-statistic to determine model significance; in R, both the F-statistic and the corresponding model significance (called its p-value) are given by in the summary of a linear regression (lm()
) model.
So we can heuristically determine which variables have signal for a classification model by looking for variables with small significance value (as estimated by the chi-squared distribution), and for a regression model by looking for variables with small significance value (as estimated by the F distribution). We define "small" by picking a threshold, and accepting variables whose significance value is smaller than that threshold. Most modeling algorithms can handle the presence of a few noise variables, so it's better to pick a somewhat high threshold to err on the side of accepting useless variables, rather than losing useful ones.
Let's look at a small example. We'll generate a dataframe of 1000 rows, with ten input variables: five with signal (g-variables), called gx_x
, and five without (n-variables), called nx_x
. The variables are either continuous-valued (gn_x
or nn_x
) variables with expected mean zero and unit standard deviation; or categorical variables with three levels (gc_x$a
, gc_x$b
, and so on), uniformly distributed. The g-variables are additively related to the outcome, with random coefficients. The outcome y
is again binary. We'll use a threshold of 0.05. The code to generate the data set and score the variables can be found here.
The graph below shows the variable scores (significances). The threshold is shown as the dashed red line; the variables that fell below the threshold are shown in green.
## [1] "Variables selected:" ## var scores ## 2 gn_2 3.909215e-99 ## 3 gn_3 3.561903e-09 ## 4 gc_1 3.906063e-06 ## 5 gc_2 4.325259e-06 ## 8 nn_3 1.459305e-02 ## [1] "True coefficients of signal variables" print(coefs) ## $gn_1 ## [1] 0.03406483 ## ## $gn_2 ## [1] -1.457936 ## ## $gn_3 ## [1] 0.4138757 ## ## $gc_1 ## $gc_1$a ## [1] -0.504306 ## ## $gc_1$b ## [1] -0.2303952 ## ## $gc_1$c ## [1] -1.174718 ## ## ## $gc_2 ## $gc_2$a ## [1] -0.2770224 ## ## $gc_2$b ## [1] -0.1947211 ## ## $gc_2$c ## [1] 0.2442891
In this example we picked four of the g-variables and one of the n-variables. As you can see from the coefficients above, the g-variable that we missed, gn_1
had a very small magnitude, smaller even than the noise variables (which essentially have magnitude 1), so its signal was weak.
Picking the Threshold
We'll use a larger example to illustrate picking the threshold. This data set has five signal variables and 2000 noise variables, with 2500 rows. We'll consider three thresholds: 0.01 (or 1/100), 0.025 (or 1/40) and 0.05 (or 1/20). This time, all the g-variables have appreciable coefficients except gn_2
.
## $gn_1 ## [1] 0.8907755 ## ## $gn_2 ## [1] 0.09630513 ## ## $gn_3 ## [1] 1.072262 ## ## $gc_1 ## $gc_1$a ## [1] 1.371897 ## ## $gc_1$b ## [1] -0.8606408 ## ## $gc_1$c ## [1] -0.4705882 ## ## ## $gc_2 ## $gc_2$a ## [1] -2.172888 ## ## $gc_2$b ## [1] -0.9766713 ## ## $gc_2$c ## [1] 0.9012387
Here are the counts of variables selected by each threshold, along with the variables selected by the strictest threshold, 0.01:
## [1] "Variables selected, threshold = 0.01" ## var scores ## 1 gn_1 4.744091e-40 ## 3 gn_3 4.426174e-72 ## 4 gc_1 3.754517e-64 ## 5 gc_2 2.838399e-117 ## 95 nn_90 7.278843e-03 ## 353 nn_348 2.702906e-03 ## 470 nn_465 3.625489e-03 ## 617 nn_612 8.157911e-03 ## 833 nn_828 7.904690e-03 ## 1006 nc_1 2.983575e-03 ## 1265 nc_260 6.512229e-03 ## 1290 nc_285 9.559572e-03 ## 1370 nc_365 5.091275e-03 ## 1490 nc_485 7.549149e-03 ## 1606 nc_601 2.689314e-03 ## 1650 nc_645 5.622456e-03 ## 1912 nc_907 9.193532e-03
The threshold you select indicates how much error you are willing to tolerate -- you can think of it as the false positive rate. A threshold of 0.01 is a 1 in 100 false positive rate, so you would expect from 2000 noise variables to select around 20. In this case, we got lucky and selected only 13. A threshold of 0.025 is 1 in 40, so from 2000 noise variables we'd expect around 50 false positives; we picked 42. A threshold of 0.05 is 1 in 20, so from 2000 noise variables we expect about 100 false positives; in this case, we got 98.
This indicates that when you are winnowing down a very large number of variables, if you expect that most of them are noise, then you want to use a stricter threshold.
We can also try fitting a random forest model to this example, once with all the variables, and once with the 17 variables selected by the threshold 0.01. Both models get perfect performance on training. The full model got an AUC of 0.87 on holdout data; the reduced model got an AUC of 0.93. This is clearly a somewhat labored example -- you don't want to fit a model with almost as many columns as there are datums -- but it does demonstrate that variable filtering can improve the model. You can see the code for the example here (bottom of the page).
Some Additional Points
Use the model significance to threshold the variables, not to sort them. This follows from point 1. Just because variable 1 has a smaller significance value than variable 2 does not mean that variable 1 is more useful than variable 2.
This heuristic is based on logistic and linear models, but it isn't restricted to logistic/linear regression modeling. Once you've selected the variables, you can use any modeling procedure to fit a model: random forest, gradient boosting, SVM, whatever you like. The primary assumption of this heuristic is that a useful variable has a signal that could be detected by a linear or logistic model, which seems like a reasonable supposition.
Because you are scoring each variable by treating it as an input to a one variable linear/logistic model, you need to take care when handling variables that are in reality the output of a submodel for the outcome. An example is using impact coding to manage categorical variables with a very large number of levels. Impact coding builds a bayesian model for the outcome from a single categorical variable, then uses the output of that submodel as a numerical input to the larger model, instead of the original variable. Because the impact-coded variable is now numerical, it will appear to have one degree of freedom. This is true only if the impact-coded model was built from data distinct from the data you are using to score the variables. If you use the same data to build the impact coding as you are using to score the variables, then the impact coding can potentially memorize the scoring data; this means the impact-coded model in reality has more than one degree of freedom (k-1, where k is the number of levels of the original variable). If you build the impact-coding without looking at the scoring data, then you can't memorize the scoring data, so it's safe to assume the impact-coded variable has only one degree of freedom.
From this it follows that if you are going to impact-code, or otherwise build submodels that also predict the outcome from your variables, the submodels should be fit from a separate calibration dataset, not the training set that you will use to score the variables and to fit the full model. This is a good idea not just because of the variable scoring, but because using the same data to fit the submodels and the primary model can introduce undesirable bias into the primary model fitting. See the note at the end of this post (and this post) for more discussion.
A More Realistic Example
In the above examples, we had good variables with reasonably strong signals and noise variables with no signal at all. In reality, there will additionally be variables with signal so weak as to be useless -- but signal. Depending on the threshold you pick such variables can still have acceptably small significance values. Is this heuristic still useful in those situations?
We tried this approach on the 2009 KDD Cup dataset: data from about 50,000 credit card accounts. Our goal is to predict churn (account cancellation). The raw data consists of 234 anonymized inputs, both numerical and categorical. Many of the variables are sparsely populated, and there were a few categorical variables with a large number of levels. We used our vtreat package to clean the data, in particular to deal with NAs and large categoricals. This inflated the number of input columns to 448, all numerical and all clean. As recommended in point 4 above, we split the data into three sets: one for data treatment (vtreat), one for model fitting, and one for test. You can see the code for this experiment here.
For comparison, here is the performance (ROC curve and AUC) of a logistic regression model built on all 448 variables (AUC=0.69 on test compared to 0.74 on training):
And here is the performance of a gradient boosting model also built on all 448 variables (AUC=0.71 on test compared to 0.72 on training):
Here's the distribution of variable scores, on a logarithmic (base 10) scale:
The shape of the distribution suggests a mixture of different types of variables, with varying signal strengths. The rightmost population (score greater than 0.05 or so) likely have no signal; the next few populations potentially have a low signal. The shape of the graph suggests that about 3e-5 is a natural cutoff; we loosened this a bit and used a threshold of 10e-4. This reduced the number of variables down to 87. Here are a few of them:
## [1] "Var6_isBAD" ## [2] "Var7_clean" ## [3] "Var7_isBAD" ## [4] "Var13_clean" ## [5] "Var13_isBAD" ## [6] "Var21_isBAD" ## [7] "Var22_isBAD" ## [8] "Var25_isBAD" ## [9] "Var28_isBAD" ## [10] "Var35_isBAD" ## [11] "Var38_isBAD" ## [12] "Var44_isBAD" ## [13] "Var65_clean" ## [14] "Var65_isBAD" ## [15] "Var72_clean" ## [16] "Var73_clean" ... ## [68] "Var218_lev_x.cJvF" ## [69] "Var218_lev_x.UYBR" ## [70] "Var218_catB" ## [71] "Var221_lev_x.d0EEeJi" ## [72] "Var221_lev_x.oslk"
The _clean
variables are numeric (with bad values like NA and Inf
converted to zeros); the isBAD
variables are indicator variables created by vtreat to mark unpopulated or otherwise NA fields in the corresponding variable. The catB
variables are impact-coded, and the _lev_
variables are indicator variables for specific levels of the corresponding categorical variable.
What is interesting is the number of isBAD
variables (without their corresponding clean
component) in the final set. This indicates that for these variables the signal is contained in whether or not the field was populated, rather than in the actual value.
Here's logistic regression's performance on the reduced variable set (AUC=0.71 on test vs. 0.73 in training):
And gradient boosting (AUC=0.71 in both training and test).
Results
Winnowing down the variables didn't improve model performance much for logistic regression, and not at all for gradient boosting, which suggests that the gradient boosting algorithm does a pretty good job of variable selection on its own. However, the variable filtering reduced the run time for gradient boosting by almost a factor of five (from 7 seconds to 1.5), and that in itself is of value.
Takeaways
We've demonstrated a heuristic for determining whether or not an input variable has signal. We derived our heuristic by empirical exploration (the permutation test), and then noticed that there is an existing standard statistical test (the chi-squared test for logistic regression model significance) that gives us the measure that we want. This is a good general practice: pick your statistical test based on what you want to measure, or what you are trying to defend against, and only then settle on a procedure.
You can find the code for these examples as R markdown, along with html of the runs, here.
]]>
I know, an unapologetic “apologia” opens me and my work to more criticism (or “two for flinching”). But this is what it is can be like for a non-statistician to work with or in front of some statisticians (in fact many statisticians, but none of the big ones). I suspect, from speaking with other data scientists, it is not just me.
I apologize now for displaying a thin skin.
I have made errors when speaking and writing about statistics. This shouldn’t come as a shock as there isn’t anybody who hasn’t made errors. I have no doubt made more errors in writing this article.
But what has been a bit disturbing is many times instead being merely corrected (when I am wrong) or asked for clarification (when I may be right), I am instead publicly accused of being stupid, ignorant, or willfully disseminating falsehoods. I find this disappointing (and I come from a field, theoretical computer science, where if they get excited during your whiteboard presentation they may grab the marker out of your hand to contribute). It would seem graduate schools are not finishing schools.
I’ll admit, I am not a statistician. Some statisticians take that to mean I am ignorant and uneducated in issues of probability. In fact I am very interested in the theory of probability, and come to probability through a fairly long path. I was trained in mathematics and theoretical computer science before becoming a data science carpetbagger. Probability relevant topics I have studied include:
I am not trying to say I know a lot. What I am trying to say is: I probably know enough to understand the basis of a statistical concept. If you try to correct me politely, I may be capable of understanding your point and learn from you. My terminology may differ from statistical canon, as I may have learned a common concept in a different field (where it may in fact have a different common name).
I am going to confess a few of my sins. And as I am writing to a technical audience, I will allow myself some technical examples.
sum_{i=1...n} (x[i]-mean(x))/n
for variance.I wrote sum_{i=1...n} (x[i]-mean(x))/n
for variance. Well, not quite. It was in R and I wrote (on page 155 of Practical Data Science with R):
d <- data.frame(y=c(2,2,3,3,3))
m <- lm(y~1,data=d)
df <- nrow(d)-length(coefficients(m))
sqrt(sum((residuals(m)^2)/df))
Superficially, there is no “-1” in there. The reason I wrote this code is I was trying to teach the exact calculation needed to reproduce the “Residual standard error” line found in R’s summary(m)
(one of the goals of Practical Data Science with R was to de-mystify summary.lm()
by documenting and showing how to calculate every element of summary.lm()
).
Now there are a few points in my defense.
n
or n-1
. To explain summary(m)
I would have to pick the one that matches the existing summary(m)
.n
is the number of rows in our data frame then we have the quantity I am dividing by is in fact n-1
(as we have length(coefficients(m))==1
). I am guessing those correcting me divide by n-1
because they have been told “population variance divide by n
sample variance divide by n-1
” and refuse to accept that Bessel’s correction can be arrived at as an attempt to correct for the number of modeling parameters. Or, on page 171 (actually by my co-author): “Null model has (number of data points – 1) degrees of freedom.” This is from code reproducing the elements of summary.glm()
(actually my co-author went a bit further to include the chi-squared statistic, which is one standard significance of fit statistic for summary.glm()
– which doesn’t supply any such statistic in the default implementation).
Yet, both of us have received feedback on these sections saying that we have no idea how to compute a sample variance.
Now I am not saying we are perfect. There may (or may not) be places we have written “S/n” instead of “S/(n-1)”. What I am saying is: even this wouldn’t indicate we didn’t understand the distinction. But when corrected I am never accused of mere carelessness (which I would in fact like to apologize for), but of willful ignorance. The aggressive correction is almost always “clearly you don’t know you need to divide by n-1” instead of “wouldn’t dividing by n-1 be better?”
In practice things are not as simple as always mechanically applying the one true method. We are always searching for better estimates (less bias, lower variance, weaker assumptions, or easier to calculate). I guess if you are used to teaching statistics to out of major beginning students you develop a strong binary and didactic feeling of right (the procedure you actually taught in the course) and wrong (anything else the students write, often in fact gibberish). I am not trying to be inclusive or think relatively here, it is just correcting people is a bit harder than one would think (so it pays to be polite when attempting it).
For example: is Sn = sum_{i=1...n} (x[i]-mean(x))/n
in fact a wrong estimate of variance? It clearly is good enough as n
gets large. I would also argue it is not in fact wrong. Oh, it is biased (tends to be too low on average), and Sn1 = sum_{i=1...n} (x[i]-mean(x))/(n-1)
is unbiased. But bias is not the only concern we might be trading off. We know Sn
is a maximum likelihood estimate (desirable for its own reasons) and also a lower variance estimate (or more statistically efficient) than Sn1
.
[ By Cochran’s theorem we know that Sn
is distributed S chi-sq(n-1)/n
and Sn1
is distributed S chi-sq(n-1)/(n-1)
(where S
is the unknown true variance). But the chi-sq(n-1)
distribution has mean n-1
and variance 2n-2
. So Sn1-S
is mean 0
variance 2 S
, and Sn-S
is mean -S/(n-1)
variance (2-2/n) S
. Thus in moving from Sn
to Sn1
we removed -S/(n-1)
units of bias in exchange for 2 S/n
units of additional variance. For large enough n
this is not an obvious good trade. ]
In writing about linear regression I used the word “heteroscedastic” wrong. The issue is you are heteroscedastic if “if there are sub-populations that have different variabilities from others” (Wikipedia heteroscedasticity). Based on my readings of the history of probability (the problems von Mises ran into with controlling selections of sub-populations, and the importance of exchangeability to de Finetti’s formulation of probability) I would propose altering the definition to: “computationally identifiable sub-populations (either through explicit or omitted variables).”
However, I (in error) wrote in a footnote: “Heteroscedastic errors are errors whose magnitude is correlated with the quantity to be predicted.”
There are in fact at least four things wrong with what I wrote.
y[i] = b.x[i] + e[i]
then of course the error terms are correlated with y
s as they are in there!. What I was thinking is: we have a problem if the error e
is correlated with the signal or systematic portion of y
: b.x
or the x
.(x,y)
pairs {(-1,2),(0,1),(1,2)}
we can write y
as function of x
but there is no linear correlation.I truly wish I could condense good statistical advice down as efficiently as problems pile up (when you don’t have space for a lot of caveats, and worked examples to indicate intent and meaning). I’ve tried to address this in errata, but still regret my error. I also regret bringing the issue up at all, given I didn’t have enough space to layout enough context and caveats. Really what I wanted to discuss is when it is safe to apply linear regression (the topic of our next section).
I said you need to check that the residuals are normal when using linear regression (and hopefully I never accidentally wrote you need the dependent or independent variables to be normal, as that is way too restrictive). This is open to criticism.
However there is a bit of gamesmanship going on here. You can pretty much criticize any position one takes on normality of errors. It goes like this:
The actual case is there is a lot of disagreement on what are the convenient assumptions needed for reliable linear regression (see here, here, here, and here). You can find well regarded statisticians on just about any side of this: depending on the context (are they modeling, teaching, or proving theorems). Just don’t have an opinion either way if you are a non-statistician.
The minimal assumptions do not in fact seem to include normality of residuals, x
s, or y
s. However, minimal may not be the same as convenient in all contexts. For example: in teaching you might invoke normality to give a concrete property and simpler special-case proofs (which goes on to being mis-quoted out of context as the general case necessity). Or: if for domain-specific reasons you known the errors should be normal (such as expected measurement errors when estimating the orbits of celestial bodies) then it pays to check if the residuals are normal (as if the residuals are not normal, they must include things other than the known to be normal errors).
There is also the issue of what do you mean by “linear regression”? If you are just going to fit the model and use it without looking at the diagnostics, you may need fewer assumptions than somebody who (wisely) decides to look at the diagnostics. What additional assumptions you need depend on what tests you run. For example if you run an F-test on the goodness of fit you are then “sensitive to non-normality“, but this is non-normality of the sampling distribution (which can itself be normal even with non-normal residuals under fairly mild assumptions, if you have enough data). However, some derivations of the F-test assume normality of errors (the strong assumption we would rather not have to make; for example result 7.2.1 on page 220 here). Similarly the t-tests on coefficient significance have some distributional assumptions- which can either be satisfied by strong assumptions on the residuals, or weak assumptions on the residuals and a lot of data (making the sampling distribution well behaved).
Frankly a reason it is hard to state reasonable assumptions for regression is there doesn’t seem to be a community agreed upon good primary source for what are considered the strongest derivations (hence with weakest assumptions) of the F-test (and for the chi-square or log-likelihood tests for logistic regression). A clear sign of what the community considers the best proof method would be a great boon (for instance in mathematics collecting proofs by style is considered very important as it hints what holds as we vary assumptions and domain; example Proofs of Fermat’s little theorem). Most sources either neglect the F-test, merely use the F-test, derive it using strong assumptions of normality, deriving in a non-regression setting (single variable contingency tables, or single variable ANOVA), or derive it only for single regression (which doesn’t show how the number of parameters adjust the degrees of freedom). With the right reference identifying the right conditions is a simple matter of proof analysis. I have found derivations of the F-test (including its ANOVA history), but if they are all considered “bad” any conclusions derived from them would be hard to defend. So I’ll say it: I don’t know a good reference for the derivation of the F-test (especially one that answers if residual normality is considered a standard requirement and one that deals directly with multiple regression), and I would appreciate a recommendation.
We are going to keep writing and keep teaching on statistical topics. We choose and arrange topics to reflect what we have found to be the most important issues in practice. We do try to make everything correct, but the order we address things is determined by historic project impact and not traditional teaching order. Just because we haven’t addressed a topic yet doesn’t mean we don’t know about it, it is just that something more urgent may have cut in front. Because of this we feel our work (book, course, blog) are some of the best ways to build data science mastery.
In 2009 Hal Varian, chief economist at Google, famously said:
I keep saying that the sexy job in the next 10 years will be statisticians, and I’m not kidding.”
(NYT 2009).
There is some concern that computer science style data science is stealing some of the thunder. Mostly I agree with Professor Matloff: statistics has a lot to offer that is needlessly being missed. Though I also wonder if we are both encroaching on a field that has been historically owned by operations research.
I myself am considered a relatively statistically friendly data scientist who often criticizes data science in statistical terms (for example my recent talk a the recent Data Science Summit: “Statistics in the age of data science, issues you can and can not ignore.”). I repeat: for all my faults I am one of the data scientists who has a more friendly (better than say, median) relationship with statisticians.
That being said: I’ll weigh in on the “why is no statistical vision triumphant?” Data science is rising not just because of data engineering. It is also rising because the business environment is largely collaborative and requires effective consulting skills and behaviors. Collaborating in a positive way with others (valuing their work, even while improving their working style) is decisive. Live and teach that and you win. Yes, I am saying software engineering is somewhat socialized (has long dealt with issues of training, project management, agile practices, working with uncertainty, recording history, and remote collaboration). I am also saying: for any of us (including myself) self improvement and collaboration are always quicker ways to get ahead than trying to pull down others (as there are too many others for tearing down others to be an effective strategy).
I want more statisticians on my data science teams. We can even call it a statistical applications team, as long as that doesn’t get me kicked off the team.
“I was wrong” Sisters of Mercy
So I was wrong I was wrong to ever doubt I can get along without
In this latest “R as it is” (again in collaboration with our friends at Revolution Analytics) we will quickly become expert at efficiently accumulating results in R.
A number of applications (most notably simulation) require the incremental accumulation of results prior to processing. For our example, suppose we want to collect rows of data one by one into a data frame. Take the mkRow
function below as a simple example source that yields a row of data each time we call it.
mkRow <- function(nCol) {
x <- as.list(rnorm(nCol))
# make row mixed types by changing first column to string
x[[1]] <- ifelse(x[[1]]>0,'pos','neg')
names(x) <- paste('x',seq_len(nCol),sep='.')
x
}
The obvious “for
-loop” solution is to collect or accumulate many rows into a data frame by repeated application of rbind
. This looks like the following function.
mkFrameForLoop <- function(nRow,nCol) {
d <- c()
for(i in seq_len(nRow)) {
ri <- mkRow(nCol)
di <- data.frame(ri,
stringsAsFactors=FALSE)
d <- rbind(d,di)
}
d
}
This would be the solution most familiar to many non-R programmers. The problem is: in R the above code is incredibly slow.
In R all common objects are usually immutable and can not change. So when you write an assignment like “d <- rbind(d,di)
” you are usually not actually adding a row to an existing data frame, but constructing a new data frame that has an additional row. This new data frame replaces your old data frame in your current execution environment (R execution environments are mutable, to implement such changes). This means to accumulate or add n
rows incrementally to a data frame, as in mkFrameForLoop
we actually build n
different data frames of sizes 1,2,...,n
. As we do work copying each row in each data frame (since in R data frame columns can potentially be shared, but not rows) we pay the cost of processing n*(n+1)/2
rows of data. So: no matter how expensive creating each row is, for large enough n
the time wasted re-allocating rows (again and again) during the repeated rbind
s eventually dominates the calculation time. For large enough n
you are wasting most of your time in the repeated rbind
steps.
To repeat: it isn’t just that accumulating rows one by one is “a bit less efficient than the right way for R”. Accumulating rows one by one becomes arbitrarily slower than the right way (which should only need to manipulate n
rows to collect n
rows into a single data frame) as n
gets large. Note: it isn’t that beginning R programmers don’t know what they are doing; it is that they are designing to the reasonable expectation that data frame is row-oriented and R objects are mutable. The fact is R data frames are column oriented and R structures are largely immutable (despite the syntax appearing to signal the opposite), so the optimal design is not what one might expect.
Given this how does anyone ever get real work done in R? The answers are:
for
-loop structure seen in mkFrameForLoop
. The most elegant way to avoid the problem is to use R’s lapply
(or list apply) function as shown below:
mkFrameList <- function(nRow,nCol) {
d <- lapply(seq_len(nRow),function(i) {
ri <- mkRow(nCol)
data.frame(ri,
stringsAsFactors=FALSE)
})
do.call(rbind,d)
}
What we did is take the contents of the for
-loop body, and wrap them in a function. This function is then passed to lapply
which creates a list of rows. We then batch apply rbind
to these rows using do.call
. It isn’t that the for
-loop is slow (which many R users mistakingly believe), it is the incremental collection of results into a data frame is slow and that is one of the steps the lapply
method is avoiding. While you can prefer lapply
to for
-loops always for stylistic reasons, it is important to understand when lapply
is in fact quantitatively better than a for
-loop (and to know when a for
-loop is in fact acceptable). In fact a for-loop with a better binder such as data.table::rbindlist
is among the fastest variations we have seen (as suggested by Arun Srinivasan in the comments below; another top contender are file based Split-Apply-Combine methods as suggested in comments by David Hood, ideas also seen in Map-Reduce).
If you don’t want to learn about lapply
you can write fast code by collecting the rows in a list as below.
mkFrameForList <- function(nRow,nCol) {
d <- as.list(seq_len(nRow))
for(i in seq_len(nRow)) {
ri <- mkRow(nCol)
di <- data.frame(ri,
stringsAsFactors=FALSE)
d[[i]] <- di
}
do.call(rbind,d)
}
The above code still uses a familiar for
-loop notation and is in fact fast. Below is a comparison of the time (in MS) for each of the above algorithms to assemble data frames of various sizes. The quadratic cost of the first method is seen in the slight upward curvature of its smoothing line. Again, to make this method truly fast replace do.call(rbind,d)
with data.table::rbindlist(d)
(examples here).
Execution time (MS) for collecting a number of rows (x-axis) for each of the three methods discussed. Slowest is the incremental for-loop accumulation.
The reason mkFrameForList
is tolerable is in some situations R can avoid creating new objects and in fact manipulate data in place. In this case the list “d
” is not in fact re-created each time we add an additional element, but in fact mutated or changed in place.
(edit) The common advice is we should prefer in-place edits. We tried that, but it wasn’t until we (after getting feedback in our comments below) threw out the data frame class attribute that we got really fast code. The code and latest run are below (but definitely check out the comments following this article for the reasoning chain).
mkFrameInPlace <- function(nRow,nCol,classHack=TRUE) {
r1 <- mkRow(nCol)
d <- data.frame(r1,
stringsAsFactors=FALSE)
if(nRow>1) {
d <- d[rep.int(1,nRow),]
if(classHack) {
# lose data.frame class for a while
# changes what S3 methods implement
# assignment.
d <- as.list(d)
}
for(i in seq.int(2,nRow,1)) {
ri <- mkRow(nCol)
for(j in seq_len(nCol)) {
d[[j]][i] <- ri[[j]]
}
}
}
if(classHack) {
d <- data.frame(d,stringsAsFactors=FALSE)
}
d
}
Note that the in-place list of vectors method is faster than any of lapply/do.call(rbind)
, dplyr::bind_rows/replicate
, or plyr::ldply
. This is despite having nested for-loops (one for rows, one for columns; though this is also why methods of this type can speed up even more if we use compile:cmpfun
). At this point you should see: it isn’t the for-loops that are the problem, it is any sort of incremental allocation, re-allocation, and checking.
At this point we are avoiding both the complexity waste (running an algorithm that takes time proportional to the square of the number of rows) and avoiding a lot of linear waste (re-allocation, type-checking, and name matching).
However, any in-place change (without which the above code would again be unacceptably slow) depends critically on the list value associated with “d
” having very limited visibility. Even copying this value to another variable or passing it to another function can break the visibility heuristic and cause arbitrarily expensive object copying.
The fragility of the visibility heuristic is best illustrated with an even simpler example.
Consider the following code that returns a vector of the squares of the first n
positive integers.
computeSquares <- function(n,messUpVisibility) {
# pre-allocate v
# (doesn't actually help!)
v <- 1:n
if(messUpVisibility) {
vLast <- v
}
# print details of v
.Internal(inspect(v))
for(i in 1:n) {
v[[i]] <- i^2
if(messUpVisibility) {
vLast <- v
}
# print details of v
.Internal(inspect(v))
}
v
}
Now of course part of the grace of R is we never would have to write such a function. We could do this very fast using vector notation such as seq_len(n)^2
. But let us work with this notional example.
Below is the result of running computeSquares(5,FALSE)
. In particular look at the lines printed by the .Internal(inspect(v))
statements and at the first field of these lines (which is the address of the value “v
” refers to).
computeSquares(5,FALSE)
## @7fdf0f2b07b8 13 INTSXP g0c3 [NAM(1)] (len=5, tl=0) 1,2,3,4,5
## @7fdf0e2ba740 14 REALSXP g0c4 [NAM(1)] (len=5, tl=0) 1,2,3,4,5
## @7fdf0e2ba740 14 REALSXP g0c4 [NAM(1)] (len=5, tl=0) 1,4,3,4,5
## @7fdf0e2ba740 14 REALSXP g0c4 [NAM(1)] (len=5, tl=0) 1,4,9,4,5
## @7fdf0e2ba740 14 REALSXP g0c4 [NAM(1)] (len=5, tl=0) 1,4,9,16,5
## @7fdf0e2ba740 14 REALSXP g0c4 [NAM(1)] (len=5, tl=0) 1,4,9,16,25
## [1] 1 4 9 16 25
Notice that the address v
refers to changes only once (when the value type changes from integer to real). After the one change the address remains constant (@7fdf0e2ba740
) and the code runs fast as each pass of the for-loop alters a single position in the value referred to by v
without any object copying.
Now look what happens if we re-run with messUpVisibility
:
computeSquares(5,TRUE)
## @7fdf0ec410e0 13 INTSXP g0c3 [NAM(2)] (len=5, tl=0) 1,2,3,4,5
## @7fdf0d9718e0 14 REALSXP g0c4 [NAM(2)] (len=5, tl=0) 1,2,3,4,5
## @7fdf0d971bb8 14 REALSXP g0c4 [NAM(2)] (len=5, tl=0) 1,4,3,4,5
## @7fdf0d971c88 14 REALSXP g0c4 [NAM(2)] (len=5, tl=0) 1,4,9,4,5
## @7fdf0d978608 14 REALSXP g0c4 [NAM(2)] (len=5, tl=0) 1,4,9,16,5
## @7fdf0d9788e0 14 REALSXP g0c4 [NAM(2)] (len=5, tl=0) 1,4,9,16,25
## [1] 1 4 9 16 25
Setting messUpVisibility
causes the value referenced by “v
” to also be referenced by a new variable named “vLast
“. Evidently this small change is enough to break the visibility heuristic as we see the address of the value “v
” refers to changes after each update. Meaning we have triggered a lot of expensive object copying. So we should consider the earlier for
-loop code a bit fragile as small changes in object visibility can greatly change performance.
The thing to remember is: for the most part R objects are immutable. So code that appears to alter R objects is often actually simulating mutation by expensive copies. This is the concrete reason that functional transforms (like lapply
) should be preferred to incremental or imperative appends.
R code for all examples in this article can be found here (this includes methods like pre-reserving space, and the original vector experiments that originally indicated the object mutation effect).
]]>
Usually when I am working with text, my goals are a bit loftier than messing with individual characters. However, it is a step you have to get right. So you would like working correctly with arbitrary characters to be as easy as possible, as any problems here are mere distractions from your actual goals.
The other day I thought it would be nice to get a list of all the article titles and URLs from the Win-Vector blog. It is a low ambition task that should be easy to do. At some point I thought: I’ll just scan the XML export file, it has all of the information in a structured form. And the obvious Python program to do this fails out with:
xml.etree.ElementTree.ParseError:
not well-formed (invalid token): line 27758, column 487
Why is that? The reason is WordPress wrote a document with a suffix “.xml” and a header directive of “<?xml version="1.0" encoding="UTF-8" ?>
” that is not in fact valid utf-8 encoded XML. Oh, it looks like modern XML (bloated beyond belief and full of complicated namespaces referring to URIs that get mis-used as concrete URLs). But unless your reader is bug-for-bug compatible with the one WordPress uses, you can’t read the file. Heck I am not ever sure WordPress can read the file back in, I’ve never tried it and confirmed such a result. This is the world you get with “fit to finish” or code written in the expectation of downstream fixes due to mis-readings of Postel’s law.
So the encoding is not in fact XML over utf-8, but some variation of wtf-8. Clearly something downstream can’t handle some character or character encoding. We would like to at least process what we have (and not abort or truncate).
Luckily there is a Python library called unidecode which will let us map exotic (at least to Americans) characters to Latin analogues (allowing us to render Erdős as Erdos instead of the even worse Erds). The Python3 code is here:
# Python3, read wv.xml (Wordpres export)
# write to stdout title/url lines
import random
import codecs
import unidecode
import xml.etree.ElementTree
import string
# read WordPress export
with codecs.open('wv.xml', 'r', 'utf-8') as f:
dat = f.read()
# WordPress export full of bad characters
dat = unidecode.unidecode(dat)
dat = ''.join([str(char) for char in dat if char in string.printable])
namespaces = {'wp': "http://wordpress.org/export/1.2/"}
root = xml.etree.ElementTree.fromstring(dat.encode('utf-8'))
list = [ item.find('title').text + " " + item.find('link').text \
for item in root.iter('item') \
if item.find('wp:post_type',namespaces).text=='post' ]
random.shuffle(list)
for item in list:
print(item)
It only took two extra lines to work around the parse problem (the unidecode.unidecode()
followed by the filter down to string.printable
). But such a simple work around depends on not actually having to represent your data in a completely faithful and reversible manner (often the case for analysis, hence my strong TSV proposal; but almost never the case in storage and presentation).
Also, it takes a bit of search to find even this work-around; and that is distracting to have to worry about this when you are in the middle of doing something else. The fix was only quick because we used a pragmatic language like Python, where somebody supplied a library to demote characters to something usable (not exactly ideologically pure). Imagine having to find which framework in Java (as mere libraries tend to be beneath Java architects) might actually supply a function simply performing a useful task.
How did text get some complicated? There are some essential difficulties, but many of the problems are inessential stumbling blocks due to architecting without use cases, and the usual consequence of committees.
A good design works from a number of explicitly stated use cases. Historically strings have been used to:
Unicode/UTF tends to be amazingly weak at all of these. Search by regular expressions is notoriously weak over Unicode (try and even define a subset of Unicode that is an alphabet, versus other graphemes). With so many alternative ways to represent things just forget human readable collating/sorting or having normal forms strong enough to support and reasonable notion of comparison/equivalence. It appears to be a research question (or even a political science question) if you can even convert a Unicode string reliably to upper case.
I accept: a concept of characters and character encoding rich enough to support non-latin languages is going to be more effort than ASCII was. Manipulating strings may no longer be as simple as working over individual bytes.
We would expect the standards to come with useful advice and reference implementations.
But what actually happens is:
Perhaps if we could break Unicode’s back with enough complexity it would die and something else could evolve to occupy the niche. My own trollish proposal would be along the following lines.
Pile on one bad idea too many and make string comparison as hard as a general instance of the Post correspondence problem and thus undecidable. Unicode/utf-8 is not there yet (due to unambiguity, reading direction, and bounded length), but I keep hoping.
The idea is many characters in Unicode have more than one equivalent representation (and these can have different lengths, examples include use of combining characters versus precomposed characters). So, roughly, checking sequences of code points represent the same string becomes the following grouping problem:
For a sequence of integers 1…t define an “ordered sequence partition” P as a sequence of contiguous sets of integers (P1,…,Ph) such that:
For two sequences of code points a1,a2,…am and b1,b2,…,bn checking “string equivilance” is therefore checking if the sequence of integers 1…m and 1…n can be ordered sequence partitioned into A=(A1,…,Au) and B=(B1,…,Bu) such that: for i=1…u the sequence of code points a_{Ai} and b_{Bi} are all valid and equivalent Unicode characters.
What stops this from encoding generally hard problems is the lack ambiguity in the code-point to character dictionaries ensuring there is only one partition of each code point sequence such that all elements are valid code-points. Thus we can find the unique partition of each code point sequence using a left to right read and then we just check if the partitions match.
So all we have to do to successfully encode hard problems is trick the standards committee into introducing a sufficient number of ambiguous grouping (things like code-points “a1 a2 a3 a4” such that both “(a1 a2 a3)” and “(a1 a2)” are valid groupings). This will kill the obvious comparison algorithms, and with some luck we get a code dictionary that will allow us to encode NP hard problems as string equivalence.
To get undecidable problems we just have to trick the committee into introducing a bad idea I’ll call “fix insertions.” We will say “a1 a2 a3 a4” can be grouped into “(a1 x1 a2) (a3 a4 x2 x3)” by the insertion of the implied or “fix” code-points x1, x2, x3. Then, with some luck, we could build a code dictionary that could encode general instances of Post correspondence problems and make Unicode string comparison Turing complete (and thus undecidable).
So I think all we need is some clever design (to actually get a dangerously expressive encoding, not just the suspicion there is one) and stand up a stooge linguist or anthropologist to claim a few additional “harmless lexical equivalences and elusions” (such as leaving vowels out of written script) are needed to faithfully represent a few more languages.
Okay, the last section was a joke (and not even a good joke). Let’s look at what I would really want if text encoding were still on the table.
Unicode is attempting to put everything in one container and thus has become essentially a multimedia format (like HTML). There is no “Unicode light” where you only need to solve the processing problems of one or two languages to get your work done. Unicode is all or nothing, you have to be able to represent everything to represent anything. Frankly I’d like to see a more modular approach where nesting and containment are separate from string/character encoding. A text could be represented as a container of multiple string segments where each segment is encoded in a single named limited capability codebook. Things like including a true Hungarian name in english text would be done at the container level, and not at the string/character level.
We have to, as computer scientists, show more discipline in what we do not allow into standards and designs. As Prof. Dr. Edsger W.Dijkstra wrote, we must:
… educate a generation of programmers with a much lower threshold for their tolerance of complexity …
Complexity tends to synergize multiplicatively (not merely additively). And real world systems already have enough essential difficulty and complexity, so we can not afford a lot more unnecessary extra complexity.
Illustration: Boris Artzybasheff
photo: James Vaughan, some rights reserved
The Example Problem
Recall that you have a mobile app with both free (A) and paid (B) actions; if a customer’s tasks involve too many paid actions, they will abandon the app. Your goal is to detect when a customer is in a state when they are likely to abandon, and offer them (perhaps through an in-app ad) a more economical alternative, for example a “Pro User” subscription that allows them to do what they are currently doing at a lower rate. You don’t want to be too aggressive about showing customers this ad, because showing it to someone who doesn’t need the subscription service is likely to antagonize them (and convince them to stop using your app).
You want to build a model that predicts whether a customer will abandon the app (“exit”) within seven days. Your training set is a set of 648 customers who were present on a specific reference day (“day 0”); their activity on day 0 and the ten days previous to that (days 1 through 10), and how many days until each customer exited (Inf
for customers who never exit), counting from day 0. For each day, you constructed all possible windows within those ten days, and counted the relative rates of A events and B events in each window. This gives you 132 features per row. You also have a hold-out set of 660 customers, with the same structure. You can download the wide data set used for these examples as an .rData
file here. The explanation of the variable names is in the previous post in this series.
In the previous installment, we built a regularized (ridge) logistic regression model over all 132 features. This model didn’t perform too badly, but in general there is more danger of overfitting when working with very wide data sets; in addition, it is quite expensive to analyze a large number of variables with standard implementations of logistic regression. In this installment, we will look for potentially more robust and less expensive ways of analyzing this data.
The Ideal Case
Ideally you would know some appropriate window lengths, from understanding of the domain. For instance, if you knew that the trend towards abandonment manifested itself over the course of a month, then weekly or twice-a-week aggregations might be all you need. But perhaps you aren’t entirely sure what the appropriate aggregation windows are. Is there any way of teasing them out?
Greedy Forward Stepwise Regression
One way to find the best features is to pick them one at a time: find the one-variable model that optimizes some model quality function, then add another variable that, combined with the first, again optimizes model quality, and so on, until the model “stops improving.” I’ve put that in quotes because in general, one stops when the incremental improvement is smaller than some threshold. Because many standard model quality metrics, like R-squared, squared error, or deviance, tend to improve as the number of parameters increases (potentially leading to bias and overfit), standard stepwise regression uses criteria like the AIC or BIC, which attempt to compensate for the complexity of the model. Here (for pedagogical purposes) we will step by hand rather than use R’s step()
function, and will simply minimize deviance, and use an ad-hoc procedure for picking an appropriate number of variables.
As before, we’ll use L2-regularized logistic regression as the base model.
library(glmnet) # stepwise ridge regression: add one more variable # to existing model # # xframe: data frame of independent variables # y: vector of dependent variable # current_vars: variables in the current model # current_dev: deviance of current model # candidate_vars: variables to be potentially added # to model # Returns: # new set of current_vars # new current_dev # improvement from previous model add_var = function(xframe, y, current_vars, current_dev, candidate_vars) { best_dev = current_dev newvar = NULL for(var in candidate_vars) { active=c(current_vars, var) xf = xframe[,active] if(length(active) > 1) { model = glmnet(as.matrix(xf), y, alpha=0, lambda=0.001, family="binomial") } else { # glmnet requires > 1 variable model =glm.fit(xframe[,active], y, family=binomial(link="logit")) } moddev = deviance(model) if(moddev < best_dev) { newvar = var best_dev = moddev } } improvement = 1 - (best_dev/current_dev) list(current_vars= c(current_vars, newvar), current_dev = best_dev, improvement = improvement) } # stepwise ridge regression: entire loop # # data: training data frame # vars: variables to consider # yVar: name of dependent variable # min_improve: terminate when model # improvement is less than this # # returns final set of variables, # along with improvements and deviances stepwise_ridge = function(data, vars, yVar, min_improve=1e-6) { current_vars=c() candidate_vars = vars devs = numeric(length(vars)) improvement = numeric(length(vars)) current_dev=null_deviance(data[[yVar]]) do_continue=TRUE while(do_continue) { iter = add_var(data, data[[yVar]], current_vars, current_dev, candidate_vars) current_vars = iter$current_vars current_dev = iter$current_dev count = length(current_vars) devs[count] = current_dev improvement[count] = iter$improvement candidate_vars = setdiff(vars, current_vars) do_continue = (length(candidate_vars) > 0) && (iter$improvement > min_improve) } list(current_vars = current_vars, deviances=devs, improvement=improvement) # load vars (names of vars), yVar (name of y column), # dTrainS, dTestS load("wideData.rData") # number of candidate variables length(vars) ## [1] 132 # fix the Infs in the training data # shouldn't be many of them isInf = dTrainS$daysToX == Inf maxfinite = max(dTrainS$daysToX[!isInf]) dTrainS$daysToX[isInf] = maxfinite # null deviance: # the deviance of the mean value # of the y variable null_deviance(dTrainS[[yVar]]) ## [1] 892.3776 # model using all variables allvar_model = ridge_model(dTrainS[,vars], dTrainS[[yVar]]) # the deviance of the model # with all variables deviance(allvar_model) ## [1] 722.1471 # greedy forward stepwise regression modelparams = stepwise_ridge(dTrainS, vars, yVar) current_vars = modelparams$current_vars devs = modelparams$deviances improvement=modelparams$improvement # number of variables selected length(current_vars) ## [1] 27 final_model = ridge_model(dTrainS[,current_vars], dTrainS[[yVar]]) final_model$deviance ## [1] 722.1666 current_vars[1:7] ## "B_1_1" "A_0_0" "B_6_0" ## "B_7_2" "B_9_3" "A_1_0" "A_5_5" }
We can reduce the number of variables from 132 to 27 without substantially increasing the training deviance (recall that large deviance is bad).
If we look at the first few selected variables, we see that the model looks at the rate of B events occurring “yesterday” (B_1_1
) and compares it with the rate of B events over sliding windows of 6-7 days from today, yesterday, and the day before yesterday. It also looks at the rate of A events from today and yesterday (and 5 days ago). Recall that in this simulated data a customer’s rates of A and B actions stay constant until they switch to the “at-risk” state, at which time their rate of B actions increases to a new constant (see the previous installment) — in other words, there is an edge after which the customer’s B rate is notably higher. Given that knowledge (which we of course wouldn’t have in a real data situation), comparing the current B rate with running averages from the last few days makes sense.
So by simply stepping through the variables that we generated through naive sessionization, we can reduce the number of features to a more tractable number. If fact, we suspect that we can decrease the number of variables even more. Let’s look at how deviance changed as we added variables.
The top plot is deviance as a function of the number of variables, the bottom plot is the improvement from the previous model — kind of the “derivative of the deviance.” After about ten variables, the model improvement leveled off. It’s a folk theorem, when looking at graphs like these (model quality as a function of a parameter) that the optimal value for the parameter occurs at the “elbow” of the model quality graph, or alternatively at either the maximum or elbow of the improvement graph. Which point is the elbow of this deviance graph is a fuzzy question; the improvement graph is easier to read. The maximum is 2 variables; the elbow is 4. There’s an argument to be made for 6 variables, too, so let’s look at all these models, this time on hold-out data.
# more reduced models final2_model = ridge_model(dTrainS[,current_vars[1:2]], dTrainS[[yVar]]) final4_model = ridge_model(dTrainS[,current_vars[1:4]], dTrainS[[yVar]]) final6_model = ridge_model(dTrainS[,current_vars[1:6]], dTrainS[[yVar]]) # Compare all the (non-trivial) models on holdout data # See https://github.com/WinVector/SessionExample/blob/master/NarrowChurnModel.Rmd # for the evaluate() function code rbind(evaluate(allvar_model, dTestS, dTestS[[yVar]], "all variables"), evaluate(final_model, dTestS, dTestS[[yVar]], "stepwise run"), evaluate(final2_model, dTestS, dTestS[[yVar]], "best 2 variables"), evaluate(final4_model, dTestS, dTestS[[yVar]], "best 4 variables"), evaluate(final6_model, dTestS, dTestS[[yVar]], "best 6 variables")) ## label deviance recall precision accuracy ## 1 all variables 756.6919 0.7752809 0.7360000 0.7287879 ## 2 stepwise run 755.8788 0.7752809 0.7360000 0.7287879 ## 3 best 2 variables 769.7035 0.7696629 0.7080103 0.7045455 ## 4 best 4 variables 743.1230 0.7921348 0.7540107 0.7484848 ## 5 best 6 variables 743.2160 0.7977528 0.7533156 0.7500000
The four-variable model dominates all the others on the hold-out data on deviance and precision, and isn’t too far behind the six-variable model on recall and accuracy. This indicates that the model using all the variables was slightly overfitting, as was even the model with 27 variables. For domain reasons, you still might prefer to use the six-variable model — I would feel more comfortable using 3 running average measurements instead of two, and I like having more A rate information in the model. The performance difference between the two models is slight, and 6 variables is still far, far fewer than 132.
Note that we could also have used n-fold cross validation to select the best number of variables.
Discussion
This approach isn’t perfect. You still have to generate all the naive sessionization features, and you still have to run through them all, multiple times. However, if M is the number of naive sessionization features, and M is large, then fitting M*k small logistic regression models (where k < M) may still be less expensive than fitting one logistic regression model of size M. Also, if M is so large that you have trouble fitting it in memory (it can happen), you can simply generate each feature on the fly, as needed.
If you really want to, you can cut down the computation a little by not fitting a model to all the current variables at every step; you can freeze the previous model and use its predictions as an offset to the next model (via the offset
parameter). This means you are only fitting a single-variable model at every iteration. If you do this, it’s a good idea to do one last polishing step at the end by refitting all the selected variables at once.
You can interpret freezing the previous model and using it as an offset to the next model as minimizing the “residual deviance” at every iteration. If this sounds familiar, it should: incrementally building up a model by minimizing residual deviance and iterating is one of the basic ideas behind gradient boosting, though the details are different, and gradient boosting usually boosts trees rather than single-variable models. So rather than trying the procedure I just described, why don’t we just try gradient boosting?
Gradient Boosting with Additive Models
We’ll use the gbm()
function from the gbm
package, with interaction.depth=1
, since we didn’t use interactions in our logistic regression models.
library(gbm) # wrapper functions for prediction gbm_predict_function = function(model, nTrees) { force(model) function(xframe) { predict(model,newdata=xframe,type='response', n.trees=nTrees) } } # wrapper function for fitting. # Returns: a prediction function # variable influences gbm_model = function(dframe, formula, weights=NULL) { if(is.null(weights)) { nrows = dim(dframe)[1] weights=numeric(nrows)+1 # all 1 } modelGBM
The function gbm.perf()
(as we’ve called it in gbm_model()
) uses cross-validation to pick the optimal number of boosting iterations (trees):
The black curve shows model deviance on training data as a function of the number of iterations; the green curve shows model deviance on holdout. The algorithm selects the point where the holdout deviance begins to increase again: in this case, 83 trees. Since we have set the interaction depth to 1, this is essentially the number of variables in the model.
We can compare the resulting model to our (reduced) stepwise model.
# compare to the best ridge model bestridge_model = final6_model bestn = 6 rbind(evaluate(bestridge_model, dTestS, dTestS[[yVar]], "best stepwise model"), evaluate(modelGBM, dTestS, dTestS[[yVar]], "gbm model, interaction=1")) ## label deviance recall precision accuracy ## 1 best stepwise model 743.2160 0.7977528 0.7533156 0.7500000 ## 2 gbm model, interaction=1 728.4487 0.7387640 0.7758112 0.7439394
The gradient boosting model has lower deviance on hold-out, so it’s predicting probabilities better. It’s also more precise, but has lower recall. Unfortunately, if you want to use the gbm model, you still have to use all the features as input, so you lose the variable reduction advantage, not only during model application, but during model fitting — this matters if you can’t get all the features into memory at once.
The summary of a gbm
model returns the variable influences, which we can use as proxies for variable importance. So you can try the “elbow” trick on the graph of influence versus number of variables, then refit a model using only those variables. I won’t show the graph here, but I decided on 7 variables, not only because it appeared to be an elbow on the influence graph, but also because 7 is nearly the same number of variables as we used in our reduced logistic regression model. The resulting variables are different from those the stepwise procedure selected:
## var rel.inf ## B_3_0 B_3_0 25.610258 ## B_2_0 B_2_0 16.911399 ## B_4_0 B_4_0 14.006369 ## A_2_0 A_2_0 12.537114 ## A_0_0 A_0_0 11.648087 ## A_1_0 A_1_0 11.206434 ## B_1_0 B_1_0 8.080339
The performance of the reduced gbm model is similar to that of the full model, and also similar to the performance of the reduced logistic regression model. Again the gradient boosting models have better deviance, but inferior recall.
## label deviance recall precision ## best stepwise model 743.2160 0.7977528 0.7533156 ## gbm model, interaction=1 728.4487 0.7387640 0.7758112 ## gbm model with best gbm variables 720.9506 0.7303371 0.7784431 ## accuracy ## 0.7500000 ## 0.7439394 ## 0.7424242
Recall that in addition to accurate classification, you want the model to identify about-to-exit customers early enough for you to intervene with them. So you also want to compare the reduced stepwise and gradient boosted models for the timeliness of their predictions. Here, we show the distribution of days to exit for all customers who exited within 7 days in the hold out set (shown as the green bars), along with how many of those customers each model identified (shown as the points with stems).
Both models did a good job identifying customers who will exit “today” or “tomorrow” (perhaps too soon for you to intervene with them), but the stepwise regression model did a little better at early identification of customers who will exit in three to seven days.
Takeaways
You can download the wide sessionized data sets that we used in this post here.
You can download an R markdown script showing all the steps we did in this post (and more) here.
Next:
We will continue to explain important steps in sessionization.
]]>One notable exception is log data. Log data is a very thin data form where different facts about different individuals are written across many different rows. Converting log data into a ready for analysis form is called sessionizing. We are going to share a short series of articles showing important aspects of sessionizing and modeling log data. Each article will touch on one aspect of the problem in a simplified and idealized setting. In this article we will discuss the importance of dealing with time and of picking a business appropriate goal when evaluating predictive models.
For this article we are going to assume that we have sessionized our data by picking a concrete near-term goal (predicting cancellation of account or “exit” within the next 7 days) and that we have already selected variables for analysis (a number of time-lagged windows of recent log events of various types). We will use a simple model without variable selection as our first example. We will use these results to show how you examine and evaluate these types of models. In later articles we will discuss how you sessionize, how you choose examples, variable selection, and other key topics.
The Setup
One lesson of survival analysis is that it is a lot more practical to model the hazard function (the fraction of accounts terminating at a given date, conditioned on the account being active just prior to the date) than to directly model account lifetime or account survival. Knowing to re-state your question in terms of hazard is a big step (as is figuring out how to sessionize your data, how to define positive and negative instances, how to select variables, and how to evaluate a model). Let’s set up our example modeling situation.
Suppose you have a mobile app with both free (A) and paid (B) actions; if a customer’s tasks involve too many paid actions, they will abandon the app. Your goal is to detect when a customer is in a state when they are likely to abandon, and offer them (perhaps through an in-app ad) a more economical alternative, for example a “Pro User” subscription that allows them to do what they are currently doing at a lower rate. You don’t want to be too aggressive about showing customers this ad, because showing it to someone who doesn’t need the subscription service is likely to antagonize them (and convince them to stop using your app).
Suppose the idealized data is collected a log style form, like the following:
dayIndex accountId eventType 1 101 act10000 A 2 101 act10000 A 3 101 act10000 A 4 101 act10000 A 5 101 act10000 A 6 101 act10000 A 7 101 act10000 A 8 101 act10000 A 9 101 act10000 A 10 101 act10003 B 11 101 act10003 A 12 101 act10003 A 13 101 act10003 A 14 101 act10003 A 15 101 act10003 A 16 101 act10003 A 17 101 act10003 A 18 101 act10012 B
For every customer, on every day (dayIndex
, which we can think of as the date), we’ve recorded each action, and whether it’s A or B. In realistic data you’d likely have more information, for example exactly what the actions were, perhaps how much the customer paid per B action, and other details about customer history or demographics. But this simple case is enough for our discussion.
Just to analyze data of this type generates several issues:
Ragged vs. uniform use of time when generating training examples
There are two ways to collect customers to use in the training set:
(1) pick a specific date, say one month ago, select a subset of your customer set from that day, and use those customers’ historical data (say, the last few months’ activity for those customers) as the training set. We’ll call this a uniform time training set.
(2) select a subset from the set of all your customers over all time (including some who may not currently be customers), and use their historical data as the training set. We’ll call this a ragged time training set.
The first method has the advantage that the training set exactly reflects how the model will be applied in practice: on a set of customers all on the same date. However, it limits the size of your training set, and if abandonment is very rare, then it limits the number of positive examples available for the modeling algorithm to learn from. The second method potentially allows you to build a larger training set (with more positive examples), but it has a number of pitfalls:
A corollary to this observation is that even if you use a uniform training set, you should be prepared to retrain or otherwise update the model at a reasonable frequency, to account for concept drift.
You could consider using several uniform time sets: pick a date from last month, one from the month before, and so on. If the abandonment process changes slowly enough, this alleviates the data scarcity issue without affecting the prevalence of positive examples. You may still have issues with time trends in the variables, and you will have duplicated data: many customers from a month ago were also customers two months ago, and so can show up in the data twice. Depending on the domain and your goal, the may or may not matter. Also, you need to be careful that the same customer does not end up both in the training and test sets (see our article on structured test/train splits).
Defining Positive Examples
What do you consider a positive example? A customer who will leave tomorrow, within the next week, or within the next year? Predicting abandonment from long range data is nice, but it’s also a noisier problem; someone who will leave a year from now probably looks today a lot like someone who won’t leave in a year. If minimizing false positives is a subgoal (as it is in our example problem), then you might not want to attempt predicting long-range. Hopefully the signals will be stronger the closer a customer gets to abandoning, but you also want to catch them while you still have time to do something about it.
Picking the Features
In this example, you suspect that customers abandon your app when they start to access paid features at too high a rate. But what’s too high a rate? Is that measured in absolute terms, or relative to their total app usage? And what’s the proper measurement window? You want to measure their usage rates over a window that’s not too noisy, but still detects relevant patterns in time for the information to be useful.
The Data
For this artificial example, we created a population of customers who initially begin in a “safe” state in which they generate events via two Poisson processes, with A events generated at ten times the rate of B events. Customers also have a 10% chance every day of switching to an “at risk” state, in which they begin to generate B events at five times the rate that they did in the “safe” state (they also generate A events at a reduced rate, so that their total activity rate stays constant). Once they are in the “at risk” state, they have a 20% chance of exiting (abandoning the app — recorded as state X).
To build a data set, we start with an initial customer population of 1500, let the simulation run for 100 days to “warm up” the population and get rid of boundary conditions, then collect data for 100 more days to form the data set. We also generate new customers every day via a Poisson process with an intensity of 100 customers per day. The expected time for a customer to go into “at risk” is ten days; once they are in the “at risk” state, they stay another five days (in expectation), giving an expected lifetime of fifteen days (of course in reality you wouldn’t know about the internal state changes of your customers). Note that by the way we’ve constructed the population, the lifetime process is in fact stationary and memoryless.
This is obviously much cleaner data than you would have in real life, but it’s enough to let us walk through the analysis process.
The Data Treatment
We chose a uniform time training set: a set of customers present on a reference day (“day 0”) and the ten days previous to that (days 1 through 10), and recorded how many days until each customer exited (Inf
for customers who never exit), counting from day 0. The hold-out set is of the same structure. We defined positive examples as those customers who would exit within seven days of day 0. Rather than guessing the appropriate sessionizing window length ahead of time, we constructed all possible windows within those ten days, and counted the relative rates of A events and B events in each window. This gave us data sets of approximately 650 rows (648 for training, 660 for hold-out) and 132 features; one row per customer, one feature per window. We’ll discuss how we created the wide data sets from the “skinny” log data in a future post; you can download the wide data set we used as an .rData
file here.
The resulting data has the following columns:
colnames(dTrainS) [1] "accountId" "A_0_0" "A_1_0" [4] "A_1_1" "A_10_0" "A_10_1" [7] "A_10_10" "A_10_2" "A_10_3" [10] "A_10_4" "A_10_5" "A_10_6" [13] "A_10_7" "A_10_8" "A_10_9" [16] "A_2_0" "A_2_1" "A_2_2" [19] "A_3_0" "A_3_1" "A_3_2" [22] "A_3_3" "A_4_0" "A_4_1" [25] "A_4_2" "A_4_3" "A_4_4" [28] "A_5_0" "A_5_1" "A_5_2" [31] "A_5_3" "A_5_4" "A_5_5" [34] "A_6_0" "A_6_1" "A_6_2" [37] "A_6_3" "A_6_4" "A_6_5" [40] "A_6_6" "A_7_0" "A_7_1" [43] "A_7_2" "A_7_3" "A_7_4" [46] "A_7_5" "A_7_6" "A_7_7" [49] "A_8_0" "A_8_1" "A_8_2" [52] "A_8_3" "A_8_4" "A_8_5" [55] "A_8_6" "A_8_7" "A_8_8" [58] "A_9_0" "A_9_1" "A_9_2" [61] "A_9_3" "A_9_4" "A_9_5" [64] "A_9_6" "A_9_7" "A_9_8" [67] "A_9_9" "B_0_0" "B_1_0" [70] "B_1_1" "B_10_0" "B_10_1" [73] "B_10_10" "B_10_2" "B_10_3" [76] "B_10_4" "B_10_5" "B_10_6" [79] "B_10_7" "B_10_8" "B_10_9" [82] "B_2_0" "B_2_1" "B_2_2" [85] "B_3_0" "B_3_1" "B_3_2" [88] "B_3_3" "B_4_0" "B_4_1" [91] "B_4_2" "B_4_3" "B_4_4" [94] "B_5_0" "B_5_1" "B_5_2" [97] "B_5_3" "B_5_4" "B_5_5" [100] "B_6_0" "B_6_1" "B_6_2" [103] "B_6_3" "B_6_4" "B_6_5" [106] "B_6_6" "B_7_0" "B_7_1" [109] "B_7_2" "B_7_3" "B_7_4" [112] "B_7_5" "B_7_6" "B_7_7" [115] "B_8_0" "B_8_1" "B_8_2" [118] "B_8_3" "B_8_4" "B_8_5" [121] "B_8_6" "B_8_7" "B_8_8" [124] "B_9_0" "B_9_1" "B_9_2" [127] "B_9_3" "B_9_4" "B_9_5" [130] "B_9_6" "B_9_7" "B_9_8" [133] "B_9_9" "daysToX" "defaultsSoon"
The feature columns are labeled by type of event (A or B), the first day of the window, and the last day of the window: so A_0_0
means “fraction of events that were A events today (day 0)”, B_8_5
means “fraction of events that were B events from eight days back to five days back” (a window of length 4), and so on. The column daysToX
is the number of days until the customer exits; defaultsSoon
is true if daysToX <= 7
This naive sessionizing can quickly generate very wide data sets, especially if: there are more than two classes of events; if we want to consider wider windows; or if we have several types of log measurements that we want to aggregate and sessionize. You can imagine situations where you generate more features than you have datums (customers) in the training set. In future posts we will look at alternative approaches.
Modeling
Principled feature selection (or even better, principled feature generation) before modeling is a good idea, but for now let's just feed the sessionized data into regularized (ridge) logistic regression and see how well it can predict soon-to-exit customers.
library(glmnet) # loads vars (names of vars), yVar (name of y column), # dTrainS, dTestS load("wideData.rData") # assuming the xframe is entirely numeric # if there are factor variables, use # model.matrix ridge_model = function(xframe, y, family="binomial") { model = glmnet(as.matrix(xframe), y, alpha=0, lambda=0.001, family=family) list(coef = coef(model), deviance = deviance(model), predfun = ridge_predict_function(model) ) } # assuming xframe is entirely numeric ridge_predict_function = function(model) { # to get around the 'unfullfilled promise' leak. blech. force(model) function(xframe) { as.numeric(predict(model, newx=as.matrix(xframe), type="response")) } } model = ridge_model(dTrainS[,vars], dTrainS[[yVar]]) testpred = model$predfun(dTestS[,vars]) dTestS$pred = testpred
Evaluating the Model
You can plot the distribution of model scores on the holdout data as a function of class label:
The model mostly separates about-to-exit customers from the others, although far from perfectly (the AUC of this model is 0.78). To evaluate whether this model is good enough, you should take into account how the output of the model is to be used. You can use the model as a classifier, by picking a threshold score (say 0.5) to sort the customers into "about to exit" and not. In this case, look at the confusion matrix:
dTestS$predictedToLeave = dTestS$pred>0.5 # confusion matrix cmat = table(pred=dTestS$predictedToLeave, actual=dTestS[[yVar]]) cmat ## actual ## pred FALSE TRUE ## FALSE 205 80 ## TRUE 99 276 recall = cmat[2,2]/sum(cmat[,2]) recall ## [1] 0.7752809 precision = cmat[2,2]/sum(cmat[2,]) precision ## [1] 0.736
The model found 78% of the about-to-exit customers in the holdout set; of the customers identified as about-to-exit, about 74% of them actually did exit within seven days (26% false positive rate).
Alternatively you could use the model to prioritize your customers with respect to who should see in-app ads that encourage them to consider a subscription service. The improvement you can get by using the model score to prioritize ad placement is summarized in the gain curve:
If you sort your customers by model score (decreasing), then the blue curve shows what fraction of about-to-leave customers you will reach, as a fraction of the number of customers you target based on the model's recommendations; the green curve shows the best you can do on this population of customers, and the diagonal line shows what fraction of about-to-leave customers you reach if you target at random. As shown on the graph, if you target the 20% highest-risk customers (as scored by the model), you will reach 30% of your about-to-leave customers. This is an improvement over the 20% you would expect to hit at random; the best you could possibly do targeting only 20% of your customers is about 37% of the about-to-leaves.
The confusion matrix and the gain curve help you to pick a trade-off between targeting in-app ads to try to retain at-risk customers, without antagonizing customers who are not at risk by showing too many of them an irrelevant ad.
Evaluating Utility
The distribution of days until exit by class label confirms that "risky" (according to the model) customers do in general exit sooner:
But you also want to double-check that the model identifies abandoning customers soon enough. Once the model has identified someone as being at risk, how long do you have to intervene?
# make daysToX finite. The idea is that the live-forevers should be rare isinf = dTestS$daysToX==Inf maxval = max(dTestS$daysToX[!isinf]) dTestS$daysToX = with(dTestS, ifelse(daysToX==Inf, maxval, daysToX)) # how long on average until flagged customers leave? posmean = mean(dTestS[dTestS$predictedToLeave, "daysToX"]) posmean ## [1] 5.693333 # how many days until true positives (customers flagged as leaving # who really do leave) leave? tpfilter = dTestS$predictedToLeave & dTestS[[yVar]] trueposmean = mean(dTestS[tpfilter, "daysToX"]) trueposmean ## [1] 2.507246
Ideally, you'd like the above distribution to be skewed to the right: that is, you want the model to identify at-risk customers as early as possible. You probably can't intervene in time to save customers who are leaving today (day 0) or tomorrow (you can think of these customers as recall errors from "yesterday's" application of the model. Fortunately, on average this model catches at-risk customers a few days before they leave, giving you time to put the appropriate in-app ad in front of them. Once you put this model into operation, you will further want to monitor the flagged customers, to see if your intervention is effective.
Conclusion
For sessionized problems the easiest way to make a “best classifier” is to cheat the customer and try only to predict events right before they happen. This allows your model to use small windows of near-term data and look artificially good. In practice you need to negotiate with your customer how far out a prediction is useful for the customer and build a model with training data oriented towards that goal. Even then you must re-inspect such a model, as even a properly trained near-term event model will have a significant (and low-utility) component given by events that are essentially happening immediately. These “immediate events” are technically correct predictions (so they don’t penalize precision and recall statistics), but are also typically of low business utility as they don’t give the business time for a meaningful intervention.
Next:
As mentioned above, you would prefer to have a principled variable selection technique. This will be the topic of our next article in this series.
The R markdown script describing our analysis is here. The plots are generated using our own in-progress visualization package, WVPlots
. You can find the source code for the package on GitHub, here.
The plot of aggregated door traffic log data shown at the top of the post uses data from Ilher, Hutchins and Smyth, "Adaptive event detection with time-varying Poisson processes", Proceedings of the 12th ACM SIGKDD Conference (KDD-06), August 2006. The data can be downloaded from the UCI Machine Learning Repository, here.
]]>For this article we are assigning two different advertising message to our potential customers. The first message, called “A”, we have been using a long time, and we have a very good estimate at what rate it generates sales (we are going to assume all sales are for exactly $1, so all we are trying to estimate rates or probabilities). We have a new proposed advertising message, called “B”, and we wish to know does B convert traffic to sales at a higher rate than A?
We are assuming:
As we wrote in our previous article: in practice you usually do not know the answers to the above questions. There is always uncertainty in the value of the A-group, you never know how long you are going to run the business (in terms of events or in terms of time, and you would also want to time-discount any far future revenue), and often you value things other than revenue (valuing knowing if B is greater than A, or even maximizing risk adjusted returns instead of gross returns). This represents severe idealization of the A/B testing problem, one that will let us solve the problem exactly using fairly simple R code. The solution comes from the theory of binomial option pricing (which is in turn related to Pascal’s triangle).
For this “statistics as it should be” (in partnership with Revolution Analytics) article let us work the problem (using R) pretending things are this simple.
Abstractly we have two streams of events (“A” events and “B” events). Each event returns a success or a failure (say valued at $1 and $0, respectively)- and we want to maximize our overall success rate. The special feature of this problem formulation is that we assume we know how long we are going to run the business: there is an n so the total number of events routed to A (call this amount a) plus the total amount of events routed to B (call this b) is such that a+b=n.
To make things simple assume:
The usual method of running an A/B test is to fix some parameters (prior distribution of expected values, acceptable error rate, acceptable range of error in dollars per event) and then design an experiment that estimates which of A or B is the most valuable event stream. After the experiment is over you then only work with whichever of A or B you have determined to be the better event stream. You essentially divide your work into an experimentation phase followed by an exploitation phase.
Suppose instead of deriving formal statistical estimates instead we solved the problem using ideas of operations research and asked for an adaptive strategy that directly maximized expected return rate? What would that even look like? It turns out you get a sensing procedure that routes all of its experiments to B for a while and then depending on observed returns may switch over to working only with A. This looks again like a sensing phase followed by an exploitation phase, but the exact behavior is determined by the algorithm interacting with experiment returns and is not something specified by the user. Let’s make things concrete by working a very specific example.
For the sake of argument: suppose we are willing to work with exactly four events ever, A’s conversion rate is exactly 1/2, and we are going to use what I am calling “naive priors” on the rate B returns success. The entire task is to pick whether to work next with an event from A or from B. One strategy is a fill in of the following table:
Number of B-trials run | Number of B-successes seen | Decision to go to A or B next |
---|---|---|
0 | 0 | ? |
1 | 0 | ? |
1 | 1 | ? |
2 | 0 | ? |
2 | 1 | ? |
2 | 2 | ? |
3 | 0 | ? |
3 | 1 | ? |
3 | 2 | ? |
3 | 3 | ? |
Notice we have not recorded the number of times we have tried the A-event. Because we are assuming we know the exact expected value of A (in this case 1/2) there is an optimal strategy that never tries an A-event until the strategy decides to give up on B. So we only need to record how many B’s we have tried, how many successes we have seen from B, and if we are willing to go on with B. Remember we have exactly four events to route to A and B in combined total, and this is why we don’t need to know what decision to make after the fourth trial.
We can present the decision process more graphically as the following directed graph:
Each row of the decision table is represented as node in the graph. Each node contains the following summary information:
The first two values (step,bwins) are essentially the keys that identify the node and the other fields (known or unknown) are derived values. In our example directed graph we have written down everything that is easy to derive (pbEst), but still don’t know the thing we want: valueB (or equivalently whether to try B one more time at each node).
It turns out there is an easy way to fill in all of the unknown valueB answers in this diagram. The idea is called dynamic programming and this application of it is inspired from something called the binomial options pricing model. But the idea is so simple yet powerful that we can actually just directly derive it for our problem.
Consider the leaf-nodes of the directed graph (the nodes with no exiting edges representing our the state of the world before our last decision). For these nodes we do have an estimate of valueB: pbEst! We can fill in this estimate to get the following refined diagram:
For the final four nodes we know whether to try B again (the open green nodes) or if to give up on B and switch to A (the shaded red nodes). The decision is based on our stated goal: maximizing expected value. And in our last play we should go with B only if its estimated expected value is higher than the known expected value of A. Using the observed frequency of B-successes as our estimate of the probability of B (or expected value of B) may seem slightly bold in this context, but it is the standard way to infer (we can justify this either through Frequentist arguments or by Bayesian arguments using an appropriate beta prior distribution).
So we now know how to schedule the fourth and final stage. That wouldn’t seem to help us much: as the first decision (the top row, or root node) is what we need first, and it still has a “?” for valueB. But look at the three nodes in the third stage. We can now estimate their value using known values from the fourth stage.
Define:
The formula for valuing any non-leaf node in our diagram is:
valueB[step=n,bwins=w] =
( pbEst[step=n,bwins=w] * (1+value[step=n+1,bwins=w+1]) ) +
( (1-pbEst[step=n,bwins=w]) * value[step=n+1,bwins=w] )
So if we know all of pbEst[step=n,bwins=w], value[step=n+1,bwins=w+1], and value[step=n+1,bwins=w]: we then know valueB[step=n,bwins=w]. This is just saying the valueB of a node is the value of the immediate next step if we played B (which gives us a bonus of 1 if B gives us a success) plus the value of the node we end up at.
For example we can calculate valueB of the “step 2 bwins 1” as:
valueB[step=2,bwins=1] =
( pbEst[step=2,bwins=1] * (1+value[step=3,bwins=2]) ) +
( (1-pbEst[step=2,bwins=1]) * value[step=3,bwins=1] )
=
( 0.5 * (1+0.67) ) +
( (1-0.5) * 0.5 )
= 1.085
All this is just done by reading quantities off the current diagram. We can do this for all of the nodes in the third row yielding the following revision of the diagram.
In the above diagram we have rendered nodes we consider unreachable (nodes we would never go to when following the optimal strategy) with dashed lines. We now have enough information in the diagram to use the equation to fill in the second row:
And finally we fill in the first row (or root node) and have a complete copy of the optimal strategy.
We can copy this from the diagram back to our original strategy table by writing “Choose A” or “Choose B” depending if valueA ≥ value B for the node corresponding to the line in the table.
Number of B-trials run | Number of B-successes seen | Decision to go to A or B next |
---|---|---|
0 | 0 | Choose B |
1 | 0 | Choose A |
1 | 1 | Choose B |
2 | 0 | Choose A |
2 | 1 | Choose B |
2 | 2 | Choose B |
3 | 0 | Choose A |
3 | 1 | Choose A |
3 | 2 | Choose B |
3 | 3 | Choose B |
We don’t have to use what we have called naive estimates for pbEst. If we have good prior expectations on the likely success rate of B we can work this into our solution through the form of Bayesian beta priors. For example if our experience was such that the expected value of B is in fact around 0.25 (much worse than A) with a standard deviation of 0.25 (somewhat diffuse, so there is a non-negligible chance B could be better than A) we could design our pbEst calculation with that prior (which is easy to implement as a beta distribution with parameters alpha=0.5 and beta=1.5). In the case where we are only going to try A or B at total of four times we get the following diagram:
This diagram is saying: for only four trials we already know enough to never try B. The optimal strategy is to stick with A all four times. However, if the total number of trials we were budgeted to run was larger (say 20) then it actually starts to make sense to try B a few times to see if it is in fact a higher rate than A (despite our negative prior belief). We demonstrate the optimal design for this pessimal prior and n=20 in the following diagram.
And this is the magic of the dynamic programming solution. It uses the knowledge of how long you are going to run your business to decide how to value exploration (possibly losing money by giving traffic to B) versus exploitation (going with which of A or B is currently thought to be better). Notice the only part of the diagram or strategy table we need to keep is the list of nodes where we decide to no longer ever try B (the filled red stopping nodes). This is why we call this variation of the A/B test a stopping time problem.
All the B-rate calculations above are exactly correct if we in fact had the exact right priors for B’s rate. If the priors were correct at the root node, then by Bayes law the pBest probability estimates are in fact exactly correct posterior estimates at each node, and every decision made in the strategy is then correct. For convenience we have been using a beta distribution as our prior (as it has some justification, and makes calculation very easy), but these is no guarantee that the actual prior is in fact beta or that we even have the right beta distribution as our initial choice (the beta distribution is a determined by two parameters alpha and beta).
However, with n large enough (i.e. a budget of enough proposed events to design a good experiment) the strategy performance starts to become insensitive to the chosen prior (see the Bernstein–von Mises theorem for some motivation). So the strategy performs nearly as well with a prior we can supply as with the unknown perfect prior. As long as we start with an optimistic prior (one that allows our algorithm to route traffic to B for some time) we tend to do well.
In practice we would never know the exact expected value of A (and certainly not know it prior to starting the experiment). In the more realistic situation where we assume we are trying to choose between an A and B where we have things to learn about both groups the dynamic programming solution still applies: we just get a larger dynamic programming table. Each state is indexed by four numbers:
For each so-labeled state we have four derived values:
And again, an optimal strategy is one that just chooses A or B depending on if valueA > valueB or not. Notice that in this case an optimal strategy may switch back and forth between using A or B experiments. The derived values are filled in from states at or near the end of the entire experiment just as before. We now have an index consisting of four numbers (nA,wA,nB,wB) instead of just two numbers (nB,wB) so it is harder to graphically present the intermediate calculations and the final strategy tables.
Here is an example that is closer to the success rates and length of business seen in email or web advertising (though one problem for email advertising is that this is sequential plan, we need all earlier results back to make later decisions). Suppose we are going to run an A/B campaign for a total of 10,000 units of traffic, we assume the A success rate is exactly 1%, and we will use the uninformative Jefferys prior for B (which is actually pretty generous to B as this prior has initial expected value 1/2). That is it: our entire problem specification is the assumed A-rate, the total amount of traffic to plan over, and the choice of B-prior. This is specific enough for the dynamic programming strategy to completely solve the problem of maximizing expected revenue.
The dynamic program solution to this problem can be concisely represented by the following graph:
The plan is: we route all traffic to B, always measuring the empirical return rate of B (number of B successes over number of B trials). The number of B trials is the x-axis of our graph and we can use the current estimated B return rate as our y-height. The decision is: if you end up in the red area (below the curve) you stop B and switch over to A forever. Notice B is initially given a lot of leeway. It can fail to pay off a few hundred times and we don’t insist on it having a success rate near A’s 1% until well over 5,000 trials have passed.
Dynamic programming offers an interesting alternative solution to A/B test planning (in contrast to the classic methods we outlined here).
All the solutions and diagrams were produced by R code we share here.
We will switch from “statistics as it should be” back to “R as it is” and discuss the best ways to incrementally collect data or results in R.
]]>What the Sharpe ratio does is: give you a dimensionless score to compare similar investments that may vary both in riskiness and returns without needing to know the investor’s risk tolerance. It does this by separating the task of valuing an investment (which can be made independent of the investor’s risk tolerance) from the task of allocating/valuing a portfolio (which must depend on the investor’s preferences).
But what we have noticed is nobody is willing to honestly say what a good value for this number is. We will use the R analysis suite and Yahoo finance data to produce some example real Sharpe ratios here so you can get a qualitative sense of the metric.
“What is a good Sharpe ratio” was a fairly popular query in our search log (until search engines stopped sharing the incoming queries with mere bloggers such as myself). When you do such a search you see advice of the form:
… a ratio of 1 or better is considered good, 2 and better is very good, and 3 and better is considered excellent …
Some sources of this statement include:
To give you some insight, a ratio of 1 or better is considered good, 2 and better is very good, and 3 and better is considered excellent.
A Sharpe ratio of 1 is considered good, while 2 is considered great and 3 is considered exceptional.
To give you some insight, a ratio of 1 or better is considered good, 2 and better is very good, and 3 and better is considered excellent.
… frankly a Sharpe of 1+ is a yawn, and *no*one* notices. Above 2 and you get attention.
Reading these together you see a bit of a content-free echo chamber. Remember: on the web when you see the exact same answer again and again it is more likely due to copying than due to authoritativeness. The last reference indicates a part of the problem: once somebody claims some specific number (such as 1) is a middling Sharpe ratio, no-one dares call any smaller number good (for fear of looking weak).
One also wonders of “2 is good” is some sort of confounding interpretation of the Sharpe ratio as a Fisher style Z statistic (which uses the same ratio of mean over deviance). The point being the rule of thumb “two standard deviations has a two-sided significance of 0.0455” falls fairly close to the heavily ritualized p-value of 0.05.
The correct perspectives about Sharpe ratio are a bit more nuanced:
Of course, the higher the Sharpe ratio the better. But given no other information, you can’t tell whether a Sharpe ratio of 1.5 is good or bad. Only when you compare one fund’s Sharpe ratio with that of another fund (or group of funds) do you get a feel for its risk-adjusted return relative to other funds.
In fact it is Morningstar that gave a specific range for annual returns (around 0.3) that I used in my article Betting with their money (though now I am not sure I have enough context to be sure if the number they gave was a real example or just notional).
The theory of the Sharpe ratio is: if you have access to the ability to borrow or lend money, then for two similar investments you should always prefer the one with higher Sharpe ratio. So the Sharpe ratio is definitely used for comparison. When I was in finance I used the Sharpe ratio for comparison, but I didn’t have a Sharpe ratio goal.
Reading from a primary source we see estimating the Sharpe ratio of a particular investment at a particular time depends on at least the choice of:
sqrt(365)
time more volatile than yearly returns). The Sharpe ratio is the ratio of these two quantities, and they are not varying in similar ways as we change scale.So without holding at least these three choices constant, it doesn’t make a lot of sense to compare.
We emphasize because the Sharpe ratio itself varies over time (even with the above windowing) it in fact does not strictly make sense to talk about “the Sharpe ratio of an investment.” Instead you must consider “the Sharpe ratio of an investment at a particular time” or “the distribution of the Sharpe ratio of an investment over a particular time interval.” This means if you want to estimate a Sharpe ratio for an investment you at least must specify an additional time scale or smoothing window to average over.
For example: below is the Sharpe ratio of the S&P500 index annual returns using a 500 day window to estimate variance/deviation using the 10-year US T-note interest rate as the risk-free rate of return (note we are using the interest rate as the risk-free return, we are not using the returns you would see from buying and selling T-notes). Note: the choice of “risk free” investment here is a bit heterodox.
Notice two things:
For our betting article we needed a Sharpe ratio on a 10 day scale. Here is the S&P500 index 10 day returns using a 500 day window to estimate variance/deviation using the 10-year US T-note interest rate as the risk-free rate of return:
Over the graphed time interval viewed we have the upper quartile value is 0.1. So the S&P 10 day return Sharpe ratio spends 75% of its time below 0.1. Thus the 10 theoretical day Sharpe ration of 0.18 in “Betting with their money” is in fact large. Though we have found this calculation is sensitive to the length of the window used to estimate variance (for example using a window of 30 days gives us mean: 0.08, median: 0.075, 3rd quartile 0.45).
And for fun here is a similar Sharpe ratio calculation for the PIMCO TENZ bond fund:
This just confirms the last few years have not been good for US bonds.
Note: it is traditional to use very low interest rate instruments as the “safe comparison” in Sharpe ratio. So using 10 T-note interest rates gives an analysis that is a bit pessimistic (and also ascribes the T-note variance to the instrument being scored). However, the “safe comparison” is really only used in the Sharpe portfolio argument as the rate you can borrow and/or lend money at (which is not in fact risk-free in the real world). So there is some value in using an easy to obtain realistic “boring investment” as a proxy for the “risk-free return rate.” The ignoring of the risk-free rate in the Betting with their money article is also not strictly correct (but also something Sharpe ignored for a while), but given the scale of potential wins and losses in that set-up it is not going to cause significant issues.
Basically remember this: there are a lot of analyst chosen details in estimating a Sharpe ratio. One of the biggest ones you can fudge is the estimate of deviation/variance (be it theoretical/Ex-Ante or Ex-Post). I would say very high Sharpe ratios or more likely evidence of under estimating the deviation/variance and the reference return process than evidence of actual astronomical risk-adjusted returns.
The complete R code to produce these graphs from downloaded finance data is given here.
]]>Win-Vector LLC can complete your high value project quickly (some examples), and train your data science team to work much more effectively. Our consultants include the authors of Practical Data Science with R and also the video course Introduction to Data Science. We now offer on site custom master classes in data science and R.
Please reach out to us at contact@win-vector.com for research, consulting, or training.
Follow us on (Twitter @WinVectorLLC), and sharpen your skills by following our technical blog (link, RSS).
]]>An A/B test is a very simple controlled experiment where one group is subject to a new treatment (often group “B”) and the other group (often group “A”) is considered a control group. The classic example is attempting to compare defect rates of two production processes (the current process, and perhaps a new machine).
A/B testing is one of the simplest controlled experimental design problems possible (and one of the simplest examples of a Markov decision process). And that is part of the problem: it is likely the first time a person will need to truly worry about:
All of these are technical terms we will touch on in this article. However, we argue the biggest sticking point of A/B testing is: it requires a lot more communication between the business partner (sponsoring the test) and the analyst (designing and implementing the test) than a statistician or data scientist would care to admit. In this first article of a new series called “statistics as it should be” (in partnership with Revolution Analytics) we will discuss some of the essential issues in planning A/B tests.
Communication is the most important determiner of data science project success or failure. However, communication is expensive. That is one reason why a lot of statistical procedures are designed and taught in a way to minimize communication. But minimizing communication has its own costs and is somewhat responsible for the terse style of many statistical conversations.
A typical bad interaction is as follows. The business person wants to see if a new advertising creative is more profitable than the old one. It is unlikely they phrase it as precisely as “I want to maximize my expected return” (what they likely in fact want) or as “I want to test the difference between two means” (what a statistician most likely wants to hear). To make matters worse the “communication” is usually a “clarifying conversation” where the business person is forced to pick a goal that is convenient for analysis. The follow-ups are typically:
This is a very doctrinal and handbook way of talking and leaves little time to discuss alternatives. It kills legitimate statistical discussion (example: for testing difference of rates shouldn’t one consider Poisson or binomial tests such as Fisher or Barnard over Gaussian approximations?). And it shuts-out a typical important business goal: maximizing expected return. Directly maximizing expected return is a legitimate well-posed goal, but it is not in fact directly solved by any of the methods we listed above. For a good discussion of maximizing expected return see here.
What we have to remember is: the responsibility of the statistician or data scientist consultant isn’t to quickly bully the business partner into terms and questions that are easiest for the consultant. The consultant’s responsibility is in spend the time to explore goals with the business partner, formulate an appropriate goal into a well-posed problem, and only then move on to solving problem.
The problem to solve is the one that is best for the business. For A/B testing the right problem is usually one of:
A lot of literature on A/B testing is written as if problem-1 is the only legitimate goal. In many cases problem-1 is the goal, for example when testing drugs and medical procedures. And a good solution to problem-1 is usually a good approximate solution to problem-2. However, in business (as opposed to medicine) problem-2 is often the actual goal. And, as we have said, there are standard ways to solve problem-2 directly.
Once we have a goal we should look to standard solutions. Some of the methods I like to use in working with A/B tests include:
Each of these methods is trying to encapsulate a procedure that, in addition to serving a particular goal, minimizes the amount of prior knowledge needed to run a good A/B test. A lot of the differences in procedure come from using different assumptions to fill in quantities not known prior to starting the A/B test. Also notice a lot of the choice of Bayesian versus frequentist is pivoting on what you are trying to do (and less on which you like more).
Guided interaction with the calculator or exploration of derived decision tables is very important. In all cases you work the problem (maybe with both statistician and client present) by interactively proposing goals, examining the calculated test consequences, and then revising goals (if the proposed test sizes are too long). This ask, watch, reject cycle greatly improves communication between the sponsor and the analyst as it quickly makes apparent concrete consequences of different combinations of goals and prior knowledge.
The following is a quick stab at a list of parameter estimates needed in order to design an efficient A/B test. We call them “prior estimates” as we need them during the test design phase, before the test is run.
Essentially a good test plan depends on having good prior estimates of rates, and a clear picture of future business intentions. Each of the standard solutions has different sensitivity to the answered and ignored points. For example: many of the solutions assume you will be able to use the chosen treatment (A or B) arbitrary long after an initial test phase, and this may or may not be a feature of your actual business situation.
Given the (often ignored) difficulty in faithfully encoding business goals and in supplying good prior parameters estimates, one might ask why A/B testing as it is practiced ever works? My guess is that practical A/B testing is often not working. Or at least not making correct decisions as often as typically thought.
Practitioners have seen that even tests that are statistically designed to make the wrong decision no more than 10% of the time seem to be wrong much more often. But this is noticed only if one comes back to re-check! Some driving issues include using the wrong testing procedure (such as inappropriately applying one-tailed bounds to an actual two-tailed experiment). But even with correct procedures, any mathematical guarantee is contingent on assumptions and idealizations that may not be met in the actual business situation.
Likely a good fraction of all A/B tests run returned wrong results (picked B, when the right answer was A; or vice-versa). But as long as the fraction is small enough such that the the expected value of an A/B test is positive the business sees large long-term net improvement. For example if all tested changes are of similar magnitude, then it is okay for even one-third of the tests to be wrong. You don’t know which decisions you made were wrong, but you know about 2/3rds of them were right and the law of large numbers says your net gain is probably large and positive (again, assuming each change has a similar bounded impact to your business).
Or one could say:
One third of the decisions I make based on A/B tests are wrong; the trouble is I don’t know which third.
in place of the famous:
Half the money I spend on advertising is wasted; the trouble is I don’t know which half.
The point is: it may actually make business sense to apply 10 changes to your business suggested by “too short” A/B tests (so 2 of the suggestions may in fact be wrong) than to tie up your A/B testing infrastructure so long you only test one possible change. In fact considering an A/B test as single event done in isolation (as is typically done) may not always be a good idea (for business reasons, in addition to the usual statistical considerations).
It has pained me to informally discuss the business problem and put off jumping into the math. But that was the point of this note: the problem precedes the math. In our next “Statistics as it should be” article we will jump into math and algorithms when we use a dynamic programming scheme to exactly solve the A/B testing plan problem for the special case when we assume we have answers to some of the questions we are usually afraid to ask.
]]>