The talk is called *Improving Prediction using Nested Models and Simulated Out-of-Sample Data*.

In this talk I will discuss nested predictive models. These are models that predict an outcome or dependent variable (called y) using additional submodels that have also been built with knowledge of y. Practical applications of nested models include “the wisdom of crowds”, prediction markets, variable re-encoding, ensemble learning, stacked learning, and superlearners.

Nested models can improve prediction performance relative to single models, but they introduce a number of undesirable biases and operational issues, and when they are improperly used, are statistically unsound. However modern practitioners have made effective, correct use of these techniques. In my talk I will give concrete examples of nested models, how they can fail, and how to fix failures. The solutions we will discuss include advanced data partitioning, simulated out-of-sample data, and ideas from differential privacy. The theme of the talk is that with proper techniques, these powerful methods can be safely used.

John Mount and I will also be giving a workshop called *A Unified View of Model Evaluation* at **ODSC West 2016 on November 4** (the premium workshop sessions), and **November 5** (the general workshop sessions).

We will present a unified framework for predictive model construction and evaluation. Using this perspective we will work through crucial issues from classical statistical methodology, large data treatment, variable selection, ensemble methods, and all the way through stacking/super-learning. We will present R code demonstrating principled techniques for preparing data, scoring models, estimating model reliability, and producing decisive visualizations. In this workshop we will share example data, methods, graphics, and code.

I’m looking forward to these talks, and I hope some of you will be able to attend.

]]>

Like many other empirical sciences, data science plays fast and loose with important statistical questions (confusing exploration with confirmation, confusing significances with posteriors, hiding negative results, ignoring selection bias, and so on).

Some really neat writing on such pitfalls include:

- “Why most of psychology is statistically unfalsifiable”, Morey and Lakens (in preparation, also see here).
- “HARKing: Hypothesizing After the Results are Known”, Kerr, Personality and Social Psychology Review, 1998, Vol. 2., No. 3. pp. 196-217.
- “Why Most Published Research Findings Are False”, Ioannidis JPA (2005), PLoS Med 2(8): e124. doi:10.1371/journal.pmed.0020124.
- “How to Make More Published Research True”, Ioannidis JPA (2014), PLoS Med 11(10): e1001747. doi:10.1371/journal.pmed.1001747.

The last reference in particular makes two specific claims:

- Results can be true or false (independently of us being able to determine which). Remember this: reality isn’t a mere social construct.
- Possibly as much as 85% of all scientific research is wasted effort. Unlike Wanamaker’s “Half the money I spend on advertising is wasted; the trouble is I don’t know which half” we can know a priori which efforts could never pay off (those with a sufficiently implausible hypothesis or poor enough experimental design) prior to the waste.

The driving issue is: academic science is a profession measured by publication. This gets perverted (under the theory “the inevitable becomes acceptable”) into: scientists have a right to publish as they need to do so to survive. This is why the simple act of critically reading published papers (presumably why they are published) for statistical typos has been called “methodological terrorism” (presumably under the rubric that the tenured shit on the graduate students, and not the other way around; please see here and here for more commentary).

The above was fairly general. I am going to propose a few cartoon examples to be more specific.

- When I was working biological science my impression was that about 90% of the published papers were of the form: “we purchased reagent 342234 from the Merck catalogue and used our purchased instrument to measure index of refraction under a grant to cure cancer.” The waste and ridiculousness of this is that usually reagent 342234 could not exist in living tissue (it would kill the tissue or get destroyed or bound) and the index of refraction has very little to do with cancer.
- From the outside (and this in fact unfair) a lot of psychology studies look like: “we show a simple exercise that has no plausible mechanism linked to our claimed outcome possibly showed effect (below statistical significance) on a deliberately small test population.”
- From nutrition science “we designed an experiment so small it can’t show unhealthy food is bad for you, therefore unhealthy food isn’t bad for you” (related ranting can be found here).
- From neuroscience “we show an expensive fMRI lets us draw a pretty picture, from which we will then draw unrelated conclusions” (wonderfully lampooned in Bennett et al. “Neural Correlates of Interspecies Perspective Taking in the Post-Mortem Atlantic Salmon: An Argument For Proper Multiple Comparisons Correction” Journal of Serendipitous and Unexpected Results, 2010, see here for some discussion, and here for the meaningless circularity of such logic).
- From the quack physics world (reaction-less drives, free energy, and so on): “we studied something that can not happen (such as reaction-less drive which is impossible due to Noether’s theorem) in conditions (our garage) utterly different than where we claim the effect will be used (deep space).” Examples include the good-old Dean drive and the more modern EmDrive.
- And in our own data science land: “we combined thousands of advanced models to claim a marginal improvement on classifier performance that will never replicate on new data.”

We start with non-statistical people (such as computer scientists) running one experiment and claiming victory. We try to steer them into at least some repetition to get a crude look at distribution. But repetition by instruction is mere ritual, how do we get to valid empirical science?

What we want from experiments is to know the truth (“is this food good for you or bad for you?”). While that is an unachievable goal in a mere empirical world, knowing the truth should always remain the goal.

Statistically we will settle for solid posterior odds on the statement under question being true. This is a strict formulation of empirical science. It is distinctly Bayesian, and unfortunately unachievable (as it depends on having good objective prior-odds on the effect). Usable subjective or usable un-informative priors are easily to get, but the true “prior odds” of a substantial empirical statement are usually inaccessible (though there is good practical use of so-called “empirical Bayes” methodology).

We move on to ideas of positivism, Popperism, falsifiability, and frequentism. Maybe we can’t work out the odds of a statement being true, but we may be able to eliminate some obvious false statements. Under frequentism we can at least complete a calculation (though it may not mean what we hope to claim). This is where science roughly is, and (despite its limitations and not giving proper posterior odds) I think it is about where we need to be, *if done correctly*.

Without access to objective priors, I think frequentism is about as good as empirical science is going to commonly get. However to even correctly apply frequentist methods you need to think deeply on at least the two following issues:

- Frequentist statistics depend on both the sequence of experiments performed and the sequence of experiments even contemplated (counter-factuals)! For example experimenter intent can matter.
Consider an honest experimenter that says they are going to pick an integer

`k`

from the geometric distribution p=1/2 and flip a fair coin that many times reporting the last flip only. Also consider a dishonest experimenter that is going to flip a fair coin until it comes up heads and also reports the last flip. The first reports “heads/tails” at a 50/50 rate, and the second researcher always reports “heads”. Very different outcomes, and the only distinction is procedure- so if we are lied to about that we are at sea. - Frequentist results are one-sided. You can never succeed. You can only “fail to fail.” This is where frequentism is most abused. A researcher can show the data observed is very unlikely under the assumption of a (hoped to be falsified) null-hypothesis. This lowers the plausibility of the null hypothesis. The hope is that some of this lost plausibility is captured by the researcher’s pet hypothesis (which is not always the case, as it can be captured by other competing possibilities).
There is an analogy to this in constructive mathematics. One of the tenants of classical logic is “

`not not P is equivalent to P`

.” This is the plan hinted at in null-hypothesis testing if we read “`not P`

” as “no effect” we would read as “falsifying no-effect is equivalent to showing*an*effect.” This is routinely abused into “falsifying no-effect is equivalent to proving*my*hypothesis” (i.e. claiming to have supported a particular reason or mechanism for an effect).However in constructive logic “

`not not P is equivalent to P`

” is not true in general. Though we do have “`not not not P is equivalent to not P`

” in intuitionist logic. In terms of hypothesis testing this reads as: “failing to falsify the null-hypothesis is equivalent to maintaining the null-hypothesis.” This statement has more limited content, which is why it is a tautology even in intuitionist logic.

What are we to do? Accept results that are only run once (with absolutely no statistics)? Teach basic frequentism which will be badly abused? Treat subjective Bayesianism (Bayes’ method with subjective priors) as universal (for example I have a near zero prior on reaction-less drive, but presumably reaction-less drive advocates have a much larger prior; so we will never agree on the interpretation of a reasonable number of experiments). Wait for the impossible city of objective Bayesian priors?

I’d say the usable lesson comes from our digression into logic emphasizing that “failing to fail” is not always the same as success. However in teaching I try to move researchers a bit up on the above sequence and ask them to keep their eyes even further up.

]]>

R has a number of ROC/AUC packages; for example ROCR, pROC, and plotROC. But it is instructive to see how ROC plots are produced and how AUC can be calculated. Bob Horton’s article showed how elegantly the points on the ROC plot are expressed in terms of sorting and cumulative summation.

The next step is computing AUC. Obviously computing area is a solved problem. The issue is how you deal with interpolating between points and the conventions of what to do with data that has identical scores. An elegant interpretation of the usual tie breaking rules is: for every point on the ROC curve we must have either all of the data above a given score threshold or none of the data above a given score threshold. This is the issue alluded to when the original article states:

This brings up another limitation of this simple approach; by assuming that the rank order of the outcomes embodies predictive information from the model, it does not properly handle sequences of cases that all have the same score.

This problem is quite easy to explain with an example. Consider the following data.

```
```d <- data.frame(pred=c(1,1,2,2),y=c(FALSE,FALSE,TRUE,FALSE))
print(d)
## pred y
## 1 1 FALSE
## 2 1 FALSE
## 3 2 TRUE
## 4 2 FALSE

Using code adapted from the original article we can quickly get an interesting summary.

```
```ord <- order(d$pred, decreasing=TRUE) # sort by prediction reversed
labels <- d$y[ord]
data.frame(TPR=cumsum(labels)/sum(labels),
FPR=cumsum(!labels)/sum(!labels),
labels=labels,
pred=d$pred[ord])
## TPR FPR labels pred
## 1 1 0.0000000 TRUE 2
## 2 1 0.3333333 FALSE 2
## 3 1 0.6666667 FALSE 1
## 4 1 1.0000000 FALSE 1

The problem is: we need to take all of the points with the same prediction score as an atomic unit (we take all of them or none of them). Notice also `TPR`

is always 1 (an undesirable effect).

We do not really want rows 1 and 3 in our plot or area calculations. In fact the values in row 1 and 3 are not fully determined as they can vary depending on details of tie breaking in the sorting (though the values recorded in rows 2 and 4 can not so vary). Also (especially after deleting rows) we may need to add in ideal points with `(FPR,TPR)=(0,0)`

and `(FPR,TPR)=(1,1)`

to complete our plot and area calculations.

What we want is a plot where ties are handled. Such plots look like the following:

```
```# devtools::install_github('WinVector/WVPlots')
library('WVPlots') # see: https://github.com/WinVector/WVPlots
WVPlots::ROCPlot(d,'pred','y',TRUE,'example plot')

There is a fairly elegant way to get the necessary adjusted plotting frame: use differencing (the opposite of cumulative sums) to find where the `pred`

column changes, and limit to those rows.

The code is as follows (also found in our `sigr`

library here):

```
```calcAUC <- function(modelPredictions,yValues) {
ord <- order(modelPredictions, decreasing=TRUE)
yValues <- yValues[ord]
modelPredictions <- modelPredictions[ord]
x <- cumsum(!yValues)/max(1,sum(!yValues)) # FPR = x-axis
y <- cumsum(yValues)/max(1,sum(yValues)) # TPR = y-axis
# each point should be fully after a bunch of points or fully before a
# decision level. remove dups to achieve this.
dup <- c(modelPredictions[-1]>=modelPredictions[-length(modelPredictions)],
FALSE)
# And add in ideal endpoints just in case (redundancy here is not a problem).
x <- c(0,x[!dup],1)
y <- c(0,y[!dup],1)
# sum areas of segments (triangle topped vertical rectangles)
n <- length(y)
area <- sum( ((y[-1]+y[-n])/2) * (x[-1]-x[-n]) )
area
}

This correctly calculates the AUC.

```
```# devtools::install_github('WinVector/sigr')
library('sigr') # see: https://github.com/WinVector/sigr
calcAUC(d$pred,d$y)
## [1] 0.8333333

I think this extension maintains the spirit of the original. We have also shown how complexity increases as you move from code known to work on a particular data set at hand, to library code that may be exposed to data with unanticipated structures or degeneracies (this is why Quicksort, which has an elegant description, often has monstrous implementations; please see here for a rant on that topic).

]]>`R`

for statistics” to groups of scientists (who tend to be quite well informed in statistics, and just need a bit of help with R) we take the time to re-work some tests of model quality with the appropriate significance tests. We organize the lesson in terms of a larger and more detailed version of the following list:
- To test the quality of a numeric model to numeric outcome: F-test (as in linear regression).
- To test the quality of a numeric model to a categorical outcome: χ
^{2}or “Chi-squared” test (as in logistic regression). - To test the association of a categorical predictor to a categorical outcome: many tests including Fisher’s exact test and Barnard’s test.
- To test the quality of a categorical predictor to a numeric outcome: t-Test, ANOVA, and Tukey’s “honest significant difference” test.

The above tests are all in terms of checking model results, so we don’t allow re-scaling of the predictor as part of the test (as we would have in a Pearson correlation test, or an area under the curve test). There are, of course, many alternatives such as Wald’s test- but we try to start with a set of tests that are standard, well known, and well reported by `R`

. An odd exception has always been the χ^{2} test, which we will write a bit about in this note.

The χ^{2} test is a very useful statistical test. In particular, under fairly mild assumptions, it is a usable probability model for the quality of fit of a logistic regression. It is based on a summary statistics called “deviance” (an odd name, but the quantity is strongly related to likelihood and to entropy). And, after a simple transform, it yields a quantity called “pseudo-R^{2}” (see The Simpler Derivation of Logistic Regression) that reads like “fraction of variation explained.” It is great final test for well-tuned models designed to estimate probabilities (just as area under the curve is a good early test as it abstracts out scale and choice of decision threshold).

Yet the χ^{2} test is under-emphasized and under-implemented in R. Consider the following trivial logistic regression model.

```
```d <- data.frame(x=c(1,2,3,4,5,6,7,7),
y=c(TRUE,FALSE,FALSE,FALSE,TRUE,TRUE,TRUE,FALSE))
model <- glm(y~x,data=d,family=binomial)
summary(model)
## Call:
## glm(formula = y ~ x, family = binomial, data = d)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.37180 -1.09714 -0.00811 1.08024 1.42939
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.7455 1.6672 -0.447 0.655
## x 0.1702 0.3429 0.496 0.620
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 11.090 on 7 degrees of freedom
## Residual deviance: 10.837 on 6 degrees of freedom
## AIC: 14.837
##
## Number of Fisher Scoring iterations: 4

Notice that `R`

reported the change in deviance (the measure of model quality, also note the AIC or Akaike information criteria is present), but no significance on the overall model fit is reported (n.b. this is different from coefficient significances). The significance is probably not there because `glm()`

is a fully general generalized linear model fitter (it can fit much more than just logistic regressions) and the error model likely changes as you change the model type (controlled by the `family`

and `link`

controls).

But logistic regression really deserves to be front and center. This is why in Section 7.2 of *Practical Data Science with R*, Nina Zumel, John Mount, Manning 2014 we take the time to define and show how to calculate the pseudo-R^{2} and the significance for the reported deviance.

As we observed in “Proofing statistics in papers” having standard tests and standard reporting of tests is a great advantage. In this spirit we add an “APA-like” report of the χ^{2} significance in our `sigr`

package. The use is quick and decisive:

```
```library('sigr')
formatChiSqTest(model,pLargeCutoff=1,format='html')$formatStr

**Chi-Square Test** summary: *pseudo- R^{2}*=0.023 (

The `sigr`

package isn’t up on CRAN, but can be installed using `devtools::install_github('WinVector/sigr')`

. It includes documentation, and an example vignette. `sigr`

can format directly to HTML, Latex, Markdown, and Word. It designed to be used with a `knitr`

workflow, and when used with `knit`

will automatically select the correct target formatting. Right now it generates reports for only linear and logistic regressions; we will likely fill out with a few more “most ofter used, so should have a nice neat format” tests going forward.

Early automated analysis:

Trial model of a part of the Analytical Engine, built by Babbage, as displayed at the Science Museum (London) (Wikipedia).

From the abstract of Nuijten et.al. paper we have:

This study documents reporting errors in a sample of over 250,000 p-values reported in eight major psychology journals from 1985 until 2013, using the new R package “statcheck.” statcheck retrieved null-hypothesis significance testing (NHST) results from over half of the articles from this period. In line with earlier research, we found that half of all published psychology papers that use NHST contained at least one p-value that was inconsistent with its test statistic and degrees of freedom. One in eight papers contained a grossly inconsistent p-value that may have affected the statistical conclusion.

How did they do that? Has science been so systematized it is finally mechanically reproducible? Did they get access to one of the new open information extraction systems (please see *Open Information Extraction: the Second Generation* Oren Etzioni, Anthony Fader, Janara Christensen, Stephen Soderland, and Mausam for some discussion)?

No, they used the fact that the American Psychological Association defines a formal style for reporting statistical significances, just like they define a formal style for citations. Roughly it looks for text like the following:

The results of the regression indicated the two predictors explained 35.8% of the variance (R2=.38, F(2,55)=5.56, p < .01).

(From a derived style guide found at the University of Connecticut.)

The software looks for fragments like: “(R2=.38, F(2,55)=5.56, p < .01)”. So really we are looking at statistics in psychology papers because they have standards clear enough to facilitate inspection.

These statistical summaries are often put into research papers by cutting and pasting from multiple sources, as not all stat packages report all these pieces in one contiguous string. So there are many chances for human error and therefore there is a very high chance they eventually get out of sync. Think of a researcher using Microsoft Word, Microsoft Excel, and some complicated graphical interface driven software again and again as data and treatment change throughout a study. Eventually something gets out of sync. We can try to check for inconsistency as both the reported p-value and R-squared are derivable from the `F(numdf,dendf)=Fvalue`

portion.

In fact the cited example has errors. The “explained 35.8% of the variance” should likely be 38% (to match the R2 / coefficient of determination) *and* the “`F(2,55)=5.56`

” bit would entail an R-squared closer to the following: **F Test** summary: (*R*^{2}=0.17, *F*(2,55)=5.56, *p*≤0.00632) (we chose to show the actual p-value, but cutting off at a sensible limit is part of the guidelines). Likely this is a *notional* example itself built by copying and pasting to show the format (so we have no intent of mocking it). We derived this result by writing our own R function that takes the F-summaries and re-calculates the R-squared and p-value. In our case we performed the calculation by pasting the following into R: “`formatFTest(numdf=2,dendf=55,FValue=5.56)`

” which performs the calculation and formats the result close to APA style.

Really this helps point out *why* scientists should strongly prefer workflows that support reproducible research (a topic we teach using R, RStudio, knitr, Sweave, and optionally Latex). It would be better to have correct conclusions automatically transcribed into reports, instead of hoping to catch some fraction of wrong ones later. This is one reason Charles Babbage specified a printer on both his Difference Engine 2 and Analytical Engine (circa 1847)- to avoid errors!

The workflow we recommend is to include data-driven results automatically. With R-based tools after we build a linear regression model (say called `model`

) we can include an additional bit of R code in a master document that looks like the following:

```
formatFTest(model,pSmallCutoff=1.0e-12)
```

And then automatic tools would copy the summary statistics in APA format into our report automatically:

F Testsummary: (R^{2}=0.86,F(1,18)=110.7,p=4.06e-09).

These are methods we teach using tools such as R, knitr, and Sweave.

That being said we recommend reading the original paper. The ability to detect errors gives the ability to collect statistics on errors over time, so there are a number of interesting observations to be made. For more work in this spirit we suggest “An empirical study of FORTRAN programs” Knuth, Donald E., Software: Practice and Experience, Vol. 1, No. 2, 1971, doi: 10.1002/spe.4380010203.

We can even trying running statcheck on the guide; it confirms the relation between the F-value and p-value and doesn’t seem to check the R-squared (probably not part of the intended check):

x | |
---|---|

Source | 1 |

Statistic | F |

df1 | 2 |

df2 | 55 |

Test.Comparison | = |

Value | 5.56 |

Reported.Comparison | < |

Reported.P.Value | 0.01 |

Computed | 0.006321509 |

Raw | F(2,55)=5.56, p < .01 |

Error | FALSE |

DecisionError | FALSE |

OneTail | FALSE |

OneTailedInTxt | FALSE |

APAfactor | 1 |

Our R code demonstrating how to automatically produce ready to go APA style F-summaries can be found here.

]]>I am just going to add a few additional references (mostly from Nina) and some more discussion on log-normal distributions versus Zipf-style distributions or Pareto distributions.

In analytics, data science, and statistics we often assume we are dealing with nice or tightly concentrated distributions such as the normal or Gaussian distribution. Analysis tends to be very easy in these situation and not require much data. However, for many quantities of interest (wealth, company sizes, sales, and many more) it becomes obvious that we cannot be dealing with such a distribution. The telltale sign is usually when relative error is more plausible than absolute error. For example it is much more plausible we know our net worth to within plus or minus 10% than to within plus or minus $10.

In such cases you have to deal with the consequences of slightly more wild distributions such as at least the log-normal distribution. In fact this is the important point and I suggest you read Nina’s article for motivation, explanation, and methods. We have found this article useful both in working with data scientists and in working with executives and other business decision makers. The article formalizes ideas all of these people already “get” or anticipate into concrete examples.

In addition to trying to use mathematics to make things more clear, there is a mystic sub-population of mathematicians that try to use mathematics to make things more esoteric. They are literally disappointed when things make sense. For this population it isn’t enough to see if switching from a normal to log-normal distribution will fix the issues in their analysis. They want to move on to even more exotic distributions such as Pareto (which has even more consequences) with or without any evidence of such a need.

The issue is: in a log-normal distribution we see rare large events much more often than in a standard normal distribution. Modeling this can be crucial as it tells us not to be lulled into to strong a sense of security by small samples. This concern *can* be axiomatized into “heavy tailed” or “fat tailed” distributions, but be aware: these distributions tend to be more extreme than what is implied by a relative error model. The usual heavy tail examples are Zipf-style distributions or Pareto distributions (people tend to ignore the truly nasty example the Cauchy distribution, possibly because it dates back the 17th century and thus doesn’t seem hip).

The hope seems to be that one is saving the day by brining in new esoteric or exotic knowledge such as fractal dimension or Zipf’s law. The actual fact is this sort of power-law structure has been know for a very long time under many names. Here are some more references:

- “Power laws, Pareto distributions and Zipf’s law”, Mark Newman, Complex Systems 899, Winter 2006: Theory of Complex Systems.
- “The Long Tail”, Chris Anderson, Wired 10.01.04.
- “A Brief History of Generative Models for Power Law and Lognormal Distributions”, Michael Mitzenmacher, Internet Mathematics Vol. 1, No. 2: 226-251.
- “Zipf’s word frequency law in natural language: a critical review and future directions”, Steven T. Piantadosi, June 2, 2015.
- “On the statistical laws of linguistic distribution”, Vitold Belevitch, Annales de la Société Scientifique de Bruxelles, vol.3, iss.73, pp. 310–326.
- “Living in a Lognormal World,” Nina Zumel, Win-Vector blog, February 3, 2010.

Reading these we see that the relevant statistical issues have been well known since at least the 1920’s (so were not a new discovery by the later loud and famous popularizers). The usual claim of old wine in new bottles is that there is some small detail (and mathematics is a detailed field) that is now set differently. To this I put forward a quote from Banach (from *Adventures of a Mathematician* S.M. Ulam, University of California Press, 1991, page 203):

Good mathematicians see analogies between theorems or theories, the very best ones see analogies between analogies.

Drowning in removable differences and distinctions is the world of the tyro, not the master.

From Piantadosi we have:

The apparent simplicity of the distribution is an artifact of how the distribution is plotted. The standard method for visualizing the word frequency distribution is to count how often each word occurs in a corpus, and sort the word frequency counts by decreasing magnitude. The frequency f(r) of the r’th most frequent word is then plotted against the frequency rank r, yielding typically a mostly linear curve on a log-log plot (Zipf, 1936), corresponding to roughly a power law distribution. This approach— though essentially universal since Zipf—commits a serious error of data visualization. In estimating the frequency-rank relationship this way, the frequency f(r) and frequency rank r of a word are estimated on the same corpus, leading to correlated errors between the x-location r and y-location f(r) of points in the plot.

Let us work through this one detailed criticism using R (all synthetic data/graphs found here). We start with the problem and a couple of observations.

Suppose we are running a business and organize our sales data as follows. We compute what fraction of our sales each item is (be it a count, or be it in dollars) and then rank them (item 1 is top selling, item 2 is next, and so on).

The insight of the Pareto-ists and Zipfians is if we plot sales intensity (probability or frequency) as a function of sales rank we are in fact very likely to get a graph that looks like the following:

Instead of all items selling at the same rate we see the top selling item can often make up a signficant fraction of the sales (such as 20%). There are a lot of 80/20 rules based on this empirical observation.

Notice also the graph is fairly illegible, the curve hugs the axes and most of the visual space is wasted. The next suggestion is to plot on “log-log paper” or plot the logarithm of frequency as a function of logarithm of rank. That gives us a graph that looks like the following:

If the original data is Zipfian distributed (as it is in the artificial example) the graph becomes a very legible straight line. The slope of the line is the important feature of the distribution and is (in a very loose sense) the “fractal dimension” of this data. The mystics think that by identifying the slope you have identified some key esoteric fact about the data and can then somehow “make hay” with this knowledge (though they never go on to explain how).

Chris Anderson in his writings on the “long tail” (including his book) clearly described a very practical use of such graphs. Suppose instead of assuming the line on log-log plots is a consequence of something special, suppose it is a consequence of something mundane. Maybe graphs tend to look like this for catalogs, sales, wealth, company sizes, and so on. So instead of saying the perfect fit is telling us something, look at defects in fit. Perhaps they indicate something. For example: suppose something we are selling products online and something is wrong with a great part of our online catalogue. Perhaps many of the products don’t have pictures, don’t have good descriptions, or some other common defect. We might expect our rank/frequency graph to look more like the following:

What happened is after product 20 something went wrong. In this case (because the problem happened early at an important low rank) can see it, but it is even more legible on the log-log plot.

The business advice is: look for that jump, sample items above and below the jump, and look for a difference. As we said the difference could be no images on such items, no free shipping, or some other sensible business impediment. The reason we care is this large population of low-volume items could represent a non-negligible fraction of sales. Below is the theoretical graph if we fixed whatever is wrong with the rarer items and plotted sales:

From this graph we can calculate that the missing sales represent a loss of about 32% of revenue. If we could service these sales cheaply we would want them.

In the above I used a theoretical Zipfian world to generate my example. But suppose the world isn’t Zipfian (there are many situations where log-normal is a much more plausible situation). Just because the analyst wishes things were exotic (requiring their unique heroic contribution) doesn’t mean they are in fact exotic. Log-log paper is legible because it reprocesses the data fairly violently. As Piantadosi said: we may see patterns in such plots that are features of the analysis technique, and not features of the world.

Suppose the underlying sales dates is log-normal distributed instead of Zipfian distributed (a plausible assumption until eliminated). If we had full knowledge of every possible sale for all time we could make a log-log plot over all time and get the following graph.

What we want to point out is: this is not a line. The hook down at the right side means that rare items have far fewer sales than a Zipfian model would imply. It isn’t just a bit of noise to be ignored. This means when one assumes a Zipfian model one is assuming the rare items as a group are in fact very important. This may be true or may be false, which is why you want to measure such a property and not assume it one way or the other.

The above graph doesn’t look so bad. The honest empiricist may catch the defect and say it doesn’t look like a line (though obviously a quantitive test of distributions would also be called for). But this graph was plotting all sales over all time. We would never see that. Statistically we usually model observed sales as a sample drawn from this larger ideal sampling population. Let’s take a look at what that graph may look like. An example is given below.

I’ll confess, I’d have a hard time arguing this wasn’t a line. It may or may not be a line, but it is certainly not strong evidence of a non-line. This data did not come from a Zipfian distribution (I know I drew it from a log-normal distribution), yet I would have a hard time convincing a Zipfian that it wasn’t from a Zipfian source.

And this brings us back to Piantadosi’s point. We used the same sample to estimate both sales frequencies and sales ranks. Neither of those are actually known to us (we can only estimate them from samples). And when we use the same sample to estimate both, they necessarily come out very related due to the sampling procedure. Some of the biases seem harmless such as frequency monotone decreasing in rank (which is true for unknown true values). But remember: relations that are true in the full population are not always true in the sample. Suppose we had a peek at the answers and instead of estimating the ranks took them from the theoretical source. In this case we could plot true rank versus estimated frequency:

This graph is much less orderly because we have eliminated some of the plotting bias which was introducing its own order. There are still analysis artifacts visible, but that is better than hidden artifacts. For example the horizontal strips are items that occurred with the same frequency in our sample, but had different theoretical ranks. In fact our sample is size 1000, so the rarest frequency we can measures is 1/1000 which creates the lowest horizontal stripe. The neatness of the previous graph were dots standing on top of each other as we estimated frequency as function of rank.

We are not advocating specific changes, we are just saying the log-log plot is a fairly refined view, and as such many of its features are details of processing- not all correctly inferred or estimated features of the world. Again, for a more useful applied view we suggest Nina Zumel’s living in a log-normal world.

]]>

Classic machine learning (especially as it is taught in classes) emphasizes a nice safe static environment where you are given some unchanging data and are asked to produce a nice predictive model one time. It is formally easier that casual inference or statistical inference as being right often is enough, no matter what the reason. It lives in an overly idealized world where one implicitly assumes the following simplifying assumptions:

- The world does not know you are trying to model it (and so can’t take counter-measures, for ideas see Ling Huang, Anthony D. Joseph, Blaine Nelson, Benjamin I.P. Rubinstein, and J. D. Tygar. 2011. Adversarial machine learning. In Proceedings of the 4th ACM workshop on Security and artificial intelligence (AISec ’11). ACM, New York, NY, USA, 43-58. DOI=http://dx.doi.org/10.1145/2046684.2046692).
- Your model has no effect on the world (positive or negative: see
*Weapons of Math Destruction*, Cathy O’Neil, Crown (September 6, 2016) for some discussion).

Adversarial machine learning is the formal name for studying what happens when conceding even a slightly more realistic alternative to assumptions of these types (harmlessly called “relaxing assumptions”).

At startup.ml’s adversarial machine learning conference Dr. Alyssa Frazee gave a good talk her work at Stripe. One point she was particularly clear on: once you actually start using your model in a sense you become an additional adversary.

Her example was denying payment requests. Suppose you have a model that for a transaction `x`

returns an estimate `pfraud(x)`

, the estimated probability that a payment request is fraudulent. Further suppose you set up your business rules to refuse all transactions `x`

where `pfraud(x) ≥ T`

, where `T`

is a chosen threshold. Then after running your system for a while you will no longer have any recent observations on the behavior of transactions where your model thinks `pfraud(x) ≥ T`

(as you never let them through!). In particular you can no longer asses your false-positive rate in a meaningful way as you are no longer collecting outcome data on items our classifier thinks are in the fraud class.

I don’t want to try explain the setup or solution any further as Alyssa Frazee developed it very well and very concretely, and I assume we will be hearing more of her speaking and writing in the future.

The solution suggested is standard, clever, simple and clear: intentionally let some of the `pfraud(x) ≥ T`

cases through to see what happens (though if possible spend to take some additional measures to mitigate potential loss on these) and then use inverse probability weighting to adjust the impact of these test cases. The idea is if you are letting through these “I should have rejected these” items at a rate of 1 in 100 (instead of the full rejection rate of 0 in 100) then each of these requests in fact represents a collection of 100 similar requests: so replicate each of them 100 times in your data and you have an estimate of what would have followed all of these cases through to the end.

The above may sound “dangerous and expensive” but I’ve never seen anything safer or cheaper that actually works reliably. And it is classic experimental design in disguise (the “accept even though I think I should reject” group can be thought of having been marked as “control” before scoring).

There is a tempting (but very wrong) alternative of treating the data marked as potentially fraudulent as being confirmed fraudulent during re-training (something that can actually happen in semi-supervised learning if you are not careful). I wrote on the dangers of this (incorrect) alternate method in my praise of a famous joke (DO NOT USE) method called the data enrichment method.

It is not surprising that the correct adjustment is already well known to statisticians; statistics is largely a field of trying to reliably extract meaningful summaries and inferences from a potentially hostile data environment. This distinction is why I say machine learning stands out from statistics in being a more optimistic (meaning more naive) field.

]]>Nina Zumel and I definitely troubled over possibilities for some time before deciding to write *Practical Data Science with R*, Nina Zumel, John Mount, Manning 2014.

In the end we worked very hard to organize and share a lot of good material in what we feel is a very readable manner. But I think the first-author may have been signaling and preparing a bit earlier than I was aware we were writing a book. Please read on to see some of her prefiguring work.

- September 4, 2012 “On Writing Technical Articles for the Nonspecialist”
- September 19, 2012 “On Being a Data Scientist”
- October 11, 2012 “I Write, Therefore I Think”
- December 6, 2012 Good News: We’re Writing a Book!

Suppose we have the task of predicting an outcome `y`

given a number of variables `v1,..,vk`

. We often want to “prune variables” or build models with fewer than all the variables. This can be to speed up modeling, decrease the cost of producing future data, improve robustness, improve explain-ability, even reduce over-fit, and improve the quality of the resulting model.

For some informative discussion on such issues please see the following:

- How Do You Know if Your Data Has Signal?
- How do you know if your model is going to work?
- Variable pruning is NP hard

In this article we are going to deliberately (and artificially) find and test one of the limits of the technique. We recommend simple variable pruning, but also think it is important to be aware of its limits.

To be truly effective in applied fields (such as data science) one often has to use (with care) methods that “happen to work” in addition to methods that “are known to always work” (or at least be aware, you are always competing against such); hence the interest in mere heuristic.

Let \(L(m;S)\) denote the estimate loss (or badness of performance, so smaller is better) of a model for \(y\) fit using modeling method \(m\) and the variables \(v_i : i \in S\). Let \(d(m;a)\) denote the portion of \(L(m;\{ \})-L(m;\{ a \} )\) credited to the variable \(v_a\). This could be the change in loss, something like \(\mathrm{effectsize}(v_a)\), or \(-\log(\mathrm{significance}(v_a))\); in all cases *larger* is considered better.

For practical variable pruning (during predictive modeling) our intuition often implicitly relies on the following heuristic arguments.

- \(L(m; )\) is monotone decreasing, we expect \(L(m;S \cup \{ a \} )\) is no larger than \(L(m;S)\). Note this may be achievable “in sample” (or on training data), but is often false if \(L(m; )\) accounts for model complexity or is estimated on out of sample data (itself a good practice).
- If \(L(m;S \cup \{ a \} )\) is significantly lower than \(L(m;S)\) then we will be lucky enough to have \(d(m;a)\) not too small.
- If \(d(m;a)\) is not too small then we will be lucky enough to have \(d(\mathrm{lm};a)\) is non-negligible (where modeling method
`lm`

is one linear regression or logistic regression).

Intuitively we are *hoping* (for ease of calculation) variable utility has a roughly diminishing return structure and at least some non-vanishing fraction of a variable’s utility can be seen in simple linear or generalized linear models. Obviously this can not be true in general (interactions in decision trees being a well know situation where variable utility can increase in the presence of other variables, and there are many non-linear relations that escape detection by linear models). Synergy is a good thing, we just would hate to miss it, and one way to prove we don’t miss it would be to know it isn’t there. We will show there is in fact synergy, so naive methods may in fact miss it.

However, if the above were true (or often nearly true) we could effectively prune variables by keeping only the set of variables \(\left\{ a \; \left| \; d(\mathrm{lm};a) \; \text{is non negligible} \right. \right\}\). This is a (user controllable) heuristic built into our `vtreat`

R package and proves to be quite useful in practice.

I’ll repeat: we feel in real world data you can use the above heuristics to usefully prune variables. Complex models do eventually get into a regime of diminishing returns, and real world engineered useful variables usually (by design) have a hard time hiding. Also, remember data science is an empirical field- methods that happen to work will dominate (even if they do not apply in all cases).

For every heuristic you should crisply know if it is true (and is in fact a theorem) or it is false (and has counter-examples). We stand behind the above heuristics, and will show their empirical worth in a follow-up article. Let’s take some time and show that they are not in fact laws.

We are going to show that per-variable coefficient significances and effect sizes are not monotone in that adding more variables can in fact improve them.

First (using R) we build a data frame where `y = a xor b`

. This is a classic example of `y`

being a function of two variable but not a *linear* function of them (at least over the real numbers, it is a linear relation over the field GF(2)).

```
d <- data.frame(a=c(0,0,1,1),b=c(0,1,0,1))
d$y <- as.numeric(d$a == d$b)
```

We look at the (real) linear relations between `y`

and `a`

, `b`

.

`summary(lm(y~a+b,data=d))`

```
##
## Call:
## lm(formula = y ~ a + b, data = d)
##
## Residuals:
## 1 2 3 4
## 0.5 -0.5 -0.5 0.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.500 0.866 0.577 0.667
## a 0.000 1.000 0.000 1.000
## b 0.000 1.000 0.000 1.000
##
## Residual standard error: 1 on 1 degrees of freedom
## Multiple R-squared: 3.698e-32, Adjusted R-squared: -2
## F-statistic: 1.849e-32 on 2 and 1 DF, p-value: 1
```

`anova(lm(y~a+b,data=d))`

```
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## a 1 0 0 0 1
## b 1 0 0 0 1
## Residuals 1 1 1
```

As we expect linear methods fail to find any evidence of a relation between `y`

and `a`

, `b`

. This clearly violates our hoped for heuristics.

For details on reading these summaries we strongly recommend *Practical Regression and Anova using R*, Julian J. Faraway, 2002.

In this example the linear model fails to recognize `a`

and `b`

as useful variables (even though `y`

is a function of `a`

and `b`

). From the linear model’s point of view variables are not improving each other (so that at least looks monotone), but it is largely because the linear model can not see the relation unless we were to add an interaction of `a`

and `b`

(denoted `a:b`

).

Let us develop this example a bit more to get a more interesting counterexample.

Introduce new variables `u = a and b`

, `v = a or b`

. By the rules of logic we have `y == 1+u-v`

, so there is a linear relation.

```
d$u <- as.numeric(d$a & d$b)
d$v <- as.numeric(d$a | d$b)
print(d)
```

```
## a b y u v
## 1 0 0 1 0 0
## 2 0 1 0 0 1
## 3 1 0 0 0 1
## 4 1 1 1 1 1
```

`print(all.equal(d$y,1+d$u-d$v))`

`## [1] TRUE`

We can now see the counter-example effect: together the variables work better than they did alone.

`summary(lm(y~u,data=d))`

```
##
## Call:
## lm(formula = y ~ u, data = d)
##
## Residuals:
## 1 2 3 4
## 6.667e-01 -3.333e-01 -3.333e-01 -1.388e-16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.3333 0.3333 1 0.423
## u 0.6667 0.6667 1 0.423
##
## Residual standard error: 0.5774 on 2 degrees of freedom
## Multiple R-squared: 0.3333, Adjusted R-squared: 5.551e-16
## F-statistic: 1 on 1 and 2 DF, p-value: 0.4226
```

`anova(lm(y~u,data=d))`

```
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## u 1 0.33333 0.33333 1 0.4226
## Residuals 2 0.66667 0.33333
```

`summary(lm(y~v,data=d))`

```
##
## Call:
## lm(formula = y ~ v, data = d)
##
## Residuals:
## 1 2 3 4
## 5.551e-17 -3.333e-01 -3.333e-01 6.667e-01
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.0000 0.5774 1.732 0.225
## v -0.6667 0.6667 -1.000 0.423
##
## Residual standard error: 0.5774 on 2 degrees of freedom
## Multiple R-squared: 0.3333, Adjusted R-squared: 0
## F-statistic: 1 on 1 and 2 DF, p-value: 0.4226
```

`anova(lm(y~v,data=d))`

```
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## v 1 0.33333 0.33333 1 0.4226
## Residuals 2 0.66667 0.33333
```

`summary(lm(y~u+v,data=d))`

```
## Warning in summary.lm(lm(y ~ u + v, data = d)): essentially perfect fit:
## summary may be unreliable
```

```
##
## Call:
## lm(formula = y ~ u + v, data = d)
##
## Residuals:
## 1 2 3 4
## -1.849e-32 7.850e-17 -7.850e-17 1.849e-32
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.00e+00 1.11e-16 9.007e+15 <2e-16 ***
## u 1.00e+00 1.36e-16 7.354e+15 <2e-16 ***
## v -1.00e+00 1.36e-16 -7.354e+15 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.11e-16 on 1 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 4.056e+31 on 2 and 1 DF, p-value: < 2.2e-16
```

`anova(lm(y~u+v,data=d))`

```
## Warning in anova.lm(lm(y ~ u + v, data = d)): ANOVA F-tests on an
## essentially perfect fit are unreliable
```

```
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## u 1 0.33333 0.33333 2.7043e+31 < 2.2e-16 ***
## v 1 0.66667 0.66667 5.4086e+31 < 2.2e-16 ***
## Residuals 1 0.00000 0.00000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

In this example we see synergy instead of diminishing returns. Each variable becomes better in the presence of the other. This is on its own good, but indicates variable pruning is harder than one might expect- even for a linear model.

We can get around the above warnings by adding some rows to the data frame that don’t follow the designed relation. We can even draw rows from this frame to show the effect on a “more row independent looking” data frame.

```
d0 <- d
d0$y <- 0
d1 <- d
d1$y <- 1
dG <- rbind(d,d,d,d,d0,d1)
set.seed(23235)
dR <- dG[sample.int(nrow(dG),100,replace=TRUE),,drop=FALSE]
summary(lm(y~u,data=dR))
```

```
##
## Call:
## lm(formula = y ~ u, data = dR)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.8148 -0.3425 -0.3425 0.3033 0.6575
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.34247 0.05355 6.396 5.47e-09 ***
## u 0.47235 0.10305 4.584 1.35e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4575 on 98 degrees of freedom
## Multiple R-squared: 0.1765, Adjusted R-squared: 0.1681
## F-statistic: 21.01 on 1 and 98 DF, p-value: 1.349e-05
```

`anova(lm(y~u,data=dR))`

```
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## u 1 4.3976 4.3976 21.01 1.349e-05 ***
## Residuals 98 20.5124 0.2093
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

`summary(lm(y~v,data=dR))`

```
##
## Call:
## lm(formula = y ~ v, data = dR)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.7619 -0.3924 -0.3924 0.6076 0.6076
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.7619 0.1049 7.263 9.12e-11 ***
## v -0.3695 0.1180 -3.131 0.0023 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4807 on 98 degrees of freedom
## Multiple R-squared: 0.09093, Adjusted R-squared: 0.08165
## F-statistic: 9.802 on 1 and 98 DF, p-value: 0.002297
```

`anova(lm(y~v,data=dR))`

```
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## v 1 2.265 2.26503 9.8023 0.002297 **
## Residuals 98 22.645 0.23107
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

`summary(lm(y~u+v,data=dR))`

```
##
## Call:
## lm(formula = y ~ u + v, data = dR)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.8148 -0.1731 -0.1731 0.1984 0.8269
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.76190 0.08674 8.784 5.65e-14 ***
## u 0.64174 0.09429 6.806 8.34e-10 ***
## v -0.58883 0.10277 -5.729 1.13e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3975 on 97 degrees of freedom
## Multiple R-squared: 0.3847, Adjusted R-squared: 0.3721
## F-statistic: 30.33 on 2 and 97 DF, p-value: 5.875e-11
```

`anova(lm(y~u+v,data=dR))`

```
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## u 1 4.3976 4.3976 27.833 8.047e-07 ***
## v 1 5.1865 5.1865 32.826 1.133e-07 ***
## Residuals 97 15.3259 0.1580
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

Consider the above counter example as *exceptio probat regulam in casibus non exceptis* (“the exception confirms the rule in cases not excepted”). Or roughly outlining the (hopefully labored and uncommon) structure needed to break the otherwise common and useful heuristics.

In later articles in this series we will show more about the structure of model quality and show the above heuristics actually working very well in practice (and adding a lot of value to projects).

// add bootstrap table styles to pandoc tables $(document).ready(function () { $('tr.header').parent('thead').parent('table').addClass('table table-condensed'); });

In each case I was treated very politely, listened to, and actually got fixes back in a very short timeframe from volunteers. That is really great and probably one of the many reasons R is a great ecosystem.

Please read on for my list of `n=3`

interactions.

- While discussing plotting market data I ran into a corner-case with ggplot2. Even though I figured out how to work around it, it is now fixed by the ggplot2 team!
- I wrote an entire article denouncing a default setting of a single argument in the ranger random forest library. The ranger author himself replied with a fix that is very clever and mathematically well-founded (I suspect he had be researching this issue a while on his own).
- I complained about summary presentation fidelity in base R
`summary.default`

. You guessed it: the volunteers have generously fielded a patch!

Like any real-world system R represents a sequence of history and compromises. Only unused systems can be perfect without compromise. It is very evident how eager and able the volunteers who maintain it are to make sure R represents very good compromises.

I would like to offer a sincere appreciation and thank you from me to the R community. If this is what you can expect using R it is yet another strong argument for R.

And personal thanks to: Martin Maechler, Hadley Wickham, and Marvin N. Wright.

]]>