Controlled experiments embody the best scientific design for establishing a causal relationship between changes and their influence on user-observable behavior.

– Kohavi, Henne, Sommerfeld, “Practical Guide to Controlled Experiments on the Web” (2007)

A/B tests are one of the simplest ways of running controlled experiments to evaluate the efficacy of a proposed improvement (a new medicine, compared to an old one; a promotional campaign; a change to a website). To run an A/B test, you split your population into a *control* group (let’s call them “A”) and a *treatment* group (“B”). The A group gets the “old” protocol, the B group gets the proposed improvement, and you collect data on the outcome that you are trying to achieve: the rate that patients are cured; the amount of money customers spend; the rate at which people who come to your website actually complete a transaction. In the traditional formulation of A/B tests, you measure the outcomes for the A and B groups, determine which is better (if either), and whether or not the difference observed is statistically significant. This leads to questions of test size: how big a population do you need to get reliably detect a difference to the desired statistical significance? And to answer that question, you need to know how big a difference (*effect size*) matters to you.

The irony is that to detect small differences accurately you need a larger population size, even though in many cases, if the difference is small, *picking the wrong answer matters less*. It can be easy to lose sight of that observation in the struggle to determine correct experiment sizes.

There is an alternative formulation for A/B tests that is especially suitable for online situations, and that explicitly takes the above observation into account: the so-called *multi-armed bandit* problem. Imagine that you are in a casino, faced with K slot machines (which used to be called “one-armed bandits” because they had a lever that you pulled to play (the “arm”) — and they pretty much rob you of all your money). Each of the slot machines pays off at a different (unknown) rate. You want to figure out which of the machines pays off at the highest rate, then switch to that one — but you don’t want to lose too much money to the suboptimal slot machines while doing so. What’s the best strategy?

The “pulling one lever at a time” formulation isn’t a bad way of thinking about online transactions (as opposed to drug trials); you can imagine all your customers arriving at your site sequentially, and being sent to bandit A or bandit B according to some strategy. Note also, that if the best bandit and the second-best bandit have very similar payoff rates, then settling on the second best bandit, while not optimal, isn’t necessarily that bad a strategy. You lose winnings — but not much.

Traditionally, bandit games are infinitely long, so analysis of bandit strategies is asymptotic. The idea is that you test less as the game continues — but the testing stage can go on for a very long time (often interleaved with periods of pure *exploitation*, or playing the best bandit). This infinite-game assumption isn’t always tenable for A/B tests — for one thing, the world changes; for another, testing is not necessarily without cost. We’ll look at finite games below.

**The Intuition**

Let’s look at the simplest situation. We are in an existing situation A, with a known payoff rate, `pA`

. We want to test a proposed improvement, B, with unknown payoff rate `pB`

. Our testing strategy is to play the B situation `N`

times, and at the end of that period, we will play whichever situation looks better. in other words, if our estimate of `pB`

looks higher than `pA`

at the end of the test period, then at the `N+1`

th step we’ll play B; otherwise, we’ll go back to A. If we pick the right answer, we win. If you pick the wrong answer, then call `delta = abs(pA - pB)`

the “opportunity loss”. What’s the expected opportunity loss on the `N+1`

th turn? It’s delta times the probability of picking the wrong bandit.

If in reality `pB`

is less than `pA`

, then the probability of being wrong is the probability of flipping a coin with a `pB`

heads-probability `N`

times and seeing more than `ceiling(N*pA)`

heads:

pbinom(ceiling(N*pA), N, pB, lower.tail=F)

Or taking both situations (`pB <= pA and pB > pA)`

into account:

expectedLoss = function(pA, pB, N) { delta = abs(pA - pB) # The probability of seeing more than/less than pA*N heads in N flips, # if the probability is really pB -- the probability of being wrong. prob = (pB <= pA)*pbinom(ceiling(N*pA), N, pB, lower.tail=F) + (pB > pA)*pbinom(ceiling(N*pA)-1, N, pB) prob*delta }

Let’s set `pA = 0.10`

and `N=100`

, and plot opportunity loss for different `pB`

:

library(ggplot2) pA = 0.10 pBvec = seq(from = 0, to=0.2, by = 0.002) loss100 = expectedLoss(pA, pBvec, 100) ggplot(data.frame(pB=pBvec, loss=loss100), aes(x=pB, y=loss)) + geom_point() + geom_line() + geom_vline(xintercept=pA, color="red")

As you can see in the figure, you don’t lose much when `pB`

is much smaller or much larger than `pA`

, because the probability of picking the wrong bandit is low. You don’t lose much when `pB`

is very close to `pA`

, because even if you pick the wrong bandit (fairly likely), the payoff rates are close. There is an intermediate difference (roughly 0.03 on either side of `pA`

) where the difference in payoffs is notable, and the probability of picking the wrong bandit is fairly high, so the expected opportunity loss is maximized.

We can plot the loss curve for different values of `N`

(same `pA`

):

As expected, the longer you test, the lower the expected loss on the next turn. The worst-case `pB`

moves, too, and the region of largest loss gets smaller.

Of course, you might pick the right bandit, too. So the expected value of the next turn is:

# after an N-length test, what's the expected value of a turn? expectedValue = function(pA, pB, N) { # case where pA => pB areaBoverA = pbinom(ceiling(N*pA), N, pB, lower.tail=F) # probability we guess wrong value1 = (pB <= pA) * (areaBoverA*pB + (1-areaBoverA)*pA) # case where pB > pA areaAoverB = pbinom(ceiling(N*pA)-1, N, pB) value2 = (pB > pA)*(areaAoverB*pA + (1-areaAoverB)*pB) value1 + value2 } # Expected value of a turn after 100 test-turns val100 = expectedValue(pA, pBvec, 100) ggplot(data.frame(pB=pBvec, value=val100), aes(x=pB, y=value)) + geom_point() + geom_line() + geom_vline(xintercept=pA, color="red") + geom_line(aes(y=pmax(pA, pB)), color="red", linetype=2)

The above graph suggests that if you test `pB`

for 100 turns, the expected value of the next turn goes to “the right answer” for `pB`

outside the region `(0.05, 0.18)`

. If you test `pB`

longer, you can even shrink that region. If your game is infinitely long (that is, you will go with your chosen bandit from turn `N+1`

on, forevermore), then whatever opportunity you lost during the testing phase is a negligible part of your total expected value, and it’s in your interest to test for a very long time (this is not, however, the best way to play an infinite-length bandit game).

**Finite Games**

But games aren’t necessarily infinite, as we mentioned above. Let’s look at a simple finite game. As before, `pA`

is known, `pB`

is what you want to test. The entire game consists of `M`

turns, and you will spend the first `N`

turns testing `pB`

. After that, you play the bandit that appears to have the higher payoff rate for the remainder of the game (we’ll call that the *exploitation* phase). Now what’s the best choice of `N`

?

The expected value of the entire game is the value of the testing phase, `pB*N`

, plus the expected value of the rest of the game: `expectedValue(pA, pB, N)*(M-N)`

. We can compare that to the perfect game, where you psychically know which bandit is better, and play that one for the entire game: `pmax(pA, pB)*M`

.

gameMatrix = function(pa, pb, M, N) { psychicPlay = pmax(pa,pb)*M valueTest = pb*N expValuePlay = expectedValue(pa, pb, N)*(M-N) data.frame(pB=pb, testValue=valueTest, expPlayVal=expValuePlay, value=valueTest+expValuePlay,psychicPlay=psychicPlay,N=N) }

We can evaluate a 1000-turn game for different values on `N`

(`pA = 0.1`

), and plot the expected total opportunity loss, as compared to the perfect game:

If `pB > pA`

, then it’s best to play for a long time; you are not losing opportunity during the test phase, and you will only lose opportunity in the exploitation phase if you incorrectly choose `pA`

— so test long enough to make that unlikely. If `pB < pA`

, you are losing opportunity in the testing phase, which argues for a small `N`

, but if you don’t test long enough, you are more likely to pick the wrong bandit at the end of the test phase — and hence will lose even more opportunity. On the other hand, if `pB`

is very small, you are “wasting” opportunity by continuing to test even after it’s clear that `pB < pA`

, and so losing opportunity when you could be playing optimally (that’s why the loss curve dips up again as `pB`

approaches zero). The trick is to balance the tradeoffs.

**An Adversarial Approach**

Imagine that the universe is actively working against you: no matter what `N`

you choose, the universe arranges that bandit B will have the worst possible `pB`

for that testing length. Then, for all the test lengths that you want to consider, figure out what that worst-case `pB`

would be, and its expected opportunity loss. Call that `maxloss_N`

. The best N to use is the N for which `maxloss_N`

is minimized.

N = seq(from=10, to=200, by=10) for(n in N) { if(n==10) {gameValue = gameMatrix(pA, pBvec,M, n)} else {gameValue = rbind(gameValue, gameMatrix(pA, pBvec, M, n))} } # # I'm using sqldf to do the "group by", but you can use # aggregate() or a similar function instead. # options(gsubfn.engine="R") # need this on a Mac library(sqldf) maxloss = sqldf('select N, max(psychicPlay-value) as mloss from gameValue group by N')

For the game we are playing, `N = 50`

is the best choice. The maximum expected loss occurs at `pB = 0.134`

, with an expected value of 128.1; that’s a loss of 5.89 payoffs, or 4.4% fewer than the perfect game for that value of `pB`

, which has an expected value of 134 (1000*0.134). There’s another local maxima at `pB = 0.072`

, with an expected value of 94.65. Compared to the perfect game’s expected value of 100, that’s a loss of 5.35 payoffs, or 5.35% of the perfect game.

Compare this to the number of times you would have to play bandit B at `pB = 0.134`

or `pB = 0.07`

in order to separate it from bandit A to *p* = 0.05 significance:

library(gtools) tailprobs = function(pA, pB, N) { if(pB <= pA) { prob = pbinom(ceiling(N*pA), N, pB, lower.tail=F) else { #(pB > pA) prob = pbinom(ceiling(N*pA)-1, N, pB) } prob } # # Do binary search to find the minimum N that achieves # the desired significance # sigtarget = 0.05 s1 = binsearch(function(k) {tailprobs(pA, 0.134, k) - sigtarget}, range=c(1,ceiling(10000/pA))) max(s1$where) # 256 s2 = binsearch(function(k) {tailprobs(pA, 0.07, k) - sigtarget}, range=c(10,ceiling(100/pA))) max(s2$where) # 131

Given the numbers above, we know that if `pB`

is about 0.03 away from our `pA`

, an `N=50`

game with the given parameters will make a lot of mistakes identifying which bandit has the better payoff — but we also know from our previous analysis that (assuming we don’t know the true `pB`

) the lost opportunity costs are as low as we can make them. Unless it is absolutely critical that you identify the correct bandit, the above analysis shows that it possible to achieve utility before you achieve significance.

Linear Regression is one of the most common statistical modeling techniques. It is very powerful, important, and (at first glance) easy to teach. However, because it is such a broad topic it can be a minefield for teaching and discussion. It is common for angry experts to accuse writers of carelessness, ignorance, malice and stupidity. If the type of regression the expert reader is expecting doesn’t match the one the writer is discussing then the writer is assumed to be ill-informed. The writer is especially vulnerable to experts when writing for non-experts. In such writing the expert finds nothing new (as they already know the topic) and is free to criticize any accommodation or adaption made for the intended non-expert audience. We argue that many of the corrections are not so much evidence of wrong ideas but more due a lack of empathy for the necessary informality necessary in concise writing. You can only define so much in a given space, and once you write too much you confuse and intimidate a beginning audience.

Let’s start with a common definition of regression modeling from The Cambridge Dictionary of Statistics (B. S. Everitt, Cambridge 2005 printing):

Regression ModelingA frequently applied statistical technique that serves as a basis for studying and characterizing a system of interest, by formulating a mathematical model of the relation between a response variable, y and a set of q explanatory variables x1, x2, … xq. The choice of the explicit form of the model may be based on previous knowledge of the system or on considerations such as “smoothness” and continuity of y as a function of the x variables. In very general terms all such models can be considered to be of the form.

`y = f(x1,...xq) + e`

where the function f reflects the true but unknown relationship between y and the explanatory variables. The random additive error e which is assumed to have mean 0 and variance sigma_e^2 reflects the dependence of y on quantities other than x1,…,xq. The goal is to formulate a function fhat(x1,x2,…,xp)[

sic] that is a reasonable approximation of f. If the correct parametric form of f is known, then methods such asleast squares estimationormaximum likelihood estimationcan be used to estimate the set of unknown coefficients. If f is linear in the parameters, for example, then the model is that ofmultiple regression. If the experimenter is unwilling to assume a particular parametric form for f thennonparametric regression modelingcan be used, for examplekernel regression smoothing,recursive partitioning regressionormultivariate adaptive regression splines.

This is a bit long for a non-expert audience. Also notice a single typo (writing p where you clearly mean q) is *not* evidence of a lack of knowledge, care or effort (typos happen). The definition has a lot of conditions, caveats and alternatives. For practical writing you need to take a slice of this definition (the slice closest to what you are actually going to use) and go with that.

And even this definition is not complete enough to be strictly correct to a hostile reader. An easy (and common) cheap shot would be to write the following: “The writer clearly does not understand the nature of regression as he fails to correct define *regression* as estimating expectations when attempting to discuss *regression modeling*.” Note: this is *not* the case: Everitt clearly has a deep understanding of regression. But knowing what you are talking about seems not to be a sufficient protection or defense.

Here is what a our hypothetical critic claimed to be missing:

Linear regressionA term usually reserved for the simple linear model involving a response y, that is a continuous variable and a single explanatory variable, x, related by the equation.

`E(y) = a + b x`

Where E denotes expected value. See also

multiple regressionandleast squares estimation. [ ARA Chapter 1.]

Except this isn’t missing. This definition is also from Everitt. He just doesn’t have space for this aspect of regression in his discussion of regression modeling. You can only emphasize so many things at once (as you add more generalizations, caveats, conclusions and consequences you dilute core ideas).

When writing for the non-expert you need to make sure what you are writing is correct (so you are actually usefully educating) but you need to also be concise and (at least initially) anticipate the new reader’s initially naive expectations. If you spend a lot of time on a side issue, the non-expert will assume the side issue was the actual central topic of discussion. For example you don’t repeat over and over that you must assume the variances sigma_e^2 are uniformly bounded (which is in fact important), but use the fact that a new reader’s intuition often doesn’t yet include random variables with unbounded variance (saving you discussing the precaution). You spend your initial time addressing issues that are likely causing the reader conceptual trouble (such as how can a linear function approximate a non-linear one and how can you simultaneously estimate coefficients). You only bring in stuff that the naive reader isn’t likely to worry about later and only if it something they need to defend against. This style of writing is needed if you actually want to teach to a broad audience. But it leaves the writer vulnerable to the accusation that they don’t know what they are talking about (because you didn’t spend time on something that could theoretically invalidate your work, but that doesn’t tend to happen in application at hand). Beginning learners do need correct definitions, but they also need succinct and situationally relevant discourse.

As an example that even pure mathematics writing is commonly informal (and requires a friendly, not a hostile reading to make strict sense). Consider: one combinatorics course I attended (combinatorics being a specific type of mathematics) the lecturer used the following convention. For every theorem the phrase “for all sets” was to be taken to mean either “for all sets except the empty set” or “for all sets including the empty set” depending on which specialization actually worked in the theorem in question. This “sloppiness” improved and sped up discourse greatly. Many mathematicians do this. Take problem 1C from page 6 of “A Course in Combinatorics” van Lint and Wilson (1st Ed. 1993 reprint): “Show a connected graph on n vertices is a tree if and only if it has n-1 edges.” If you worry all the way down about empty sets you have a hard time *sensibly* deciding if an empty graph can be a tree (having to decide if a graph can have zero vertices, and having to decide if a zero vertex graph is connected; my skim of the book definitions seems to allow the empty graph as a connected graph; leading to the reasoning failure that a 0-node graph is a connected graph that is not a tree as it fails to have the required -1 edges).

But back to statistics and regression. Where does the term regression even come from or even mean in this context? From the Wikipedia:

The term “regression” was coined by Francis Galton in the nineteenth century to describe a biological phenomenon. The phenomenon was that the heights of descendants of tall ancestors tend to regress down towards a normal average (a phenomenon also known as regression toward the mean). For Galton, regression had only this biological meaning, but his work was later extended by Udny Yule and Karl Pearson to a more general statistical context. In the work of Yule and Pearson, the joint distribution of the response and explanatory variables is assumed to be Gaussian. This assumption was weakened by R.A. Fisher in his works of 1922 and 1925. Fisher assumed that the conditional distribution of the response variable is Gaussian, but the joint distribution need not be. In this respect, Fisher’s assumption is closer to Gauss’s formulation of 1821.

Galton’s regression is the observation that repeated experiments (like heights of decedents) tend to revert to the mean (meaning the children of a tallest child are not necessarily much taller than their cousins). The term “regression” is about expectations and implies separations of explainable variation, and unexplainable variation (treated as 0-mean noise, compatible with reversions to the mean).

From “On the Theory of Correlation”, G Udny Yule, Journal of the Royal Statistical Society, 1897 vol. 60 (4) pp. 812-854 (and speaking about the typical fit curve of y as a function of x over many data points):

It is a fact attested by statistical experience that these means do not lie chaotically all over the table, but range themselves more or less closely round a smooth curve, which we will name the curve of regression of x on y.

So regression methods evolve from finding the curve of regression, which itself is the best fit for *groups* of observations after allowing some of the variation to be declared “unexplained” and left in a noise term. This is a advance from mere fitting or solving where you might be trying to explain all of the observed variation in n-individuals using as many as n-variables.

Regression methods have multiple formulations with different strength of assumptions and different strength of conclusions. You can use varying assumptions to trade generality for power at will. Thus a hostile reader can equally criticize a writer who carefully states a distributional assumption (as they “clearly don’t know how general the method is”) or a writer that fails to make a distributional assumption (as they “clearly don’t know what they are doing”).

A wide range of applications fall under the rubric of linear regression. Some include:

- Simple least squares fitting. Running a line through only known data to minimize the total sum of square errors. No

probability model or statistical theory is initially involved (you are not interpreting the fit as being maximum likelihood or useful for prediction), so few assumptions are actually needed. You can criticize a writer for failing to assume “the errors have expectation zero and are uncorrelated and have equal variances” (because then they can’t assume the Gauss-Markov theorem that lease squares is a best linear unbiased estimator and the full power of the method) or you can criticize a writer for making any such assumption (because then they fail to realize that least squares by definition minimizes square error and the full generality of the method). - Estimating coefficients (either in a frequentist or Bayesian manner).

Now you certainly have to make statistical assumptions (you see the

data as a noisy transformation of unknown coefficients). For a frequentist analysis you need distributional assumptions on the noise process (to turn losses into likelihoods) and for the Bayesian analysis you need to make assumptions on the prior distribution of the unknown coefficients (to turn conditional likelihoods on observations onto posterior likelihoods on parameters). - Making predictions. On thing that surprises most data scientists is that statisticians do not consider making predictions as

the most important use of models. Statisticians tend to emphasize finding relations as more important (hence their assumptions tend to be designed to make coefficient estimation correct, but no stronger to preserve generality). It turns out to make reliable predictions you may need slightly different assumptions (at the very least something like exchangeability of data). So you can always criticize good work on relations as not being theoretically sound for predictions or good work on predictions as not being theoretically sound for extracting relations (or criticize work careful enough to meet both goals as bringing in too many assumptions).

Please read carefully the following from the Professor Andrew Gelman (a statistics professor I highly respect):

In section 3.6 of my book with Jennifer we list the assumptions of the linear regression model. In decreasing order of importance, these assumptions are:

- Validity. Most importantly, the data you are analyzing should map to the research question you are trying to answer. This sounds obvious but is often overlooked or ignored because it can be inconvenient. . . .
- Additivity and linearity. The most important mathematical assumption of the regression model is that its deterministic component is a linear function of the separate predictors . . .
- Independence of errors. . . .
- Equal variance of errors. . . .
- Normality of errors. . . .
Further assumptions are necessary if a regression coefficient is to be given a causal interpretation . . .

Normality and equal variance are typically minor concerns, unless you’re using the model to make predictions for individual data points.

Notice that what to worry about depends on how you intend to use the regression result. Then look at the diversity of comments on the original article. Regression is clearly not a single method with one best set of conditions and one strongest set of consequences. There isn’t a “one true weakest set of assumptions that simultaneously gives sharpest results.”

]]>

Recently Hugh Howey shared some eBook sales data spidered from Amazon.com: The 50k Report. The data is largely a single scrape of statistics about various anonymized books. Howey’s analysis tries to break sales down by declared category and source, but there are a lot of difficulties due to the quality of the tags in the data. A lot of the questions we would like to look into (such as do reviews drive sales or sales drive reviews) are not practical unless we had a more longitudinal data set that includes many observations on a repeated set of books over time.

However, we can try to relate one type of reported outcome (sales rank, a number Amazon visibly shares on ebook product pages) and number of sales (a harder to find quantity). Note: we are not really doing any *predictive* modeling as we are not trying to predict future sales from features, but instead we are just try to learn an approximate relation between two different encodings of outcomes (sales count and sales rank).

We share down the steps to convert the Excel data to a usable R format here on GitHub. A quick use of the data is as follows:

```
```library('RCurl')
url <- paste('https://raw.github.com/WinVector/',
'Examples/master/',
'AmazonBookData/amazonBookData.Rdata',sep='')
load(rawConnection(getBinaryURL(url)))

The data is now in a dataframe named “`d`

“. The crude analysis we want to do is to relate `Kindle.eBooks.Sales.Rank`

to `Daily.Units.Sold`

. We will do this on “log-log” paper (where famously most anything looks like a line).

```
```model <- lm(log(Daily.Units.Sold)~log(Kindle.eBooks.Sales.Rank),
data=d)
d$EstLogUnitsSold <- predict(model,newdata=d)
library('ggplot2')
ggplot(data=d,aes(x=log(Kindle.eBooks.Sales.Rank))) +
geom_point(aes(y=log(Daily.Units.Sold))) +
geom_line(aes(y=EstLogUnitsSold))

The line fit looks plausible for ebooks in the sales-rank range around 200 through 150,000. Lets take a quick look at the model:

```
```print(model)
Call:
lm(formula = log(Daily.Units.Sold) ~ log(Kindle.eBooks.Sales.Rank),
data = d)
Coefficients:
(Intercept) log(Kindle.eBooks.Sales.Rank)
11.5063 -0.9334

This is roughly saying `Daily.Units.Sold ~ exp(11.5 - 0.93*log(Kindle.eBooks.Sales.Rank))`

or (with a little algebra): `Daily.Units.Sold ~ 99339.64 / Kindle.eBooks.Sales.Rank^0.93`

.

This isn’t too far from the following easy rule of thumb: `Daily.Units.Sold ~ 100000 / Kindle.eBooks.Sales.Rank`

. Applying this we would expect a typical ebook ranked at position 100,000 to sell about 1 copy a day. Now we don’t want to read too much into this, as fitting a line onto log-log paper is a classic example of heavy-handed econometrics (in econometrics you often force the structure of the results by model selection, see “Bad models and the end of the world” for some enjoyable vitriol on abuses of the idea).

However this rule of thumb is consistent with Chris Anderson’s point in the The Long Tail. The fact we see a plausible power law over a large range means we can (crudely) estimate the entire expected sales of an infinite sized catalog as: `sum_{rank=1...infinity} 99339.64 rank^pow`

. In our case `pow=-0.93`

which is `≥ -1`

: meaning the sum diverges or the total is infinite. If `pow`

had been something smaller (like `pow=-2`

) then even an infinite catalog would only have a finite total value. But in this case the theory says the ebook distributor can grow their total revenue to just about any level, if they can add enough books cheaply (they don’t get overwhelmed by diminishing revenue returns early).

Amazon clearly wants the large revenue found in the popular (or “head” books), but you can see that it is plausible they will always have more opportunity to grow their business by increasing coverage (and making the handling of) many less popular products (the so-called “long tail”). Not a new observation, but fun to be able to pull it quickly from shared data.

(Funny side note. This sort of analysis can be stretched to say that the expected lifetime sales of any book that stays in print forever is infinite. This argument only works *if* cumulative sales rank has an exponent of `-1`

or larger (and Amazon seems to be using a recent sales rank, so we don’t actually have any estimate for the exponent of cumulative sales rank). Suppose our book starts at rank-A and each day k more books are written and they all are more popular than our book. Then the modeled total unit sales of our book is `sum_{rank=A,A+k,A+2k...infinity} 100000/rank`

which also diverges (though would stay bounded if we added a reasonable discount term for future value). Mostly we are showing you can push these analyses way too far; to get better results you need to correctly model more of the market.)

In our new book (Practical Data Science with R) we didn’t get into the lack of pointers for a purely didactic reason. To tell a general audience (perhaps one new to scripting or programming) that they don’t need to know about pointers, we would have to first explain what pointers are (somewhat losing the cognitive savings). We settled for demonstrating R’s (primarily) call by value semantics for functions (which we already needed to explain) with the following example:

```
```> vec <- c(1,2)
> fun <- function(v) { v[[2]]<-5; print(v)}
> fun(vec)
[1] 1 5
> print(vec)
[1] 1 2

Notice how the mutation (changing an entry to 5) does not escape the function as a side effect. Because R is a bit of kitchen sink (everything and its opposite is pretty much available) we had to cautiously title this example as “R behaves like a call-by-value language” in our book (R in fact has a number of sharable reference structures including `environment`

s, `ReferenceClasses`

, lazy evaluation systems like promises/`delayedAssign`

, and more). (The ugly `[[]]`

notation is something we recommend as it catches a few more errors than the more common `[]`

notation. For details please see appendix A of our book.)

What we didn’t discuss is that you get this sort of change isolation and safety in R in just about every situation (not just when binding values to function arguments). Here is another example (this time not from the book):

```
```> vec <- c(1,2)
> v2 <- vec
> v2[[2]] <- 5
> print(v2)
[1] 1 5
> print(vec)
[1] 1 2

Unlike many languages the assignment “`v2 <- vec`

” does not end up with `vec`

and `v2`

as references (or pointers) entangled to the same object. Instead they behave as if they are two different objects. This does prevent using these two symbols to communicate results (a legitimate programming practice) but it also prevents a whole host of errors and confusions that beginning programmers run into in the presence of such *shared mutability*. R protects the programmer by treating objects directly without exposing the additional ideas of references or pointers. Many ideal functional programming languages more directly expose references but mitigate their danger by insisting on immutable structures; but this requires the user to learn (in addition to data handling, statistics and programming) the fairly alien discipline of composing immutable data structures.

We encourage beginning programmers to think of programs as organizing sequences of transformations over data. So the simpler (and fewer) the mutations are, the easier it is to reason about programs. When you program in R you are mostly working with values and not variables (which is good, as it leaves you more time to think about data). So, as much as we complain about R, it is in fact a good choice for teaching, analysis, data science and even basic scripting tasks.

However, you do eventually have to deal with the unpleasant details of side-effects and shared mutability. One place where R doesn’t hide the sharp edges from you is in *closures* (the structure R uses to represent the context of a function). Consider the following code puzzle where we wonder what gets printed by the following:

```
```# make an array of 3 functions
f <- vector('list',3)
# set the i'th function to return i
for(i in 1:length(f)) {
f[[i]] <- function() { i }
}
# apply the functions using a different loop variable
for(j in 1:length(f)) {
print(f[[j]]())
}

Note this is one place where you really do need to use the uglier `[[]]`

notation. In the current version of R (3.0.2) if you try to use `[]`

you get the error message “cannot coerce type ‘closure’ to vector of type ‘list’.” But the puzzle is: what do you expect to be printed. If R was binding the value of `i`

into the `i`

‘th function you would expect to see the sequence “1,2,3.” Instead each function in fact gets its value for `i`

by using what is current in its capture of the evaluation environment. So this code in fact prints “3,3,3″, as this is the value i has after the first loop is finished. This is unfortunate, as a lot of productive programming patterns depend on capturing safe isolated values- not capturing entangled references.

This sort of puzzle may seem unpleasant and unnatural, but when pointers (and other sort of shared references) are involved you are forced to solve this sort of puzzle to understand the meaning or semantics of a code fragment or program. It is because these puzzles are laborious that languages like R emphasize isolation, so there is much less to worry about when you try to compose useful data transformations.

Closures and environments are very powerful tools (many of R’s features and built in terms of them). And this common shared mutability of them is a huge source of confusion in many programming languages (Javascript also has this issue, and Java only allows closures to capture final variables to try and cut down on some of the possible interference). To get the behavior we want (each function capturing the current value of `i`

in its closure and not sharing a common reference) we can write the following code:

```
```f <- vector('list',3)
for(i in 1:length(f)) {
f[[i]] <- function() { i }
e <- new.env()
assign('i',i,envir=e)
environment(f[[i]]) <- e
}
for(j in 1:length(f)) {
print(f[[j]]())
}

And this prints 1,2,3 as we would hope. Note we are now in *very* deep programming ground (closures being at least as confusing to beginners as pointers) and no longer even thinking about data. We have to admit: we really counted to 3 the hard way.

It took a little longer than we’d hoped, but we did it! *Practical Data Science with R* will be released on April 2nd (physical version). The eBook version will follow soon after, on April 15th. You can preorder the pBook now on the Manning book page. The physical version comes with a complimentary eBook version (when the eBook is released), in all three formats: PDF, ePub, and Kindle.

If you haven’t yet, order it now!

(softbound 416 pages, black and white; includes access to color PDF, ePub and Kindle when available)

]]>

Let’s work with a simple but very common example. You are asked to build a classification engine for a rare event: say default in credit card accounts. In good times for well managed accounts it is easy to imagine the default rate per year could be well under 1%. In this situation you do not want to propose predicting which accounts will actually default in a given year. This may be what the client asks for, but it isn’t reasonable to presume this is always achievable. You need to talk the client out of a business process that requires perfect prediction and work with them to design a business process that works well with reasonable forecasting.

Why is such prediction hard? Usually prediction in these situations is hard because while you usually have access to a lot broad summary data for each account (net-worth, age, family size, number of years account has been active, patterns of borrowing, amount of health insurance, amount of life insurance, patterns of re-payment and so on) you usually do not have access to many of the factors that trigger the default or even when you do such variables are not available very long before the event to be predicted. Trigger events for default can include sudden illness, physical accident, falling victim to a crime and other acute set-backs. The point is: two families without health insurance may have an equally elevated probability of credit default, but until you know which family gets sick you don’t know which one is much more likely to default.

Why does everybody ask for prediction? First: good prediction would fantastic, if they could get it. Second: most layman have no familiar notion of classifier quality other than accuracy (and measures similar to accuracy). And if all you know is accuracy then all you are prepared to discuss is prediction. So the client is really unlikely to ask to optimize a metric they are unfamiliar with. The measures that help get you out of this rut are statistical deviance and information theoretic entropy; so you will want to start hinting at these measures early on.

How do we show the value of achievable forecasting? For this discussion we define forecasting credit default as the calculation of good conditional probability estimate of credit default. To evaluate forecasts we need measures beyond accuracy and measurers that directly evaluate scores (without having to set a threshold to convert scores into predictions).

Back to our example. Suppose that in our population we expect 1% of the accounts to default. And we build a good forecast or scoring procedure that for 2% of the population returns a score of 0.3 and for the remaining 98% of the population returns a score near 0.01. Further suppose our scoring algorithm is well calibrated and excellent: the 2% of the population that it returns a score of 0.3 and above on actually tends to default at a rate of 30%.

Such a forecast identifies a 2% subset of the population that has a 30% chance of defaulting. Treated as a classifier it never says “yes” because it has not identified any examples that are estimated to have at least a 50% chance of defaulting (obviously we can force it to say “yes” by monkeying with scoring thresholds). So the classifier is not a silver bullet predictor. But it may (when backed with the right business process) be a fantastic forecaster: the subset it identifies is only 2% of the overall population yet has 60% of the expected defaults! Designing procedures to protect the lender from these accounts (insurance, cancellation, intervention, tighter limits, tighter payment schedule or even assistance) represents a potential opportunity to manage half of the lender’s losses at minimal cost. To benefit the client must both be able to sort or score accounts and have a business process that is not forced to treat all accounts as identical.

As we have said: laymen tend to only be familiar with accuracy. And accuracy is not a good measure of forecasts (see: “I don’t think that means what you think it means;” Statistics to English Translation, Part 1: Accuracy Measures). What you need to do is shop through metrics before starting your project and find one that is good for your client. Finding a metric that is good for your client involves helping them specify how classifier information will be used (i.e. you have to help them design a business process). Some types of scores to try with your client include: lift, precision/recall, sensitivity/specificity, AUC, deviance, KL-divergence and log-likelihood.

Time spent researching and discussing these metrics with your client is more valuable to the client than endless tweaking and tuning of a machine learning algorithm.

For a more on designing projects around good data science project metrics please see Zumel, Mount, “Practical Data Science with R” Chapter 5 Choosing and Evaluating Models which discusses many of the above metrics.

]]>

“Practical Data Science with R” started formal work in October of 2012. We had always felt the Win-Vector blog represented practice and research for such an effort, but this is when we started outlining a concrete book proposal. Most of a book proposal is specifying and limiting scope down to something that has a coherent point of view.

By May 2013 we had three chapters written and were able to launch the MEAP (Manning Early Access Program, where chapters drafts are shared to subscribers). By December 2013 the book was “content complete” (everything had been written and was accepted by initial editors and technical reviewers). Even though a lot of work had gone into writing, editing and technical review (see On writing a technical book) the pace actually picked up at this point.

We continue working with additional formal technical reviewers, proof editors, copy editors, indexers, graphic artists, layout specialists, QA readers and many more to give the book what one editor called “the sparkle the book deserves.” The MEAP now has all chapters available to subscribers, though even subscribers will not see a great number of the fixes and improvements until the final book is released.

But let’s get down to some of the numbers produced in the process of writing the book.

- Final chapter count: 11 (one chapter got moved to the appendixes).
- Page count: 416 (softbound black and white).
- Number of figures: 159.
- Number of words: about 130,000.
- Size of book text: 1.8MB.
- Number of git commits in book text repository: 742.
- Number of example code extracts: 274 (about 1.1MB).
- Size of example support site: 100MB.
- Number of git commits in example repository: 151.
- Number of book related emails in my email folder: 968.

We (Nina, myself and Manning Publications Co.) have put a *lot* into this book to make it easier for readers to get a lot out of it. We can’t wait to put it in your hands.

Just for the fun: the cover page of a book I very much respect that got me thinking about counting things.

]]>We normally don’t write about science here at Win-Vector, but we do sometimes examine the statistics and statistical methods behind scientific announcements and issues. NASA’s new technique is a cute and relatively straightforward (statistically speaking) approach.

From what I understand of the introduction to the paper, there are two ways to determine whether or not a planet candidate is really a planet: the first is to confirm the fact with additional measurements of the target star’s gravitational wobble, or by measurements of the transit times of the apparent planets across the face of the star. Getting sufficient measurements can take time. The other way is to “validate” the planet by showing that it’s highly unlikely that the sighting was a false positive. Specifically, the probability that the signal observed was caused by a planet should be at least 100 times larger than the probability that the signal is a false positive. The validation analysis is a Bayesian approach that considers various mechanisms that produce false positives, determines the probability that these various mechanisms could have produced the signal in question, and compares them to the probability that a planet produced the signal.

The basic idea behind verification by multiplicity is that planets are often clustered in multi-planet star systems, while false positive measurements (mistaken identification of potential planets) occur randomly. Putting this another way: if false positives are random, then they won’t tend to occur together near the same star. So if you observe a star with multiple “planet signals,” it’s unlikely that all the signals are false positives. We can use that observation to quantify how much more likely it is that a star with multiple candidates actually hosts a planet. The resulting probability can be used as an improved prior for the planet model when doing the statistical validation described above.

You can read the rest of the article here.

]]>

The most common reported significance is the frequentist p-value. Formally the p-value is the probability a repeat of the current experiment would show an effect as large as the current one assuming the null-hypothesis that there is in fact no effect present. This is frequentist because we are assuming an unknown fixed state of the world and variation in the possibility of alternative or repeated experiments. The issue is: significance tests are neither as simple as one would like nor as powerful as one would hope. Usually significance is misstated (either through sloppiness, ignorance, or malice) as being the chance the given result is false. Failure to reject the null hypothesis is only one possible source of error, so a low p-value is necessary but in no way sufficient condition to having a good result. False positives of this sort are not reproducible and show what is called reversion to mediocrity.

The Bayesian version of such a test would assume a prior distribution of the unknown quantity and hope to infer a low posterior probability on the “no effect” alternative. This leads to a similar calculation as the frequentist, but with the the ability to interpret a low probability of mistake as a high probability of success. An issue with the Bayesian analysis is you must supply priors, so your conclusion is dependent on and sensitive to your choice of priors (another possible avenue of abuse).

At best what a p-value represents is the degree of filtering the experiment has (under ideal conditions) against non-results. Run 100 experiments at p=0.05 and you expect to see at least 5 results that *appear to be* good; even if there was in fact no improvement to be measured. This is unfortunately standard practice for many. It is not enough to work hard on many projects and report your good results, see: “Why Most Published Research Findings Are False” John P A Ioannidis. Plos Med, 2005 vol. 2 (8) p. e124; and “Does your model weigh the same as a Duck?” Ajay N Jain and Ann E Cleves, J Comput Aided Mol Des, 2011 vol. 26 (1) pp. 57-67. Also shotgun style A/B testing of pointless variations is particularly problematic (see “Most winning A/B test results are illusory” Martin Goodson, qubitproducts.com, 2014). Projects like 41 blues are not only bad design they are likely bad science.

Combine a large number of bad hypotheses, the impossibility of “accepting the null hypothesis” and you have no reason to believe any result through mere first report. The issue being: while only 5% of the tests ran falsely appear to succeed, if mostly useless experiments are run it can easily be that nearly 100% of what gets published and acted on are false results. A stream of nonsense can drown out and hide more expensive and rare actual good work, if your filter is sloppy enough. Add in the inability to reproduce results and have a large problem.

Two questions we want to comment on: why would a researcher submit a bunch of bad work to testing and surely there is an easy fix?

Why are unsubstantiated work and ideas submitted for testing? Ideally testing is a means of scientific confirmation: you submit an idea that has good reasons to work in principle and then confirm the improvement in performance with a test. In fact to correctly design an A/B test you must propose the smallest difference you expect to detect. The reason you get so many meaningless changes submitted as meaningful experiments are varied. First A/B testing has been sold as a way to avoid bike shedding (avoiding the debate of meaningless differences by attempting to test meaningless differences). Also you get what you reward: if there is a benefit (getting a publication or bonus) for having the appearance of a good result, then you will eventually only get results that merely appear to be good. Once people figure out the appearance of success is rewarded your field becomes dominated by shotgun studies (proposing many useless variations is easier than inventing a plausible improvement) using a fixed p-threshold (p=0.05, because you are not traditionally allowed to get away with p any higher and p any lower just makes it take longer to appear to succeed).

There is a any easy fix: apply the Bonferroni correction. This is just a fancy way of saying: if we allow somebody to submit 10 ideas to test and report success if any of them look good, then we need to tighten the test criterion. If we are convinced that p=0.05 is a valid threshold for a single test (which should not be automatic, just because everybody uses p=0.05 doesn’t mean you should) then we should force somebody submitting 10 tests to run each test at p=0.005 to try and compensate for their venue shopping. A possible Bayesian adjustment would be to force the prior estimate of the probability of success to fall linearly in the number of experiments run.

Tests are filters. What p-value you should use is not set in stone at p=0.05. It depends on your prior model of the distribution of items you are going to test (are we confirming experiments thought to work, or are we running through a haystack looking for rumored needle?) and your estimates of the relative costs of type-1 versus type-2 errors (is this early screen where false negatives are to be avoided, or a final decision where false positive are to be avoided?). With a good loss model and prior estimates it is mere arithmetic to pick an optimal p-value.

Experimental design and significance encompass the whole experimental process. To calculate correct significances you must include facts about many experiments, not just a given single experiment. You must think in terms of actual probability of correctness, not mere procedures.

]]>