Accumulating wheat (Photo: Cyron Ray Macey, some rights reserved)

In this latest “R as it is” (again in collaboration with our friends at Revolution Analytics) we will quickly become expert at efficiently accumulating results in R.

A number of applications (most notably simulation) require the incremental accumulation of results prior to processing. For our example, suppose we want to collect rows of data one by one into a data frame. Take the `mkRow`

function below as a simple example source that yields a row of data each time we call it.

```
```mkRow <- function(nCol) {
x <- as.list(rnorm(nCol))
# make row mixed types by changing first column to string
x[[1]] <- ifelse(x[[1]]>0,'pos','neg')
names(x) <- paste('x',seq_len(nCol),sep='.')
x
}

The obvious “`for`

-loop” solution is to collect or accumulate many rows into a data frame by repeated application of `rbind`

. This looks like the following function.

```
```mkFrameForLoop <- function(nRow,nCol) {
d <- c()
for(i in seq_len(nRow)) {
ri <- mkRow(nCol)
di <- data.frame(ri,
stringsAsFactors=FALSE)
d <- rbind(d,di)
}
d
}

This would be the solution most familiar to many non-R programmers. The problem is: in R the above code is incredibly slow.

In R all common objects are usually immutable and can not change. So when you write an assignment like “`d <- rbind(d,di)`

” you are *usually* not actually adding a row to an existing data frame, but constructing a new data frame that has an additional row. This new data frame replaces your old data frame in your current execution environment (R execution environments are mutable, to implement such changes). This means to accumulate or add `n`

rows incrementally to a data frame, as in `mkFrameForLoop`

we actually build `n`

different data frames of sizes `1,2,...,n`

. As we do work copying each row in each data frame (since in R data frame columns can potentially be shared, but not rows) we pay the cost of processing `n*(n+1)/2`

rows of data. So: *no matter how expensive creating each row is*, for large enough `n`

the time wasted re-allocating rows (again and again) during the repeated `rbind`

s eventually dominates the calculation time. For large enough `n`

you are wasting most of your time in the repeated `rbind`

steps.

To repeat: it isn’t just that accumulating rows one by one is “a bit less efficient than the right way for R”. Accumulating rows one by one becomes arbitrarily slower than the right way (which should only need to manipulate `n`

rows to collect `n`

rows into a single data frame) as `n`

gets large. Note: it isn’t that beginning R programmers don’t know what they are doing; it is that they are designing to the reasonable expectation that data frame is row-oriented and R objects are mutable. The fact is R data frames are column oriented and R structures are largely immutable (despite the syntax appearing to signal the opposite), so the optimal design is not what one might expect.

Given this how does anyone ever get real work done in R? The answers are:

- Experienced R programmers avoid the
`for`

-loop structure seen in`mkFrameForLoop`

. - In some specialized situations (where value visibility is sufficiently limited) R can avoid a number of the unnecessary user specified calculations by actual in-place mutation (which means R can in some cases change things when nobody is looking, so only
*observable*object semantics are truly immutable).

The most elegant way to avoid the problem is to use R’s `lapply`

(or list apply) function as shown below:

```
```mkFrameList <- function(nRow,nCol) {
d <- lapply(seq_len(nRow),function(i) {
ri <- mkRow(nCol)
data.frame(ri,
stringsAsFactors=FALSE)
})
do.call(rbind,d)
}

What we did is take the contents of the `for`

-loop body, and wrap them in a function. This function is then passed to `lapply`

which creates a list of rows. We then batch apply `rbind`

to these rows using `do.call`

. It isn’t that the `for`

-loop is slow (which many R users mistakingly believe), it is the *incremental* collection of results into a data frame is slow and that is one of the steps the `lapply`

method is avoiding. While you can prefer `lapply`

to `for`

-loops always for stylistic reasons, it is important to understand when `lapply`

is in fact quantitatively better than a `for`

-loop (and to know when a `for`

-loop is in fact acceptable). In fact a for-loop with a better binder such as `data.table::rbindlist`

is among the fastest variations we have seen (as suggested by Arun Srinivasan in the comments below; another top contender are file based Split-Apply-Combine methods as suggested in comments by David Hood, ideas also seen in Map-Reduce).

If you don’t want to learn about `lapply`

you can write fast code by collecting the rows in a list as below.

```
```mkFrameForList <- function(nRow,nCol) {
d <- as.list(seq_len(nRow))
for(i in seq_len(nRow)) {
ri <- mkRow(nCol)
di <- data.frame(ri,
stringsAsFactors=FALSE)
d[[i]] <- di
}
do.call(rbind,d)
}

The above code still uses a familiar `for`

-loop notation and is in fact fast. Below is a comparison of the time (in MS) for each of the above algorithms to assemble data frames of various sizes. The quadratic cost of the first method is seen in the slight upward curvature of its smoothing line. Again, to make this method truly fast replace `do.call(rbind,d)`

with `data.table::rbindlist(d)`

(examples here).

Execution time (MS) for collecting a number of rows (x-axis) for each of the three methods discussed. Slowest is the incremental for-loop accumulation.

The reason `mkFrameForList`

is tolerable is in some situations R can avoid creating new objects and in fact manipulate data in place. In this case the list “`d`

” is not in fact re-created each time we add an additional element, but in fact mutated or changed in place.

(edit) The common advice is we should prefer in-place edits. We tried that, but it wasn’t until we (after getting feedback in our comments below) threw out the data frame class attribute that we got really fast code. The code and latest run are below (but definitely check out the comments following this article for the reasoning chain).

```
```mkFrameInPlace <- function(nRow,nCol,classHack=TRUE) {
r1 <- mkRow(nCol)
d <- data.frame(r1,
stringsAsFactors=FALSE)
if(nRow>1) {
d <- d[rep.int(1,nRow),]
if(classHack) {
# lose data.frame class for a while
# changes what S3 methods implement
# assignment.
d <- as.list(d)
}
for(i in seq.int(2,nRow,1)) {
ri <- mkRow(nCol)
for(j in seq_len(nCol)) {
d[[j]][i] <- ri[[j]]
}
}
}
if(classHack) {
d <- data.frame(d,stringsAsFactors=FALSE)
}
d
}

More timings.

Note that the in-place list of vectors method is faster than any of `lapply/do.call(rbind)`

, `dplyr::bind_rows/replicate`

, or `plyr::ldply`

. This is despite having nested for-loops (one for rows, one for columns; though this is also why methods of this type can speed up even more if we use `compile:cmpfun`

). At this point you should see: it isn’t the for-loops that are the problem, it is any sort of incremental allocation, re-allocation, and checking.

At this point we are avoiding both the complexity waste (running an algorithm that takes time proportional to the square of the number of rows) and avoiding a lot of linear waste (re-allocation, type-checking, and name matching).

However, any in-place change (without which the above code would again be unacceptably slow) depends critically on the list value associated with “`d`

” having very limited visibility. Even copying this value to another variable or passing it to another function can break the visibility heuristic and cause arbitrarily expensive object copying.

The fragility of the visibility heuristic is best illustrated with an even simpler example.

Consider the following code that returns a vector of the squares of the first `n`

positive integers.

```
```computeSquares <- function(n,messUpVisibility) {
# pre-allocate v
# (doesn't actually help!)
v <- 1:n
if(messUpVisibility) {
vLast <- v
}
# print details of v
.Internal(inspect(v))
for(i in 1:n) {
v[[i]] <- i^2
if(messUpVisibility) {
vLast <- v
}
# print details of v
.Internal(inspect(v))
}
v
}

Now of course part of the grace of R is we never would have to write such a function. We could do this very fast using vector notation such as `seq_len(n)^2`

. But let us work with this notional example.

Below is the result of running `computeSquares(5,FALSE)`

. In particular look at the lines printed by the `.Internal(inspect(v))`

statements and at the first field of these lines (which is the address of the value “`v`

” refers to).

```
```computeSquares(5,FALSE)
## @7fdf0f2b07b8 13 INTSXP g0c3 [NAM(1)] (len=5, tl=0) 1,2,3,4,5
## @7fdf0e2ba740 14 REALSXP g0c4 [NAM(1)] (len=5, tl=0) 1,2,3,4,5
## @7fdf0e2ba740 14 REALSXP g0c4 [NAM(1)] (len=5, tl=0) 1,4,3,4,5
## @7fdf0e2ba740 14 REALSXP g0c4 [NAM(1)] (len=5, tl=0) 1,4,9,4,5
## @7fdf0e2ba740 14 REALSXP g0c4 [NAM(1)] (len=5, tl=0) 1,4,9,16,5
## @7fdf0e2ba740 14 REALSXP g0c4 [NAM(1)] (len=5, tl=0) 1,4,9,16,25
## [1] 1 4 9 16 25

Notice that the address `v`

refers to changes only once (when the value type changes from integer to real). After the one change the address remains constant (`@7fdf0e2ba740`

) and the code runs fast as each pass of the for-loop alters a single position in the value referred to by `v`

without any object copying.

Now look what happens if we re-run with `messUpVisibility`

:

```
```computeSquares(5,TRUE)
## @7fdf0ec410e0 13 INTSXP g0c3 [NAM(2)] (len=5, tl=0) 1,2,3,4,5
## @7fdf0d9718e0 14 REALSXP g0c4 [NAM(2)] (len=5, tl=0) 1,2,3,4,5
## @7fdf0d971bb8 14 REALSXP g0c4 [NAM(2)] (len=5, tl=0) 1,4,3,4,5
## @7fdf0d971c88 14 REALSXP g0c4 [NAM(2)] (len=5, tl=0) 1,4,9,4,5
## @7fdf0d978608 14 REALSXP g0c4 [NAM(2)] (len=5, tl=0) 1,4,9,16,5
## @7fdf0d9788e0 14 REALSXP g0c4 [NAM(2)] (len=5, tl=0) 1,4,9,16,25
## [1] 1 4 9 16 25

Setting `messUpVisibility`

causes the value referenced by “`v`

” to also be referenced by a new variable named “`vLast`

“. Evidently this small change is enough to break the visibility heuristic as we see the address of the value “`v`

” refers to changes after each update. Meaning we have triggered a lot of expensive object copying. So we should consider the earlier `for`

-loop code a bit fragile as small changes in object visibility can greatly change performance.

The thing to remember is: for the most part R objects are immutable. So code that appears to alter R objects is often actually simulating mutation by expensive copies. This is the concrete reason that functional transforms (like `lapply`

) should be preferred to incremental or imperative appends.

R code for all examples in this article can be found here (this includes methods like pre-reserving space, and the original vector experiments that originally indicated the object mutation effect).

]]>ASCII Code Chart-Quick ref card” by Namazu-tron – See above description. Licensed under Public Domain via Wikimedia Commons

Usually when I am working with text, my goals are a bit loftier than messing with individual characters. However, it is a step you have to get right. So you would like working correctly with arbitrary characters to be as easy as possible, as any problems here are mere distractions from your actual goals.

The other day I thought it would be nice to get a list of all the article titles and URLs from the Win-Vector blog. It is a low ambition task that should be easy to do. At some point I thought: I’ll just scan the XML export file, it has all of the information in a structured form. And the obvious Python program to do this fails out with:

```
```xml.etree.ElementTree.ParseError:
not well-formed (invalid token): line 27758, column 487

Why is that? The reason is WordPress wrote a document with a suffix “.xml” and a header directive of “`<?xml version="1.0" encoding="UTF-8" ?>`

” that is not in fact valid utf-8 encoded XML. Oh, it looks like modern XML (bloated beyond belief and full of complicated namespaces referring to URIs that get mis-used as concrete URLs). But unless your reader is bug-for-bug compatible with the one WordPress uses, you can’t read the file. Heck I am not ever sure WordPress can read the file back in, I’ve never tried it and confirmed such a result. This is the world you get with “fit to finish” or code written in the expectation of downstream fixes due to mis-readings of Postel’s law.

So the encoding is not in fact XML over utf-8, but some variation of wtf-8. Clearly something downstream can’t handle some character or character encoding. We would like to at least process what we have (and not abort or truncate).

Luckily there is a Python library called unidecode which will let us map exotic (at least to Americans) characters to Latin analogues (allowing us to render Erdős as Erdos instead of the even worse Erds). The Python3 code is here:

```
```# Python3, read wv.xml (Wordpres export)
# write to stdout title/url lines
import random
import codecs
import unidecode
import xml.etree.ElementTree
import string
# read WordPress export
with codecs.open('wv.xml', 'r', 'utf-8') as f:
dat = f.read()
# WordPress export full of bad characters
dat = unidecode.unidecode(dat)
dat = ''.join([str(char) for char in dat if char in string.printable])
namespaces = {'wp': "http://wordpress.org/export/1.2/"}
root = xml.etree.ElementTree.fromstring(dat.encode('utf-8'))
list = [ item.find('title').text + " " + item.find('link').text \
for item in root.iter('item') \
if item.find('wp:post_type',namespaces).text=='post' ]
random.shuffle(list)
for item in list:
print(item)

It only took two extra lines to work around the parse problem (the `unidecode.unidecode()`

followed by the filter down to `string.printable`

). But such a simple work around depends on not actually having to represent your data in a completely faithful and reversible manner (often the case for analysis, hence my strong TSV proposal; but almost never the case in storage and presentation).

Also, it takes a bit of search to find even this work-around; and that is distracting to have to worry about this when you are in the middle of doing something else. The fix was only quick because we used a pragmatic language like Python, where somebody supplied a library to demote characters to something usable (not exactly ideologically pure). Imagine having to find which framework in Java (as mere libraries tend to be beneath Java architects) might actually supply a function simply performing a useful task.

How did text get some complicated? There are some essential difficulties, but many of the problems are inessential stumbling blocks due to architecting without use cases, and the usual consequence of committees.

A good design works from a number of explicitly stated use cases. Historically strings have been used to:

- store text
- compare text
- sort text
- search text
- compute over text
- manipulate text
- present text

Unicode/UTF tends to be amazingly weak at all of these. Search by regular expressions is notoriously weak over Unicode (try and even define a subset of Unicode that is an alphabet, versus other graphemes). With so many alternative ways to represent things just forget human readable collating/sorting or having normal forms strong enough to support and reasonable notion of comparison/equivalence. It appears to be a research question (or even a political science question) if you can even convert a Unicode string reliably to upper case.

I accept: a concept of characters and character encoding rich enough to support non-latin languages is going to be more effort than ASCII was. Manipulating strings may no longer be as simple as working over individual bytes.

We would expect the standards to come with useful advice and reference implementations.

But what actually happens is:

- Perverse “distinctions without differences.” Computer science has a history of these bad ideas (such as Algol 58 which “specified three different syntaxes: a reference syntax, a publication syntax, and an implementation syntax” largely so one member wouldn’t be reduced to using “apostrophe for quote” even though no computer at the time supported both characters). Continuing the tradition Unicode supports all of characters with accents, accents applied to characters, invisible characters, and much more.
- Time wasted with text support for clearly graphical elements such as emoji:

. We already have container media for mixing text and other elements (HTML being one example), so we do not need to crudely repeat this functionality in the character set. - Time wasted fending off troll proposals to incorporate fictional languages (such as Klingon) into base Unicode.

Perhaps if we could break Unicode’s back with enough complexity it would die and something else could evolve to occupy the niche. My own trollish proposal would be along the following lines.

Pile on one bad idea too many and make string comparison as hard as a general instance of the Post correspondence problem and thus undecidable. Unicode/utf-8 is not there yet (due to unambiguity, reading direction, and bounded length), but I keep hoping.

The idea is many characters in Unicode have more than one equivalent representation (and these can have different lengths, examples include use of combining characters versus precomposed characters). So, roughly, checking sequences of code points represent the same string becomes the following grouping problem:

For a sequence of integers 1…t define an “ordered sequence partition” P as a sequence of contiguous sets of integers (P1,…,Ph) such that:

- Each Pi is a contiguous sequence of integers from the range 1 … t.
- Union_i Pi = {1,…,t}
- max(Pi) < min(Pj) for all 1 ≤ i < j ≤ h.

For two sequences of code points a1,a2,…am and b1,b2,…,bn checking “string equivilance” is therefore checking if the sequence of integers 1…m and 1…n can be ordered sequence partitioned into A=(A1,…,Au) and B=(B1,…,Bu) such that: for i=1…u the sequence of code points a_{Ai} and b_{Bi} are all valid and equivalent Unicode characters.

What stops this from encoding generally hard problems is the lack ambiguity in the code-point to character dictionaries ensuring there is only one partition of each code point sequence such that all elements are valid code-points. Thus we can find the unique partition of each code point sequence using a left to right read and then we just check if the partitions match.

So all we have to do to successfully encode hard problems is trick the standards committee into introducing a sufficient number of ambiguous grouping (things like code-points “a1 a2 a3 a4″ such that both “(a1 a2 a3)” and “(a1 a2)” are valid groupings). This will kill the obvious comparison algorithms, and with some luck we get a code dictionary that will allow us to encode NP hard problems as string equivalence.

To get undecidable problems we just have to trick the committee into introducing a bad idea I’ll call “fix insertions.” We will say “a1 a2 a3 a4″ can be grouped into “(a1 x1 a2) (a3 a4 x2 x3)” by the insertion of the implied or “fix” code-points x1, x2, x3. Then, with some luck, we could build a code dictionary that could encode general instances of Post correspondence problems and make Unicode string comparison Turing complete (and thus undecidable).

So I think all we need is some clever design (to actually get a dangerously expressive encoding, not just the suspicion there is one) and stand up a stooge linguist or anthropologist to claim a few additional “harmless lexical equivalences and elusions” (such as leaving vowels out of written script) are needed to faithfully represent a few more languages.

Okay, the last section was a joke (and not even a good joke). Let’s look at what I would really want if text encoding were still on the table.

Unicode is attempting to put everything in one container and thus has become essentially a multimedia format (like HTML). There is no “Unicode light” where you only need to solve the processing problems of one or two languages to get your work done. Unicode is all or nothing, you have to be able to represent everything to represent anything. Frankly I’d like to see a more modular approach where nesting and containment are separate from string/character encoding. A text could be represented as a container of multiple string segments where each segment is encoded in a single named limited capability codebook. Things like including a true Hungarian name in english text would be done at the container level, and not at the string/character level.

We have to, as computer scientists, show more discipline in what we do not allow into standards and designs. As Prof. Dr. Edsger W.Dijkstra wrote, we must:

… educate a generation of programmers with a much lower threshold for their tolerance of complexity …

Complexity tends to synergize multiplicatively (not merely additively). And real world systems already have enough essential difficulty and complexity, so we can not afford a lot more unnecessary extra complexity.

]]>Illustration: Boris Artzybasheff

photo: James Vaughan, some rights reserved

**The Example Problem**

Recall that you have a mobile app with both free (A) and paid (B) actions; if a customer’s tasks involve too many paid actions, they will abandon the app. Your goal is to detect when a customer is in a state when they are likely to abandon, and offer them (perhaps through an in-app ad) a more economical alternative, for example a “Pro User” subscription that allows them to do what they are currently doing at a lower rate. You don’t want to be too aggressive about showing customers this ad, because showing it to someone who doesn’t need the subscription service is likely to antagonize them (and convince them to stop using your app).

You want to build a model that predicts whether a customer will abandon the app (“exit”) within seven days. Your training set is a set of 648 customers who were present on a specific reference day (“day 0″); their activity on day 0 and the ten days previous to that (days 1 through 10), and how many days until each customer exited (`Inf`

for customers who never exit), counting from day 0. For each day, you constructed all possible windows within those ten days, and counted the relative rates of A events and B events in each window. This gives you 132 features per row. You also have a hold-out set of 660 customers, with the same structure. You can download the wide data set used for these examples as an `.rData`

file here. The explanation of the variable names is in the previous post in this series.

In the previous installment, we built a regularized (ridge) logistic regression model over all 132 features. This model didn’t perform too badly, but in general there is more danger of overfitting when working with very wide data sets; in addition, it is quite expensive to analyze a large number of variables with standard implementations of logistic regression. In this installment, we will look for potentially more robust and less expensive ways of analyzing this data.

**The Ideal Case**

Ideally you would know some appropriate window lengths, from understanding of the domain. For instance, if you knew that the trend towards abandonment manifested itself over the course of a month, then weekly or twice-a-week aggregations might be all you need. But perhaps you aren’t entirely sure what the appropriate aggregation windows are. Is there any way of teasing them out?

**Greedy Forward Stepwise Regression**

One way to find the best features is to pick them one at a time: find the one-variable model that optimizes some model quality function, then add another variable that, combined with the first, again optimizes model quality, and so on, until the model “stops improving.” I’ve put that in quotes because in general, one stops when the incremental improvement is smaller than some threshold. Because many standard model quality metrics, like R-squared, squared error, or deviance, tend to improve as the number of parameters increases (potentially leading to bias and overfit), standard stepwise regression uses criteria like the AIC or BIC, which attempt to compensate for the complexity of the model. Here (for pedagogical purposes) we will step by hand rather than use R’s `step()`

function, and will simply minimize deviance, and use an ad-hoc procedure for picking an appropriate number of variables.

As before, we’ll use L2-regularized logistic regression as the base model.

library(glmnet) # stepwise ridge regression: add one more variable # to existing model # # xframe: data frame of independent variables # y: vector of dependent variable # current_vars: variables in the current model # current_dev: deviance of current model # candidate_vars: variables to be potentially added # to model # Returns: # new set of current_vars # new current_dev # improvement from previous model add_var = function(xframe, y, current_vars, current_dev, candidate_vars) { best_dev = current_dev newvar = NULL for(var in candidate_vars) { active=c(current_vars, var) xf = xframe[,active] if(length(active) > 1) { model = glmnet(as.matrix(xf), y, alpha=0, lambda=0.001, family="binomial") } else { # glmnet requires > 1 variable model =glm.fit(xframe[,active], y, family=binomial(link="logit")) } moddev = deviance(model) if(moddev < best_dev) { newvar = var best_dev = moddev } } improvement = 1 - (best_dev/current_dev) list(current_vars= c(current_vars, newvar), current_dev = best_dev, improvement = improvement) } # stepwise ridge regression: entire loop # # data: training data frame # vars: variables to consider # yVar: name of dependent variable # min_improve: terminate when model # improvement is less than this # # returns final set of variables, # along with improvements and deviances stepwise_ridge = function(data, vars, yVar, min_improve=1e-6) { current_vars=c() candidate_vars = vars devs = numeric(length(vars)) improvement = numeric(length(vars)) current_dev=null_deviance(data[[yVar]]) do_continue=TRUE while(do_continue) { iter = add_var(data, data[[yVar]], current_vars, current_dev, candidate_vars) current_vars = iter$current_vars current_dev = iter$current_dev count = length(current_vars) devs[count] = current_dev improvement[count] = iter$improvement candidate_vars = setdiff(vars, current_vars) do_continue = (length(candidate_vars) > 0) && (iter$improvement > min_improve) } list(current_vars = current_vars, deviances=devs, improvement=improvement) # load vars (names of vars), yVar (name of y column), # dTrainS, dTestS load("wideData.rData") # number of candidate variables length(vars) ## [1] 132 # fix the Infs in the training data # shouldn't be many of them isInf = dTrainS$daysToX == Inf maxfinite = max(dTrainS$daysToX[!isInf]) dTrainS$daysToX[isInf] = maxfinite # null deviance: # the deviance of the mean value # of the y variable null_deviance(dTrainS[[yVar]]) ## [1] 892.3776 # model using all variables allvar_model = ridge_model(dTrainS[,vars], dTrainS[[yVar]]) # the deviance of the model # with all variables deviance(allvar_model) ## [1] 722.1471 # greedy forward stepwise regression modelparams = stepwise_ridge(dTrainS, vars, yVar) current_vars = modelparams$current_vars devs = modelparams$deviances improvement=modelparams$improvement # number of variables selected length(current_vars) ## [1] 27 final_model = ridge_model(dTrainS[,current_vars], dTrainS[[yVar]]) final_model$deviance ## [1] 722.1666 current_vars[1:7] ## "B_1_1" "A_0_0" "B_6_0" ## "B_7_2" "B_9_3" "A_1_0" "A_5_5" }

We can reduce the number of variables from 132 to 27 without substantially increasing the training deviance (recall that large deviance is bad).

If we look at the first few selected variables, we see that the model looks at the rate of B events occurring “yesterday” (`B_1_1`

) and compares it with the rate of B events over sliding windows of 6-7 days from today, yesterday, and the day before yesterday. It also looks at the rate of A events from today and yesterday (and 5 days ago). Recall that in this simulated data a customer’s rates of A and B actions stay constant until they switch to the “at-risk” state, at which time their rate of B actions increases to a new constant (see the previous installment) — in other words, there is an edge after which the customer’s B rate is notably higher. Given that knowledge (which we of course wouldn’t have in a real data situation), comparing the current B rate with running averages from the last few days makes sense.

So by simply stepping through the variables that we generated through naive sessionization, we can reduce the number of features to a more tractable number. If fact, we suspect that we can decrease the number of variables even more. Let’s look at how deviance changed as we added variables.

The top plot is deviance as a function of the number of variables, the bottom plot is the improvement from the previous model — kind of the “derivative of the deviance.” After about ten variables, the model improvement leveled off. It’s a folk theorem, when looking at graphs like these (model quality as a function of a parameter) that the optimal value for the parameter occurs at the “elbow” of the model quality graph, or alternatively at either the maximum or elbow of the improvement graph. Which point is the elbow of this deviance graph is a fuzzy question; the improvement graph is easier to read. The maximum is 2 variables; the elbow is 4. There’s an argument to be made for 6 variables, too, so let’s look at all these models, this time on hold-out data.

# more reduced models final2_model = ridge_model(dTrainS[,current_vars[1:2]], dTrainS[[yVar]]) final4_model = ridge_model(dTrainS[,current_vars[1:4]], dTrainS[[yVar]]) final6_model = ridge_model(dTrainS[,current_vars[1:6]], dTrainS[[yVar]]) # Compare all the (non-trivial) models on holdout data # See https://github.com/WinVector/SessionExample/blob/master/NarrowChurnModel.Rmd # for the evaluate() function code rbind(evaluate(allvar_model, dTestS, dTestS[[yVar]], "all variables"), evaluate(final_model, dTestS, dTestS[[yVar]], "stepwise run"), evaluate(final2_model, dTestS, dTestS[[yVar]], "best 2 variables"), evaluate(final4_model, dTestS, dTestS[[yVar]], "best 4 variables"), evaluate(final6_model, dTestS, dTestS[[yVar]], "best 6 variables")) ## label deviance recall precision accuracy ## 1 all variables 756.6919 0.7752809 0.7360000 0.7287879 ## 2 stepwise run 755.8788 0.7752809 0.7360000 0.7287879 ## 3 best 2 variables 769.7035 0.7696629 0.7080103 0.7045455 ## 4 best 4 variables 743.1230 0.7921348 0.7540107 0.7484848 ## 5 best 6 variables 743.2160 0.7977528 0.7533156 0.7500000

The four-variable model dominates all the others on the hold-out data on deviance and precision, and isn’t too far behind the six-variable model on recall and accuracy. This indicates that the model using all the variables was slightly overfitting, as was even the model with 27 variables. For domain reasons, you still might prefer to use the six-variable model — I would feel more comfortable using 3 running average measurements instead of two, and I like having more A rate information in the model. The performance difference between the two models is slight, and 6 variables is still far, far fewer than 132.

Note that we could also have used n-fold cross validation to select the best number of variables.

**Discussion**

This approach isn’t perfect. You still have to generate all the naive sessionization features, and you still have to run through them all, multiple times. However, if M is the number of naive sessionization features, and M is large, then fitting M*k small logistic regression models (where k < M) may still be less expensive than fitting one logistic regression model of size M. Also, if M is so large that you have trouble fitting it in memory (it can happen), you can simply generate each feature on the fly, as needed.

If you really want to, you can cut down the computation a little by not fitting a model to all the current variables at every step; you can freeze the previous model and use its predictions as an offset to the next model (via the `offset`

parameter). This means you are only fitting a single-variable model at every iteration. If you do this, it’s a good idea to do one last polishing step at the end by refitting all the selected variables at once.

You can interpret freezing the previous model and using it as an offset to the next model as minimizing the “residual deviance” at every iteration. If this sounds familiar, it should: incrementally building up a model by minimizing residual deviance and iterating is one of the basic ideas behind gradient boosting, though the details are different, and gradient boosting usually boosts trees rather than single-variable models. So rather than trying the procedure I just described, why don’t we just try gradient boosting?

**Gradient Boosting with Additive Models**

We’ll use the `gbm()`

function from the `gbm`

package, with `interaction.depth=1`

, since we didn’t use interactions in our logistic regression models.

library(gbm) # wrapper functions for prediction gbm_predict_function = function(model, nTrees) { force(model) function(xframe) { predict(model,newdata=xframe,type='response', n.trees=nTrees) } } # wrapper function for fitting. # Returns: a prediction function # variable influences gbm_model = function(dframe, formula, weights=NULL) { if(is.null(weights)) { nrows = dim(dframe)[1] weights=numeric(nrows)+1 # all 1 } modelGBM

The function `gbm.perf()`

(as we’ve called it in `gbm_model()`

) uses cross-validation to pick the optimal number of boosting iterations (trees):

The black curve shows model deviance on training data as a function of the number of iterations; the green curve shows model deviance on holdout. The algorithm selects the point where the holdout deviance begins to increase again: in this case, 83 trees. Since we have set the interaction depth to 1, this is essentially the number of variables in the model.

We can compare the resulting model to our (reduced) stepwise model.

# compare to the best ridge model bestridge_model = final6_model bestn = 6 rbind(evaluate(bestridge_model, dTestS, dTestS[[yVar]], "best stepwise model"), evaluate(modelGBM, dTestS, dTestS[[yVar]], "gbm model, interaction=1")) ## label deviance recall precision accuracy ## 1 best stepwise model 743.2160 0.7977528 0.7533156 0.7500000 ## 2 gbm model, interaction=1 728.4487 0.7387640 0.7758112 0.7439394

The gradient boosting model has lower deviance on hold-out, so it’s predicting probabilities better. It’s also more precise, but has lower recall. Unfortunately, if you want to use the gbm model, you still have to use all the features as input, so you lose the variable reduction advantage, not only during model application, but during model fitting — this matters if you can’t get all the features into memory at once.

The summary of a `gbm`

model returns the variable influences, which we can use as proxies for variable importance. So you can try the “elbow” trick on the graph of influence versus number of variables, then refit a model using only those variables. I won’t show the graph here, but I decided on 7 variables, not only because it appeared to be an elbow on the influence graph, but also because 7 is nearly the same number of variables as we used in our reduced logistic regression model. The resulting variables are different from those the stepwise procedure selected:

## var rel.inf ## B_3_0 B_3_0 25.610258 ## B_2_0 B_2_0 16.911399 ## B_4_0 B_4_0 14.006369 ## A_2_0 A_2_0 12.537114 ## A_0_0 A_0_0 11.648087 ## A_1_0 A_1_0 11.206434 ## B_1_0 B_1_0 8.080339

The performance of the reduced gbm model is similar to that of the full model, and also similar to the performance of the reduced logistic regression model. Again the gradient boosting models have better deviance, but inferior recall.

## label deviance recall precision ## best stepwise model 743.2160 0.7977528 0.7533156 ## gbm model, interaction=1 728.4487 0.7387640 0.7758112 ## gbm model with best gbm variables 720.9506 0.7303371 0.7784431 ## accuracy ## 0.7500000 ## 0.7439394 ## 0.7424242

Recall that in addition to accurate classification, you want the model to identify about-to-exit customers early enough for you to intervene with them. So you also want to compare the reduced stepwise and gradient boosted models for the timeliness of their predictions. Here, we show the distribution of days to exit for all customers who exited within 7 days in the hold out set (shown as the green bars), along with how many of those customers each model identified (shown as the points with stems).

Both models did a good job identifying customers who will exit “today” or “tomorrow” (perhaps too soon for you to intervene with them), but the stepwise regression model did a little better at early identification of customers who will exit in three to seven days.

**Takeaways**

- Though it is not a complete substitute for domain knowledge, stepwise regression can be a useful tool for teasing out good predictive variables from a very large pool of candidate variables in sessionized data — or in any very wide dataset.
- Since additional variables can artificially improve the training performance of a model, hold-out evaluation is essential when evaluating a model found by stepwise procedures. Either use hold-out data as we did in this post, or use cross-validation measurements like the PRESS statistic.
- Stepwise regression and gradient boosting are fairly related ideas.

You can download the wide sessionized data sets that we used in this post here.

You can download an R markdown script showing all the steps we did in this post (and more) here.

**Next:**

We will continue to explain important steps in sessionization.

]]>One notable exception is log data. Log data is a very thin data form where different facts about different individuals are written across many different rows. Converting log data into a ready for analysis form is called *sessionizing*. We are going to share a short series of articles showing important aspects of sessionizing and modeling log data. Each article will touch on one aspect of the problem in a simplified and idealized setting. In this article we will discuss the importance of dealing with time and of picking a business appropriate goal when evaluating predictive models.

For this article we are going to assume that we have sessionized our data by picking a concrete near-term goal (predicting cancellation of account or “exit” within the next 7 days) and that we have already selected variables for analysis (a number of time-lagged windows of recent log events of various types). We will use a simple model without variable selection as our first example. We will use these results to show how you examine and evaluate these types of models. In later articles we will discuss how you sessionize, how you choose examples, variable selection, and other key topics.

**The Setup**

One lesson of survival analysis is that it is a lot more practical to model the *hazard function* (the fraction of accounts terminating at a given date, conditioned on the account being active just prior to the date) than to directly model account lifetime or account survival. Knowing to re-state your question in terms of hazard is a big step (as is figuring out how to sessionize your data, how to define positive and negative instances, how to select variables, and how to evaluate a model). Let’s set up our example modeling situation.

Suppose you have a mobile app with both free (A) and paid (B) actions; if a customer’s tasks involve too many paid actions, they will abandon the app. Your goal is to detect when a customer is in a state when they are likely to abandon, and offer them (perhaps through an in-app ad) a more economical alternative, for example a “Pro User” subscription that allows them to do what they are currently doing at a lower rate. You don’t want to be too aggressive about showing customers this ad, because showing it to someone who doesn’t need the subscription service is likely to antagonize them (and convince them to stop using your app).

Suppose the idealized data is collected a log style form, like the following:

dayIndex accountId eventType 1 101 act10000 A 2 101 act10000 A 3 101 act10000 A 4 101 act10000 A 5 101 act10000 A 6 101 act10000 A 7 101 act10000 A 8 101 act10000 A 9 101 act10000 A 10 101 act10003 B 11 101 act10003 A 12 101 act10003 A 13 101 act10003 A 14 101 act10003 A 15 101 act10003 A 16 101 act10003 A 17 101 act10003 A 18 101 act10012 B

For every customer, on every day (`dayIndex`

, which we can think of as the date), we’ve recorded each action, and whether it’s A or B. In realistic data you’d likely have more information, for example exactly what the actions were, perhaps how much the customer paid per B action, and other details about customer history or demographics. But this simple case is enough for our discussion.

Just to analyze data of this type generates several issues:

Ragged vs. uniform use of time when generating training examples

There are two ways to collect customers to use in the training set:

(1) pick a specific date, say one month ago, select a subset of your customer set from that day, and use those customers’ historical data (say, the last few months’ activity for those customers) as the training set. We’ll call this a *uniform time* training set.

(2) select a subset from the set of all your customers over all time (including some who may not currently be customers), and use their historical data as the training set. We’ll call this a *ragged time* training set.

The first method has the advantage that the training set exactly reflects how the model will be applied in practice: on a set of customers all on the same date. However, it limits the size of your training set, and if abandonment is very rare, then it limits the number of positive examples available for the modeling algorithm to learn from. The second method potentially allows you to build a larger training set (with more positive examples), but it has a number of pitfalls:

If the abandonment process in your customer population is*The prevalence of positive examples in the training set may differ from the prevalence you would observe on a given day.**stationary*— it does not change over time and has no trends — then the abandonment rate in a ragged training set will look like the abandonment rate in a uniform training set. It’s unlikely that the abandonment process is stationary (though perhaps it’s nearly so). If you are using a modeling algorithm that is sensitive to class prevalence, like logistic regression, this can cause a problem.A corollary to this observation is that even if you use a uniform training set, you should be prepared to retrain or otherwise update the model at a reasonable frequency, to account for concept drift.

Variables may have different meanings in different time periods. In our example, you may have changed the prices of the paid features in your phone app; in another domain, a $200,000 home may mean something different today than it did last year (relative to the median home price in the region, for example). If you are using data from different time periods, you should account for such effects.*Time trends in the features.*

You could consider using several uniform time sets: pick a date from last month, one from the month before, and so on. If the abandonment process changes slowly enough, this alleviates the data scarcity issue without affecting the prevalence of positive examples. You may still have issues with time trends in the variables, and you will have duplicated data: many customers from a month ago were also customers two months ago, and so can show up in the data twice. Depending on the domain and your goal, the may or may not matter. Also, you need to be careful that the same customer does not end up both in the training and test sets (see our article on structured test/train splits).

Defining Positive Examples

What do you consider a positive example? A customer who will leave tomorrow, within the next week, or within the next year? Predicting abandonment from long range data is nice, but it’s also a noisier problem; someone who will leave a year from now probably looks today a lot like someone who won’t leave in a year. If minimizing false positives is a subgoal (as it is in our example problem), then you might not want to attempt predicting long-range. Hopefully the signals will be stronger the closer a customer gets to abandoning, but you also want to catch them while you still have time to do something about it.

Picking the Features

In this example, you suspect that customers abandon your app when they start to access paid features at too high a rate. But what’s too high a rate? Is that measured in absolute terms, or relative to their total app usage? And what’s the proper measurement window? You want to measure their usage rates over a window that’s not too noisy, but still detects relevant patterns in time for the information to be useful.

**The Data**

For this artificial example, we created a population of customers who initially begin in a “safe” state in which they generate events via two Poisson processes, with A events generated at ten times the rate of B events. Customers also have a 10% chance every day of switching to an “at risk” state, in which they begin to generate B events at five times the rate that they did in the “safe” state (they also generate A events at a reduced rate, so that their total activity rate stays constant). Once they are in the “at risk” state, they have a 20% chance of exiting (abandoning the app — recorded as state X).

To build a data set, we start with an initial customer population of 1500, let the simulation run for 100 days to “warm up” the population and get rid of boundary conditions, then collect data for 100 more days to form the data set. We also generate new customers every day via a Poisson process with an intensity of 100 customers per day. The expected time for a customer to go into “at risk” is ten days; once they are in the “at risk” state, they stay another five days (in expectation), giving an expected lifetime of fifteen days (of course in reality you wouldn’t know about the internal state changes of your customers). Note that by the way we’ve constructed the population, the lifetime process is in fact stationary and memoryless.

This is obviously much cleaner data than you would have in real life, but it’s enough to let us walk through the analysis process.

**The Data Treatment**

We chose a uniform time training set: a set of customers present on a reference day (“day 0″) and the ten days previous to that (days 1 through 10), and recorded how many days until each customer exited (`Inf`

for customers who never exit), counting from day 0. The hold-out set is of the same structure. We defined positive examples as those customers who would exit within seven days of day 0. Rather than guessing the appropriate sessionizing window length ahead of time, we constructed all possible windows within those ten days, and counted the relative rates of A events and B events in each window. This gave us data sets of approximately 650 rows (648 for training, 660 for hold-out) and 132 features; one row per customer, one feature per window. We’ll discuss how we created the wide data sets from the “skinny” log data in a future post; you can download the wide data set we used as an `.rData`

file here.

The resulting data has the following columns:

colnames(dTrainS) [1] "accountId" "A_0_0" "A_1_0" [4] "A_1_1" "A_10_0" "A_10_1" [7] "A_10_10" "A_10_2" "A_10_3" [10] "A_10_4" "A_10_5" "A_10_6" [13] "A_10_7" "A_10_8" "A_10_9" [16] "A_2_0" "A_2_1" "A_2_2" [19] "A_3_0" "A_3_1" "A_3_2" [22] "A_3_3" "A_4_0" "A_4_1" [25] "A_4_2" "A_4_3" "A_4_4" [28] "A_5_0" "A_5_1" "A_5_2" [31] "A_5_3" "A_5_4" "A_5_5" [34] "A_6_0" "A_6_1" "A_6_2" [37] "A_6_3" "A_6_4" "A_6_5" [40] "A_6_6" "A_7_0" "A_7_1" [43] "A_7_2" "A_7_3" "A_7_4" [46] "A_7_5" "A_7_6" "A_7_7" [49] "A_8_0" "A_8_1" "A_8_2" [52] "A_8_3" "A_8_4" "A_8_5" [55] "A_8_6" "A_8_7" "A_8_8" [58] "A_9_0" "A_9_1" "A_9_2" [61] "A_9_3" "A_9_4" "A_9_5" [64] "A_9_6" "A_9_7" "A_9_8" [67] "A_9_9" "B_0_0" "B_1_0" [70] "B_1_1" "B_10_0" "B_10_1" [73] "B_10_10" "B_10_2" "B_10_3" [76] "B_10_4" "B_10_5" "B_10_6" [79] "B_10_7" "B_10_8" "B_10_9" [82] "B_2_0" "B_2_1" "B_2_2" [85] "B_3_0" "B_3_1" "B_3_2" [88] "B_3_3" "B_4_0" "B_4_1" [91] "B_4_2" "B_4_3" "B_4_4" [94] "B_5_0" "B_5_1" "B_5_2" [97] "B_5_3" "B_5_4" "B_5_5" [100] "B_6_0" "B_6_1" "B_6_2" [103] "B_6_3" "B_6_4" "B_6_5" [106] "B_6_6" "B_7_0" "B_7_1" [109] "B_7_2" "B_7_3" "B_7_4" [112] "B_7_5" "B_7_6" "B_7_7" [115] "B_8_0" "B_8_1" "B_8_2" [118] "B_8_3" "B_8_4" "B_8_5" [121] "B_8_6" "B_8_7" "B_8_8" [124] "B_9_0" "B_9_1" "B_9_2" [127] "B_9_3" "B_9_4" "B_9_5" [130] "B_9_6" "B_9_7" "B_9_8" [133] "B_9_9" "daysToX" "defaultsSoon"

The feature columns are labeled by type of event (A or B), the first day of the window, and the last day of the window: so `A_0_0`

means “fraction of events that were A events today (day 0)”, `B_8_5`

means “fraction of events that were B events from eight days back to five days back” (a window of length 4), and so on. The column `daysToX`

is the number of days until the customer exits; `defaultsSoon`

is true if `daysToX <= 7`

This naive sessionizing can quickly generate very wide data sets, especially if: there are more than two classes of events; if we want to consider wider windows; or if we have several types of log measurements that we want to aggregate and sessionize. You can imagine situations where you generate more features than you have datums (customers) in the training set. In future posts we will look at alternative approaches.

**Modeling**

Principled feature selection (or even better, principled feature generation) before modeling is a good idea, but for now let's just feed the sessionized data into regularized (ridge) logistic regression and see how well it can predict soon-to-exit customers.

library(glmnet) # loads vars (names of vars), yVar (name of y column), # dTrainS, dTestS load("wideData.rData") # assuming the xframe is entirely numeric # if there are factor variables, use # model.matrix ridge_model = function(xframe, y, family="binomial") { model = glmnet(as.matrix(xframe), y, alpha=0, lambda=0.001, family=family) list(coef = coef(model), deviance = deviance(model), predfun = ridge_predict_function(model) ) } # assuming xframe is entirely numeric ridge_predict_function = function(model) { # to get around the 'unfullfilled promise' leak. blech. force(model) function(xframe) { as.numeric(predict(model, newx=as.matrix(xframe), type="response")) } } model = ridge_model(dTrainS[,vars], dTrainS[[yVar]]) testpred = model$predfun(dTestS[,vars]) dTestS$pred = testpred

**Evaluating the Model**

You can plot the distribution of model scores on the holdout data as a function of class label:

The model mostly separates about-to-exit customers from the others, although far from perfectly (the AUC of this model is 0.78). To evaluate whether this model is good enough, you should take into account how the output of the model is to be used. You can use the model as a classifier, by picking a threshold score (say 0.5) to sort the customers into "about to exit" and not. In this case, look at the confusion matrix:

dTestS$predictedToLeave = dTestS$pred>0.5 # confusion matrix cmat = table(pred=dTestS$predictedToLeave, actual=dTestS[[yVar]]) cmat ## actual ## pred FALSE TRUE ## FALSE 205 80 ## TRUE 99 276 recall = cmat[2,2]/sum(cmat[,2]) recall ## [1] 0.7752809 precision = cmat[2,2]/sum(cmat[2,]) precision ## [1] 0.736

The model found 78% of the about-to-exit customers in the holdout set; of the customers identified as about-to-exit, about 74% of them actually did exit within seven days (26% false positive rate).

Alternatively you could use the model to prioritize your customers with respect to who should see in-app ads that encourage them to consider a subscription service. The improvement you can get by using the model score to prioritize ad placement is summarized in the gain curve:

If you sort your customers by model score (decreasing), then the blue curve shows what fraction of about-to-leave customers you will reach, as a fraction of the number of customers you target based on the model's recommendations; the green curve shows the best you can do on this population of customers, and the diagonal line shows what fraction of about-to-leave customers you reach if you target at random. As shown on the graph, if you target the 20% highest-risk customers (as scored by the model), you will reach 30% of your about-to-leave customers. This is an improvement over the 20% you would expect to hit at random; the best you could possibly do targeting only 20% of your customers is about 37% of the about-to-leaves.

The confusion matrix and the gain curve help you to pick a trade-off between targeting in-app ads to try to retain at-risk customers, without antagonizing customers who are not at risk by showing too many of them an irrelevant ad.

**Evaluating Utility**

The distribution of days until exit by class label confirms that "risky" (according to the model) customers do in general exit sooner:

But you also want to double-check that the model identifies abandoning customers soon enough. Once the model has identified someone as being at risk, how long do you have to intervene?

# make daysToX finite. The idea is that the live-forevers should be rare isinf = dTestS$daysToX==Inf maxval = max(dTestS$daysToX[!isinf]) dTestS$daysToX = with(dTestS, ifelse(daysToX==Inf, maxval, daysToX)) # how long on average until flagged customers leave? posmean = mean(dTestS[dTestS$predictedToLeave, "daysToX"]) posmean ## [1] 5.693333 # how many days until true positives (customers flagged as leaving # who really do leave) leave? tpfilter = dTestS$predictedToLeave & dTestS[[yVar]] trueposmean = mean(dTestS[tpfilter, "daysToX"]) trueposmean ## [1] 2.507246

Ideally, you'd like the above distribution to be skewed to the right: that is, you want the model to identify at-risk customers as early as possible. You probably can't intervene in time to save customers who are leaving today (day 0) or tomorrow (you can think of these customers as recall errors from "yesterday's" application of the model. Fortunately, on average this model catches at-risk customers a few days before they leave, giving you time to put the appropriate in-app ad in front of them. Once you put this model into operation, you will further want to monitor the flagged customers, to see if your intervention is effective.

**Conclusion**

For sessionized problems the easiest way to make a “best classifier” is to cheat the customer and try only to predict events right before they happen. This allows your model to use small windows of near-term data and look artificially good. In practice you need to negotiate with your customer how far out a prediction is useful for the customer and build a model with training data oriented towards that goal. Even then you must re-inspect such a model, as even a properly trained near-term event model will have a significant (and low-utility) component given by events that are essentially happening immediately. These “immediate events” are technically correct predictions (so they don’t penalize precision and recall statistics), but are also typically of low business utility as they don’t give the business time for a meaningful intervention.

**Next:**

As mentioned above, you would prefer to have a principled variable selection technique. This will be the topic of our next article in this series.

The R markdown script describing our analysis is here. The plots are generated using our own in-progress visualization package, `WVPlots`

. You can find the source code for the package on GitHub, here.

The plot of aggregated door traffic log data shown at the top of the post uses data from Ilher, Hutchins and Smyth, "Adaptive event detection with time-varying Poisson processes", *Proceedings of the 12th ACM SIGKDD Conference (KDD-06)*, August 2006. The data can be downloaded from the UCI Machine Learning Repository, here.

For this article we are assigning two different advertising message to our potential customers. The first message, called “A”, we have been using a long time, and we have a very good estimate at what rate it generates sales (we are going to assume all sales are for exactly $1, so all we are trying to estimate rates or probabilities). We have a new proposed advertising message, called “B”, and we wish to know does B convert traffic to sales at a higher rate than A?

We are assuming:

- We know exact rate of A events.
- We know exactly how long we are going to be in this business (how many potential customers we will ever attempt to message, or the total number of events we will ever process).
- The goal is to maximize expected revenue over the lifetime of the project.

As we wrote in our previous article: in practice you usually do not know the answers to the above questions. There is always uncertainty in the value of the A-group, you never know how long you are going to run the business (in terms of events or in terms of time, and you would also want to time-discount any far future revenue), and often you value things other than revenue (valuing knowing if B is greater than A, or even maximizing risk adjusted returns instead of gross returns). This represents severe idealization of the A/B testing problem, one that will let us solve the problem exactly using fairly simple R code. The solution comes from the theory of binomial option pricing (which is in turn related to Pascal’s triangle).

Yang Hui (ca. 1238–1298) (Pascal’s) triangle, as depicted by the Chinese using rod numerals.

For this “statistics as it should be” (in partnership with Revolution Analytics) article let us work the problem (using R) pretending things are this simple.

Abstractly we have two streams of events (“A” events and “B” events). Each event returns a success or a failure (say valued at $1 and $0, respectively)- and we want to maximize our overall success rate. The special feature of this problem formulation is that we assume we know how long we are going to run the business: there is an n so the total number of events routed to A (call this amount a) plus the total amount of events routed to B (call this b) is such that a+b=n.

To make things simple assume:

- There are no time varying factors (
*very*unrealistic, dealing with time varying factors is one of the reasons you run A and B at the same time). - All potential opportunities are considered identical and exchangeable.

The usual method of running an A/B test is to fix some parameters (prior distribution of expected values, acceptable error rate, acceptable range of error in dollars per event) and then design an experiment that estimates which of A or B is the most valuable event stream. After the experiment is over you then only work with whichever of A or B you have determined to be the better event stream. You essentially divide your work into an experimentation phase followed by an exploitation phase.

Suppose instead of deriving formal statistical estimates instead we solved the problem using ideas of operations research and asked for an adaptive strategy that directly maximized expected return rate? What would that even look like? It turns out you get a sensing procedure that routes all of its experiments to B for a while and then depending on observed returns may switch over to working only with A. This looks again like a sensing phase followed by an exploitation phase, but the exact behavior is determined by the algorithm interacting with experiment returns and is not something specified by the user. Let’s make things concrete by working a very specific example.

For the sake of argument: suppose we are willing to work with exactly four events ever, A’s conversion rate is exactly 1/2, and we are going to use what I am calling “naive priors” on the rate B returns success. The entire task is to pick whether to work next with an event from A or from B. One strategy is a fill in of the following table:

Number of B-trials run | Number of B-successes seen | Decision to go to A or B next |
---|---|---|

0 | 0 | ? |

1 | 0 | ? |

1 | 1 | ? |

2 | 0 | ? |

2 | 1 | ? |

2 | 2 | ? |

3 | 0 | ? |

3 | 1 | ? |

3 | 2 | ? |

3 | 3 | ? |

Notice we have not recorded the number of times we have tried the A-event. Because we are assuming we know the exact expected value of A (in this case 1/2) there is an optimal strategy that never tries an A-event until the strategy decides to give up on B. So we only need to record how many B’s we have tried, how many successes we have seen from B, and if we are willing to go on with B. Remember we have exactly four events to route to A and B in combined total, and this is why we don’t need to know what decision to make after the fourth trial.

We can present the decision process more graphically as the following directed graph:

Each row of the decision table is represented as node in the graph. Each node contains the following summary information:

- step: the number of B-trials we have run prior to this stage. In our procedure once we decide to run an A we in fact switch to A for the remaining trials (as we assumed we knew the A success rate perfectly we are assuming we can’t learn anything about A going forward, so we would have no reason to ever switch back).
- bwins: the number of successes we have seen from our B-trials prior to this stage.
- pbEst: the empirical estimate of the expected win-rate for this node. The idea is a node that has tried B n times and seen w wins should naively estimate the unknown true success rate of B as w/n (which is what is written in the node). A special case is the first node, which we started with 1/2 instead of 0/0.
- valueA: the value to be gained by switching over to the A events at this time. This is just how many events remain to process (four at the root node, and only 1 at the leaves) times our known value of A (1/2 in this case). Notice this is an expected value, we are not actually running the A’s and recording empirical frequencies; but instead multiplying the known expected value by the number remaining trials.
- valueB: the value to be gained by trying B one more time and then optimally continuing until the end of the 4 event trial (picking B’s or A’s for the remaining events as appropriate). Note that valueB ignores any payoff we have seen in getting to this node (that is already recorded in bwins and pbEst), it is only the value we expect from the remaining plays assuming our next draw is from B. If valueB>valueA then our optimal strategy is to go with B (and could fill in our earlier table with this decision). If we could just solve for valueB we would have our optimal strategy.
- (not shown): value: defined as max(valueA,valueB), the future value of a given node under the optimal strategy

The first two values (step,bwins) are essentially the keys that identify the node and the other fields (known or unknown) are derived values. In our example directed graph we have written down everything that is easy to derive (pbEst), but still don’t know the thing we want: valueB (or equivalently whether to try B one more time at each node).

It turns out there is an easy way to fill in all of the unknown valueB answers in this diagram. The idea is called dynamic programming and this application of it is inspired from something called the binomial options pricing model. But the idea is so simple yet powerful that we can actually just directly derive it for our problem.

Consider the leaf-nodes of the directed graph (the nodes with no exiting edges representing our the state of the world before our last decision). For these nodes we do have an estimate of valueB: pbEst! We can fill in this estimate to get the following refined diagram:

For the final four nodes we know whether to try B again (the open green nodes) or if to give up on B and switch to A (the shaded red nodes). The decision is based on our stated goal: maximizing expected value. And in our last play we should go with B only if its estimated expected value is higher than the known expected value of A. Using the observed frequency of B-successes as our estimate of the probability of B (or expected value of B) may seem slightly bold in this context, but it is the standard way to infer (we can justify this either through Frequentist arguments or by Bayesian arguments using an appropriate beta prior distribution).

So we now know how to schedule the fourth and final stage. That wouldn’t seem to help us much: as the first decision (the top row, or root node) is what we need first, and it still has a “?” for valueB. But look at the three nodes in the third stage. We can now estimate their value using known values from the fourth stage.

Define:

- pbEst[step=n,bwins=w]: as the number written for pbEst in the node labeled “step n bwins w”.
- valueA[step=n,bwins=w]: as the number written for valueA in the node labeled “step n bwins w”.
- valueB[step=n,bwins=w]: as the number written for valueB in the node labeled “step n bwins w”.
- value[step=n,bwins=w]: max(valueA[step=n,bwins=w],valueB[step=n,bwins=w]).

The formula for valuing any non-leaf node in our diagram is:

```
```valueB[step=n,bwins=w] =
( pbEst[step=n,bwins=w] * (1+value[step=n+1,bwins=w+1]) ) +
( (1-pbEst[step=n,bwins=w]) * value[step=n+1,bwins=w] )

So if we know all of pbEst[step=n,bwins=w], value[step=n+1,bwins=w+1], and value[step=n+1,bwins=w]: we then know valueB[step=n,bwins=w]. This is just saying the valueB of a node is the value of the immediate next step if we played B (which gives us a bonus of 1 if B gives us a success) plus the value of the node we end up at.

For example we can calculate valueB of the “step 2 bwins 1″ as:

```
```valueB[step=2,bwins=1] =
( pbEst[step=2,bwins=1] * (1+value[step=3,bwins=2]) ) +
( (1-pbEst[step=2,bwins=1]) * value[step=3,bwins=1] )
=
( 0.5 * (1+0.67) ) +
( (1-0.5) * 0.5 )
= 1.085

All this is just done by reading quantities off the current diagram. We can do this for all of the nodes in the third row yielding the following revision of the diagram.

In the above diagram we have rendered nodes we consider unreachable (nodes we would never go to when following the optimal strategy) with dashed lines. We now have enough information in the diagram to use the equation to fill in the second row:

And finally we fill in the first row (or root node) and have a complete copy of the optimal strategy.

We can copy this from the diagram back to our original strategy table by writing “Choose A” or “Choose B” depending if valueA ≥ value B for the node corresponding to the line in the table.

Number of B-trials run | Number of B-successes seen | Decision to go to A or B next |
---|---|---|

0 | 0 | Choose B |

1 | 0 | Choose A |

1 | 1 | Choose B |

2 | 0 | Choose A |

2 | 1 | Choose B |

2 | 2 | Choose B |

3 | 0 | Choose A |

3 | 1 | Choose A |

3 | 2 | Choose B |

3 | 3 | Choose B |

We don’t have to use what we have called naive estimates for pbEst. If we have good prior expectations on the likely success rate of B we can work this into our solution through the form of Bayesian beta priors. For example if our experience was such that the expected value of B is in fact around 0.25 (much worse than A) with a standard deviation of 0.25 (somewhat diffuse, so there is a non-negligible chance B could be better than A) we could design our pbEst calculation with that prior (which is easy to implement as a beta distribution with parameters alpha=0.5 and beta=1.5). In the case where we are only going to try A or B at total of four times we get the following diagram:

This diagram is saying: for only four trials we already know enough to never try B. The optimal strategy is to stick with A all four times. However, if the total number of trials we were budgeted to run was larger (say 20) then it actually starts to make sense to try B a few times to see if it is in fact a higher rate than A (despite our negative prior belief). We demonstrate the optimal design for this pessimal prior and n=20 in the following diagram.

And this is the magic of the dynamic programming solution. It uses the knowledge of how long you are going to run your business to decide how to value exploration (possibly losing money by giving traffic to B) versus exploitation (going with which of A or B is currently *thought* to be better). Notice the only part of the diagram or strategy table we need to keep is the list of nodes where we decide to no longer ever try B (the filled red stopping nodes). This is why we call this variation of the A/B test a stopping time problem.

All the B-rate calculations above are exactly correct if we in fact had the exact right priors for B’s rate. If the priors were correct at the root node, then by Bayes law the pBest probability estimates are in fact exactly correct posterior estimates at each node, and every decision made in the strategy is then correct. For convenience we have been using a beta distribution as our prior (as it has some justification, and makes calculation very easy), but these is no guarantee that the actual prior is in fact beta or that we even have the right beta distribution as our initial choice (the beta distribution is a determined by two parameters alpha and beta).

However, with n large enough (i.e. a budget of enough proposed events to design a good experiment) the strategy performance starts to become insensitive to the chosen prior (see the Bernstein–von Mises theorem for some motivation). So the strategy performs nearly as well with a prior we can supply as with the unknown perfect prior. As long as we start with an optimistic prior (one that allows our algorithm to route traffic to B for some time) we tend to do well.

In practice we would never know the exact expected value of A (and certainly not know it prior to starting the experiment). In the more realistic situation where we assume we are trying to choose between an A and B where we have things to learn about both groups the dynamic programming solution still applies: we just get a larger dynamic programming table. Each state is indexed by four numbers:

- nA: number of A trials already tried.
- awins: number of A successes already seen.
- nB: number of B trials already tried.
- bwins: number of B successes already seen.

For each so-labeled state we have four derived values:

- paEst: the estimate expected value of A in this state, this is a simple function of the state index label.
- pbEst: the estimate expected value of B in this state, this is a simple function of the state index label.
- valueA: the estimated expected continuation value of sending the next unit of traffic to A and then continuing on an optimal strategy. This starts out unknown.
- valueB: the estimated expected continuation value of sending the next unit of traffic to B and then continuing on an optimal strategy. This starts out unknown.

And again, an optimal strategy is one that just chooses A or B depending on if valueA > valueB or not. Notice that in this case an optimal strategy may switch back and forth between using A or B experiments. The derived values are filled in from states at or near the end of the entire experiment just as before. We now have an index consisting of four numbers (nA,wA,nB,wB) instead of just two numbers (nB,wB) so it is harder to graphically present the intermediate calculations and the final strategy tables.

Here is an example that is closer to the success rates and length of business seen in email or web advertising (though one problem for email advertising is that this is sequential plan, we need all earlier results back to make later decisions). Suppose we are going to run an A/B campaign for a total of 10,000 units of traffic, we assume the A success rate is exactly 1%, and we will use the uninformative Jefferys prior for B (which is actually pretty generous to B as this prior has initial expected value 1/2). That is it: our entire problem specification is the assumed A-rate, the total amount of traffic to plan over, and the choice of B-prior. This is specific enough for the dynamic programming strategy to completely solve the problem of maximizing expected revenue.

The dynamic program solution to this problem can be concisely represented by the following graph:

The plan is: we route all traffic to B, always measuring the empirical return rate of B (number of B successes over number of B trials). The number of B trials is the x-axis of our graph and we can use the current estimated B return rate as our y-height. The decision is: if you end up in the red area (below the curve) you stop B and switch over to A forever. Notice B is initially given a lot of leeway. It can fail to pay off a few hundred times and we don’t insist on it having a success rate near A’s 1% until well over 5,000 trials have passed.

Dynamic programming offers an interesting alternative solution to A/B test planning (in contrast to the classic methods we outlined here).

All the solutions and diagrams were produced by R code we share here.

We will switch from “statistics as it should be” back to “R as it is” and discuss the best ways to incrementally collect data or results in R.

]]>What the Sharpe ratio does is: give you a dimensionless score to compare similar investments that may vary both in riskiness and returns without needing to know the investor’s risk tolerance. It does this by separating the task of valuing an investment (which can be made independent of the investor’s risk tolerance) from the task of allocating/valuing a portfolio (which must depend on the investor’s preferences).

But what we have noticed is nobody is willing to honestly say what a good value for this number is. We will use the R analysis suite and Yahoo finance data to produce some example real Sharpe ratios here so you can get a qualitative sense of the metric.

“What is a good Sharpe ratio” was a fairly popular query in our search log (until search engines stopped sharing the incoming queries with mere bloggers such as myself). When you do such a search you see advice of the form:

… a ratio of 1 or better is considered good, 2 and better is very good, and 3 and better is considered excellent …

Some sources of this statement include:

- Investopedia: Understanding The Sharpe Ratio

To give you some insight, a ratio of 1 or better is considered good, 2 and better is very good, and 3 and better is considered excellent.

- Yahoo Finance: Why you should use the Sharpe ratio when investing in the medical device industry

A Sharpe ratio of 1 is considered good, while 2 is considered great and 3 is considered exceptional.

- HowTheMarketWorks: Sharpe Ratio

To give you some insight, a ratio of 1 or better is considered good, 2 and better is very good, and 3 and better is considered excellent.

- Nuclearphynance: How high a Sharpe is considered “good?”

… frankly a Sharpe of 1+ is a yawn, and *no*one* notices. Above 2 and you get attention.

Reading these together you see a bit of a content-free echo chamber. Remember: on the web when you see the exact same answer again and again it is more likely due to copying than due to authoritativeness. The last reference indicates a part of the problem: once somebody claims some specific number (such as 1) is a middling Sharpe ratio, no-one dares call any smaller number good (for fear of looking weak).

One also wonders of “2 is good” is some sort of confounding interpretation of the Sharpe ratio as a Fisher style Z statistic (which uses the same ratio of mean over deviance). The point being the rule of thumb “two standard deviations has a two-sided significance of 0.0455″ falls fairly close to the heavily ritualized p-value of 0.05.

The correct perspectives about Sharpe ratio are a bit more nuanced:

- Morningstar classroom: How to Use the Sharpe Ratio

Of course, the higher the Sharpe ratio the better. But given no other information, you can’t tell whether a Sharpe ratio of 1.5 is good or bad. Only when you compare one fund’s Sharpe ratio with that of another fund (or group of funds) do you get a feel for its risk-adjusted return relative to other funds.

In fact it is Morningstar that gave a specific range for annual returns (around 0.3) that I used in my article Betting with their money (though now I am not sure I have enough context to be sure if the number they gave was a real example or just notional).

The theory of the Sharpe ratio is: if you have access to the ability to borrow or lend money, then for two similar investments you should always prefer the one with higher Sharpe ratio. So the Sharpe ratio is definitely used for comparison. When I was in finance I used the Sharpe ratio for comparison, but I didn’t have a Sharpe ratio goal.

Reading from a primary source we see estimating the Sharpe ratio of a particular investment at a particular time depends on at least the choice of:

- Investment time frame: are we talking about daily, monthly, quarterly, or annual returns? Changing scale changes returns (daily returns compound about 365 times to get yearly returns!) and changes deviation (theoretically daily returns tend to have a deviation that is around
`sqrt(365)`

time more volatile than yearly returns). The Sharpe ratio is the ratio of these two quantities, and they are not varying in similar ways as we change scale. - Choice of “risk free” reference returns. “Return” in the Sharpe ratio is actually defined as “excess return over a chosen risk-free investment.” Choose a comparison investment with low returns and you
*artificially*look good. - Length of data used to estimate empirical variance (as we are talking about Ex Post Sharpe Ratio, which means we don’t have a theoretical variance to use). In theory (for normal data or data with bounded theoretical variance) variance/deviation estimates should stabilize at moderate window sizes. Picking a too small window may let you avoid some rare losses and display an elevated Sharpe ratio. Picking a too large window may let in data from markets climates not relevant to the current market.

So without holding at least these three choices constant, it doesn’t make a lot of sense to compare.

We emphasize because the Sharpe ratio itself varies over time (even with the above windowing) it in fact does not strictly make sense to talk about “the Sharpe ratio of an investment.” Instead you must consider “the Sharpe ratio of an investment at a particular time” or “the distribution of the Sharpe ratio of an investment over a particular time interval.” This means if you want to estimate a Sharpe ratio for an investment you at least must specify an additional time scale or smoothing window to average over.

For example: below is the Sharpe ratio of the S&P500 index annual returns using a 500 day window to estimate variance/deviation using the 10-year US T-note interest rate as the risk-free rate of return (note we are using the interest rate as the risk-free return, we are *not* using the returns you would see from buying and selling T-notes). Note: the choice of “risk free” investment here is a bit heterodox.

Notice two things:

- The ratio is all over the map, we can call the S&P Sharpe ratio just about any value between -2 and 2 by picking the right 5 year period to consider “typical.” The mean over the period graphed (1960 through 2015) is 0.3 and the median is 0.17.
- The ratio spends a lot of time well below 1 over this history.

For our betting article we needed a Sharpe ratio on a 10 day scale. Here is the S&P500 index 10 day returns using a 500 day window to estimate variance/deviation using the 10-year US T-note interest rate as the risk-free rate of return:

Over the graphed time interval viewed we have the upper quartile value is 0.1. So the S&P 10 day return Sharpe ratio spends 75% of its time below 0.1. Thus the 10 *theoretical* day Sharpe ration of 0.18 in “Betting with their money” is in fact large. Though we have found this calculation is sensitive to the length of the window used to estimate variance (for example using a window of 30 days gives us mean: 0.08, median: 0.075, 3rd quartile 0.45).

And for fun here is a similar Sharpe ratio calculation for the PIMCO TENZ bond fund:

This just confirms the last few years have not been good for US bonds.

Note: it is traditional to use very low interest rate instruments as the “safe comparison” in Sharpe ratio. So using 10 T-note interest rates gives an analysis that is a bit pessimistic (and also ascribes the T-note variance to the instrument being scored). However, the “safe comparison” is really only used in the Sharpe portfolio argument as the rate you can borrow and/or lend money at (which is not in fact risk-free in the real world). So there is some value in using an easy to obtain realistic “boring investment” as a proxy for the “risk-free return rate.” The ignoring of the risk-free rate in the Betting with their money article is also not strictly correct (but also something Sharpe ignored for a while), but given the scale of potential wins and losses in that set-up it is not going to cause significant issues.

Basically remember this: there are a lot of analyst chosen details in estimating a Sharpe ratio. One of the biggest ones you can fudge is the estimate of deviation/variance (be it theoretical/Ex-Ante or Ex-Post). I would say very high Sharpe ratios or more likely evidence of under estimating the deviation/variance and the reference return process than evidence of actual astronomical risk-adjusted returns.

The complete R code to produce these graphs from downloaded finance data is given here.

]]>Win-Vector LLC can complete your high value project quickly (some examples), and train your data science team to work much more effectively. Our consultants include the authors of Practical Data Science with R and also the video course Introduction to Data Science. We now offer on site custom master classes in data science and R.

Please reach out to us at contact@win-vector.com for research, consulting, or training.

Follow us on (Twitter @WinVectorLLC), and sharpen your skills by following our technical blog (link, RSS).

]]>An A/B test is a very simple controlled experiment where one group is subject to a new treatment (often group “B”) and the other group (often group “A”) is considered a control group. The classic example is attempting to compare defect rates of two production processes (the current process, and perhaps a new machine).

Illustration: Boris Artzybasheff

(photo James Vaughan, some rights reserved)

A/B testing is one of the simplest controlled experimental design problems possible (and one of the simplest examples of a Markov decision process). And that is part of the problem: it is likely the first time a person will need to truly worry about:

- Power/Significance
- Design of experiments
- Defining utility
- Priors or beliefs
- Efficiency of inference

All of these are technical terms we will touch on in this article. However, we argue the biggest sticking point of A/B testing is: it requires a lot more communication between the business partner (sponsoring the test) and the analyst (designing and implementing the test) than a statistician or data scientist would care to admit. In this first article of a new series called “statistics as it should be” (in partnership with Revolution Analytics) we will discuss some of the essential issues in planning A/B tests.

Communication is the most important determiner of data science project success or failure. However, communication is expensive. That is one reason why a lot of statistical procedures are designed and taught in a way to minimize communication. But minimizing communication has its own costs and is somewhat responsible for the terse style of many statistical conversations.

A typical bad interaction is as follows. The business person wants to see if a new advertising creative is more profitable than the old one. It is unlikely they phrase it as precisely as “I want to maximize my expected return” (what they likely in fact want) or as “I want to test the difference between two means” (what a statistician most likely wants to hear). To make matters worse the “communication” is usually a “clarifying conversation” where the business person is forced to pick a goal that is convenient for analysis. The follow-ups are typically:

- You want to test the difference between two means? ANOVA.
- You to check significance after the test is run? t-Test or F-Test (depending on distribution).
- Oh, you want to know how long to run the test? Here is a power/significance calculator.

This is a very doctrinal and handbook way of talking and leaves little time to discuss alternatives. It kills legitimate statistical discussion (example: for testing difference of rates shouldn’t one consider Poisson or binomial tests such as Fisher or Barnard over Gaussian approximations?). And it shuts-out a typical important business goal: maximizing expected return. Directly maximizing expected return is a legitimate well-posed goal, but it is not in fact directly solved by any of the methods we listed above. For a good discussion of maximizing expected return see here.

What we have to remember is: the responsibility of the statistician or data scientist consultant isn’t to quickly bully the business partner into terms and questions that are easiest for the consultant. The consultant’s responsibility is in spend the time to explore goals with the business partner, formulate an appropriate goal into a well-posed problem, and *only then* move on to solving problem.

The problem to solve is the one that is best for the business. For A/B testing the right problem is usually one of:

- With high probability correctly determine which of A or B has higher expected value. (power/significance formulation)
- Route an amount of business to A and B that maximizes the expected return. (maximizing utility formulation)

A lot of literature on A/B testing is written as if problem-1 is the only legitimate goal. In many cases problem-1 is the goal, for example when testing drugs and medical procedures. And a good solution to problem-1 is usually a good approximate solution to problem-2. However, in business (as opposed to medicine) problem-2 is often the actual goal. And, as we have said, there are standard ways to solve problem-2 directly.

Once we have a goal we should look to standard solutions. Some of the methods I like to use in working with A/B tests include:

- Frequentist power/significance planners/calculators (here is a simplified interactive one). These tend to be very good for the traditional task of ensuring a given accuracy in picking A versus B correctly.
- Bayesian posterior planners. These tend to be good at targeting a given efficiency in expected return.
- Online or bandit formulations. These are good at maximizing returns.
- A dynamic programming solution inspired by binomial option pricing (the topic of our next A/B test article).
- Wald‘s graphical sequential inspection technique (the topic of a future article).

Each of these methods is trying to encapsulate a procedure that, in addition to serving a particular goal, minimizes the amount of prior knowledge needed to run a good A/B test. A lot of the differences in procedure come from using different assumptions to fill in quantities not known prior to starting the A/B test. Also notice a lot of the choice of Bayesian versus frequentist is pivoting on what you are trying to do (and less on which you like more).

Guided interaction with the calculator or exploration of derived decision tables is very important. In all cases you work the problem (maybe with both statistician and client present) by interactively proposing goals, examining the calculated test consequences, and then revising goals (if the proposed test sizes are too long). This ask, watch, reject cycle greatly improves communication between the sponsor and the analyst as it quickly makes apparent concrete consequences of different combinations of goals and prior knowledge.

The following is a quick stab at a list of parameter estimates needed in order to design an efficient A/B test. We call them “prior estimates” as we need them during the test design phase, before the test is run.

- What likelihood of being wrong is acceptable? Power and significance calculators need these as goals.
- How much money are you willing to lose to experimentation if the new process is in fact no better than your current process? (hint: if the answer is zero, then you can’t run any meaningful test).
- What are your prior experiences and beliefs on the alternatives treatments being proposed? Is it a obvious speed improvement or bug fix (which may only need to be confirmed, not fully estimated)? Or is it one of a long stream of random proposals that usually don’t work (which means you have to test longer!).
- What are your initial bounds on the rates? Power/significance based test get expensive as you try to measure differences between similar rates. Some experimental design procedures use a business supplied bound on rates and differences in essential ways. Most frameworks require one or two questions be answered in this direction.
- How long are you going to use the result? This is the question almost none of the frameworks ever ask. However, it is a key question. How much you are willing to spend (by having both the A test and B test up, intentionally sending some traffic to a possibly inferior new treatment) to determine the best group should depend on how long you expect to exploit the knowledge. You don’t ask the hotel concierge for dinner recommendations the morning you are flying out (as at that point the information no longer has value). Similarly if you are running a business for 100 days: you don’t want to run a test for 99 days and then only switch to the perceived better treatment for the single final day.
- Is the business person going to look at and possibly make decisions on intermediate results? Allowing early termination of experiments can lower accuracy if proper care is not taken (related issues include the multiple comparison problem).

Essentially a good test plan depends on having good prior estimates of rates, and a clear picture of future business intentions. Each of the standard solutions has different sensitivity to the answered and ignored points. For example: many of the solutions assume you will be able to use the chosen treatment (A or B) arbitrary long after an initial test phase, and this may or may not be a feature of your actual business situation.

Given the (often ignored) difficulty in faithfully encoding business goals and in supplying good prior parameters estimates, one might ask why A/B testing *as it is practiced* ever works? My guess is that practical A/B testing is often not working. Or at least not making correct decisions as often as typically thought.

Practitioners have seen that even tests that are statistically designed to make the wrong decision no more than 10% of the time seem to be wrong much more often. But this is noticed only if one comes back to re-check! Some driving issues include using the wrong testing procedure (such as inappropriately applying one-tailed bounds to an actual two-tailed experiment). But even with correct procedures, any mathematical guarantee is contingent on assumptions and idealizations that may not be met in the actual business situation.

Likely a good fraction of all A/B tests run returned wrong results (picked B, when the right answer was A; or vice-versa). But as long as the fraction is small enough such that the the expected value of an A/B test is positive the business sees large long-term net improvement. For example if all tested changes are of similar magnitude, then it is okay for even one-third of the tests to be wrong. You don’t know which decisions you made were wrong, but you know about 2/3rds of them were right and the law of large numbers says your net gain is probably large and positive (again, assuming each change has a similar bounded impact to your business).

Or one could say:

One third of the decisions I make based on A/B tests are wrong; the trouble is I don’t know which third.

in place of the famous:

Half the money I spend on advertising is wasted; the trouble is I don’t know which half.

The point is: it may actually make business sense to apply 10 changes to your business suggested by “too short” A/B tests (so 2 of the suggestions may in fact be wrong) than to tie up your A/B testing infrastructure so long you only test one possible change. In fact considering an A/B test as single event done in isolation (as is typically done) may not always be a good idea (for business reasons, in addition to the usual statistical considerations).

It has pained me to informally discuss the business problem and put off jumping into the math. But that was the point of this note: the problem precedes the math. In our next “Statistics as it should be” article we will jump into math and algorithms when we use a dynamic programming scheme to exactly solve the A/B testing plan problem for the special case when we assume we have answers to some of the questions we are usually afraid to ask.

]]>A number of researchers had previously done this (many cited in their references), but the authors added more good ideas:

- Enforce a “natural image constraint” through insisting on near-pixel correlations.
- Start the search from another real image. For example: if the net is internal activation is constrained to recognize buildings and you start the image optimization from a cloud you can get a cloud with building structures. This is a great way to force interesting pareidolia like effects.
- They then “apply the algorithm iteratively on its own outputs and apply some zooming after each iteration.” This gives them wonderful fractal architecture with repeating motifs and beautiful interpolations.
- Freeze the activation pattern on intermediate layers of the neural network.
- (not claimed, but plausible given the look of the results) Use the access to the scoring gradient for final image polish (likely cleans up edges and improves resolution).

From Michael Tyka’s Inceptionism gallery

Likely this used a lot of GPU cycles. The question is, can we play with some of the ideas on our own (and on the cheap)? The answer is yes.

I share complete instructions, and complete code for a baby (couple of evenings) version of related effects.

What we need to optimize images through a neural net scoring function is at least the following:

- A trained image recognizing neural net. This supplies our objective function. I chose Caffe after seeing it featured in another fun article.
- Somewhere to run the whole thing. I chose Amazon EC2. I tried to assemble complete instructions for installing Caffe on a fresh EC2 instance.
- A source of images and image transformations. Instead of modifying images directly (which likely is a bit of work to do effectively for an arbitrary scoring net) I decided to used an evolving image process I already had access to: my 1995 genetic art project. The code is getting a bit creaky, but is available here. This system already had a cross-over combinator for the underlying formulas that generate the images- so we have a ready process we can try to optimize over (through crude evolutionary algorithms). Exact EC2 instructions are in the included file
`ec2Steps.txt`

. - A “natural image constraint.” I dashed this off quickly by saying an image is “natural” if
`skimage.restoration denoise_tv_chambolle`

doesn’t pull pixels to far away from the original image. What we are fighting is the now well-known issue that convolutional neural nets deep learning machines seem to (unfortunately) determine a lot of their classification on what humans consider to be visual static (see the references included in the original article), so something as simple as a regularization control should work here.

Given this set-up I decided to optimize the genetic art for “crab-like pictures” (as defined by the classification categories from the chosen pre-trained neural net). This is a tip of the hat to Michael Witbrock (one of my collaborators, along with Scott Neal Reilly on the 1995 genetic art project) who inspired us with a (probably apocryphal) story of crabs perhaps naturally selected to have patterns resembling human faces.

Samurai crab, H. japonica and stylized Kabuki samurai face (inset). From: Samurai Crabs: Transmogrified Japanese warriors, the product of artificial selection, or pareidolia?

A quick run yielded an image that the neural net was 99.12% sure was some sort of crab:

Here it is rendered at 256×256 (the net’s concept space):

Artificial “crab” image (rendered 256×256, as this is the net’s concept space).

And re-rendered at a higher resolution (with some anti-aliasing):

The genetic art project really only seems to have so many images in its concept space, but even with a crude evolutionary optimizer over its underlying representation (which is text formulas, not images) it can evolve pictures that fool the image classification net (the advesary seems to have the easy side in adversarial machine learning). With more care (better “natural image” function, richer representation language) we could probably do a lot more.

Now the “winner” was not a very legible or natural image (so we need a better “natural image” filter, which we could definitely develop). But check out the renders of some images we got on the way to this one.

These are images saved as “new record matches” while running the genetic art *unattended*. Obviously purely artificial images scored against a low-resolution image classifier are not going to have as many realistic features as images built starting from high resolution sources on a high-resolution net (and repeated and re-zoomed). But I think there is something here. The image classification neural net seems to work as a passable “is interesting” function. This is noteworthy because one of the inspirations for my 1995 project was:

Shumeet Baluja, Dean Pomerleau and Todd Jochem, “Simulating User’s Preferences: Towards Automated Artificial Evolution for Computer Generated Images” Technical Report CMU-CS-93-198, Carnegie Mellon University. Pittsburgh, PA. October 1993.

A paper who’s goal was to train a neural net to recognize interesting images from a stream of artificial images (trained from previous user decisions).

And for a more generative approach to image synthesis check out: (warning, some of the image sources were pornographic) Scott Draves’ 1993 Fuse work .

]]>Blackjack Wikimedia

According to the article Don Johnson developed a reputation as a non card-counting high roller in gambling circles. This tempted revenue hungry Atlantic City casinos to invite him to private room gambling with custom rules and special considerations. Mr. Johnson who describes himself as “not naive in math” got different casinos to agree to the following two important game changes:

- Various rule changes to blackjack changing the player’s ability to split hands, dealer’s hitting rules, and other things. Mr. Johnson is quoted as estimating the rule changes brought the house advantage or edge down to about 0.25% per game. This is much less than the typical house edge of over 0.5% for multiple deck games. But it is still in the house’s favor (no doubt casinos understand the mathematics of blackjack very well).
- A per-visit refund or rebate privilege: on any visit where Mr. Johnson leaves the table down $500,000 or more he needs only to pay 80% of his deficit to settle his account. This is essentially free money- but most gamblers lack the discipline to take advantage of it.

The blackjack rule changes were not the problem. The issue was the rebate. Casinos regularly give gamblers initial stakes (such as $100 of chips just for walking in) and apparently routinely negotiate different refund and rebate programs with high rollers. Again, the casinos know what they are doing. Rebates are free money, but most gamblers lack the discipline to hold onto the money.

However, Mr. Johnson clearly is a disciplined gambler. In the article he stated his strategy was to cash in if he lost enough to trigger the rebate (so probably stop if he was $500,000 in the hole) but continue to bet and bet large if he was ahead. His delightfully disarming quote on why he bets when he is ahead is:

So my philosophy at that point was that I can afford to take an additional risk here, because I’m battling with their money, using their discount against them.

And he is very right. There is an advantage to betting with the house’s money. Here is some math I am sure Mr. Johnson knows (either formally or intuitively, probably both).

Mathematically, the discount is an obvious bad bet on the part of the casino. They known this. It is simple to exploit: come in every night and make only a single $500,000 bet. If you win end the visit, if you lose pay off the 80% ($400,000). Roughly every two days the casino gives you about $100,000 (or $50,000 a day on average).

The casino likely feels confident offering such deal to a high-roller, because likely the high-roller has been losing or winning more than $50,000 a day when playing and the casino can just cancel the deal if they notice the gambler has started to accumulate profits. The hoped for benefit is: the gambler lacks discipline and loses a great deal of money every night at your casino (and not at a competitor’s casino). If the gambler never ends a night ahead, then any money collected by the casino *seems* like profit (no matter how deep a discount is offered).

Mr. Johnson no doubt knows how to break this scheme. He will play a very disciplined strategy that loses a bounded amount of money most days (keeping the casino happy, yet exploiting the discount), looks like the behavior of an undisciplined gambler, but happens to have a positive expected return for Mr. Johnson.

Suppose you try to exploit these rules. Call the amount of money you are up or down for the day $X ($X starts at zero). For this section everything will be only “for the day.” Let’s make the generous assumption that you negotiated so many rule changes, play so well, control bet sizes, and do just enough card counting to make the odds 50/50. That for each and every bet the casino has exactly a 50% chance of winning.

Suppose your current net-winnings for the day are $X (X starts at zero). Further suppose you gamble with the following strategy. You start what we call “a phase” by writing down your current net-winnings for the day $X and a positive integer $B. The phase continuous until your net-winnings for the day are driven to a new value that is either $(X-B) or $(X+B).

A phase is played as follows:

- If and only if your net winnings for the day are $(X-B) or $(X+B) (our interval boundaries) the phase ends.
- Otherwise if your current net-winnings for the day are $Y with $(X-B) < $Y < $(X+B) you bet any integer number of dollars between $1 and $min(Y-(X-B),X+B-Y) (i.e. any amount at least $1 that won’t jump over the interval boundaries).

For example you can start a phase by betting $B (which guarantees the phase will be exactly one bet long), or run a phase of always betting $1 until you hit the boundary. Because we assumed this was a “fair game” (neither you or the casino have any advantage on bets) then martingale theory tells us the expected value at the end of a phase must equal the value at the beginning of the phase. You started the phase with a value of $X, so the expected value at the end of the phase must also be $X. The expected value is p*(X-B) + (1-p)*(X+B) where p is the (unknown) probability of your betting exiting the interval as a winner. But if p*(X-B) + (1-p)*(X+B) = X then we must have p=0.5, no matter what variation of bets you execute.

This is a standard result in Martingale theory. And it just means: if neither you or the house have an advantage on individual games played, then neither you or the house have an advantage on any sequence of games even with you choosing the bet sizes and choosing when to stop playing. This is one of the hooks of gambling: players think there is great power in choosing bet size and when to stop, when usually there is no way to use those to any advantage beyond choosing not to gamble.

Things change when we add in the discount.

Let us always pick B such that X-B is -$500,000. So B = X+500,000 and our interval is -$500,000 to $(2X + 500,000). And our odds of exiting the interval at the left or right boundary remain 50/50. However, if you walk away when you are down -$500,000 you are only expected to pay $400,000. So the expected value of a bet of X+500,000 is: 0.5*(-400,000) + 0.5*(2X + 500,000) = $(X+50,000), not $X. It actually makes sense (in terms of expected value, not in terms of risk) for you to bet. You have an expected profit of $50,000 every time you bet $(X+500,000) due to the casino’s generous discount. In a sense casino is subsidizing every one of your large bets, not just the end of day settlement. Also notice the player betting a bit more than their daily winnings each time they win is pretty much equivalent to the house betting a bit more than their daily loss, or the doubling pattern of the “small martingale” (a famously risky system).

Let’s push this strategy forward a bit. One complete multi-phase strategy is: pick a daily winning target (say $3,500,000) and running betting phases of the form $(X+500,000) until you are at least that far ahead or until you are $500,000 down and then quit. That is: each night you either see three winning phases in a row and take home $3,500,000 or see a loss and pay $400,000. On average you would take home $3,500,000 one night in 8 and be down $500,000 each of 7 nights in 8. Since 7*$500,000 = $3,500,000 we have expected winnings match expected losses, prior to the discount. But you are only paying $400,000 each night you lose, so your expected net take home over 8 nights is $3,500,000 – 7*$400,000 = $700,000. So, in expectation, the casino is leaking about $87,500 a night to you.

The previous section assumed “fair odds.” Mr. Johnson is quoted as saying he think his rule changes took the house’s edge down to about 0.25%. This is good for multiple deck blackjack, but still a problem.

Even if the odds were fair or near-fair you want to make large bets to exit the phase interval in a reasonable amount of time. You are not going to win or lose a million dollars quickly in $100 bets. The issue is the scheme requires virtuoso play on each hand. You have to play well almost every hand to eat into the house advantage, and that is going to be exhausting.

Once the house odds are against you, you don’t just want to make large bets- you *need* to make large bets. The only way of having a high probability of exiting on the profitable side of one of your phase intervals is to make large bets. With small bets the law of large numbers will almost always force you to lose the phase (and the day). With a large bet you don’t exit with the 50/50 odds, but you at least win with appreciable probability. Also with bad odds, you find above a certain size you no longer want to bet. The house has an expectation advantage proportional to the size of your bet, so eventually the bonus you are pulling per-bet (due to the house insuring losses) is overwhelmed and bets become unprofitable.

The fair-odds phases strategy is pretty much guaranteed profit if one is allowed to play long enough. To do this you have to have enough bankroll to finance enough losing nights to have a good chance of a win, and you have continue to have access to the discount.

We know three casinos eventually stopped given Mr. Johnson a discount when he had nights where he took home $4 million, $5 million, and $6 million. Suppose the casino will immediately stop your play if you are ever ahead $3,500,000 or more (the article says casinos cut Mr. Johnson off on nights at winnings somewhat above this value), and not invite you back if you are ever net-ahead over a few nights.

When we add this possibility of getting barred: the overall betting scheme looses money is if it looses 9 nights in a row (as we assumed the casino will not allow you to win enough to recoup that loss). Suppose you now play at the casino until you have a winning night and are barred (in our case a day that wins $3,500,000), or until you lose 10 nights in a row (and are thus down $4,000,000). Some calculation shows this scheme has a 30% chance of losing money and a positive expected value of $515,847.10. This represents a high risk expected return on the stake (most money the scheme is prepared to lose) of around 13% in ten days. The Sharpe ratio is 0.18 which is very large for a 10 day investment (for example Morningstar quotes a good Sharpe ratio of over an annual term as being 0.40, and an average annual return being closer to 0.29). The idea is: Sharpe ratios on smaller time-period investments are necessarily smaller (smaller returns, and larger variances). A crude conversion to move from a monthly scale to a yearly scale would be to just look at how variance should decline (by a factor of about sqrt(12)), so a good monthly financial instrument might have Sharpe ratio of around 0.4/sqrt(12) or around 0.12 (and this is ignoring any issue of compounding of value). So this is a high-risk but also high-return scheme.

To lower your risk you would want to play this scheme at multiple casinos, as the Atlantic reported Mr. Johnson did. Mr. Johnson is reported to have won millions from three casinos. Since the strategy has about a 70% win rate it is safe to assume to see 3 wins we would have to play at about 4 casinos.

Unfortunately my calculations show there is still a good chance of losing money remain high (now around 33%, but your Sharpe ratio is now 0.35 showing an improved reward to risk ratio). This is also assuming The Atlantic doesn’t write an article about you and get you barred from some casino private rooms before you finish your betting sequence.

I don’t actually play blackjack. I do however, love thinking about martingales.

]]>