Category Archives: Statistics

Efficient accumulation in R

R has a number of very good packages for manipulating and aggregating data (plyr, sqldf, ScaleR, data.table, and more), but when it comes to accumulating results the beginning R user is often at sea. The R execution model is a bit exotic so many R users are very uncertain which methods of accumulating results are efficient and which are inefficient.

Accumulating wheat (Photo: Cyron Ray Macey, some rights reserved)

In this latest “R as it is” (again in collaboration with our friends at Revolution Analytics) we will quickly become expert at efficiently accumulating results in R. Continue reading Efficient accumulation in R

Working with Sessionized Data 2: Variable Selection

In our previous post in this series, we introduced sessionization, or converting log data into a form that’s suitable for analysis. We looked at basic considerations, like dealing with time, choosing an appropriate dataset for training models, and choosing appropriate (and achievable) business goals. In that previous example, we sessionized the data by considering all possible aggregations (window widths) of the data as features. Such naive sessionization can quickly lead to very wide data sets, with potentially more features than you have datums (and collinear features, as well). In this post, we will use the same example, but try to select our features more intelligently.

4203801748 f760c22c47 zIllustration: Boris Artzybasheff
photo: James Vaughan, some rights reserved

The Example Problem

Recall that you have a mobile app with both free (A) and paid (B) actions; if a customer’s tasks involve too many paid actions, they will abandon the app. Your goal is to detect when a customer is in a state when they are likely to abandon, and offer them (perhaps through an in-app ad) a more economical alternative, for example a “Pro User” subscription that allows them to do what they are currently doing at a lower rate. You don’t want to be too aggressive about showing customers this ad, because showing it to someone who doesn’t need the subscription service is likely to antagonize them (and convince them to stop using your app).

You want to build a model that predicts whether a customer will abandon the app (“exit”) within seven days. Your training set is a set of 648 customers who were present on a specific reference day (“day 0″); their activity on day 0 and the ten days previous to that (days 1 through 10), and how many days until each customer exited (Inf for customers who never exit), counting from day 0. For each day, you constructed all possible windows within those ten days, and counted the relative rates of A events and B events in each window. This gives you 132 features per row. You also have a hold-out set of 660 customers, with the same structure. You can download the wide data set used for these examples as an .rData file here. The explanation of the variable names is in the previous post in this series.

In the previous installment, we built a regularized (ridge) logistic regression model over all 132 features. This model didn’t perform too badly, but in general there is more danger of overfitting when working with very wide data sets; in addition, it is quite expensive to analyze a large number of variables with standard implementations of logistic regression. In this installment, we will look for potentially more robust and less expensive ways of analyzing this data.

Continue reading Working with Sessionized Data 2: Variable Selection

Working with Sessionized Data 1: Evaluating Hazard Models

When we teach data science we emphasize the data scientist’s responsibility to transform available data from multiple systems of record into a wide or denormalized form. In such a “ready to analyze” form each individual example gets a row of data and every fact about the example is a column. Usually transforming data into this form is a matter of performing the equivalent of a number of SQL joins (for example, Lecture 23 (“The Shape of Data”) from our paid video course Introduction to Data Science discusses this).


One notable exception is log data. Log data is a very thin data form where different facts about different individuals are written across many different rows. Converting log data into a ready for analysis form is called sessionizing. We are going to share a short series of articles showing important aspects of sessionizing and modeling log data. Each article will touch on one aspect of the problem in a simplified and idealized setting. In this article we will discuss the importance of dealing with time and of picking a business appropriate goal when evaluating predictive models.

For this article we are going to assume that we have sessionized our data by picking a concrete near-term goal (predicting cancellation of account or “exit” within the next 7 days) and that we have already selected variables for analysis (a number of time-lagged windows of recent log events of various types). We will use a simple model without variable selection as our first example. We will use these results to show how you examine and evaluate these types of models. In later articles we will discuss how you sessionize, how you choose examples, variable selection, and other key topics.

Continue reading Working with Sessionized Data 1: Evaluating Hazard Models

A dynamic programming solution to A/B test design

Our last article on A/B testing described the scope of the realistic circumstances of A/B testing in practice and gave links to different standard solutions. In this article we will be take an idealized specific situation allowing us to show a particularly beautiful solution to one very special type of A/B test.

For this article we are assigning two different advertising message to our potential customers. The first message, called “A”, we have been using a long time, and we have a very good estimate at what rate it generates sales (we are going to assume all sales are for exactly $1, so all we are trying to estimate rates or probabilities). We have a new proposed advertising message, called “B”, and we wish to know does B convert traffic to sales at a higher rate than A?

We are assuming:

  • We know exact rate of A events.
  • We know exactly how long we are going to be in this business (how many potential customers we will ever attempt to message, or the total number of events we will ever process).
  • The goal is to maximize expected revenue over the lifetime of the project.

As we wrote in our previous article: in practice you usually do not know the answers to the above questions. There is always uncertainty in the value of the A-group, you never know how long you are going to run the business (in terms of events or in terms of time, and you would also want to time-discount any far future revenue), and often you value things other than revenue (valuing knowing if B is greater than A, or even maximizing risk adjusted returns instead of gross returns). This represents severe idealization of the A/B testing problem, one that will let us solve the problem exactly using fairly simple R code. The solution comes from the theory of binomial option pricing (which is in turn related to Pascal’s triangle).

Yang Hui (ca. 1238–1298) (Pascal’s) triangle, as depicted by the Chinese using rod numerals.

For this “statistics as it should be” (in partnership with Revolution Analytics) article let us work the problem (using R) pretending things are this simple. Continue reading A dynamic programming solution to A/B test design

A bit about Win-Vector LLC

Win-Vector LLC is a consultancy founded in 2007 that specializes in research, algorithms, data-science, and training. (The name is an attempt at a mathematical pun.)

Win-Vector LLC can complete your high value project quickly (some examples), and train your data science team to work much more effectively. Our consultants include the authors of Practical Data Science with R and also the video course Introduction to Data Science. We now offer on site custom master classes in data science and R.

IMG 6061

Please reach out to us at for research, consulting, or training.

Follow us on (Twitter @WinVectorLLC), and sharpen your skills by following our technical blog (link, RSS).

Why does designing a simple A/B test seem so complicated?

Why does planning something as simple as an A/B test always end up feeling so complicated?

An A/B test is a very simple controlled experiment where one group is subject to a new treatment (often group “B”) and the other group (often group “A”) is considered a control group. The classic example is attempting to compare defect rates of two production processes (the current process, and perhaps a new machine).

4140852076 4a9da0a43f o
Illustration: Boris Artzybasheff
(photo James Vaughan, some rights reserved)
In our time an A/B test typically compares the conversion to sales rate of different web-traffic sources or different web-advertising creatives (like industrial defects, a low rate process). An A/B test uses a randomized “at the same time” test design to help mitigate the impact of any possible interfering or omitted variables. So you do not run “A” on Monday and then “B” on Tuesday, but instead continuously route a fraction of your customers to each treatment. Roughly a complete “test design” is: how much traffic to route to A, how much traffic to route to B, and how to chose A versus B after the results are available.

A/B testing is one of the simplest controlled experimental design problems possible (and one of the simplest examples of a Markov decision process). And that is part of the problem: it is likely the first time a person will need to truly worry about:

  • Power/Significance
  • Design of experiments
  • Defining utility
  • Priors or beliefs
  • Efficiency of inference

All of these are technical terms we will touch on in this article. However, we argue the biggest sticking point of A/B testing is: it requires a lot more communication between the business partner (sponsoring the test) and the analyst (designing and implementing the test) than a statistician or data scientist would care to admit. In this first article of a new series called “statistics as it should be” (in partnership with Revolution Analytics) we will discuss some of the essential issues in planning A/B tests. Continue reading Why does designing a simple A/B test seem so complicated?

I do not believe Google invented the term A/B test

The June 4, 2015 Wikipedia entry on A/B Testing claims Google data scientists were the origin of the term “A/B test”:

Google data scientists ran their first A/B test at the turn of the millennium to determine the optimum number of results to display on a search engine results page.[citation needed] While this was the origin of the term, very similar methods had been used by marketers long before “A/B test” was coined. Common terms used before the internet era were “split test” and “bucket test”.

It is very unlikely Google data scientists were the first to use the informal shorthand “A/B test.” Test groups have been routinely called “A” and “B” at least as early as the 1940s. So it would be natural for any working group to informally call their test comparing abstract groups “A” and “B” an “A/B test” from time to time. Statisticians are famous for using the names of variables (merely chosen by convention) as formal names of procedures (p-values, t-tests, and many more).

Even if other terms were dominant in earlier writing, it is likely A/B test was used in speech. And writings of our time are sufficiently informal (or like speech) that they should be compared to earlier speech, not just earlier formal writing.

Apothecary s balance with steel beam and brass pans in woode Wellcome L0058880

That being said, a quick search yields some examples of previous use. We list but a few below. Continue reading I do not believe Google invented the term A/B test

Wanted: A Perfect Scatterplot (with Marginals)

We saw this scatterplot with marginal densities the other day, in a blog post by Thomas Wiecki:


The graph was produced in Python, using the seaborn package. Seaborn calls it a “jointplot;” it’s called a “scatterhist” in Matlab, apparently. The seaborn version also shows the strength of the linear relationship between the x and y variables. Nice.

I like this plot a lot, but we’re mostly an R shop here at Win-Vector. So we asked: can we make this plot in ggplot2? Natively, ggplot2 can add rugs to a scatterplot, but doesn’t immediately offer marginals, as above.

However, you can use Dean Attali’s ggExtra package. Here’s an example using the same data as the seaborn jointplot above; you can download the dataset here.

frm = read.csv("tips.csv")

plot_center = ggplot(frm, aes(x=total_bill,y=tip)) + 
  geom_point() +

# default: type="density"
ggMarginal(plot_center, type="histogram")

I didn’t bother to add the internal annotation for the goodness of the linear fit, though I could.


The ggMarginal() function goes to heroic effort to line up the coordinate axes of all the graphs, and is probably the best way to do a scatterplot-plus-marginals in ggplot (you can also do it in base graphics, of course). Still, we were curious how close we could get to the seaborn version: marginal density and histograms together, along with annotations. Below is our version of the graph; we report the linear fit’s R-squared, rather than the Pearson correlation.

# our own (very beta) plot package: details later
frm = read.csv("tips.csv")

ScatterHist(frm, "total_bill", "tip",
            title="Tips vs. Total Bill")


You can see that (at the moment) we’ve resorted to padding the axis labels with underbars to force the x-coordinates of the top marginal plot and the scatterplot to align; white space gets trimmed. This is profoundly unsatisfying, and less robust than the ggMarginal version. If you’re curious, the code is here. It relies on some functions in the file sharedFunctions.R in the same repository. Our more general version will do either a linear or lowess/spline smooth, and you can also adjust the histogram and density plot parameters.

Thanks to Slawa Rokicki’s excellent ggplot2: Cheatsheet for Visualizing Distributions for our basic approach. Check out the graph at the bottom of her post — and while you’re at it, check out the rest of her blog too.

What is new in the vtreat library?

The Win-Vector LLC vtreat library is a library we supply (under a GPL license) for automating the simple domain independent part of variable cleaning an preparation.

The idea is you supply (in R) an example general data.frame to vtreat’s designTreatmentsC method (for single-class categorical targets) or designTreatmentsN method (for numeric targets) and vtreat returns a data structure that can be used to prepare data frames for training and scoring. A vtreat-prepared data frame is nice in the sense:

  • All result columns are numeric.
  • No odd type columns (dates, lists, matrices, and so on) are present.
  • No columns have NA, NaN, +-infinity.
  • Categorical variables are expanded into multiple indicator columns with all levels present which is a good encoding if you are using any sort of regularization in your modeling technique.
  • No rare indicators are encoded (limiting the number of indicators on the translated data.frame).
  • Categorical variables are also impact coded, so even categorical variables with very many levels (like zip-codes) can be safely used in models.
  • Novel levels (levels not seen during design/train phase) do not cause NA or errors.

The idea is vtreat automates a number of standard inspection and preparation steps that are common to all predictive analytic projects. This leaves the data scientist more time to work on important domain specific steps. vtreat also leaves as much of variable selection to the down-stream modeling software. The goal of vtreat is to reliably (and repeatably) generate a data.frame that is safe to work with.

This note explains a few things that are new in the vtreat library. Continue reading What is new in the vtreat library?