Category Archives: Tutorials

Efficient accumulation in R

R has a number of very good packages for manipulating and aggregating data (plyr, sqldf, ScaleR, data.table, and more), but when it comes to accumulating results the beginning R user is often at sea. The R execution model is a bit exotic so many R users are very uncertain which methods of accumulating results are efficient and which are inefficient.

Accumulating wheat (Photo: Cyron Ray Macey, some rights reserved)

In this latest “R as it is” (again in collaboration with our friends at Revolution Analytics) we will quickly become expert at efficiently accumulating results in R. Continue reading Efficient accumulation in R

Working with Sessionized Data 1: Evaluating Hazard Models

When we teach data science we emphasize the data scientist’s responsibility to transform available data from multiple systems of record into a wide or denormalized form. In such a “ready to analyze” form each individual example gets a row of data and every fact about the example is a column. Usually transforming data into this form is a matter of performing the equivalent of a number of SQL joins (for example, Lecture 23 (“The Shape of Data”) from our paid video course Introduction to Data Science discusses this).


One notable exception is log data. Log data is a very thin data form where different facts about different individuals are written across many different rows. Converting log data into a ready for analysis form is called sessionizing. We are going to share a short series of articles showing important aspects of sessionizing and modeling log data. Each article will touch on one aspect of the problem in a simplified and idealized setting. In this article we will discuss the importance of dealing with time and of picking a business appropriate goal when evaluating predictive models.

For this article we are going to assume that we have sessionized our data by picking a concrete near-term goal (predicting cancellation of account or “exit” within the next 7 days) and that we have already selected variables for analysis (a number of time-lagged windows of recent log events of various types). We will use a simple model without variable selection as our first example. We will use these results to show how you examine and evaluate these types of models. In later articles we will discuss how you sessionize, how you choose examples, variable selection, and other key topics.

Continue reading Working with Sessionized Data 1: Evaluating Hazard Models

What is a good Sharpe ratio?

We have previously written that we like the investment performance summary called the Sharpe ratio (though it does have some limits).

What the Sharpe ratio does is: give you a dimensionless score to compare similar investments that may vary both in riskiness and returns without needing to know the investor’s risk tolerance. It does this by separating the task of valuing an investment (which can be made independent of the investor’s risk tolerance) from the task of allocating/valuing a portfolio (which must depend on the investor’s preferences).

But what we have noticed is nobody is willing to honestly say what a good value for this number is. We will use the R analysis suite and Yahoo finance data to produce some example real Sharpe ratios here so you can get a qualitative sense of the metric. Continue reading What is a good Sharpe ratio?

Why does designing a simple A/B test seem so complicated?

Why does planning something as simple as an A/B test always end up feeling so complicated?

An A/B test is a very simple controlled experiment where one group is subject to a new treatment (often group “B”) and the other group (often group “A”) is considered a control group. The classic example is attempting to compare defect rates of two production processes (the current process, and perhaps a new machine).

4140852076 4a9da0a43f o
Illustration: Boris Artzybasheff
(photo James Vaughan, some rights reserved)
In our time an A/B test typically compares the conversion to sales rate of different web-traffic sources or different web-advertising creatives (like industrial defects, a low rate process). An A/B test uses a randomized “at the same time” test design to help mitigate the impact of any possible interfering or omitted variables. So you do not run “A” on Monday and then “B” on Tuesday, but instead continuously route a fraction of your customers to each treatment. Roughly a complete “test design” is: how much traffic to route to A, how much traffic to route to B, and how to chose A versus B after the results are available.

A/B testing is one of the simplest controlled experimental design problems possible (and one of the simplest examples of a Markov decision process). And that is part of the problem: it is likely the first time a person will need to truly worry about:

  • Power/Significance
  • Design of experiments
  • Defining utility
  • Priors or beliefs
  • Efficiency of inference

All of these are technical terms we will touch on in this article. However, we argue the biggest sticking point of A/B testing is: it requires a lot more communication between the business partner (sponsoring the test) and the analyst (designing and implementing the test) than a statistician or data scientist would care to admit. In this first article of a new series called “statistics as it should be” (in partnership with Revolution Analytics) we will discuss some of the essential issues in planning A/B tests. Continue reading Why does designing a simple A/B test seem so complicated?

Neural net image salad again (with code)

Alexander Mordvintsev, Christopher Olah, and Mike Tyka, recently posted a great research blog article where they tried to visualize what a image classification neural net “wants to see.” They achieve this by optimizing the input to correspond to a fixed pattern of neural net internal node activation. This generated truly beautiful and fascinating phantasmagorical images (or an “image salad” by analogy to word salad). It is sort of like a search for eigenfaces (but a lot more fun).

A number of researchers had previously done this (many cited in their references), but the authors added more good ideas:

  • Enforce a “natural image constraint” through insisting on near-pixel correlations.
  • Start the search from another real image. For example: if the net is internal activation is constrained to recognize buildings and you start the image optimization from a cloud you can get a cloud with building structures. This is a great way to force interesting pareidolia like effects.
  • They then “apply the algorithm iteratively on its own outputs and apply some zooming after each iteration.” This gives them wonderful fractal architecture with repeating motifs and beautiful interpolations.
  • Freeze the activation pattern on intermediate layers of the neural network.
  • (not claimed, but plausible given the look of the results) Use the access to the scoring gradient for final image polish (likely cleans up edges and improves resolution).

From Michael Tyka’s Inceptionism gallery

Likely this used a lot of GPU cycles. The question is, can we play with some of the ideas on our own (and on the cheap)? The answer is yes.

I share complete instructions, and complete code for a baby (couple of evenings) version of related effects. Continue reading Neural net image salad again (with code)

Betting with their money

The recent The Atlantic article “The Man Who Broke Atlantic City” tells the story of Don Johnson who won millions of dollars in private room custom rules high stakes blackjack. The method Mr. Johnson reportedly used is, surprisingly, not card counting (as made famous by professor Edward O. Thorp in Beat the Dealer). It is instead likely an amazingly simple process I will call a martingale money pump. Naturally the Atlantic wouldn’t want to go into the math, but we can do that here.

Blackjack board
Blackjack Wikimedia
Continue reading Betting with their money

Wanted: A Perfect Scatterplot (with Marginals)

We saw this scatterplot with marginal densities the other day, in a blog post by Thomas Wiecki:


The graph was produced in Python, using the seaborn package. Seaborn calls it a “jointplot;” it’s called a “scatterhist” in Matlab, apparently. The seaborn version also shows the strength of the linear relationship between the x and y variables. Nice.

I like this plot a lot, but we’re mostly an R shop here at Win-Vector. So we asked: can we make this plot in ggplot2? Natively, ggplot2 can add rugs to a scatterplot, but doesn’t immediately offer marginals, as above.

However, you can use Dean Attali’s ggExtra package. Here’s an example using the same data as the seaborn jointplot above; you can download the dataset here.

frm = read.csv("tips.csv")

plot_center = ggplot(frm, aes(x=total_bill,y=tip)) + 
  geom_point() +

# default: type="density"
ggMarginal(plot_center, type="histogram")

I didn’t bother to add the internal annotation for the goodness of the linear fit, though I could.


The ggMarginal() function goes to heroic effort to line up the coordinate axes of all the graphs, and is probably the best way to do a scatterplot-plus-marginals in ggplot (you can also do it in base graphics, of course). Still, we were curious how close we could get to the seaborn version: marginal density and histograms together, along with annotations. Below is our version of the graph; we report the linear fit’s R-squared, rather than the Pearson correlation.

# our own (very beta) plot package: details later
frm = read.csv("tips.csv")

ScatterHist(frm, "total_bill", "tip",
            title="Tips vs. Total Bill")


You can see that (at the moment) we’ve resorted to padding the axis labels with underbars to force the x-coordinates of the top marginal plot and the scatterplot to align; white space gets trimmed. This is profoundly unsatisfying, and less robust than the ggMarginal version. If you’re curious, the code is here. It relies on some functions in the file sharedFunctions.R in the same repository. Our more general version will do either a linear or lowess/spline smooth, and you can also adjust the histogram and density plot parameters.

Thanks to Slawa Rokicki’s excellent ggplot2: Cheatsheet for Visualizing Distributions for our basic approach. Check out the graph at the bottom of her post — and while you’re at it, check out the rest of her blog too.

I still think you can manufacture an unfair coin

In Gelman and Nolan’s paper “You Can Load a Die, But You Can’t Bias a Coin” The American Statistician, November 2002, Vol. 56, No. 4 it is argued you can’t easily produce a coin that is biased when flipped (and caught). A number of variations that can be easily biased (such as spinning) are also discussed.

Obviously Gelman and Nolan are smart and careful people. And we are discussing a well-regarded peer-reviewed article. So we don’t expect there is a major error. What we say is the abstraction they are using doesn’t match the physical abstraction I would pick. I pick a different one and I get different results. This is what I would like to discuss. Continue reading I still think you can manufacture an unfair coin