Category Archives: Pragmatic Machine Learning

A bit about Win-Vector LLC

Win-Vector LLC is a consultancy founded in 2007 that specializes in research, algorithms, data-science, and training. (The name is an attempt at a mathematical pun.)

Win-Vector LLC can complete your high value project quickly (some examples), and train your data science team to work much more effectively. Our consultants include the authors of Practical Data Science with R and also the video course Introduction to Data Science. We now offer on site custom master classes in data science and R.

IMG 6061

Please reach out to us at contact@win-vector.com for research, consulting, or training.

Follow us on (Twitter @WinVectorLLC), and sharpen your skills by following our technical blog (link, RSS).

Why does designing a simple A/B test seem so complicated?

Why does planning something as simple as an A/B test always end up feeling so complicated?

An A/B test is a very simple controlled experiment where one group is subject to a new treatment (often group “B”) and the other group (often group “A”) is considered a control group. The classic example is attempting to compare defect rates of two production processes (the current process, and perhaps a new machine).


4140852076 4a9da0a43f o
Illustration: Boris Artzybasheff
(photo James Vaughan, some rights reserved)
In our time an A/B test typically compares the conversion to sales rate of different web-traffic sources or different web-advertising creatives (like industrial defects, a low rate process). An A/B test uses a randomized “at the same time” test design to help mitigate the impact of any possible interfering or omitted variables. So you do not run “A” on Monday and then “B” on Tuesday, but instead continuously route a fraction of your customers to each treatment. Roughly a complete “test design” is: how much traffic to route to A, how much traffic to route to B, and how to chose A versus B after the results are available.

A/B testing is one of the simplest controlled experimental design problems possible (and one of the simplest examples of a Markov decision process). And that is part of the problem: it is likely the first time a person will need to truly worry about:

  • Power/Significance
  • Design of experiments
  • Defining utility
  • Priors or beliefs
  • Efficiency of inference

All of these are technical terms we will touch on in this article. However, we argue the biggest sticking point of A/B testing is: it requires a lot more communication between the business partner (sponsoring the test) and the analyst (designing and implementing the test) than a statistician or data scientist would care to admit. In this first article of a new series called “statistics as it should be” (in partnership with Revolution Analytics) we will discuss some of the essential issues in planning A/B tests. Continue reading Why does designing a simple A/B test seem so complicated?

Wanted: A Perfect Scatterplot (with Marginals)

We saw this scatterplot with marginal densities the other day, in a blog post by Thomas Wiecki:

NewImage

The graph was produced in Python, using the seaborn package. Seaborn calls it a “jointplot;” it’s called a “scatterhist” in Matlab, apparently. The seaborn version also shows the strength of the linear relationship between the x and y variables. Nice.

I like this plot a lot, but we’re mostly an R shop here at Win-Vector. So we asked: can we make this plot in ggplot2? Natively, ggplot2 can add rugs to a scatterplot, but doesn’t immediately offer marginals, as above.

However, you can use Dean Attali’s ggExtra package. Here’s an example using the same data as the seaborn jointplot above; you can download the dataset here.

library(ggplot2)
library(ggExtra)
frm = read.csv("tips.csv")

plot_center = ggplot(frm, aes(x=total_bill,y=tip)) + 
  geom_point() +
  geom_smooth(method="lm")

# default: type="density"
ggMarginal(plot_center, type="histogram")

I didn’t bother to add the internal annotation for the goodness of the linear fit, though I could.

PltggMarginal

The ggMarginal() function goes to heroic effort to line up the coordinate axes of all the graphs, and is probably the best way to do a scatterplot-plus-marginals in ggplot (you can also do it in base graphics, of course). Still, we were curious how close we could get to the seaborn version: marginal density and histograms together, along with annotations. Below is our version of the graph; we report the linear fit’s R-squared, rather than the Pearson correlation.

# our own (very beta) plot package: details later
library(WVPlots)
frm = read.csv("tips.csv")

ScatterHist(frm, "total_bill", "tip",
            smoothmethod="lm",
            annot_size=3,
            title="Tips vs. Total Bill")

PlotPkg

You can see that (at the moment) we’ve resorted to padding the axis labels with underbars to force the x-coordinates of the top marginal plot and the scatterplot to align; white space gets trimmed. This is profoundly unsatisfying, and less robust than the ggMarginal version. If you’re curious, the code is here. It relies on some functions in the file sharedFunctions.R in the same repository. Our more general version will do either a linear or lowess/spline smooth, and you can also adjust the histogram and density plot parameters.

Thanks to Slawa Rokicki’s excellent ggplot2: Cheatsheet for Visualizing Distributions for our basic approach. Check out the graph at the bottom of her post — and while you’re at it, check out the rest of her blog too.

What is new in the vtreat library?

The Win-Vector LLC vtreat library is a library we supply (under a GPL license) for automating the simple domain independent part of variable cleaning an preparation.

The idea is you supply (in R) an example general data.frame to vtreat’s designTreatmentsC method (for single-class categorical targets) or designTreatmentsN method (for numeric targets) and vtreat returns a data structure that can be used to prepare data frames for training and scoring. A vtreat-prepared data frame is nice in the sense:

  • All result columns are numeric.
  • No odd type columns (dates, lists, matrices, and so on) are present.
  • No columns have NA, NaN, +-infinity.
  • Categorical variables are expanded into multiple indicator columns with all levels present which is a good encoding if you are using any sort of regularization in your modeling technique.
  • No rare indicators are encoded (limiting the number of indicators on the translated data.frame).
  • Categorical variables are also impact coded, so even categorical variables with very many levels (like zip-codes) can be safely used in models.
  • Novel levels (levels not seen during design/train phase) do not cause NA or errors.

The idea is vtreat automates a number of standard inspection and preparation steps that are common to all predictive analytic projects. This leaves the data scientist more time to work on important domain specific steps. vtreat also leaves as much of variable selection to the down-stream modeling software. The goal of vtreat is to reliably (and repeatably) generate a data.frame that is safe to work with.

This note explains a few things that are new in the vtreat library. Continue reading What is new in the vtreat library?

Does Balancing Classes Improve Classifier Performance?

It’s a folk theorem I sometimes hear from colleagues and clients: that you must balance the class prevalence before training a classifier. Certainly, I believe that classification tends to be easier when the classes are nearly balanced, especially when the class you are actually interested in is the rarer one. But I have always been skeptical of the claim that artificially balancing the classes (through resampling, for instance) always helps, when the model is to be run on a population with the native class prevalences.

On the other hand, there are situations where balancing the classes, or at least enriching the prevalence of the rarer class, might be necessary, if not desirable. Fraud detection, anomaly detection, or other situations where positive examples are hard to get, can fall into this case. In this situation, I’ve suspected (without proof) that SVM would perform well, since the formulation of hard-margin SVM is pretty much distribution-free. Intuitively speaking, if both classes are far away from the margin, then it shouldn’t matter whether the rare class is 10% or 49% of the population. In the soft-margin case, of course, distribution starts to matter again, but perhaps not as strongly as with other classifiers like logistic regression, which explicitly encodes the distribution of the training data.

So let’s run a small experiment to investigate this question.

Continue reading Does Balancing Classes Improve Classifier Performance?

Random Test/Train Split is not Always Enough

Most data science projects are well served by a random test/train split. In our book Practical Data Science with R we strongly advise preparing data and including enough variables so that data is exchangeable, and scoring classifiers using a random test/train split.

With enough data and a big enough arsenal of methods, it’s relatively easy to find a classifier that looks good; the trick is finding one that is good. What many data science practitioners (and consumers) don’t seem to remember is that when evaluating a model, a random test/train split may not always be enough.

Continue reading Random Test/Train Split is not Always Enough

The Geometry of Classifiers

As John mentioned in his last post, we have been quite interested in the recent study by Fernandez-Delgado, et.al., “Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?” (the “DWN study” for short), which evaluated 179 popular implementations of common classification algorithms over 120 or so data sets, mostly from the UCI Machine Learning Repository. For fun, we decided to do a follow-up study, using their data and several classifier implementations from scikit-learn, the Python machine learning library. We were interested not just in classifier accuracy, but also in seeing if there is a “geometry” of classifiers: which classifiers produce predictions patterns that look similar to each other, and which classifiers produce predictions that are quite different? To examine these questions, we put together a Shiny app to interactively explore how the relative behavior of classifiers changes for different types of data sets.

Continue reading The Geometry of Classifiers

Bias/variance tradeoff as gamesmanship

Continuing our series of reading out loud from a single page of a statistics book we look at page 224 of the 1972 Dover edition of Leonard J. Savage’s “The Foundations of Statistics.” On this page we are treated to an example attributed to Leo A. Goodman in 1953 that illustrates how for normally distributed data the maximum likelihood, unbiased, and minimum variance estimators of variance are in fact typically three different values. So in the spirit of gamesmanship you always have at least two reasons to call anybody else’s estimator incorrect. Continue reading Bias/variance tradeoff as gamesmanship