Nina Zumel prepared an excellent article on the consequences of working with relative error distributed quantities (such as wealth, income, sales, and many more) called “Living in A Lognormal World.” The article emphasizes that if you are dealing with such quantities you are already seeing effects of relative error distributions (so it isn’t an exotic idea you bring to analysis, it is a likely fact about the world that comes at you). The article is a good example of how to plot and reason about such situations.

Settle on one or two metrics as you move project to project. We prefer “AUC” early in a project (when you want a flexible score) and “deviance” late in a project (when you want a strict score).

When working on practical problems work with your business partners to find out which of precision/recall, or sensitivity/specificity most match their business needs. If you have time show them and explain the ROC plot and invite them to price and pick points along the ROC curve that most fit their business goals. Finance partners will rapidly recognize the ROC curve as “the efficient frontier” of classifier performance and be very comfortable working with this summary.

That being said it always seems like there is a bit of gamesmanship in that somebody always brings up yet another score, often apparently in the hope you may not have heard of it. Some choice of measure is signaling your pedigree (precision/recall implies a data mining background, sensitivity/specificity a medical science background) and hoping to befuddle others.

Stanley Wyatt illustration from “Mathmanship” Nicholas Vanserg, 1958, collected in A Stress Analysis of a Strapless Evening Gown, Robert A. Baker, Prentice-Hall, 1963

The rest of this note is some help in dealing with this menagerie of common competing classifier evaluation scores.

In our previous note we demonstrated Y-Aware PCA and other y-aware approaches to dimensionality reduction in a predictive modeling context, specifically Principal Components Regression (PCR). For our examples, we selected the appropriate number of principal components by eye. In this note, we will look at ways to select the appropriate number of principal components in a more automated fashion.

One reason not to use accuracy: an example where a classifier that does nothing is “more accurate” than one that actually has some utility. (Figure credit Nina Zumel, slides here)

And we worked through the usual bestiary of other metrics (precision, recall, sensitivity, specificity, AUC, balanced accuracy, and many more).

In my recent article on optimizing set diversity I mentioned the primary abstraction was of “diminishing returns” and is formalized by the theory of monotone submodular functions (though I did call out some of my own work which used a different abstraction). A proof that appears again and again in the literature is: showing that when maximizing a monotone submodular function the greedy algorithm run for k steps picks a set that is scores no worse than 1-1/e less than the unknown optimal pick (or picks up at least 63% of the possible value). This is significant, because naive optimization may only pick a set of value 1/k of the value of the optimal selection.

The proof that the greedy algorithm does well in maximizing monotone increasing submodular functions is clever and a very good opportunity to teach about reading and writing mathematical proofs. The point is: one needs an active reading style as: most of what is crucial to a proof isn’t written, and that which is written in a proof can’t all be pivotal (else proofs would be a lot more fragile than they actually are).

One of the trickier tasks in clustering is determining the appropriate number of clusters. Domain-specific knowledge is always best, when you have it, but there are a number of heuristics for getting at the likely number of clusters in your data. We cover a few of them in Chapter 8 (available as a free sample chapter) of our book Practical Data Science with R.

We also came upon another cool approach, in the mixtools package for mixture model analysis. As with clustering, if you want to fit a mixture model (say, a mixture of gaussians) to your data, it helps to know how many components are in your mixture. The boot.comp function estimates the number of components (let’s call it k) by incrementally testing the hypothesis that there are k+1 components against the null hypothesis that there are k components, via parametric bootstrap.

You can use a similar idea to estimate the number of clusters in a clustering problem, if you make a few assumptions about the shape of the clusters. This approach is only heuristic, and more ad-hoc in the clustering situation than it is in mixture modeling. Still, it’s another approach to add to your toolkit, and estimating the number of clusters via a variety of different heuristics isn’t a bad idea.

We here at Win-Vector LLC been working through an ad-hoc series about A/B testing combining elements of both operations research and statistical points of view.

Our most recent article was a dynamic programming solution to the A/B test problem. Explicitly solving such dynamic programs gets long and tedious, so you are well served by finding and introducing clever invariants to track (something better than just raw win-rates). That clever idea is called “sequential analysis” and was introduced by Abraham Wald (somebody we have written about before). If you have ever heard of a test plan such as “first process to get more than 30 wins ahead of the other is the one we choose” you have seen methods derived from Wald’s sequential analysis technique.

Our last article on A/B testing described the scope of the realistic circumstances of A/B testing in practice and gave links to different standard solutions. In this article we will be take an idealized specific situation allowing us to show a particularly beautiful solution to one very special type of A/B test.

For this article we are assigning two different advertising message to our potential customers. The first message, called “A”, we have been using a long time, and we have a very good estimate at what rate it generates sales (we are going to assume all sales are for exactly $1, so all we are trying to estimate rates or probabilities). We have a new proposed advertising message, called “B”, and we wish to know does B convert traffic to sales at a higher rate than A?

We are assuming:

We know exact rate of A events.

We know exactly how long we are going to be in this business (how many potential customers we will ever attempt to message, or the total number of events we will ever process).

The goal is to maximize expected revenue over the lifetime of the project.

As we wrote in our previous article: in practice you usually do not know the answers to the above questions. There is always uncertainty in the value of the A-group, you never know how long you are going to run the business (in terms of events or in terms of time, and you would also want to time-discount any far future revenue), and often you value things other than revenue (valuing knowing if B is greater than A, or even maximizing risk adjusted returns instead of gross returns). This represents severe idealization of the A/B testing problem, one that will let us solve the problem exactly using fairly simple R code. The solution comes from the theory of binomial option pricing (which is in turn related to Pascal’s triangle).

What the Sharpe ratio does is: give you a dimensionless score to compare similar investments that may vary both in riskiness and returns without needing to know the investor’s risk tolerance. It does this by separating the task of valuing an investment (which can be made independent of the investor’s risk tolerance) from the task of allocating/valuing a portfolio (which must depend on the investor’s preferences).