A client recently came to us with a question: what’s a good way to monitor data or model output for changes? That is, how can you tell if new data is distributed differently from previous data, or if the distribution of scores returned by a model have changed? This client, like many others who have faced the same problem, simply checked whether the mean and standard deviation of the data had changed more than some amount, where the threshold value they checked against was selected in a more or less ad-hoc manner. But they were curious whether there was some other, perhaps more principled way, to check for a change in distribution.
We have just released two new free video lectures on vectors from a programmer’s point of view. I am experimenting with what ideas do programmers find interesting about vectors, what concepts do they consider safe starting points, and how to condense and present the material.
Please check the lectures out.
- Vectors for Programmers and Data Scientists (with Python examples)
- Vectors for Programmers and Data Scientists (with R examples)
- (support materials for both lectures)
While working on a variation of the
RcppDynProg algorithm we derived the following beautiful identity of 2 by 2 real matrices:
The superscript “top” denoting the transpose operation, the ||.||^2_2 denoting sum of squares norm, and the single |.| denoting determinant.
Here at Win-Vector LLC we like permutation tests. Our team has written on them (for example: How Do You Know if Your Data Has Signal?) and they are used to estimate significances in our sigr and WVPlots R packages. For example permutation methods are used to estimate the significance reported in the following ROC plot.
Permutation tests have their own literature and issues (examples: Permutation, Parametric and Bootstrap Tests of Hypotheses, Springer-Verlag, NY, 1994 (3rd edition, 2005), 2, 3, and 4).
In our R packages the permutation tests are estimated by a sampling procedure, and not computed exactly (or deterministically). It turns out this is likely a necessary concession; a complete exact permutation test procedure at scale would be big news. Please read on for my comments on this issue.
Nina Zumel prepared an excellent article on the consequences of working with relative error distributed quantities (such as wealth, income, sales, and many more) called “Living in A Lognormal World.” The article emphasizes that if you are dealing with such quantities you are already seeing effects of relative error distributions (so it isn’t an exotic idea you bring to analysis, it is a likely fact about the world that comes at you). The article is a good example of how to plot and reason about such situations.
I am just going to add a few additional references (mostly from Nina) and some more discussion on log-normal distributions versus Zipf-style distributions or Pareto distributions. Continue reading Relative error distributions, without the heavy tail theatrics
Beginning analysts and data scientists often ask: “how does one remember and master the seemingly endless number of classifier metrics?”
My concrete advice is:
- Read Nina Zumel’s excellent series on scoring classifiers.
- Keep notes.
- Settle on one or two metrics as you move project to project. We prefer “AUC” early in a project (when you want a flexible score) and “deviance” late in a project (when you want a strict score).
- When working on practical problems work with your business partners to find out which of precision/recall, or sensitivity/specificity most match their business needs. If you have time show them and explain the ROC plot and invite them to price and pick points along the ROC curve that most fit their business goals. Finance partners will rapidly recognize the ROC curve as “the efficient frontier” of classifier performance and be very comfortable working with this summary.
That being said it always seems like there is a bit of gamesmanship in that somebody always brings up yet another score, often apparently in the hope you may not have heard of it. Some choice of measure is signaling your pedigree (precision/recall implies a data mining background, sensitivity/specificity a medical science background) and hoping to befuddle others.
The rest of this note is some help in dealing with this menagerie of common competing classifier evaluation scores.
In our previous note we demonstrated Y-Aware PCA and other y-aware approaches to dimensionality reduction in a predictive modeling context, specifically Principal Components Regression (PCR). For our examples, we selected the appropriate number of principal components by eye. In this note, we will look at ways to select the appropriate number of principal components in a more automated fashion.
At Strata+Hadoop World “R Day” Tutorial, Tuesday, March 29 2016, San Jose, California we spent some time on classifier measures derived from the so-called “confusion matrix.”
We repeated our usual admonition to not use “accuracy itself” as a project quality goal (business people tend to ask for it as it is the word they are most familiar with, but it usually isn’t what they really want).
One reason not to use accuracy: an example where a classifier that does nothing is “more accurate” than one that actually has some utility. (Figure credit Nina Zumel, slides here)
And we worked through the usual bestiary of other metrics (precision, recall, sensitivity, specificity, AUC, balanced accuracy, and many more).
Please read on to see what stood out. Continue reading A bit on the F1 score floor
In my recent article on optimizing set diversity I mentioned the primary abstraction was of “diminishing returns” and is formalized by the theory of monotone submodular functions (though I did call out some of my own work which used a different abstraction). A proof that appears again and again in the literature is: showing that when maximizing a monotone submodular function the greedy algorithm run for k steps picks a set that is scores no worse than
1-1/e less than the unknown optimal pick (or picks up at least
63% of the possible value). This is significant, because naive optimization may only pick a set of value
1/k of the value of the optimal selection.
The proof that the greedy algorithm does well in maximizing monotone increasing submodular functions is clever and a very good opportunity to teach about reading and writing mathematical proofs. The point is: one needs an active reading style as: most of what is crucial to a proof isn’t written, and that which is written in a proof can’t all be pivotal (else proofs would be a lot more fragile than they actually are).
Uwe Kils “Iceberg”
In this article I am attempting to reproduce some fraction of the insight found in: Polya “How to Solve It” (1945) and Doron Zeilberger “The Method of Undetermined Generalization and Specialization Illustrated with Fred Galvin’s Amazing Proof of the Dinitz Conjecture” (1994).
So I repeat the proof here (with some annotations and commentary). Continue reading Reading and writing proofs
One of the trickier tasks in clustering is determining the appropriate number of clusters. Domain-specific knowledge is always best, when you have it, but there are a number of heuristics for getting at the likely number of clusters in your data. We cover a few of them in Chapter 8 (available as a free sample chapter) of our book Practical Data Science with R.
We also came upon another cool approach, in the
mixtools package for mixture model analysis. As with clustering, if you want to fit a mixture model (say, a mixture of gaussians) to your data, it helps to know how many components are in your mixture. The
boot.comp function estimates the number of components (let’s call it k) by incrementally testing the hypothesis that there are k+1 components against the null hypothesis that there are k components, via parametric bootstrap.
You can use a similar idea to estimate the number of clusters in a clustering problem, if you make a few assumptions about the shape of the clusters. This approach is only heuristic, and more ad-hoc in the clustering situation than it is in mixture modeling. Still, it’s another approach to add to your toolkit, and estimating the number of clusters via a variety of different heuristics isn’t a bad idea.