Relative error distributions, without the heavy tail theatrics

Posted on Categories Expository Writing, Mathematics, Opinion, Statistics, TutorialsTags , , , , 1 Comment on Relative error distributions, without the heavy tail theatrics

Nina Zumel prepared an excellent article on the consequences of working with relative error distributed quantities (such as wealth, income, sales, and many more) called “Living in A Lognormal World.” The article emphasizes that if you are dealing with such quantities you are already seeing effects of relative error distributions (so it isn’t an exotic idea you bring to analysis, it is a likely fact about the world that comes at you). The article is a good example of how to plot and reason about such situations.

I am just going to add a few additional references (mostly from Nina) and some more discussion on log-normal distributions versus Zipf-style distributions or Pareto distributions. Continue reading Relative error distributions, without the heavy tail theatrics

Did she know we were writing a book?

Posted on Categories Administrativia, Expository Writing, Opinion, Practical Data Science, StatisticsTags , , Leave a comment on Did she know we were writing a book?

Writing a book is a sacrifice. It takes a lot of time, represents a lot of missed opportunities, and does not (directly) pay very well. If you do a good job it may pay back in good-will, but producing a serious book is a great challenge.

Nina Zumel and I definitely troubled over possibilities for some time before deciding to write Practical Data Science with R, Nina Zumel, John Mount, Manning 2014.

600 387630642

In the end we worked very hard to organize and share a lot of good material in what we feel is a very readable manner. But I think the first-author may have been signaling and preparing a bit earlier than I was aware we were writing a book. Please read on to see some of her prefiguring work. Continue reading Did she know we were writing a book?

Variables can synergize, even in a linear model

Posted on Categories Statistics, TutorialsTags , , , , 4 Comments on Variables can synergize, even in a linear model


Suppose we have the task of predicting an outcome y given a number of variables v1,..,vk. We often want to “prune variables” or build models with fewer than all the variables. This can be to speed up modeling, decrease the cost of producing future data, improve robustness, improve explain-ability, even reduce over-fit, and improve the quality of the resulting model.

For some informative discussion on such issues please see the following:

In this article we are going to deliberately (and artificially) find and test one of the limits of the technique. We recommend simple variable pruning, but also think it is important to be aware of its limits.

Continue reading Variables can synergize, even in a linear model

The R community is awesome (and fast)

Posted on Categories Administrativia, StatisticsTags 2 Comments on The R community is awesome (and fast)

Recently I whined/whinged or generally complained about a few sharp edges in some powerful R systems.

In each case I was treated very politely, listened to, and actually got fixes back in a very short timeframe from volunteers. That is really great and probably one of the many reasons R is a great ecosystem.

Please read on for my list of n=3 interactions. Continue reading The R community is awesome (and fast)

vtreat 0.5.27 released on CRAN

Posted on Categories Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, StatisticsTags , Leave a comment on vtreat 0.5.27 released on CRAN

Win-Vector LLC, Nina Zumel and I are pleased to announce that ‘vtreat’ version 0.5.27 has been released on CRAN.

vtreat is a data.frame processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner.

(from the package documentation)

Very roughly vtreat accepts an arbitrary “from the wild” data frame (with different column types, NAs, NaNs and so forth) and returns a transformation that reliably and repeatably converts similar data frames to numeric (matrix-like) frames (all independent variables numeric free of NA, NaNs, infinities, and so on) ready for predictive modeling. This is a systematic way to work with high-cardinality character and factor variables (which are incompatible with some machine learning implementations such as random forest, and also bring in a danger of statistical over-fitting) and leaves the analyst more time to incorporate domain specific data preparation (as vtreat tries to handle as much of the common stuff as practical). For more of an overall description please see here.

We suggest any users please update (and you will want to re-run any “design” steps instead of mixing “design” and “prepare” from two different versions of vtreat).

For what is new in version 0.5.27 please read on. Continue reading vtreat 0.5.27 released on CRAN

My criticism of R numeric summary

Posted on Categories Rants, Statistics, TutorialsTags , 8 Comments on My criticism of R numeric summary

My criticism of R‘s numeric summary() method is: it is unfaithful to numeric arguments (due to bad default behavior) and frankly it should be considered unreliable. It is likely the way it is for historic and compatibility reasons, but in my opinion it does not currently represent a desirable set of tradeoffs. summary() likely represents good work by high-ability researchers, and the sharp edges are due to historically necessary trade-offs.


The Big Lebowski, 1998.

Please read on for some context and my criticism.

Edit 8/25/2016: Martin Maechler generously committed a fix! Assuming this works out in testing it looks like we could see an improvement on this core function in April 2017. I really want to say “thank you” to Martin Maechler and the rest of the team for not only this, for all the things they do, and for putting up with me.

Continue reading My criticism of R numeric summary

The Win-Vector parallel computing in R series

Posted on Categories Administrativia, Programming, Statistics, TutorialsTags , Leave a comment on The Win-Vector parallel computing in R series

With our recent publication of “Can you nest parallel operations in R?” we now have a nice series of “how to speed up statistical computations in R” that moves from application, to larger/cloud application, and then to details.

For your convenience here they are in order:

  1. A gentle introduction to parallel computing in R
  2. Running R jobs quickly on many machines
  3. Can you nest parallel operations in R?

Please check it out, and please do Tweet/share these tutorials.

Can you nest parallel operations in R?

Posted on Categories Programming, TutorialsTags , , , 2 Comments on Can you nest parallel operations in R?

Parallel programming is a technique to decrease how long a task takes by performing more parts of it at the same time (using additional resources). When we teach parallel programming in R we start with the basic use of parallel (please see here for example). This is, in our opinion, a necessary step before getting into clever notation and wrapping such as doParallel and foreach. Only then do the students have a sufficiently explicit interface to frame important questions about the semantics of parallel computing. Beginners really need a solid mental model of what services are really being provided by their tools and to test edge cases early.

One question that comes up over and over again is “can you nest parLapply?”

The answer is “no.” This is in fact an advanced topic, but it is one of the things that pops up when you start worrying about parallel programming. Please read on for what that is the right answer and how to work around that (simulate a “yes”).

I don’t think the above question is usually given sufficient consideration (nesting parallel operations can in fact make a lot of sense). You can’t directly nest parLapply, but that is a different issue than can one invent a work-around. For example: a “yes” answer (really meaning there are work-arounds) can be found here. Again this is a different question than “is there a way to nest foreach loops” (which is possible through the nesting operator %.% which presumably handles working around nesting issues in parLapply).

Continue reading Can you nest parallel operations in R?

The magrittr monad

Posted on Categories Computer ScienceTags , 2 Comments on The magrittr monad

Monads are a formal theory of composition where programmers get to invoke some very abstract mathematics (category theory) to argue the minutia of annotating, scheduling, sequencing operations, and side effects. On the positive side the monad axioms are a guarantee that related ways of writing code are in fact substitutable and equivalent; so you want your supplied libraries to obey such axioms to make your life easy. On the negative side, the theory is complicated.

In this article we will consider the latest entry of our mad “programming theory in R series” (see Some programming language theory in R, You don’t need to understand pointers to program using R, Using closures as objects in R, and How and why to return functions in R): category theory!

Continue reading The magrittr monad

A budget of classifier evaluation measures

Posted on Categories Mathematics, StatisticsTags , , , , , , 2 Comments on A budget of classifier evaluation measures

Beginning analysts and data scientists often ask: “how does one remember and master the seemingly endless number of classifier metrics?”

My concrete advice is:

  • Read Nina Zumel’s excellent series on scoring classifiers.
  • Keep notes.
  • Settle on one or two metrics as you move project to project. We prefer “AUC” early in a project (when you want a flexible score) and “deviance” late in a project (when you want a strict score).
  • When working on practical problems work with your business partners to find out which of precision/recall, or sensitivity/specificity most match their business needs. If you have time show them and explain the ROC plot and invite them to price and pick points along the ROC curve that most fit their business goals. Finance partners will rapidly recognize the ROC curve as “the efficient frontier” of classifier performance and be very comfortable working with this summary.

That being said it always seems like there is a bit of gamesmanship in that somebody always brings up yet another score, often apparently in the hope you may not have heard of it. Some choice of measure is signaling your pedigree (precision/recall implies a data mining background, sensitivity/specificity a medical science background) and hoping to befuddle others.


Stanley Wyatt illustration from “Mathmanship” Nicholas Vanserg, 1958, collected in A Stress Analysis of a Strapless Evening Gown, Robert A. Baker, Prentice-Hall, 1963

The rest of this note is some help in dealing with this menagerie of common competing classifier evaluation scores.

Continue reading A budget of classifier evaluation measures