Relative error distributions, without the heavy tail theatrics

Posted on Categories Expository Writing, Mathematics, Opinion, Statistics, TutorialsTags , , , , 1 Comment on Relative error distributions, without the heavy tail theatrics

Nina Zumel prepared an excellent article on the consequences of working with relative error distributed quantities (such as wealth, income, sales, and many more) called “Living in A Lognormal World.” The article emphasizes that if you are dealing with such quantities you are already seeing effects of relative error distributions (so it isn’t an exotic idea you bring to analysis, it is a likely fact about the world that comes at you). The article is a good example of how to plot and reason about such situations.

I am just going to add a few additional references (mostly from Nina) and some more discussion on log-normal distributions versus Zipf-style distributions or Pareto distributions. Continue reading Relative error distributions, without the heavy tail theatrics

Adversarial machine learning

Posted on Categories Opinion, StatisticsTags Leave a comment on Adversarial machine learning

I just got back from a very good conference organized by Adversarial Machine Learning. Please read on for my to comments on part of one of the very good talks. Continue reading Adversarial machine learning

Did she know we were writing a book?

Posted on Categories Administrativia, Expository Writing, Opinion, Practical Data Science, StatisticsTags , , Leave a comment on Did she know we were writing a book?

Writing a book is a sacrifice. It takes a lot of time, represents a lot of missed opportunities, and does not (directly) pay very well. If you do a good job it may pay back in good-will, but producing a serious book is a great challenge.

Nina Zumel and I definitely troubled over possibilities for some time before deciding to write Practical Data Science with R, Nina Zumel, John Mount, Manning 2014.

600 387630642

In the end we worked very hard to organize and share a lot of good material in what we feel is a very readable manner. But I think the first-author may have been signaling and preparing a bit earlier than I was aware we were writing a book. Please read on to see some of her prefiguring work. Continue reading Did she know we were writing a book?

Variables can synergize, even in a linear model

Posted on Categories Statistics, TutorialsTags , , , , 5 Comments on Variables can synergize, even in a linear model


Suppose we have the task of predicting an outcome y given a number of variables v1,..,vk. We often want to “prune variables” or build models with fewer than all the variables. This can be to speed up modeling, decrease the cost of producing future data, improve robustness, improve explain-ability, even reduce over-fit, and improve the quality of the resulting model.

For some informative discussion on such issues please see the following:

In this article we are going to deliberately (and artificially) find and test one of the limits of the technique. We recommend simple variable pruning, but also think it is important to be aware of its limits.

Continue reading Variables can synergize, even in a linear model

The R community is awesome (and fast)

Posted on Categories Administrativia, StatisticsTags 2 Comments on The R community is awesome (and fast)

Recently I whined/whinged or generally complained about a few sharp edges in some powerful R systems.

In each case I was treated very politely, listened to, and actually got fixes back in a very short timeframe from volunteers. That is really great and probably one of the many reasons R is a great ecosystem.

Please read on for my list of n=3 interactions. Continue reading The R community is awesome (and fast)

Variable pruning is NP hard

Posted on Categories Computer Science, math programming, Statistics, TutorialsTags , , , 2 Comments on Variable pruning is NP hard

I am working on some practical articles on variable selection, especially in the context of step-wise linear regression and logistic regression. One thing I noticed while preparing some examples is that summaries such as model quality (especially out of sample quality) and variable significances are not quite as simple as one would hope (they in fact lack a lot of the monotone structure or submodular structure that would make things easy).

That being said we have a lot of powerful and effective heuristics to discuss in upcoming articles. I am going to leave such positive results for my later articles and here concentrate on an instructive technical negative result: picking a good subset of variables is theoretically quite hard. Continue reading Variable pruning is NP hard

vtreat 0.5.27 released on CRAN

Posted on Categories Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, StatisticsTags , Leave a comment on vtreat 0.5.27 released on CRAN

Win-Vector LLC, Nina Zumel and I are pleased to announce that ‘vtreat’ version 0.5.27 has been released on CRAN.

vtreat is a data.frame processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner.

(from the package documentation)

Very roughly vtreat accepts an arbitrary “from the wild” data frame (with different column types, NAs, NaNs and so forth) and returns a transformation that reliably and repeatably converts similar data frames to numeric (matrix-like) frames (all independent variables numeric free of NA, NaNs, infinities, and so on) ready for predictive modeling. This is a systematic way to work with high-cardinality character and factor variables (which are incompatible with some machine learning implementations such as random forest, and also bring in a danger of statistical over-fitting) and leaves the analyst more time to incorporate domain specific data preparation (as vtreat tries to handle as much of the common stuff as practical). For more of an overall description please see here.

We suggest any users please update (and you will want to re-run any “design” steps instead of mixing “design” and “prepare” from two different versions of vtreat).

For what is new in version 0.5.27 please read on. Continue reading vtreat 0.5.27 released on CRAN

My criticism of R numeric summary

Posted on Categories Rants, Statistics, TutorialsTags , 8 Comments on My criticism of R numeric summary

My criticism of R‘s numeric summary() method is: it is unfaithful to numeric arguments (due to bad default behavior) and frankly it should be considered unreliable. It is likely the way it is for historic and compatibility reasons, but in my opinion it does not currently represent a desirable set of tradeoffs. summary() likely represents good work by high-ability researchers, and the sharp edges are due to historically necessary trade-offs.


The Big Lebowski, 1998.

Please read on for some context and my criticism.

Edit 8/25/2016: Martin Maechler generously committed a fix! Assuming this works out in testing it looks like we could see an improvement on this core function in April 2017. I really want to say “thank you” to Martin Maechler and the rest of the team for not only this, for all the things they do, and for putting up with me.

Continue reading My criticism of R numeric summary

The Win-Vector parallel computing in R series

Posted on Categories Administrativia, Programming, Statistics, TutorialsTags , Leave a comment on The Win-Vector parallel computing in R series

With our recent publication of “Can you nest parallel operations in R?” we now have a nice series of “how to speed up statistical computations in R” that moves from application, to larger/cloud application, and then to details.

For your convenience here they are in order:

  1. A gentle introduction to parallel computing in R
  2. Running R jobs quickly on many machines
  3. Can you nest parallel operations in R?

Please check it out, and please do Tweet/share these tutorials.

Can you nest parallel operations in R?

Posted on Categories Programming, TutorialsTags , , , 2 Comments on Can you nest parallel operations in R?

Parallel programming is a technique to decrease how long a task takes by performing more parts of it at the same time (using additional resources). When we teach parallel programming in R we start with the basic use of parallel (please see here for example). This is, in our opinion, a necessary step before getting into clever notation and wrapping such as doParallel and foreach. Only then do the students have a sufficiently explicit interface to frame important questions about the semantics of parallel computing. Beginners really need a solid mental model of what services are really being provided by their tools and to test edge cases early.

One question that comes up over and over again is “can you nest parLapply?”

The answer is “no.” This is in fact an advanced topic, but it is one of the things that pops up when you start worrying about parallel programming. Please read on for what that is the right answer and how to work around that (simulate a “yes”).

I don’t think the above question is usually given sufficient consideration (nesting parallel operations can in fact make a lot of sense). You can’t directly nest parLapply, but that is a different issue than can one invent a work-around. For example: a “yes” answer (really meaning there are work-arounds) can be found here. Again this is a different question than “is there a way to nest foreach loops” (which is possible through the nesting operator %.% which presumably handles working around nesting issues in parLapply).

Continue reading Can you nest parallel operations in R?