Posted on Categories data science, Practical Data Science, Pragmatic Data Science, Programming, Statistics, TutorialsTags , , , , 12 Comments on The Zero Bug

The Zero Bug

I am going to write about an insidious statistical, data analysis, and presentation fallacy I call “the zero bug” and the habits you need to cultivate to avoid it.

The zero bug

The zero bug

Here is the zero bug in a nutshell: common data aggregation tools often can not “count to zero” from examples, and this causes problems. Please read on for what this means, the consequences, and how to avoid the problem. Continue reading The Zero Bug

Posted on Categories Programming, Statistics, TutorialsTags , , , , , 1 Comment on Evolving R Tools and Practices

Evolving R Tools and Practices

One of the distinctive features of the R platform is how explicit and user controllable everything is. This allows the style of use of R to evolve fairly rapidly. I will discuss this and end with some new notations, methods, and tools I am nominating for inclusion into your view of the evolving “current best practice style” of working with R. Continue reading Evolving R Tools and Practices

Posted on Categories Exciting Techniques, Statistics, TutorialsTags , , , , , , , 2 Comments on Using the Bizarro Pipe to Debug magrittr Pipelines in R

Using the Bizarro Pipe to Debug magrittr Pipelines in R

I have just finished and released a free new R video lecture demonstrating how to use the “Bizarro pipe” to debug magrittr pipelines. I think R dplyr users will really enjoy it.

Please read on for the link to the video lecture. Continue reading Using the Bizarro Pipe to Debug magrittr Pipelines in R

Posted on Categories Computers, Opinion, Public Service Article, Statistics, TutorialsTags , , , , 12 Comments on Upgrading to macOS Sierra (nee OSX) for R users

Upgrading to macOS Sierra (nee OSX) for R users

A good fraction of R users use Apple computers. Apple machines historically have sat at a sweet spot of convenience, power, and utility:

  • Convenience: Apple machines are available at retail stores, come with purchasable support, and can run a lot of common commercial software.
  • Power: R packages such as parallel and Rcpp work better on top of a Posix environment.
  • Utility: OSX was good at interoperating with the Linux your big data systems are likely running on, and some R packages expect a native operating system supporting a Posix environment (which historically has not been a Microsoft Windows, strength despite claims to the contrary).

Frankly the trade-off is changing:

  • Apple is neglecting its computer hardware and operating system in favor of phones and watches. And (for claimed license prejudice reasons) the lauded OSX/macOS “Unix userland” is woefully out of date (try “bash --version” in an Apple Terminal; it is about 10 years out of date!).
  • Microsoft Windows Unix support is improving (Windows 10 bash is interesting, though R really can’t take advantage of that yet).
  • Linux hardware support is improving (though not fully there for laptops, modern trackpads, touch screens, or even some wireless networking).

Our current R platform remains Apple macOS. But our next purchase is likely a Linux laptop with the addition of a legal copy of Windows inside a virtual machine (for commercial software not available on Linux). It has been a while since Apple last “sparked joy” around here, and if Linux works out we may have a few Apple machines sitting on the curb with paper bags over their heads (Marie Kondo’s advice for humanely disposing of excess inanimate objects that “see”, such as unloved stuffed animals with eyes and laptops with cameras).

IMG 0726

That being said: how does one update an existing Apple machine to macOS Sierra and then restore enough functionality to resume working? Please read on for my notes on the process. Continue reading Upgrading to macOS Sierra (nee OSX) for R users

Posted on Categories math programming, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , , 3 Comments on Why do Decision Trees Work?

Why do Decision Trees Work?

In this article we will discuss the machine learning method called “decision trees”, moving quickly over the usual “how decision trees work” and spending time on “why decision trees work.” We will write from a computational learning theory perspective, and hope this helps make both decision trees and computational learning theory more comprehensible. The goal of this article is to set up terminology so we can state in one or two sentences why decision trees tend to work well in practice.

Continue reading Why do Decision Trees Work?

Posted on Categories data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , , , , 3 Comments on A Theory of Nested Cross Simulation

A Theory of Nested Cross Simulation

[Reader’s Note. Some of our articles are applied and some of our articles are more theoretical. The following article is more theoretical, and requires fairly formal notation to even work through. However, it should be of interest as it touches on some of the fine points of cross-validation that are quite hard to perceive or discuss without the notational framework. We thought about including some “simplifying explanatory diagrams” but so many entities are being introduced and manipulated by the processes we are describing we found equation notation to be in fact cleaner than the diagrams we attempted and rejected.]

Please consider either of the following common predictive modeling tasks:

  • Picking hyper-parameters, fitting a model, and then evaluating the model.
  • Variable preparation/pruning, fitting a model, and then evaluating the model.

In each case you are building a pipeline where “y-aware” (or outcome aware) choices and transformations made at each stage affect later stages. This can introduce undesirable nested model bias and over-fitting.

Our current standard advice to avoid nested model bias is either:

  • Split your data into 3 or more disjoint pieces, such as separate variable preparation/pruning, model fitting, and model evaluation.
  • Reserve a test-set for evaluation and use “simulated out of sample data” or “cross-frame”/“cross simulation” techniques to simulate dividing data among the first two model construction stages.

The first practice is simple and computationally efficient, but statistically inefficient. This may not matter if you have a lot of data, as in “big data”. The second procedure is more statistically efficient, but is also more complicated and has some computational cost. For convenience the cross simulation method is supplied as a ready to go procedure in our R data cleaning and preparation package vtreat.

What would it look like if we insisted on using cross simulation or simulated out of sample techniques for all three (or more) stages? Please read on to find out.


Hyperbole and a Half copyright Allie Brosh (use allowed in some situations with attribution)

Posted on Categories Coding, Opinion, Programming, Statistics, TutorialsTags , , , 8 Comments on Comparative examples using replyr::let

Comparative examples using replyr::let

Consider the problem of “parametric programming” in R. That is: simply writing correct code before knowing some details, such as the names of the columns your procedure will have to be applied to in the future. Our latest version of replyr::let makes such programming easier.

Archie’s Mechanics #2 (1954) copyright Archie Publications

(edit: great news! CRAN just accepted our replyr 0.2.0 fix release!)

Please read on for examples comparing standard notations and replyr::let. Continue reading Comparative examples using replyr::let

Posted on Categories Practical Data Science, Pragmatic Data Science, Statistics, TutorialsTags , , , , , , , 9 Comments on Be careful evaluating model predictions

Be careful evaluating model predictions

One thing I teach is: when evaluating the performance of regression models you should not use correlation as your score.

This is because correlation tells you if a re-scaling of your result is useful, but you want to know if the result in your hand is in fact useful. For example: the Mars Climate Orbiter software issued thrust commands in pound-seconds units to an engine expecting the commands to be in newton-seconds units. The two quantities are related by a constant ratio of 1.4881639, and therefore anything measured in pound-seconds units will have a correlation of 1.0 with the same measurement in newton-seconds units. However, one is not the other and the difference is why the Mars Climate Orbiter “encountered Mars at a lower than anticipated altitude and disintegrated due to atmospheric stresses.”

The need for a convenient direct F-test without accidentally triggering the implicit re-scaling that is associated with calculating a correlation is one of the reasons we supply the sigr R library. However, even then things can become confusing.


Please read on for a nasty little example. Continue reading Be careful evaluating model predictions

Posted on Categories Pragmatic Data Science, Programming, TutorialsTags , , , , 1 Comment on MySql in a container

MySql in a container

I have previously written on using containerized PostgreSQL with R. This show the steps for using containerized MySQL with R. Continue reading MySql in a container

Posted on Categories Statistics, TutorialsTags , , 1 Comment on A quick look at RStudio’s R notebooks

A quick look at RStudio’s R notebooks

A quick demo of RStudio’s R Notebooks shown by John Mount (of Win-Vector LLC, a statistics, data science, and algorithms consulting and training firm).


It looks like some of the new in-line display behavior is back-ported to R Markdown and some of the difference is the delayed running and different level of interactivity in the HTML document. This makes it a bit hard to call out which RStudio’s improvements are “R notebooks” versus “R markdown”, but it means there is a lot of new functionality available. I’ve updated the video to reflect the subtlty (unfortunately on YouTube that means a new URL as you can’t replace videos).

(links: and )

And some just in case decelerations/clarifications/reminders: this video is not from RStudio (the company), and Rstudio client (the software) is a user interface that is separate from the R analysis system itself.