Many data scientists (and even statisticians) often suffer under one of the following misapprehensions:
- They believe a technique doesn’t work in their current situation (when in fact it does), leading to useless precautions and missed opportunities.
- They believe a technique does work in their current situation (when in fact it does not), leading to failed experiments or incorrect results.
I feel this happens less often if you are working with observable and composable tools of the proper scale. Somewhere between monolithic all in one systems, and ad-hoc one-off coding is a cognitive sweet spot where great work can be done.
“Base R” (call it “Pure R”, “Good Old R”, just don’t call it “Old R” or late for dinner) can be fast for in-memory tasks. This is despite the commonly repeated claim that: “packages written in C/C++ are (edit: “always”) faster than R code.”
The benchmark results of “rquery: Fast Data Manipulation in R” really called out for follow-up timing experiments. This note is one such set of experiments, this time concentrating on in-memory (non-database) solutions.
Below is a graph summarizing our new results for a number of in-memory implementations, a range of data sizes, and two different machine types.
Kudos to Professor Andrew Gelman for telling a great joke at his own expense:
He brilliantly burlesqued a frustrating common occurrence many people say they “have never seen happen.”
One of the pains of writing about data science is there is a (small but vocal) sub-population of statisticians jump on your first mistake (we all make errors) and then expand it into an essay on how you: known nothing, are stupid, are ignorant, are unqualified, and are evil.
I get it: many people writing about data science do not know enough statistics. However, not every person writing from a data science point of view is statistically ignorant. That is not to say computer science (my original field) doesn’t have similar problems.
Trying to destroy a sweater by pulling on a loose thread in no way establishes that it wasn’t a nice sweater in the first place (or how nice a sweater it would be if the loose thread were fixed).
(BTW: the book in question is in fact excellent. Chapter 12 alone is worth at least ten times the list price of the book.)
A note to
dplyr with database users: you may benefit from inspecting/re-factoring your code to eliminate value re-use inside
dplyr::mutate() statements. Continue reading Please inspect your dplyr+database code
Our article "Let’s Have Some Sympathy For The Part-time R User" includes two points:
- Sometimes you have to write parameterized or re-usable code.
- The methods for doing this should be easy and legible.
The first point feels abstract, until you find yourself wanting to re-use code on new projects. As for the second point: I feel the
wrapr package is the easiest, safest, most consistent, and most legible way to achieve maintainable code re-use in
In this article we will show how
wrapr makes code-rewriting even easier with its new
let x=x automation.
vtreat is an excellent way to prepare data for machine learning, statistical inference, and predictive analytic projects. If you are an
R user we strongly suggest you incorporate
vtreat into your projects. Continue reading Upcoming data preparation and modeling article series
There are substantial differences between ad-hoc analyses (be they: machine learning research, data science contests, or other demonstrations) and production worthy systems. Roughly: ad-hoc analyses have to be correct only at the moment they are run (and often once they are correct, that is the last time they are run; obviously the idea of reproducible research is an attempt to raise this standard). Production systems have to be durable: they have to remain correct as models, data, packages, users, and environments change over time.
Demonstration systems need merely glow in bright light among friends; production systems must be correct, even alone in the dark.
“Character is what you are in the dark.”
I have found: to deliver production worthy data science and predictive analytic systems, one has to develop per-team and per-project field tested recommendations and best practices. This is necessary even when, or especially when, these procedures differ from official doctrine.