Four years ago today authors Nina Zumel and John Mount received our author’s copies of Practical Data Science with R!
I know I write a lot about coding in
R. But it is in the service of supporting statistics, analysis, predictive analytics, and data science.
Rwithout data is like going to the theater to watch the curtain go up and down.
(Adapted from Ben Katchor’s Julius Knipl, Real Estate Photographer: Stories, Little, Brown, and Company, 1996, page 72, “Excursionist Drama 2”.)
Usually you come to
R to work with data. If you think and plan in terms of data and values (including introducing more data to control processing) you will usually work in much faster, explainable, and maintainable fashion.
There are a number of easy ways to avoid illegible code nesting problems in
In this R tip we will expand upon the above statement with a simple example.
Take care if trying the new
RPostgres database connection package. By default it returns some non-standard types that code developed against other database drivers may not expect, and may not be ready to defend against.
Danger, Will Robinson!
Many data scientists (and even statisticians) often suffer under one of the following misapprehensions:
- They believe a technique doesn’t work in their current situation (when in fact it does), leading to useless precautions and missed opportunities.
- They believe a technique does work in their current situation (when in fact it does not), leading to failed experiments or incorrect results.
I feel this happens less often if you are working with observable and composable tools of the proper scale. Somewhere between monolithic all in one systems, and ad-hoc one-off coding is a cognitive sweet spot where great work can be done.
“Base R” (call it “Pure R”, “Good Old R”, just don’t call it “Old R” or late for dinner) can be fast for in-memory tasks. This is despite the commonly repeated claim that: “packages written in C/C++ are (edit: “always”) faster than R code.”
The benchmark results of “rquery: Fast Data Manipulation in R” really called out for follow-up timing experiments. This note is one such set of experiments, this time concentrating on in-memory (non-database) solutions.
Below is a graph summarizing our new results for a number of in-memory implementations, a range of data sizes, and two different machine types.