Imagine that in the course of your analysis, you regularly require summaries of numerical values. For some applications you want the mean of that quantity, plus/minus a standard deviation; for other applications you want the median, and perhaps an interval around the median based on the interquartile range (IQR). In either case, you may want the summary broken down with respect to groupings in the data. In other words, you want a table of values, something like this:
dist_intervals(iris, "Sepal.Length", "Species")
# A tibble: 3 × 7
Species sdlower mean sdupper iqrlower median iqrupper
1 setosa 4.653510 5.006 5.358490 4.8000 5.0 5.2000
2 versicolor 5.419829 5.936 6.452171 5.5500 5.9 6.2500
3 virginica 5.952120 6.588 7.223880 6.1625 6.5 6.8375
For a specific data frame, with known column names, such a table is easy to construct using dplyr::group_by and dplyr::summarize. But what if you want a function to calculate this table on an arbitrary data frame, with arbitrary quantity and grouping columns? To write such a function in dplyr can get quite hairy, quite quickly. Try it yourself, and see.
It is a bit of a shock when Rdplyr users switch from using a tbl implementation based on R in-memory data.frames to one based on a remote database or service. A lot of the power and convenience of the dplyr notation is hard to maintain with these more restricted data service providers. Things that work locally can’t always be used remotely at scale. It is emphatically not yet the case that one can practice with dplyr in one modality and hope to move to another back-end without significant debugging and work-arounds. replyr attempts to provide a few helpful work-arounds.
Our new package replyr supplies methods to get a grip on working with remote tbl sources (SQL databases, Spark) through dplyr. The idea is to add convenience functions to make such tasks more like working with an in-memory data.frame. Results still do depend on which dplyr service you use, but with replyr you have fairly uniform access to some useful functions.
With our recent publication of “Can you nest parallel operations in R?” we now have a nice series of “how to speed up statistical computations in R” that moves from application, to larger/cloud application, and then to details.
Parallel programming is a technique to decrease how long a task takes by performing more parts of it at the same time (using additional resources). When we teach parallel programming in R we start with the basic use of parallel (please see here for example). This is, in our opinion, a necessary step before getting into clever notation and wrapping such as doParallel and foreach. Only then do the students have a sufficiently explicit interface to frame important questions about the semantics of parallel computing. Beginners really need a solid mental model of what services are really being provided by their tools and to test edge cases early.
One question that comes up over and over again is “can you nest parLapply?”
The answer is “no.” This is in fact an advanced topic, but it is one of the things that pops up when you start worrying about parallel programming. Please read on for what that is the right answer and how to work around that (simulate a “yes”).
I don’t think the above question is usually given sufficient consideration (nesting parallel operations can in fact make a lot of sense). You can’t directly nest parLapply, but that is a different issue than can one invent a work-around. For example: a “yes” answer (really meaning there are work-arounds) can be found here. Again this is a different question than “is there a way to nest foreach loops” (which is possible through the nesting operator %.% which presumably handles working around nesting issues in parLapply).
We are pleased to release a new free data science video lecture: Debugging R code using R, RStudio and wrapper functions. In this 8 minute video we demonstrate the incredible power of R using wrapper functions to catch errors for later reproduction and debugging. If you haven’t tried these techniques this will really improve your debugging game.
Nina Zumel and I have been working on packaging our favorite graphing techniques in a more reusable way that emphasizes the analysis task at hand over the steps needed to produce a good visualization. The idea is: we sacrifice some of the flexibility and composability inherent to ggplot2 in R for a menu of prescribed presentation solutions (which we are sharing on Github).
For example the plot below showing both an observed discrete empirical distribution (as stems) and a matching theoretical distribution (as bars) is a built in “one liner.”
A common complaint from new users of R is: the string processing notation is ugly.
Using paste(,,sep='') to concatenate strings seems clumsy.
You are never sure which regular expression dialect grep()/gsub() are really using.
Remembering the difference between length() and nchar() is initially difficult.
As always things can be improved by using additional libraries (for example: stringr). But this always evokes Python’s “There should be one– and preferably only one –obvious way to do it” or what I call the “rule 42” problem: “if it is the right way, why isn’t it the first way?”
At this moment the King, who had been for some time busily writing in his note-book, cackled out `Silence!' and read out from his book, `Rule Forty-two. All persons more than a mile high to leave the court.'
Everybody looked at Alice.
`I'm not a mile high,' said Alice.
`You are,' said the King.
`Nearly two miles high,' added the Queen.
`Well, I shan't go, at any rate,' said Alice: `besides, that's not a regular rule: you invented it just now.'
`It's the oldest rule in the book,' said the King.