In this note we will show how to speed up work in
R by partitioning data and process-level parallelization. We will show the technique with three different
dplyr. The methods shown will also work with base-
R and other packages.
For each of the above packages we speed up work by using
wrapr::execute_parallel which in turn uses
wrapr::partition_tables to partition un-related
data.frame rows and then distributes them to different processors to be executed.
rqdatatable::ex_data_table_parallel conveniently bundles all of these steps together when working with
The partitioning is specified by the user preparing a grouping column that tells the system which sets of rows must be kept together in a correct calculation. We are going to try to demonstrate everything with simple code examples, and minimal discussion.
Continue reading Speed up your R Work
We are pleased to announce that seplyr version 0.5.8 is now available on CRAN.
seplyr is an R package that provides a thin wrapper around elements of the dplyr package and (now with version 0.5.8) the tidyr package. The intent is to give the part time R user the ability to easily program over functions from the popular dplyr and tidyr packages. Our assumption is always that a data scientist most often comes to R to work with data, not to tinker with the programming language itself.
Continue reading seplyr 0.5.8 Now Available on CRAN
R Tip: be wary of “
The following code example contains an easy error in using the R function
vec1 <- c("a", "b", "c")
vec2 <- c("c", "d")
#  "a" "b" "c"
Notice none of the novel values from
vec2 are present in the result. Our mistake was: we (improperly) tried to use
unique() with multiple value arguments, as one would use
union(). Also notice no error or warning was signaled. We used
unique() incorrectly and nothing pointed this out to us. What compounded our error was
...” function signature feature.
In this note I will talk a bit about how to defend against this kind of mistake. I am going to apply the principle that a design that makes committing mistakes more difficult (or even impossible) is a good thing, and not a sign of carelessness, laziness, or weakness. I am well aware that every time I admit to making a mistake (I have indeed made the above mistake) those who claim to never make mistakes have a laugh at my expense. Honestly I feel the reason I see more mistakes is I check a lot more.
Continue reading R Tip: Be Wary of “…”
R Tip: use
A lot of R functions are type unstable, which means they return different types or classes depending on details of their values.
For example consider
all.equal(), it returns the logical value
TRUE when the items being compared are equal:
all.equal(1:3, c(1, 2, 3))
#  TRUE
However, when the items being compared are not equal
all.equal() instead returns a message:
all.equal(1:3, c(1, 2.5, 3))
#  "Mean relative difference: 0.25"
This can be inconvenient in using functions similar to
all.equal() as tests in
if()-statements and other program control structures.
The saving functions is
TRUE if its argument value is equivalent to
TRUE, and returns
R programming much easier.
Continue reading R Tip: use isTRUE()
R tip: use slices.
R has a very powerful array slicing ability that allows for some very slick data processing.
Continue reading R Tip: Use Slices
R tip: first organize your tasks in terms of data, values, and desired transformation of values, not initially in terms of concrete functions or code.
I know I write a lot about coding in
R. But it is in the service of supporting statistics, analysis, predictive analytics, and data science.
R without data is like going to the theater to watch the curtain go up and down.
(Adapted from Ben Katchor’s Julius Knipl, Real Estate Photographer: Stories, Little, Brown, and Company, 1996, page 72, “Excursionist Drama 2”.)
Usually you come to
R to work with data. If you think and plan in terms of data and values (including introducing more data to control processing) you will usually work in much faster, explainable, and maintainable fashion.
Continue reading R Tip: Think in Terms of Values
Here is an R tip. Want to re-map a column of values? Use a named vector as the mapping.
Continue reading R Tip: Use Named Vectors to Re-Map Values
Another R tip. Need to replace a name in some R code or make R code re-usable? Use
Continue reading R Tip: Use let() to Re-Map Names