Posted on Categories data science, Opinion, Practical Data Science, Pragmatic Data Science, TutorialsTags , , , , 7 Comments on R Tip: Give data.table a Try

R Tip: Give data.table a Try

If your R or dplyr work is taking what you consider to be a too long (seconds instead of instant, or minutes instead of seconds, or hours instead of minutes, or a day instead of an hour) then try data.table.

For some tasks data.table is routinely faster than alternatives at pretty much all scales (example timings here).

If your project is large (millions of rows, hundreds of columns) you really should rent an an Amazon EC2 r4.8xlarge (244 GiB RAM) machine for an hour for about $2.13 (quick setup instructions here) and experience speed at scale.

Posted on Categories data science, ProgrammingTags , , , , , , 11 Comments on Speed up your R Work

Speed up your R Work


In this note we will show how to speed up work in R by partitioning data and process-level parallelization. We will show the technique with three different R packages: rqdatatable, data.table, and dplyr. The methods shown will also work with base-R and other packages.

For each of the above packages we speed up work by using wrapr::execute_parallel which in turn uses wrapr::partition_tables to partition un-related data.frame rows and then distributes them to different processors to be executed. rqdatatable::ex_data_table_parallel conveniently bundles all of these steps together when working with rquery pipelines.

The partitioning is specified by the user preparing a grouping column that tells the system which sets of rows must be kept together in a correct calculation. We are going to try to demonstrate everything with simple code examples, and minimal discussion.

Continue reading Speed up your R Work

Posted on Categories data science, Opinion, Programming, TutorialsTags , , , , , , , , 4 Comments on seplyr 0.5.8 Now Available on CRAN

seplyr 0.5.8 Now Available on CRAN

We are pleased to announce that seplyr version 0.5.8 is now available on CRAN.

seplyr is an R package that provides a thin wrapper around elements of the dplyr package and (now with version 0.5.8) the tidyr package. The intent is to give the part time R user the ability to easily program over functions from the popular dplyr and tidyr packages. Our assumption is always that a data scientist most often comes to R to work with data, not to tinker with the programming language itself.

Continue reading seplyr 0.5.8 Now Available on CRAN

Posted on Categories Coding, TutorialsTags , , , ,

R Tip: Be Wary of “…”

R Tip: be wary of “...“.

The following code example contains an easy error in using the R function unique().

vec1 <- c("a", "b", "c")
vec2 <- c("c", "d")
unique(vec1, vec2)
# [1] "a" "b" "c"

Notice none of the novel values from vec2 are present in the result. Our mistake was: we (improperly) tried to use unique() with multiple value arguments, as one would use union(). Also notice no error or warning was signaled. We used unique() incorrectly and nothing pointed this out to us. What compounded our error was R‘s “...” function signature feature.

In this note I will talk a bit about how to defend against this kind of mistake. I am going to apply the principle that a design that makes committing mistakes more difficult (or even impossible) is a good thing, and not a sign of carelessness, laziness, or weakness. I am well aware that every time I admit to making a mistake (I have indeed made the above mistake) those who claim to never make mistakes have a laugh at my expense. Honestly I feel the reason I see more mistakes is I check a lot more.

Continue reading R Tip: Be Wary of “…”

Posted on Categories Programming, TutorialsTags , ,

R Tip: use isTRUE()

R Tip: use isTRUE().

A lot of R functions are type unstable, which means they return different types or classes depending on details of their values.

For example consider all.equal(), it returns the logical value TRUE when the items being compared are equal:

all.equal(1:3, c(1, 2, 3))
# [1] TRUE

However, when the items being compared are not equal all.equal() instead returns a message:

all.equal(1:3, c(1, 2.5, 3))
# [1] "Mean relative difference: 0.25"

This can be inconvenient in using functions similar to all.equal() as tests in if()-statements and other program control structures.

The saving functions is isTRUE(). isTRUE() returns TRUE if its argument value is equivalent to TRUE, and returns FALSE otherwise. isTRUE() makes R programming much easier.

Continue reading R Tip: use isTRUE()

Posted on Categories Opinion, Programming, StatisticsTags , , , 14 Comments on Neglected R Super Functions

Neglected R Super Functions

R has a lot of under-appreciated super powerful functions. I list a few of our favorites below.

6095431665 88664494f0 b

Atlas, carrying the sky. Royal Palace (Paleis op de Dam), Amsterdam.

Photo: Dominik Bartsch, CC some rights reserved.

Continue reading Neglected R Super Functions