Saghir Bashir of ilustat recently shared a nice getting started with
In addition they were generous enough to link to Dirk Eddelbuette’s later adaption of the guide to use
This type of cooperation and user choice is what keeps the
R community vital. Please encourage it. (Heck, please insist on it!)
According to a KDD poll fewer respondents (by rate) used only
R in 2017 than in 2016. At the same time more respondents (by rate) used only
Python in 2017 than in 2016.
Let’s take this as an excuse to take a quick look at what happens when we try a task in both systems.
Continue reading Running the Same Task in Python and R
I’ve ended up (almost accidentally) collecting a number of different solutions to the “use a column to choose values from other columns in R” problem.
Please read on for a brief benchmark comparing these methods/solutions.
Continue reading Timing Column Indexing in R
We recently saw a great recurring R question: “how do you use one column to choose a different value for each row?” That is: how do you use a column as an index? Please read on for some idiomatic base R, data.table, and dplyr solutions.
Continue reading Using a Column as a Column Index
dplyr work is taking what you consider to be a too long (seconds instead of instant, or minutes instead of seconds, or hours instead of minutes, or a day instead of an hour) then try
For some tasks
data.table is routinely faster than alternatives at pretty much all scales (example timings here).
If your project is large (millions of rows, hundreds of columns) you really should rent an an Amazon EC2 r4.8xlarge (244 GiB RAM) machine for an hour for about $2.13 (quick setup instructions here) and experience speed at scale.
This note shares an experiment comparing the performance of a number of data processing systems available in
R. Our notional or example problem is finding the top ranking item per group (group defined by three string columns, and order defined by a single numeric column). This is a common and often needed task.
Continue reading Timings of a Grouped Rank Filter Task
R package is really good at sorting. Below is a comparison of it versus
dplyr for a range of problem sizes.
Continue reading data.table is Really Good at Sorting
In this note we will show how to speed up work in
R by partitioning data and process-level parallelization. We will show the technique with three different
dplyr. The methods shown will also work with base-
R and other packages.
For each of the above packages we speed up work by using
wrapr::execute_parallel which in turn uses
wrapr::partition_tables to partition un-related
data.frame rows and then distributes them to different processors to be executed.
rqdatatable::ex_data_table_parallel conveniently bundles all of these steps together when working with
The partitioning is specified by the user preparing a grouping column that tells the system which sets of rows must be kept together in a correct calculation. We are going to try to demonstrate everything with simple code examples, and minimal discussion.
Continue reading Speed up your R Work
rquery is an
R package for specifying data transforms using piped Codd-style operators. It has already shown great performance on
rqdatatable is a new package that supplies a screaming fast implementation of the
rquery system in-memory using the
rquery is already one of the fastest and most teachable (due to deliberate conformity to Codd’s influential work) tools to wrangle data on databases and big data systems. And now
rquery is also one of the fastest methods to wrangle data in-memory in
R (thanks to
data.table, via a thin adaption supplied by
Continue reading rqdatatable: rquery Powered by data.table