I’d like to share some new timings on a grouped in-place aggregation task. A client of mine was seeing some slow performance, so I decided to time a very simple abstraction of one of the steps of their workflow.
Lets see what
rquery is good at, and what new features are making
This note is a comment on some of the timings shared in the dplyr-0.8.0 pre-release announcement.
The original published timings were as follows:
With performance metrics: measurements are marketing. So let’s dig in the above a bit.
This type of cooperation and user choice is what keeps the
R community vital. Please encourage it. (Heck, please insist on it!)
Let’s take this as an excuse to take a quick look at what happens when we try a task in both systems.
I’ve ended up (almost accidentally) collecting a number of different solutions to the “use a column to choose values from other columns in R” problem.
Please read on for a brief benchmark comparing these methods/solutions.
dplyr work is taking what you consider to be a too long (seconds instead of instant, or minutes instead of seconds, or hours instead of minutes, or a day instead of an hour) then try
For some tasks
data.table is routinely faster than alternatives at pretty much all scales (example timings here).
If your project is large (millions of rows, hundreds of columns) you really should rent an an Amazon EC2 r4.8xlarge (244 GiB RAM) machine for an hour for about $2.13 (quick setup instructions here) and experience speed at scale.
This note shares an experiment comparing the performance of a number of data processing systems available in
R. Our notional or example problem is finding the top ranking item per group (group defined by three string columns, and order defined by a single numeric column). This is a common and often needed task.