In this note we share a quick study timing how long it takes to perform some simple data manipulation tasks with
We are interested in the time needed to select a column, alter a column, or select a row. Knowing what is fast and what is slow is critical in planning code, so here we examine some common simple cases. It is often impractical to port large applications between different work-paradigms, so we use porting small tasks as approximate stand-ins for measuring porting whole systems.
We tend to work with medium size data (hundreds of columns and millions of rows in memory), so that is the scale we simulate and study.
Continue reading Timing Working With a Row or a Column from a data.frame
Let’s try some "ugly corner cases" for data manipulation in
R. Corner cases are examples where the user might be running to the edge of where the package developer intended their package to work, and thus often where things can go wrong.
Let’s see what happens when we try to stick a fork in the power-outlet.
Continue reading Data Manipulation Corner Cases
This note is a comment on some of the timings shared in the dplyr-0.8.0 pre-release announcement.
The original published timings were as follows:
With performance metrics: measurements are marketing. So let’s dig in the above a bit.
Continue reading Timing Grouped Mean Calculation in R
Our interference from then environment issue was a bit subtle. But there are variations that can be a bit more insidious.
Please consider the following.
Continue reading A Better Example of the Confused By The Environment Issue
It is no great secret: I like value oriented interfaces that preserve referential transparency. It is the side of the public debate I take in
"One of the most useful properties of expressions is that called by Quine referential transparency. In essence this means that if we wish to find the value of an expression which contains a sub-expression, the only thing we need to know about the sub-expression is its value."
Christopher Strachey, "Fundamental Concepts in Programming Languages", Higher-Order and Symbolic Computation, 13, 1149, 2000, Kluwer Academic Publishers (lecture notes written by Christopher Strachey for the International Summer School in Computer Programming at Copenhagen in August, 1967).
Please read on for discussion of a subtle bug shared by a few popular non-standard evaluation interfaces.
Continue reading A Subtle Flaw in Some Popular R NSE Interfaces
I’ve ended up (almost accidentally) collecting a number of different solutions to the “use a column to choose values from other columns in R” problem.
Please read on for a brief benchmark comparing these methods/solutions.
Continue reading Timing Column Indexing in R
We recently saw a great recurring R question: “how do you use one column to choose a different value for each row?” That is: how do you use a column as an index? Please read on for some idiomatic base R, data.table, and dplyr solutions.
Continue reading Using a Column as a Column Index
One thing that is sure to get lost in my long note on macros in
R is just how concise and powerful macros are. The problem is macros are concise, but they do a lot for you. So you get bogged down when you explain the joke.
Let’s try to be concise.
Continue reading Parameterizing with bquote
This note shares an experiment comparing the performance of a number of data processing systems available in
R. Our notional or example problem is finding the top ranking item per group (group defined by three string columns, and order defined by a single numeric column). This is a common and often needed task.
Continue reading Timings of a Grouped Rank Filter Task
R package is really good at sorting. Below is a comparison of it versus
dplyr for a range of problem sizes.
Continue reading data.table is Really Good at Sorting