I’d like to share some new timings on a grouped in-place aggregation task. A client of mine was seeing some slow performance, so I decided to time a very simple abstraction of one of the steps of their workflow.
data_algebraproject: a data processing tool family available in
Python. These tools are designed to transform data either in-memory or on remote databases.
In particular we will discuss the
Python implementation (also called
data_algebra) and its relation to the mature
R implementations (
In this note we share a quick study timing how long it takes to perform some simple data manipulation tasks with
We are interested in the time needed to select a column, alter a column, or select a row. Knowing what is fast and what is slow is critical in planning code, so here we examine some common simple cases. It is often impractical to port large applications between different work-paradigms, so we use porting small tasks as approximate stand-ins for measuring porting whole systems.
We tend to work with medium size data (hundreds of columns and millions of rows in memory), so that is the scale we simulate and study.
Let’s try some "ugly corner cases" for data manipulation in
R. Corner cases are examples where the user might be running to the edge of where the package developer intended their package to work, and thus often where things can go wrong.
Let’s see what happens when we try to stick a fork in the power-outlet.
This note is a comment on some of the timings shared in the dplyr-0.8.0 pre-release announcement.
The original published timings were as follows:
With performance metrics: measurements are marketing. So let’s dig in the above a bit.
Our interference from then environment issue was a bit subtle. But there are variations that can be a bit more insidious.
Please consider the following.
"One of the most useful properties of expressions is that called by Quine referential transparency. In essence this means that if we wish to find the value of an expression which contains a sub-expression, the only thing we need to know about the sub-expression is its value."
Christopher Strachey, "Fundamental Concepts in Programming Languages", Higher-Order and Symbolic Computation, 13, 1149, 2000, Kluwer Academic Publishers (lecture notes written by Christopher Strachey for the International Summer School in Computer Programming at Copenhagen in August, 1967).
Please read on for discussion of a subtle bug shared by a few popular non-standard evaluation interfaces.
I’ve ended up (almost accidentally) collecting a number of different solutions to the “use a column to choose values from other columns in R” problem.
Please read on for a brief benchmark comparing these methods/solutions.