I’d like to share some new timings on a grouped in-place aggregation task. A client of mine was seeing some slow performance, so I decided to time a very simple abstraction of one of the steps of their workflow.
I’ve been writing a lot about a category theory interpretations of data-processing pipelines and some of the improvements we feel it is driving in both the
data_algebra and in
I think I’ve found an even better category theory re-formulation of the package, which I will describe here.
In our recent note What is new for
rquery December 2019 we mentioned an ugly processing pipeline that translates into
SQL of varying size/quality depending on the query generator we use. In this note we try a near-relative of that query in the
This is an attempt to get back to writing about how to use the package to work with data (versus the other-day’s discussion of package design/implementation).
Please check it out.
Lets see what
rquery is good at, and what new features are making
- The user specifies their desired transform declaratively by example and in data. What one does is: work an example, and then write down what you want (we have a tutorial on this here).
- The transform systems can print what a transform is going to do. This makes reasoning about data transforms much easier.
- The transforms, as they themselves are written as data, can be easily shared between systems (such as R and Python).
data_algebraproject: a data processing tool family available in
Python. These tools are designed to transform data either in-memory or on remote databases.
In particular we will discuss the
Python implementation (also called
data_algebra) and its relation to the mature
R implementations (
John Mount, Nina Zumel; Win-Vector LLC 2019-04-27
In this note we will use five real life examples to demonstrate data layout transforms using the
R package. The examples for this note are all demo-examples from tidyr:demo/ (current when we shared this note on 2019-04-27, removed 2019-04-28), and are mostly based on questions posted to StackOverflow. They represent a good cross-section of data layout problems, so they are a good set of examples or exercises to work through.
Here is an example how easy it is to use
cdata to re-layout your data.
Tim Morris recently tweeted the following problem (corrected).
Please will you take pity on me #rstats folks? I only want to reshape two variables x & y from wide to long! Starting with: d xa xb ya yb 1 1 3 6 8 2 2 4 7 9 How can I get to: id t x y 1 a 1 6 1 b 3 8 2 a 2 7 2 b 4 9 In Stata it's: . reshape long x y, i(id) j(t) string In R, it's: . an hour of cursing followed by a desperate tweet 👆 Thanks for any help! PS – I can make reshape() or gather() work when I have just x or just y.
This is not to make fun of Tim Morris: the above should be easy. Using diagrams and slowing down the data transform into small steps makes the process very easy.