- The user specifies their desired transform declaratively by example and in data. What one does is: work an example, and then write down what you want (we have a tutorial on this here).
- The transform systems can print what a transform is going to do. This makes reasoning about data transforms much easier.
- The transforms, as they themselves are written as data, can be easily shared between systems (such as R and Python).
data_algebraproject: a data processing tool family available in
Python. These tools are designed to transform data either in-memory or on remote databases.
In particular we will discuss the
Python implementation (also called
data_algebra) and its relation to the mature
R implementations (
John Mount, Nina Zumel; Win-Vector LLC 2019-04-27
In this note we will use five real life examples to demonstrate data layout transforms using the
R package. The examples for this note are all demo-examples from tidyr:demo/ (current when we shared this note on 2019-04-27, removed 2019-04-28), and are mostly based on questions posted to StackOverflow. They represent a good cross-section of data layout problems, so they are a good set of examples or exercises to work through.
Here is an example how easy it is to use
cdata to re-layout your data.
Tim Morris recently tweeted the following problem (corrected).
Please will you take pity on me #rstats folks? I only want to reshape two variables x & y from wide to long! Starting with: d xa xb ya yb 1 1 3 6 8 2 2 4 7 9 How can I get to: id t x y 1 a 1 6 1 b 3 8 2 a 2 7 2 b 4 9 In Stata it's: . reshape long x y, i(id) j(t) string In R, it's: . an hour of cursing followed by a desperate tweet 👆 Thanks for any help! PS – I can make reshape() or gather() work when I have just x or just y.
This is not to make fun of Tim Morris: the above should be easy. Using diagrams and slowing down the data transform into small steps makes the process very easy.
Jacqueline Cochran: at the time of her death, no other pilot held more speed, distance, or altitude records in aviation history than Cochran.
In this note I want to share some exciting and favorable initial rquery benchmark timings.