Nina Zumel and I have a two new tutorials on fluid data wrangling/shaping. They are written in a parallel structure, with the R version of the tutorial being almost identical to the Python version of the tutorial.
This reflects our opinion on the “which is better for data science R or Python?” They both are great. So start with one, and expect to eventually work with both (if you are lucky).
Continue reading Data re-Shaping in R and in Python
I’d like to share some new timings on a grouped in-place aggregation task. A client of mine was seeing some slow performance, so I decided to time a very simple abstraction of one of the steps of their workflow.
Continue reading New Timings for a Grouped In-Place Aggregation Task
In our recent note What is new for
rquery December 2019 we mentioned an ugly processing pipeline that translates into
SQL of varying size/quality depending on the query generator we use. In this note we try a near-relative of that query in the
Continue reading Better SQL Generation via the data_algebra
We have a new
rquery vignette here: Working with Many Columns.
This is an attempt to get back to writing about how to use the package to work with data (versus the other-day’s discussion of package design/implementation).
Please check it out.
Our goal has been to make
rquery the best query generation system for
R (and to make
data_algebra the best query generator for
Lets see what
rquery is good at, and what new features are making
Continue reading What is new for rquery December 2019
John Mount, Nina Zumel; Win-Vector LLC 2019-04-27
In this note we will use five real life examples to demonstrate data layout transforms using the
R package. The examples for this note are all demo-examples from tidyr:demo/ (current when we shared this note on 2019-04-27, removed 2019-04-28), and are mostly based on questions posted to StackOverflow. They represent a good cross-section of data layout problems, so they are a good set of examples or exercises to work through.
Continue reading Data Layout Exercises
Here is an example how easy it is to use
cdata to re-layout your data.
Tim Morris recently tweeted the following problem (corrected).
Please will you take pity on me #rstats folks?
I only want to reshape two variables x & y from wide to long!
d xa xb ya yb
1 1 3 6 8
2 2 4 7 9
How can I get to:
id t x y
1 a 1 6
1 b 3 8
2 a 2 7
2 b 4 9
In Stata it's:
. reshape long x y, i(id) j(t) string
In R, it's:
. an hour of cursing followed by a desperate tweet 👆
Thanks for any help!
PS – I can make reshape() or gather() work when I have just x or just y.
This is not to make fun of Tim Morris: the above should be easy. Using diagrams and slowing down the data transform into small steps makes the process very easy.
Continue reading Controlling Data Layout With cdata