Posted on Categories data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , , , , Leave a comment on Data re-Shaping in R and in Python

Data re-Shaping in R and in Python

Nina Zumel and I have a two new tutorials on fluid data wrangling/shaping. They are written in a parallel structure, with the R version of the tutorial being almost identical to the Python version of the tutorial.

This reflects our opinion on the “which is better for data science R or Python?” They both are great. So start with one, and expect to eventually work with both (if you are lucky).

Continue reading Data re-Shaping in R and in Python

Posted on Categories data science, Opinion, Pragmatic Data Science, TutorialsTags , , , , , , , , , 1 Comment on New Timings for a Grouped In-Place Aggregation Task

New Timings for a Grouped In-Place Aggregation Task

I’d like to share some new timings on a grouped in-place aggregation task. A client of mine was seeing some slow performance, so I decided to time a very simple abstraction of one of the steps of their workflow.

Continue reading New Timings for a Grouped In-Place Aggregation Task

Posted on Categories data science, Pragmatic Data Science, TutorialsTags , , , ,

A Richer Category for Data Wrangling

I’ve been writing a lot about a category theory interpretations of data-processing pipelines and some of the improvements we feel it is driving in both the data_algebra and in rquery/rqdatatable.

I think I’ve found an even better category theory re-formulation of the package, which I will describe here.

Continue reading A Richer Category for Data Wrangling

Posted on Categories Administrativia, Computer Science, Pragmatic Data ScienceTags , , , ,

Better SQL Generation via the data_algebra

In our recent note What is new for rquery December 2019 we mentioned an ugly processing pipeline that translates into SQL of varying size/quality depending on the query generator we use. In this note we try a near-relative of that query in the data_algebra.

Continue reading Better SQL Generation via the data_algebra

Posted on Categories data science, TutorialsTags , , , , ,

New rquery Vignette: Working with Many Columns

We have a new rquery vignette here: Working with Many Columns.

This is an attempt to get back to writing about how to use the package to work with data (versus the other-day’s discussion of package design/implementation).

Please check it out.

Posted on Categories data science, Exciting Techniques, TutorialsTags , , , , , 3 Comments on What is new for rquery December 2019

What is new for rquery December 2019

Our goal has been to make rquery the best query generation system for R (and to make data_algebra the best query generator for Python).

Lets see what rquery is good at, and what new features are making rquery better.

Continue reading What is new for rquery December 2019

Posted on Categories data science, Opinion, Pragmatic Data Science, Pragmatic Machine Learning, TutorialsTags , , , , , , , ,

Advanced Data Reshaping in Python and R

This note is a simple data wrangling example worked using both the Python data_algebra package and the R cdata package. Both of these packages make data wrangling easy through he use of coordinatized data concepts (relying heavily on Codd’s “rule of access”).

The advantages of data_algebra and cdata are:

  • The user specifies their desired transform declaratively by example and in data. What one does is: work an example, and then write down what you want (we have a tutorial on this here).
  • The transform systems can print what a transform is going to do. This makes reasoning about data transforms much easier.
  • The transforms, as they themselves are written as data, can be easily shared between systems (such as R and Python).

Continue reading Advanced Data Reshaping in Python and R

Posted on Categories Administrativia, data science, Opinion, Pragmatic Data Science, Pragmatic Machine Learning, TutorialsTags , , , , , , , ,

Introducing data_algebra

This article introduces the data_algebra project: a data processing tool family available in R and Python. These tools are designed to transform data either in-memory or on remote databases.

In particular we will discuss the Python implementation (also called data_algebra) and its relation to the mature R implementations (rquery and rqdatatable).

Continue reading Introducing data_algebra

Posted on Categories data science, Exciting Techniques, Practical Data Science, Pragmatic Data Science, TutorialsTags , ,

Data Layout Exercises

John Mount, Nina Zumel; Win-Vector LLC 2019-04-27

In this note we will use five real life examples to demonstrate data layout transforms using the cdata R package. The examples for this note are all demo-examples from tidyr:demo/ (current when we shared this note on 2019-04-27, removed 2019-04-28), and are mostly based on questions posted to StackOverflow. They represent a good cross-section of data layout problems, so they are a good set of examples or exercises to work through.

Continue reading Data Layout Exercises

Posted on Categories Coding, Practical Data Science, Pragmatic Data Science, TutorialsTags , , 4 Comments on Controlling Data Layout With cdata

Controlling Data Layout With cdata

Here is an example how easy it is to use cdata to re-layout your data.

Tim Morris recently tweeted the following problem (corrected).

Please will you take pity on me #rstats folks?
I only want to reshape two variables x & y from wide to long!

Starting with:
    d xa xb ya yb
    1  1  3  6  8
    2  2  4  7  9

How can I get to:
    id t x y
    1  a 1 6
    1  b 3 8
    2  a 2 7
    2  b 4 9
    
In Stata it's:
 . reshape long x y, i(id) j(t) string
In R, it's:
 . an hour of cursing followed by a desperate tweet 👆

Thanks for any help!

PS – I can make reshape() or gather() work when I have just x or just y.

This is not to make fun of Tim Morris: the above should be easy. Using diagrams and slowing down the data transform into small steps makes the process very easy.

Continue reading Controlling Data Layout With cdata