Posted on Categories data science, Exciting Techniques, Practical Data Science, Pragmatic Data Science, TutorialsTags , ,

Data Layout Exercises

John Mount, Nina Zumel; Win-Vector LLC 2019-04-27

In this note we will use five real life examples to demonstrate data layout transforms using the cdata R package. The examples for this note are all demo-examples from tidyr:demo/ (current when we shared this note on 2019-04-27, removed 2019-04-28), and are mostly based on questions posted to StackOverflow. They represent a good cross-section of data layout problems, so they are a good set of examples or exercises to work through.

Continue reading Data Layout Exercises

Posted on Categories Coding, Practical Data Science, Pragmatic Data Science, TutorialsTags , , 4 Comments on Controlling Data Layout With cdata

Controlling Data Layout With cdata

Here is an example how easy it is to use cdata to re-layout your data.

Tim Morris recently tweeted the following problem (corrected).

Please will you take pity on me #rstats folks?
I only want to reshape two variables x & y from wide to long!

Starting with:
    d xa xb ya yb
    1  1  3  6  8
    2  2  4  7  9

How can I get to:
    id t x y
    1  a 1 6
    1  b 3 8
    2  a 2 7
    2  b 4 9
    
In Stata it's:
 . reshape long x y, i(id) j(t) string
In R, it's:
 . an hour of cursing followed by a desperate tweet 👆

Thanks for any help!

PS – I can make reshape() or gather() work when I have just x or just y.

This is not to make fun of Tim Morris: the above should be easy. Using diagrams and slowing down the data transform into small steps makes the process very easy.

Continue reading Controlling Data Layout With cdata

Posted on Categories data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Programming, TutorialsTags , ,

The blocks and rows theory of data shaping

We have our latest note on the theory of data wrangling up here. It discusses the roles of “block records” and “row records” in the cdata data transform tool. With that and the theory of how to design transforms, we think we have a pretty complete description of the system.

Rowrecs to blocks

Posted on Categories Coding, data science, Programming, TutorialsTags , , , , 15 Comments on Using a Column as a Column Index

Using a Column as a Column Index

We recently saw a great recurring R question: “how do you use one column to choose a different value for each row?” That is: how do you use a column as an index? Please read on for some idiomatic base R, data.table, and dplyr solutions.

Continue reading Using a Column as a Column Index

Posted on Categories Coding, data science, Exciting Techniques, Programming, Statistics, TutorialsTags , , ,

Wanted: cdata Test Pilots

I need a few volunteers to please “test pilot” the development version of the R package cdata, please.

Jackie Cochran at 1938 Bendix Race
Jacqueline Cochran: at the time of her death, no other pilot held more speed, distance, or altitude records in aviation history than Cochran.

Continue reading Wanted: cdata Test Pilots

Posted on Categories Computer Science, data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, ProgrammingTags , , , , , , 3 Comments on rquery: Fast Data Manipulation in R

rquery: Fast Data Manipulation in R

Win-Vector LLC recently announced the rquery R package, an operator based query generator.

In this note I want to share some exciting and favorable initial rquery benchmark timings.

Continue reading rquery: Fast Data Manipulation in R

Posted on Categories Administrativia, Opinion, Pragmatic Data Science, Pragmatic Machine Learning, Programming, StatisticsTags , , , , , , , , 4 Comments on Getting started with seplyr

Getting started with seplyr

A big “thank you!!!” to Microsoft for hosting our new introduction to seplyr. If you are working R and big data I think the seplyr package can be a valuable tool.


Safety
Continue reading Getting started with seplyr

Posted on Categories Pragmatic Data Science, Pragmatic Machine Learning, Programming, Statistics, TutorialsTags , , , , , , ,

Data Wrangling at Scale

Just wrote a new R article: “Data Wrangling at Scale” (using Dirk Eddelbuettel’s tint template).

Fd

Please check it out.

Posted on Categories data science, Expository Writing, Opinion, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Programming, Statistics, TutorialsTags , , , , , ,

Teaching pivot / un-pivot

Authors: John Mount and Nina Zumel

Introduction

In teaching thinking in terms of coordinatized data we find the hardest operations to teach are joins and pivot.

One thing we commented on is that moving data values into columns, or into a “thin” or entity/attribute/value form (often called “un-pivoting”, “stacking”, “melting” or “gathering“) is easy to explain, as the operation is a function that takes a single row and builds groups of new rows in an obvious manner. We commented that the inverse operation of moving data into rows, or the “widening” operation (often called “pivoting”, “unstacking”, “casting”, or “spreading”) is harder to explain as it takes a specific group of columns and maps them back to a single row. However, if we take extra care and factor the pivot operation into its essential operations we find pivoting can be usefully conceptualized as a simple single row to single row mapping followed by a grouped aggregation.

Please read on for our thoughts on teaching pivoting data. Continue reading Teaching pivot / un-pivot

Posted on Categories data science, Expository Writing, Opinion, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Programming, Statistics, TutorialsTags , , , , , , , 1 Comment on Coordinatized Data: A Fluid Data Specification

Coordinatized Data: A Fluid Data Specification

Authors: John Mount and Nina Zumel.

Introduction

It has been our experience when teaching the data wrangling part of data science that students often have difficulty understanding the conversion to and from row-oriented and column-oriented data formats (what is commonly called pivoting and un-pivoting).

Real trust and understanding of this concept doesn’t fully form until one realizes that rows and columns are inessential implementation details when reasoning about your data. Many algorithms are sensitive to how data is arranged in rows and columns, so there is a need to convert between representations. However, confusing representation with semantics slows down understanding.

In this article we will try to separate representation from semantics. We will advocate for thinking in terms of coordinatized data, and demonstrate advanced data wrangling in R.

Continue reading Coordinatized Data: A Fluid Data Specification