Posted on

## Advanced Data Reshaping in Python and R

This note is a simple data wrangling example worked using both the Python data_algebra package and the R cdata package. Both of these packages make data wrangling easy through he use of coordinatized data concepts (relying heavily on Codd’s “rule of access”).

The advantages of data_algebra and cdata are:

• The user specifies their desired transform declaratively by example and in data. What one does is: work an example, and then write down what you want (we have a tutorial on this here).
• The transform systems can print what a transform is going to do. This makes reasoning about data transforms much easier.
• The transforms, as they themselves are written as data, can be easily shared between systems (such as R and Python).
Posted on Categories Administrativia, Statistics, Tutorials8 Comments on Update on coordinatized or fluid data

## Update on coordinatized or fluid data

We have just released a major update of the `cdata` R package to CRAN.

If you work with `R` and data, now is the time to check out the `cdata` package. Continue reading Update on coordinatized or fluid data

Posted on

## Introduction

In teaching thinking in terms of coordinatized data we find the hardest operations to teach are joins and pivot.

One thing we commented on is that moving data values into columns, or into a “thin” or entity/attribute/value form (often called “un-pivoting”, “stacking”, “melting” or “gathering“) is easy to explain, as the operation is a function that takes a single row and builds groups of new rows in an obvious manner. We commented that the inverse operation of moving data into rows, or the “widening” operation (often called “pivoting”, “unstacking”, “casting”, or “spreading”) is harder to explain as it takes a specific group of columns and maps them back to a single row. However, if we take extra care and factor the pivot operation into its essential operations we find pivoting can be usefully conceptualized as a simple single row to single row mapping followed by a grouped aggregation.

Posted on 1 Comment on Coordinatized Data: A Fluid Data Specification

## Introduction

It has been our experience when teaching the data wrangling part of data science that students often have difficulty understanding the conversion to and from row-oriented and column-oriented data formats (what is commonly called pivoting and un-pivoting).

Real trust and understanding of this concept doesn’t fully form until one realizes that rows and columns are inessential implementation details when reasoning about your data. Many algorithms are sensitive to how data is arranged in rows and columns, so there is a need to convert between representations. However, confusing representation with semantics slows down understanding.

In this article we will try to separate representation from semantics. We will advocate for thinking in terms of coordinatized data, and demonstrate advanced data wrangling in `R`.