Posted on Categories Coding, TutorialsTags , ,

How cdata Control Table Data Transforms Work

With all of the excitement surrounding cdata style control table based data transforms (the cdata ideas being named as the “replacements” for tidyr‘s current methodology, by the tidyr authors themselves!) I thought I would take a moment to describe how they work.

cdata defines two primary data manipulation operators: rowrecs_to_blocks() and blocks_to_rowrecs(). These are the fundamental transforms that convert between data representations. The two representations it converts between are:

  • A world where all facts about an instance or record are in a single row (“rowrecs”).
  • A world where all facts about an instance or record are in groups of rows (“blocks”).

It turns out once you develop the idea of specifying the data transformation as explicit data (an application of Erick S. Raymond’s admonition: “fold knowledge into data, so program logic can be stupid and robust.”), you have also a great tool for reasoning and teaching data transforms.

For example:

rowrecs_to_blocks() does the following. For each row record, make a replicant of the of the control table with values filled in. In relational terms rowrecs_to_blocks() is therefore a join of the data to the control table. Conversely blocks_to_rowrecs() combines groups of rows into single rows, so in relational terms it is an aggregation or projection. If each of these operations is faithful (keeps enough information around) they are then inverse of each other.

We share some nifty tutorials on the ideas here:

One can build fairly clever illustrations and animations to teach the above.

The most common special cases of the above have been popularized in R as unpivot/pivot (pivot invented by Pito Salas), stack/unstack, melt/cast, or gather/spread. These special cases are handled in cdata by convenience functions unpivot_to_blocks() and pivot_to_rowrecs(). A great example of a “higher order” transform that isn’t one of the common ones is given here.

Note: the above theory and implementation is joint work of Nina Zumel and John Mount and can be found here. We would really appreciate any citations or credit you can send our way (or even politely correcting those who don’t attribute the work or attribute the work to others, as there are already a lot of mentions without credit or citation).

citation("cdata")

To cite package ‘cdata’ in publications use:

  John Mount and Nina Zumel (2019). cdata: Fluid Data Transformations. https://github.com/WinVector/cdata/,
  https://winvector.github.io/cdata/.

A BibTeX entry for LaTeX users is

  @Manual{,
    title = {cdata: Fluid Data Transformations},
    author = {John Mount and Nina Zumel},
    year = {2019},
    note = {https://github.com/WinVector/cdata/, https://winvector.github.io/cdata/},
  }

2 thoughts on “How cdata Control Table Data Transforms Work”

  1. Notationally rowrecs_to_blocks() is multiplying by the control table and blocks_to_rowrecs() is divide by the control table. We have the operator notation worked out here.

Comments are closed.