Posted on Categories Administrativia, Statistics, TutorialsTags , , , , ,

Update on coordinatized or fluid data

We have just released a major update of the cdata R package to CRAN.

Cdata

If you work with R and data, now is the time to check out the cdata package.

Among the changes in the 0.5.* version of cdata package:

  • All coordinatized data or fluid data operations are now in the cdata package (no longer split between the cdata and replyr packages).
  • The transforms are now centered on the more general table driven moveValuesToRowsN() and moveValuesToColumnsN() operators (though pivot and un-pivot are now made available as convenient special cases).
  • All the transforms are now implemented in SQL through DBI (no longer using tidyr or dplyr, though we do include examples of using cdata with dplyr).
  • This is (unfortunately) a user visible API change, however adapting to the changed API is deliberately straightforward.

cdata now supplies very general data transforms on both in-memory data.frames and remote or large data systems (PostgreSQL, Spark/Hive, and so on). These transforms include operators such as pivot/un-pivot that were previously not conveniently available for these data sources (for example tidyr does not operate on such data, despite dplyr doing so).

To help transition we have updated the existing documentation:

The fluid data document is a bit long, as it covers a lot of concepts quickly. We hope to develop more targeted training material going forward.

In summary: cdata theory and package now allow very concise and powerful transformations of big data using R.

8 thoughts on “Update on coordinatized or fluid data”

  1. Really interesting, and the new docs are helpful!
    Re. “rows and columns are inessential implementation details when reasoning about your data,” that’s true- and a somewhat challenging abstraction for some. Though, of course, it’s difficult to decipher the chicken/egg problem there, given how much of my earliest experiences with “data” was spreadsheet-based.

    I’m not sure it’s of any consequence to the point, but the rstudio data transformation cheat sheet has been updated.

    1. Thanks Mara,

      We are still getting our head around the concept here (there are a lot of interesting ramifications). We all use a mental models for data, and a spreadsheet (or table) is not a bad one at all. The mental model this work is getting towards is something like “RDF triples” (though we don’t want the nightmares that are typical for implementations of such).

      What is the update you mention? Can you share a link?

      Nina is going to go through this and write even better (more readable) docs “in her copious spare time” (i.e., when she has some time). Definitely appreciate any help promoting the work.

      We already have the transforms working at big data scale (Spark and PostgreSQL) at client sites (in fact one of the reasons to package this work is so these consulting clients can get away from one-off custom code).

      John

        1. Thanks,

          I like the common R “denormalized data” model for reasoning about data (everything you want to know about an instance in a single row, using the term denormalized to offer some respect for Codd’s relational data model). I’ve never seen durable RDF triples used well (often a sign of an architectural astronaut), but I wanted to give them credit as being the “where is the data during jumps” (i.e., when it is moving from one table to another, and a reference to “where is the electron during a quantum jump” from physics). Explicitly basing things on RDF (instead of just using ideas) was one of the many blunders documented in Scott Rosenberg’s enjoyable book “Dreaming in Code” (Crown 2007).

  2. Interesting — I’m not a data scientist, just traditional field scientist. I do my analyses on data in the ‘denormalized’ and ‘entry/attr/value’ forms (what Wickham calls wide and long I think?). Anyway, the ‘normalised’ form has never occurred to me, but I can see the value in terms of data integrity. You showed how all the different formats could be converted into denormalised form, but is there a way to convert from denormalised to normalised using the operations you outline?

    1. “is there a way to convert from denormalised to normalised using the operations you outline?”

      Emphatically: yes, there is. It involves more than one step (joins, projections, and so on). In this article we imagine four different storage strategies for the same data, and show the transforms that move between them (I think I only added the last few arcs to make the graph of scientists strongly connected in the comments of the original article here).

      The main point: which format is “best” depends a lot on context. Machine learning wants denormalized form (wide form), data recorders tend to be in the tall form (and `ggplot2()` uses this, also called “molten” when thinking in terms of melt/cast), and normal form (multiple tables) is preferred in data warehouses.

      Our group has found that classical scientists (chemists, field scientists, biologists, physicists, and so on) tend to understand wide/tall conversions much better than data engineers. Data engineers tend to understand de-normalized/normalized conversions better than scientists. I think it comes down to what you actually need/use to get your actual work done. Codd’s theory was game changing.

Leave a Reply