We have been writing a lot on higher-order data transforms lately:
What I want to do now is "write a bit more, so I finally feel I have been concise."
Continue reading Arbitrary Data Transforms Using cdata
Continue reading dplyr in Context
R users often come to the false impression that the popular packages
tidyr are both all of
R and sui generis inventions (in that they might be unprecedented and there might no other reasonable way to get the same effects in
R). These packages and their conventions are high-value, but they are results of evolution and implement a style of programming that has been available in
R for some time. They evolved in a context, and did not burst on the scene fully armored with spear in hand.
Authors: John Mount and Nina Zumel
In teaching thinking in terms of coordinatized data we find the hardest operations to teach are joins and pivot.
One thing we commented on is that moving data values into columns, or into a “thin” or entity/attribute/value form (often called “un-pivoting”, “stacking”, “melting” or “gathering“) is easy to explain, as the operation is a function that takes a single row and builds groups of new rows in an obvious manner. We commented that the inverse operation of moving data into rows, or the “widening” operation (often called “pivoting”, “unstacking”, “casting”, or “spreading”) is harder to explain as it takes a specific group of columns and maps them back to a single row. However, if we take extra care and factor the pivot operation into its essential operations we find pivoting can be usefully conceptualized as a simple single row to single row mapping followed by a grouped aggregation.
Please read on for our thoughts on teaching pivoting data. Continue reading Teaching pivot / un-pivot
Authors: John Mount and Nina Zumel.
It has been our experience when teaching the data wrangling part of data science that students often have difficulty understanding the conversion to and from row-oriented and column-oriented data formats (what is commonly called pivoting and un-pivoting).
Real trust and understanding of this concept doesn’t fully form until one realizes that rows and columns are inessential implementation details when reasoning about your data. Many algorithms are sensitive to how data is arranged in rows and columns, so there is a need to convert between representations. However, confusing representation with semantics slows down understanding.
In this article we will try to separate representation from semantics. We will advocate for thinking in terms of coordinatized data, and demonstrate advanced data wrangling in
Continue reading Coordinatized Data: A Fluid Data Specification