In teaching thinking in terms of coordinatized data we find the hardest operations to teach are joins and pivot.
One thing we commented on is that moving data values into columns, or into a “thin” or entity/attribute/value form (often called “un-pivoting”, “stacking”, “melting” or “gathering“) is easy to explain, as the operation is a function that takes a single row and builds groups of new rows in an obvious manner. We commented that the inverse operation of moving data into rows, or the “widening” operation (often called “pivoting”, “unstacking”, “casting”, or “spreading”) is harder to explain as it takes a specific group of columns and maps them back to a single row. However, if we take extra care and factor the pivot operation into its essential operations we find pivoting can be usefully conceptualized as a simple single row to single row mapping followed by a grouped aggregation.
There are a number of statistical principles that are perhaps more honored in the breach than in the observance. For fun I am going to name a few, and show why they are not always the “precision surgical knives of thought” one would hope for (working more like large hammers).
It has been our experience when teaching the data wrangling part of data science that students often have difficulty understanding the conversion to and from row-oriented and column-oriented data formats (what is commonly called pivoting and un-pivoting).
Real trust and understanding of this concept doesn’t fully form until one realizes that rows and columns are inessential implementation details when reasoning about your data. Many algorithms are sensitive to how data is arranged in rows and columns, so there is a need to convert between representations. However, confusing representation with semantics slows down understanding.
In this article we will try to separate representation from semantics. We will advocate for thinking in terms of coordinatized data, and demonstrate advanced data wrangling in R.
I had the privilege of having Peter Wang himself demonstrate datashader for me and answer a few of my questions.
I am so excited about datashader capabilities I literally will not wait for the functionality to be exposed in R through rbokeh. I am going to leave my usual knitr/rmarkdown world and dust off Jupyter Notebook just to use datashader plotting. This is worth trying, even for diehard R users. Continue reading Datashader is a big deal
sigr is a simple R package that conveniently formats a few statistics and their significance tests. This allows the analyst to use the correct test no matter what modeling package or procedure they use.