Posted on Categories Opinion, Pragmatic Data Science, TutorialsTags ,

Why we Did Not Name the cdata Transforms wide/tall/long/short

We recently saw this UX (user experience) question from the tidyr author as he adapts tidyr to cdata techniques.

NewImage

NewImage

While adopting the cdata methodology into tidyr, the terminology that he is not adopting from cdata is “unpivot_to_blocks()” and “pivot_to_rowrecs()”. One of the research ideas in the cdata package is that the important thing to call out is record structure.

The key point is: are we in a very de-normalized form where all facts about an instance are in a single row (which we called “row records”), or are we in a record oriented form where all the facts about an instances are in several rows (which we called “block records”)? The point is: row records don’t necessarily have more columns than block records. This makes shape based naming of the transforms problematic, no matter what names you pick for the shapes. There is an advantage to using intent or semantic based naming.

Below is a simple example.

library("cdata")

# example 1 end up with more rows, fewer columns
d <- data.frame(AUC = 0.6, R2 = 0.7, F1 = 0.8)
print(d)
#>   AUC  R2  F1
#> 1 0.6 0.7 0.8
unpivot_to_blocks(d,
                  nameForNewKeyColumn= 'meas',
                  nameForNewValueColumn= 'val',
                  columnsToTakeFrom= c('AUC', 'R2', 'F1')) 
#>   meas val
#> 1  AUC 0.6
#> 2   R2 0.7
#> 3   F1 0.8

# example 2 end up with more rows, same number of columns
d <- data.frame(AUC = 0.6, R2 = 0.7)
print(d)
#>   AUC  R2
#> 1 0.6 0.7
unpivot_to_blocks(d,
                  nameForNewKeyColumn= 'meas',
                  nameForNewValueColumn= 'val',
                  columnsToTakeFrom= c('AUC', 'R2')) 
#>   meas val
#> 1  AUC 0.6
#> 2   R2 0.7

# example 3 end up with same number of rows, more columns
d <- data.frame(AUC = 0.6)
print(d)
#>   AUC
#> 1 0.6
unpivot_to_blocks(d,
                  nameForNewKeyColumn= 'meas',
                  nameForNewValueColumn= 'val',
                  columnsToTakeFrom= c('AUC'))
#>   meas val
#> 1  AUC 0.6

Notice the width of the result relative to input width varies as function of the input data, even though we were always calling the same transform. This makes it incorrect to characterize these transforms as merely widening or narrowing.

There are still some subtle points (for instance row records are in fact instances of block records), but overall the scheme we (Nina Zumel, and myself: John Mount) worked out, tested, and promoted is pretty good. A lot of our work researching this topic can be found here.

2 thoughts on “Why we Did Not Name the cdata Transforms wide/tall/long/short”

  1. These tiny tables strike me as degenerate cases. Is there a much larger class of cases for which the wider/longer intuition is incorrect?

    1. There are larger examples in that we can add in more rows, but this particular issue is driven by the situation where one only has a few measurements (so taking space to specify the measurement name stands out). But that being said: my training is as a mathematician and the principle there is often “statements that are true are true for extreme examples.”

      Our actual serious point is: Nina Zumel and I feel slowing things down to pick a chosen record shape name (row records and block records) helps a lot. Also, this is our work being adapted, so I feel we can comment on which changes are for the better and which are not (no matter how minor).

      Or: design by principle (throw out names, even if you liked them) is better than design by whim, is better than design by committee, is better than design by quick poll. It is a sign of how much the chosen audience loves the tidyverse that the winning name wasn’t something like “tablymctablyface.”

Comments are closed.