Posted on Categories data science, Opinion, TutorialsTags

What is “Tidy Data”?

I would like to write a bit on the meaning and history of the phrase “tidy data.”

Hadley Wickham has been promoting the term “tidy data.” For example in an eponymous paper, he wrote:

In tidy data:

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table.

Wickham, Hadley “Tidy Data”, Journal of Statistical Software, Vol 59, 2014.

Let’s try to apply this definition to following data set from the Wikipedia:

Tournament Winners
Tournament Year Winner Winner Date of Birth
Indiana Invitational 1998 Al Fredrickson 21 July 1975
Cleveland Open 1999 Bob Albertson 28 September 1968
Des Moines Masters 1999 Al Fredrickson 21 July 1975
Indiana Invitational 1999 Chip Masterson 14 March 1977

This would seem to be a nice “ready to analyze” data set. Rows are keyed by tournament and year, and rows carry additional key-derived observations of winner’s name and winner’s date of birth. From such a data set we could look for repeated winners, and look at the age of winners.

A question is: is such a data set “tidy”? The paper itself claims the above definitions are “Codd’s 3rd normal form.” So, no the above table is not “tidy” under that paper’s definition. The the winner’s date of birth is a fact about the winner alone, and not a fact about the joint row keys (the tournament plus year) as required by the rules of Codd’s 3rd normal form. The critique being: this data presentation does not express the intended data invariant that Al Fredrickson must have the same “Winner Date of Birth” in all rows.

Around January of 2017 Hadley Wickham apparently retconned the “tidy data” definition to be:

Tidy data is data where:

  1. Each variable is in a column.
  2. Each observation is a row.
  3. Each value is a cell.

tidyr commit 58cf5d1ebad6b26bd33ad1c94cc5e5e7ef1acf7e

Notice point-3 is now something possibly more related to Codd’s guaranteed access rule, and now the example table is plausibly “tidy.”

The above concept was already well known in statistics and called a “data matrix.” For example:

A standard method of displaying a multivariate set of data is in the form of a data matrix in which rows correspond to sample individuals and columns to variables, so that the entry in the ith row and jth column gives the value of the jth variate as measured or observed on the ith individual.

Krzanowski, W. J., F. H. C. Marriott, Multivariate Analysis Part 1, Edward Arnold, 1994, page 1.

One must understand that in statistics, “individual” often refers to observations, not people.

The above reference clearly considers “data matrix” to be a noun phrase already in common use in statistics. It is in the book’s index, and often used in phrases such as:

Suppose X is an n × p data matrix …

Krzanowski, W. J., F. H. C. Marriott, Multivariate Analysis Part 1, Edward Arnold, 1994, page 75.

So statistics not only already has the data organization concepts, statistics already has standard terminology around it. Data engineering often called this data organization a “de-normalized form.”

As a further example, the statistical system R, itself uses variations the above standard terminology. Take for instance the help() text from R’s data.matrix() method:

  data.matrix {base}    R Documentation

  Convert a Data Frame to a Numeric Matrix

  Description

  Return the matrix obtained by converting all the variables in a data frame to numeric mode and then binding them together as the columns of a matrix. Factors and ordered factors are replaced by their internal codes.

What is the extra “Factors and ordered factors are replaced by their internal codes” part going on about? That is also fairly standard, let’s expand the earlier data matrix quote a bit to see this.

A standard method of displaying a multivariate set of data is in the form of a data matrix in which rows correspond to sample individuals and columns to variables, so that the entry in the ith row and jth column gives the value of the jth variate as measured or observed on the ith individual. When presenting data in this form, it is customary to assign a numerical code to the categories of a qualitative variable …

Krzanowski, W. J., F. H. C. Marriott, Multivariate Analysis Part 1, Edward Arnold, 1994, page 1.

Note: for many R analyses the model.matrix() command is implicitly called in preference to the data.matrix() command, as this conversion expands factors into “dummy variables”- which is a representation often more useful for modeling. The model.matrix() documentation starts as follows:

  model.matrix {stats}  R Documentation

  Construct Design Matrices

  Description

  model.matrix creates a design (or model) matrix, e.g., by expanding factors to a set of dummy variables (depending on the contrasts) and expanding interactions similarly.

So to summarize: the whole time we have been talking about well understood concepts of organizing data for analysis that have a long history.

Frankly it appears “tidy data” is something akin to a trademark or marketing term, especially in its “tidyverse” variation.

4 thoughts on “What is “Tidy Data”?”

  1. Of course, “tidy” is not a dogma. Either is the Wikipedia data set: e.g. why not also split the date in YYYY, MM, DD, or keep it in the date format YYYMMDD[hhmmss[n..n][TZ]] and use this as a standard time?

  2. I always find Hadley somewhat… conflicted. On the one hand, he wants ‘rigour’ in the data, but on the other hand, he wants the ‘row’ to be the classic ‘observation’. e.i., what we RDBMS folk call a ‘flat file image’. There is no way to reconcile the two notions. Yes, one can, and should, store data in xNF, but until someone writes stat packs inside SQL databases, IOW automagically doing the joins, RDBMS storage will be NF while stat pack data will be flat (it has to be with current and legacy stat packs).

    It would be more useful if Hadley, et al, would be more forceful in directing stat folks to do data storage in SQL/RDBMS, using things like PL/R possibly, rather than creating yet another impedance barrier with SQL-veiling syntax in R.

    1. The way I say it is: analysis likes a denormalized image (all facts ready in one row). But your systems of record should not use this, so treat this format as something transitory you build from other sources. The analogy falls apart a bit when collecting data from multiple times in one row: which table is “ready” depends on if you think of time as a key or not (it often is for the data, but not for the analysis).

      As far as syntax. SQL is universal, but it can be daunting. Even Codd thought their may be multiple query languages, merely insisting the query language must be strong enough to perform all tasks (including DB management).

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.