Posted on Categories Opinion, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, StatisticsTags , , , , , ,

The case for index-free data manipulation

Statisticians and data scientists want a neat world where data is arranged in a table such that every row is an observation or instance, and every column is a variable or measurement. Getting to this state of “ready to model format” (often called a denormalized form by relational algebra types) often requires quite a bit of data manipulation. This is how R data.frames describe themselves (try “str(data.frame(x=1:2))” in an R-console to see this) and is part of the tidy data manifesto.

Tools like SQL (structured query language) and dplyr can make the data arrangement process less burdensome, but using them effectively requires “index free thinking” where the data are not thought of in terms of row indices. We will explain and motivate this idea below.

SQL

SQL is actually a rich modeling environment with interesting semantics due to its relation to E.F. Codd’s relational model. However, working with SQL databases presents a number of practical difficulties.

  • Analysts are forced to think in relational terms (normalized forms, key columns, constraints, and so on)
  • Not being dedicated to transient analysis, SQL databases are often are configured to prohibit the alteration of the data in place (which makes sense, analysis should not change the initial facts).
  • “Full” databases discourage copying tables or making a large number of intermediate structures
  • Ongoing transactions transactions cause data to change over time.

Working around these difficulties, perhaps by setting up a staging database, or even containerizing the database, still leaves two problems:

  1. The difficulty of the SQL syntax itself
  2. The idea that you are not supposed to identify rows or depend on implicit row order.

These two problems are side-effects of being forced to “think in relational terms.” The first problem can be mitigated with adapters, such as dplyr. The second problem is one to take to heart: stop thinking about row numbers and implicit row order. Learn to re-phrase your analysis in a row-ID independent manner. Most predictive analytic goals are row-ID independent (example: fitting a linear model doesn’t depend on row order), so it makes sense from a point of purity to insist that the intermediate analysis steps also not depend on row-IDs.

We are not going to directly demonstrate any SQL queries in this article, but be aware that SQL concepts and relational model concepts are the foundation of a lot of the user-visible semantics of substantial data stores (such as PostreSQL and Spark).

Split-Apply-Combine and dplyr

Most transformations that appear to need row-ids (such as subsetting, and computing cumulative sums) can be re-phrased in terms of grouping and ordering. Many ideas for accomplishing this reformulation can be found in Hadley Wickham’s article “The Split-Apply-Combine Strategy for Data Analysis” (link).


NewImage

Three obvious ways to implement the Split-Apply-Combine strategy are:

  1. base::split followed by lapply(.,f) and dplyr::bind_rows
  2. tidyr::nest followed by lapply(.,f) and tidyr::unnest
  3. dplyr::group_by followed by f and dplyr::ungroup

Of the three, only the dplyr::group_by realization is commonly practical on large remote data sources. However, in our own replyr::gapply we are experimenting with working with including an actual split based approach.

A proper modern SQL implementation (such as PostgreSQL or Spark SQL) supports all of the operations of Split-Apply-Combine (though pivoting/un-pivoting or gather/spread has fairly shaky database support, despite database documentation claiming otherwise). SQL just expresses grouping and ordering in an ugly way: with the grouping and ordering both being aspects of the query and not annotations on the data. This makes sense as one of the cases SQL considers is read-only data with no temporary tables, so there is nowhere for the user to maintain the annotations other than in the query.

SQL comes from the age of data centers when the computer’s time was more valuable than the analysts’ time. Thus, burdens were placed on the analysts (i.e. in the SQL). However, as mentioned above, this difficulty can be overcome with adapters. SQL queries can hold all the annotations and constraints needed for sophisticated analysis, and this process can be automated. The analyst does not have to be the one to put them in the query.

dplyr seeks to mitigate creeping SQL query complexity through adapters and many notation improvements, and includes one major semantic improvement: grouping is a visible annotation on the data. For example we move from writing seemingly convenient index-driven code that looks like:

  data(iris)
  irism <- iris
  setosaIndexes <- which(irism$Species=='setosa')
  irism[setosaIndexes,'Sepal.Length'] <- 1 + irism[setosaIndexes,'Sepal.Length']

To writing purer transforms such as:

  library('dplyr')
  data(iris)
  irism <- iris
  irism %>% split(irism$Species) -> itable
  itable$setosa$Sepal.Length <- 1 + itable$setosa$Sepal.Length 
  irism <- dplyr::bind_rows(itable)

At first this seems like a lot of work, but the row-ID free notation eventually pays off. The Split-Apply-Combine approach is much more composable and usable on many remote data sources (many of which deliberately do not have concepts of durable row-order or row-numbers).

Split-Apply-Combine is interesting because it lets us perform arbitrary simultaneous grouped and ordered operations such as: in-group min/max, in-group ranking, in-group cumulative sums, in-group replacement of values, and so on. Split-Apply-Combine achieves this is by maintaining groups by splitting, and imposing order by later sub-queries or per-group operations (SQL tables are not guaranteed to maintain row-orders, but sub-queries can). However, I have come to think of splitting is an implementation detail that we want to limit the lifetime and visibility of.

I am coming to think the useful Split-Apply-Combine primitive pattern looks like a wrapped function supplying the semantics:

    data %>% split(column1) %>% lapply(arrange(column2) %>% f()) %>% bind_rows()

where “f()” is a user supplied function or magrittr pipeline.

In my next article in this series I will further discuss the specialization of Split-Apply-Combine to this Grouped-Ordered-Apply pattern.

2 thoughts on “The case for index-free data manipulation”

  1. I’m having a hard time following the examples here. The “index-driven” example uses indexes artificially and unnecessarily. Really all you need is

    irism <- iris
    setosa <- irism$Species=='setosa'
    irism[setosa,'Sepal.Length'] % mutate(Sepal.Length = if_else(Species == “setosa”, 1 + Sepal.Length, Sepal.Length)) -> ir2

    1. Thanks for the great question. My apologies, concise examples are never as clear as one hopes.

      Your example is great, and what I would consider “index-driven.” The “setosa” variable is the indices (in logical form instead of integer form) and would remain so even if we landed them as a column. And that is the whole game of “index-free” write down what you want to do, and then find what generic service is closest to that, and then re-phrase into that service.

Leave a Reply