Posted on Categories data science, Practical Data Science, Pragmatic Data Science, Programming, Statistics, TutorialsTags , , , ,

The Zero Bug

I am going to write about an insidious statistical, data analysis, and presentation fallacy I call “the zero bug” and the habits you need to cultivate to avoid it.


The zero bug

The zero bug

Here is the zero bug in a nutshell: common data aggregation tools often can not “count to zero” from examples, and this causes problems. Please read on for what this means, the consequences, and how to avoid the problem.

Our Example

For our example problem please consider aggregating significant earthquake events in the United States of America.

To do this we will start with data from:

The National Geophysical Data Center / World Data Service (NGDC/WDS): Significant Earthquake Database, National Geophysical Data Center, NOAA, doi:10.7289/V5TD9V7K.

They database is described thusly:

The Significant Earthquake Database contains information on destructive earthquakes from 2150 B.C. to the present that meet at least one of the following criteria: Moderate damage (approximately $1 million or more), 10 or more deaths, Magnitude 7.5 or greater, Modified Mercalli Intensity X or greater, or the earthquake generated a tsunami.

I queried the form for “North America and Hawaii”:”USA” in tab delimited form. For simplicity and reproducibility I saved a the result in the url given in the R example below. Starting our example we can use R to load the data from the url and start our project.

url <- 'http://www.win-vector.com/dfiles/earthquakes.tsv'
d <- read.table(url, 
                header=TRUE, 
                stringsAsFactors = FALSE,
                sep='\t')
View(d)
head(d[,c('I_D','YEAR','INTENSITY','STATE')])
#    I_D YEAR INTENSITY STATE
# 1 6697 1500        NA    HI
# 2 6013 1668         4    MA
# 3 9954 1700        NA    OR
# 4 5828 1755         8    MA
# 5 5926 1788         7    AK
# 6 5927 1788        NA    AK

We see the data is organizes such that each row is an event (with I_D), and contains many informational columns including “YEAR” (the year the event happened) and “STATE” (the state abbreviation where the event happened). Using R tools and packages we can immediately start to summarize and visualize the data.

For example: we can count modern (say 1950 and later) US earthquakes by year.

library("ggplot2")
library("dplyr")

dModern <- d[d$YEAR>=1950, , drop=FALSE]

# histogram
ggplot(data=dModern, mapping=aes(x=YEAR)) +
  geom_histogram(binwidth=1) +
  ggtitle('number of modern USA earthquakes by year')

NewImage

Or we can get use dplyr to build the count summaries by hand and present the summary in a stem-plot instead of the histogram.

# aggregate the data
byYear <- dModern %>% 
  group_by(YEAR) %>%
  summarize(count = n()) %>%
  arrange(YEAR)

# produce a stem-style plot
ggplot(data=byYear, mapping=aes(x=YEAR, y=count)) +
  geom_point(color='darkgreen') +
  geom_segment(mapping=aes(xend=YEAR, yend=0),
               color='darkgreen') +
  scale_y_continuous(limits=c(0, max(byYear$count))) +
  ggtitle('number of modern USA earthquakes by year')

NewImage

The problem

We have already snuck in the mistake. The by-hand aggregation “byYear” is subtly wrong. The histogram is almost correct (given its graphical convention of showing counts as height), but the stem plot is revealing problems.

Here is the problem:

summary(byYear$count)

#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#    1.00    1.00    2.00    2.50    3.75    7.00 

Notice the above summary implies the minimum number of significant earthquakes seen in the United States in a modern year is 1. Looking at our graphs we can see it should in fact be 0. This wasn’t so bad for graphing, but can be disastrous in calculation or in directing action.

The fix

This is the kind of situation that anti_join is designed to fix (and is how replyr::replyr_coalesce deals with the problem).

A simple ad-hoc fix I recommend is to build a second synthetic summary frame that carries explicit zero counts. We then add these counts into our aggregation and get correct summaries.

For example the following code:

# add in zero summaries
zeroObs <- data.frame(YEAR=1950:2016, count=0)
byYear <- rbind(byYear, zeroObs) %>%
  group_by(YEAR) %>%
  summarize(count = sum(count)) %>%
  arrange(YEAR)

# re-plot
ggplot(data=byYear, mapping=aes(x=YEAR, y=count)) +
  geom_point(color='darkgreen') +
  geom_segment(mapping=aes(xend=YEAR, yend=0),
               color='darkgreen') +
  scale_y_continuous(limits=c(0, max(byYear$count))) +
  ggtitle('number of modern USA earthquakes by year')

gives us the corrected stem plot:

NewImage

The figure above is the correct stem-plot. Remember: while a histogram denotes counts by filled-heights of bars, a stem-plot denotes counts by positions of visible points. The point being: in a proper stem-plot zero counts are not invisible (and are in fact distinguishable from missing summaries).

A harder example

This issue may seem trivial but that is partly because I deliberately picked a simple example where it is obvious there is missing data. This is not always the case. Remember: hidden errors can be worse than visible errors. In fact even in the original histogram it was not obvious what to think about the missing years 1950 (which apparently had no significant US earthquakes) and 2017 (which is an incomplete year). We really have to explicitly specify the complete universe (also called range or support set) of keys (as in YEAR=1950:2016, and not use a convenience such as YEAR=min(dModern$YEAR):max(dModern$YEAR)).

This is much more obvious if we summarize by state instead of year.

byState <- dModern %>% 
  group_by(STATE) %>%
  summarize(count = n()) %>%
  arrange(STATE)

ggplot(data=byState, mapping=aes(x=STATE, y=count)) +
  geom_point(color='darkgreen') +
  geom_segment(mapping=aes(xend=STATE, yend=0),
               color='darkgreen') +
  scale_y_continuous(limits=c(0, max(byState$count))) +
  ggtitle('number of modern USA earthquakes by state')

NewImage

We do not have a proper aggregation where each state count is represented with an explicit zero, and not by implicit missingness (which can’t differentiate states with zero-counts from non-states or misspellings). To produce the proper summary we need a list of US state abbreviations to merge in.

A user of the above graph isn’t likely to be too confused. Such a person likely knows there are 50 states and will presume the missing states have zero counts. However downstream software (which can be punishingly literal) won’t inject such domain knowledge and will work on the assumption that there are 17 states and all states have had significant earthquakes in modern times. Or suppose we are aggregating number treatments given to a set of patients; in this case the unacceptable confusion of not-present and zero-counts can really hide huge, problems: such as not noticing some patients never received treatment.

R actually has list of state abbreviations (in “datasets::state.abb“) and we can use replyr::replyr_coalesce to quickly fix our issue:

library('replyr')

support <- data.frame(STATE= c('', datasets::state.abb),
                      stringsAsFactors = FALSE)

byState <- byState %>%
  replyr_coalesce(support,
                  fills= list(count= 0)) %>%
  arrange(STATE)

An important feature of replyr_coalesce is: it checks that the count-carrying rows (the data) are contained in the support rows (the range definition). It would (intentionally) throw an error if we tried to use just datasets::state.abb as the support (or range) definition as this signals the analyst didn’t expect the blank state in the data set.

“The Dog that Didn’t Bark”

An additional illustrative example is from Sir Arthur Conan Doyle’s story “The Adventure of Silver Blaze”.

NewImage

In this story Sherlock Holmes deduces that a horse had been absconded not by a stranger, but by somebody well known at its stable. Holmes explains the key clue here:

   Gregory (Scotland Yard): "Is there any other point to which you 
                             would wish to draw my attention?"
   Holmes: "To the curious incident of the dog in the night-time."
   Gregory: "The dog did nothing in the night-time."
   Holmes: "That was the curious incident."

For this type of reasoning to work: Holmes has to know that there was a dog present, the dog would have barked if a stranger approached the stables, and none of his witnesses reported hearing the dog. Holmes needs an affirmative record that no barks were heard: a zero written in a row, not a lack of rows.

Conclusion

When describing “common pitfalls” the reaction is often: “That isn’t really interesting, as I would never make such an error.” I believe “the zero bug” is in fact common and not noticed as it tends to hide. The bug is often there and invisible, but can produce invalid results.

Any summary that is produced by counting un-weighted rows can never explicitly produce a value of zero (it can only hint at such through missingness). n() can never form zero, so if zero is important it must be explicitly joined in after counting. In a sense organizing rows for counting with n() censors out zeros.

I can’t emphasize enough how important it is to only work with explicit representations (records indicating counts, even if they are zero), and not implicit representations (assuming non-matched keys indicate zero-counts). The advantages of explicit representations are one of the reasons R has a notation for missing values (“NA“) in the first place.

This is, of course, not a problem due to use of dplyr. This is a common problem in designing SQL database queries where you can think of it as “inner-join collapse” (failing to notice rows that disappear due to join conditions).

My advice: you should be suspicious of any use of summarize(VALUE=n()) until you find the matching zero correction (replyr_coalesce, or a comment documenting why no such correction is needed). This looking from n() to a matching range-control should be become as habitual as looking from an opening parenthesis to its matching closing parenthesis. Even in data science, you really have to know you have all the little things right before you rely on the large things.

16 thoughts on “The Zero Bug”

  1. It is still being tested: but replyr_coalesce is designed to work properly with remote data sources such as PostgreSQL and Spark (via dplyr and sparklyr). One of my current goals is to make working with big data through R as simple and idiomatic as working with local data frames. That is the purpose of the replyr package and the kinds of things I'll be teaching at my Strata workshop this March (please sign up!!).

  2. A question that has come up: am I against sparse matrices? No I like sparse matrices.

    The sparse matrix version of the above would be that the sparse matrix should carry a representation of what keys are possible (and use them to bounds-check attempted indexing). This can be very compact as a list of dimension definitions can represent a huge cross-product.

    Database users routinely solve this problem with “dimension fact tables”, which I also consider a great explicit representation. So I definitely do not feel that having all the extra rows around is the only solution.

    The only “implicit representation” I am against is unchecked: “it didn’t match, maybe it was zero?” I consider a sparse matrix with bounds checking fully explicit. The fact it doesn’t have to store the zeros is an implementation detail as long as it offers the user at least the semantics of the matching dense representation (and perhaps some good “skip zero” traversal methods).

  3. In this example, an easy solution is to just explicitly include all the relevant years as levels of a factor, and then you can use base functions like xtabs() or table():

    d$YEAR <- factor(d$YEAR)
    levels(d$YEAR) <- as.character(1500:2016)
    x <- xtabs(~YEAR,d)
    range(x)
    ### [1] 0 7

    1. Very good point, probably one of the reasons factors are a type in R (to usefully carry the support/range definition around). I don’t think a lot of the dplyr verbs tend to have this extra behavior for factors (they tend to see factors as just a strange encoding for strings). ggplot2 itself I think does make missing factor levels visible.

      I’d like to take this moment to point out that replyr_coalesce accepts an arbitrary data.frame as its support/range definition. This means you can throw in cross-products or even arbitrary ragged structures.

      library("replyr")
      # complex key example
      data <- data.frame(year = c(2005,2007,2010),
                         count = c(6,1,NA),
                         name = c('a','b','c'),
                         stringsAsFactors = FALSE)
      support <- expand.grid(year=2005:2010,
                             name= c('a','b','c','d'),
                             stringsAsFactors = FALSE)
      filled <- replyr_coalesce(data, support,
                                fills=list(count=0))
      filled <- filled[order(filled$year, filled$name), ]
      filled
      

      Again you can do more than single columns or cross-products. For example you could have x1 be a date and the matching x2s for each x be exactly last 5 business days.

  4. I was looking at someones data the other day where they had a similar problem:
    – give people a random selection of options
    – note which they ticked

    but what they actually needed to know to answer their question was which were seen and ticked, which were seen and unticked (effectively a 0), and which were not presented to the person (effectively a NULL).

    Fortunately, they had records of who got which options, so it was a relatively simple bit of recoding.

    But, to me, at a broad level, all of these things come back to epistemological/ set theory kind of questions about how do the attributes of the data we are working with match up to the attributes of the question being solved. In your case if we imagine a venn diagram of a circle of years with a smaller circle of earthquakes inside it (because all earthquakes occur at a time but not all times have an earthquake) then from the earthquake data alone we can only ask questions where both attributes exist. To ask a question of all years, we need to introduce another source of information, even synthetic.

    1. Thanks for the link, I liked the writeup. Also thanks for you note, and sorry if I caused confusion.

      Also I forgot to point out: but I first show how to fix the problem directly using a support data set and dplyr::summarize(count = sum(count)), so after that we are mostly talking about hardening and wrapping.

      There are similarities between tidyr::complete() and replyr::replyr_coalesce():

      • Both calculations and calling interfaces are very similar.
      • Neither method is part of the dplyr package (though tidyr::complete() is part of the “tidyverse”).

      And there are some differences:

      • tidyr::complete() in “nesting-mode” infers/guesses what cross-product of data should be the support set from the data at hand. So without adding extra pseudo-observations tidyr::complete(byYear, nesting(YEAR)) won’t fill-in the year 1950 in our byYear example.

        Though it does take our specifications in the ‘…’ section, as in: tidyr::complete(byYear, data.frame(YEAR= 1950:2016)).

      • tidyr::complete() joins down to the support definition, while replyr::replyr_coalesce() intentionally throws an error if data not anticipated by the support definition is discovered. That is: tidyr::complete(byYear, data.frame(YEAR= 1970:1979)) quietly returns 10 rows and replyr::replyr_coalesce(byYear, data.frame(YEAR= 1970:1979)) (intentionally) throws an error.
      • tidyr::complete() alters all rows with the fill values, replyr::replyr_coalesce() only fills columns in rows it added. For example look at “row c” in the following example.

        library('tidyr')
        library('replyr')
        
        d <- data.frame(x=c('a','c'), n=c(1,NA), 
                        stringsAsFactors=FALSE)
        s <- data.frame(x=c('a','b','c'), 
                        stringsAsFactors=FALSE)
        
        print(d)
        #   x  n
        # 1 a  1
        # 2 c NA
        
        tidyr::complete(d, s,
                        fill=list(n=0))
        #   x n
        # 1 a 1
        # 2 b 0
        # 3 c 0
        
        replyr::replyr_coalesce(d,s,
                                fills=list(n=0))
        #   x  n
        # 1 a  1
        # 2 c NA
        # 3 b  0
        
      • tidyr::complete() doesn't currently work on databases. Example:

        my_db <- dplyr::src_sqlite(":memory:", create = TRUE)
        dYear <- dplyr::copy_to(my_db, byYear)
        dSupport <- dplyr::copy_to(my_db, 
           data.frame(YEAR=1950:2016))
        tidyr::complete(dYear, dSupport)
        # Error in UseMethod("complete_") : 
        #   no applicable method for 'complete_' 
        #   applied to an object of class 
        #   "c('tbl_sqlite', 'tbl_sql', 'tbl_lazy', 'tbl')"
        

        Whereas you can use:

        replyr::replyr_coalesce(dYear, 
          dSupport)
        

      The main reason I haven't been using a lot of tidyr::* methods is my current tasks are all on databases and Spark. So a lot of the cool tidyr::* ideas (like stuffing data frames into columns or cells) are not fully supported there (I guess you could do something evil with serialization and blobs). The main purpose of the replyr package is to implement a number of good verbs (such as bind_rows, nrow, quantile, and split) so the work the same over many data providers (local, tlb, Spark, PostgreSQL, and SQLite).

      Everyone should use the method they prefer, especially if it already covers all the use cases they anticipate needing. I probably should have mentioned tidyr::complete(). As I said I haven't been using (and can't use) tidyr in my current project, so it was not something I I even thought about.

      Sorry that got long.

    1. I am actually working on summary. It isn’t ready for prime-time yet (the released version can’t handle columns named “n”, fixed in the Github dev version) but replyr_summary works on various data sources, counts NA, and returns a data.frame (instead of text).

      d <- data.frame(x=c(NA,'b'), y=c(1,NA),
                         stringsAsFactors= FALSE)
      summary(d)
      # x                   y    
      # Length:2           Min.   :1  
      # Class :character   1st Qu.:1  
      # Mode  :character   Median :1  
      # Mean   :1  
      # 3rd Qu.:1  
      # Max.   :1  
      # NA's   :1  
      replyr::replyr_summary(d)
      # column index     class nrows nna nunique min max mean sd lexmin lexmax
      # 1      x     1 character     2   1       1  NA  NA   NA NA      b      b
      # 2      y     2   numeric     2   1       1   1   1    1 NA   <NA>   <NA>
      

      Under the cover this is all powered by dplyr. One might want something like this as dplyr::glimpse works more like str or head and replyr_summary is “broom-like” in that it returns a machine readable object (such as a data.frame).

      1. I dunno… that’s a bit extreme, no? to invalidate the whole thing due to a few NAs? Maybe go the other way and write a wrapper function for `summary()` that takes all NAs and converts to zero?

  5. Thanks for this writing John. It is a very common problem, that I come across very often. It motivated me to create the padr package, that is now on CRAN. It does automated filling for dates and datetimes that are missing from a timeseries. Your earthquake example would be solved by `padr` in the following way:

    byYear %>% pad %>% fill_by_value(count)

    The only thing that is required here, is that your timestamp column is of class Date or POSIXt. I am working on an implementation of integers, that would work directly on these data.

    1. Thanks Edwin, I’ll check it out. I think special fill-in methods for dates and times is a good idea. I’ve edited your comment to link to padr on CRAN (WordPress often rejects links from comments, so it is probably just as well you didn’t include one).

  6. Your post inspired me to implement a padding version for integers right away. It is in the dev version on github (edwinth/padr) and will be in the next release of padr. Thanks again for the post.

Comments are closed.