Posted on Categories Rants, Statistics, TutorialsTags ,

My criticism of R numeric summary

My criticism of R‘s numeric summary() method is: it is unfaithful to numeric arguments (due to bad default behavior) and frankly it should be considered unreliable. It is likely the way it is for historic and compatibility reasons, but in my opinion it does not currently represent a desirable set of tradeoffs. summary() likely represents good work by high-ability researchers, and the sharp edges are due to historically necessary trade-offs.


E53d7f8067067a51029cde8260094ff5867b10ab6676b1d493c8dd8d23c4571b

The Big Lebowski, 1998.

Please read on for some context and my criticism.

Edit 8/25/2016: Martin Maechler generously committed a fix! Assuming this works out in testing it looks like we could see an improvement on this core function in April 2017. I really want to say “thank you” to Martin Maechler and the rest of the team for not only this, for all the things they do, and for putting up with me.

Introduction

My group has been doing a lot more professional training lately. This is interesting because bright students really put a lot of interesting demands on how you organize and communicate. They want things that make sense (so they can learn them), that are powerful (so it is worth learning them), and that are regular (so they can compose them and move beyond what you are teaching). Students are less sympathetic to implementation history and unstated conventions, as new users tend not to benefit from them. Remember a new R student is still deciding if they want to use R, to them it is new so an instructor needs to defend R‘s current trade-offs (not its evolutionary path). We find it is best to point out both what is great in R and what isn’t great (versus skipping such, or worse trying to justify such portions).

Please keep this in mind when I demonstrate what goes wrong when one attempts to teach R’s summary() function to the laity.

The Issue

Suppose you had a list or vector of numbers in R. It would be useful to be able to produce and view some summaries or statistics about these numbers. The primary way to do this in R is to call the summary() method. Here is an example below:

numbers <- 1:7
print(numbers)
##  [1]  1  2  3  4  5  6  7

summary(numbers)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0     2.5     4.0     4.0     5.5     7.0 

From the names attached to the results you can get the meanings and move on. But the whole time you are hoping none of your students call summary() on a single number. Because if the do, they have a very good chance of seeing summary() fail. And now you have broken trust in R.

Let’s tack into the wind and demonstrate the failure:

summary(15555)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   15560   15560   15560   15560   15560   15560 

summary() is claiming the minimum value from the set of numbers c(15555) is 15560. Now this is a deliberately trivial example where we can see what is going on (it sure looks like presentation rounding). To make matters worse, this isn’t just confusion generated during presentation- the actual values are wrong.

str(summary(15555))
## Classes 'summaryDefault', 'table'  Named num [1:6] 15560 15560 15560 15560 15560 ...
##   ..- attr(*, "names")= chr [1:6] "Min." "1st Qu." "Median" "Mean" ...

summary(15555)[['Min.']] == min(15555)
## [1] FALSE

It may seem silly to expect the slots from a summary() call on a vector would be used in calculation (when we have direct functions such as quantile() and mean() for getting the same results), but using values from summaries of models is standard practice in R. The trivial linear model summary summary(lm(y~0,data.frame(y=15555))) shows rounded results (though it appears to hold accurate results, and only round during presentation; use unclass() to inspect the actual values).

Why it Matters

This is in fact a problem. You can say this is a consequence of the “default settings of summary()” and it is my fault for not changing those settings. But frankly it is quite fair to expect the default settings to be safe and sane.

Let us also appeal to authority:

The many computational steps between original data source and displayed results must all be truthful, or the effect of the analysis may be worthless, if not pernicious. This places an obligation on all creators of software to program in such a way that the computations can be understood and trusted. This obligation I label the Prime Directive.

John Chambers, Software for Data Analysis: Programming with R, Springer 2008.

The point is you are delegating work to your system. If it needlessly fails (no matter how trivially) when observed, how can you trust it when unobserved? John Chambers’ point is that trust is very expensive to build up, so you really don’t want to squander it.

I used to try to “lecture this away” as just being “rounding in the presentation for neatness.” But this runs into two objections:

  • Why doesn’t the presentation hint at this by switching to scientific notation such as 1.556e+4?
  • If summary() “is just presentation” wouldn’t it be a string?

We are losing substitutability. We would love to be able to say to students that “summary() is a convenient shorthand and you can treat the following as equivalent”:

  • summary(x)[['Min.']] == min(x)
  • summary(x)[['1st Qu.']] == quantile(x,0.25)
  • summary(x)[['Median']] == median(x)
  • summary(x)[['Mean']] == mean(x)
  • summary(x)[['3rd Qu.']] == quantile(x,0.75)
  • summary(x)[['Max.']] == max(x)

But the above isn’t always the case. What we would like is for summary() to contain these values and get pretty printing by using the S3 or S4 object system to override the print() method. It is quite likely summary() predates these object systems, so achieved pretty printing through rounding of values.

What is going on?

We can take a look at the actual code and see what is happening. We are looking for a reason, not an excuse.

From help(summary) we see summary takes a digits option with default value digits = max(3, getOption("digits")-3) (lets not even get into why setting digits directly does one thing and the system default is shifted by 3). getOption("digits") returns 7 on my machine so we see we are asking for four digit rounding, which is consistent with what we saw. Digging through the dispatch rules we can eventually determine that for a numeric vector summary() eventually calls summary.default(). By calling print(summary.default) we can look at the code. The offending snippet is:

        qq <- stats::quantile(object)
        qq <- signif(c(qq[1L:3L], mean(object), qq[4L:5L]), digits)

After computing the quantiles summary then calls signif() to round the results. R isn’t inaccurate, it just went out of its way to round the results.

Why is this whiny rant so long?

One reason this article is long is the behavior we are describing breaks expectations. So we end up having to document what is actually going on (a laborious process) instead of being able to rely on shared educated expectations. The whining is where actualities and expectations diverge.

Conclusion

summary() attempts to achieve neatness and legibility. This is a laudable goal, if achievable. Numeric analysis is not so simple that rounding could safely achieve such a goal.

It is well known that rounding is not a safe or faithful operation (it loses information, and can be catastrophic if naively applied in many stages of a complex calculation). Because it is obvious rounding is dangerous, sophisticated students are surprised that it defaults to “on” in common calculations without indication or warning (such as moving to scientific notation). summary() compounds this error by returning rounded values (instead of rounding only at print/presentation). As summary() is often a first view of data (along with print()) we encounter confusing inconsistent situations where un-rounded values (presentation of original data) and rounded values are compared.

Of course, we can (and should) teach students to call mean(x) and quantile(x) rather than summary(x) when they want to reuse the summary statistics. But then we have to explain why. After seeing something like this it becomes an unfortunate additional teaching goal to convince students that more of R doesn’t behave like summary().

8 thoughts on “My criticism of R numeric summary”

  1. Desc from the DescTools package seems to handle this sort of thing a little better. It goes a little TMI on you, but it seems accurate. Unlike summary(), stored values will be correct, even though the default display shows rounded values.

  2. To be fair, the attitude that you’d pick apart the results of summary instead of using the simpler min/max/mean/median is very SAS-like. Do a PROC SUMMARY and pick out your answers.

    I was surprised by what you found, and that is bad: you want to minimize surprise in a language. But proposing that students might be taught that summary()[1] could be used instead of min() is doing them a total disservice by teaching them non-idiomatic — in fact anti-idiomatic, drawn from other languages — ways of doing things.

    Last, R does have a function that does what you want: fivenum. Summary is for, well, summarizing and is not PROC SUMMARY. So changing summary doesn’t really make sense. It’s not meant to do what you (and I have to admit I) might expect — it’s not duplicating other functions and it’s not PROC SUMMARY.

    1. I do see your point. But I would like to emphasize I don’t teach students to use summary()[1]. The issue arises from having to explain why summary() doesn’t look the way they expect. I had been hoping it was in presentation only (not in the slots) so one could say “it is just presentation.” Also a lot of the slots on summary.lm and summary.glm are a bit of trouble to re-derive (though how to calculate each and every such slot is in fact something I do teach).

Comments are closed.