Posted on Categories Opinion, Pragmatic Data Science, Pragmatic Machine Learning, Rants, Statistics, TutorialsTags , , , , ,

Bias/variance tradeoff as gamesmanship

Continuing our series of reading out loud from a single page of a statistics book we look at page 224 of the 1972 Dover edition of Leonard J. Savage’s “The Foundations of Statistics.” On this page we are treated to an example attributed to Leo A. Goodman in 1953 that illustrates how for normally distributed data the maximum likelihood, unbiased, and minimum variance estimators of variance are in fact typically three different values. So in the spirit of gamesmanship you always have at least two reasons to call anybody else’s estimator incorrect.

Let us work the example.

Consider n identically distributed independent normal random variables x_1, … x_n. A common naive estimate of the unknown common mean U and variance V of the generating distribution is given as follows:

  u = sum_{i=1...n} x_i / n
  v = sum_{i=1...n} (x_i - u)^2 / n

That is: we are calculating simple estimates u,v that we hope will be close to the unknown true population values U,V. Unfortunately if you show this estimate to a statistical audience you will likely be open to ridicule. The problem is that the preferred estimate is not what we just wrote, but in fact:

  u = sum_{i=1...n} x_i / n
  v' = sum_{i=1...n} (x_i - u)^2 / (n-1)

The proffered argument will be that the estimate v is biased (indeed, an undesirable property) and the estimate v' is unbiased. If one wants to be rude one can take pleasure in accusing the author (me) of not knowing the difference between sample variance and population variance.

In my opinion the actual reason for disagreement is: statistics, at least when taught out of major, is largely taught as a prescriptive practice; you follow the exact specified procedure or you are wholly wrong.

Let us take the time to reason about our naive estimate a bit more. We have indeed made a mistake in using it. The mistake is we didn’t state the intended goal of the estimator. That is sloppy thinking, we should always have some goal in mind (right or wrong) and not blindly execute procedures. If the goal is an unbiased estimator we have indeed picked the wrong estimator. But suppose we had been more careful and said we wanted a maximum likelihood estimator. u,v is in fact maximum likelihood and u,v' is not. Unbiasedness is not the only possible performance criteria, and often incompatible with other estimation goals (see here, and here for more examples).

The usual derivation that u,v' is unbiased involves observing that if we define:


  Q := (sum_{i=1...n} x_i^2) - 
          (1/n) (sum_{i=1...n} x_i)^2

A bit of algebra that is very familiar to statisticians shows that our earlier maximum likelihood estimate v is in fact equal to Q/n. We also note we can derive (using our knowledge of the non-central moments U,V) that E[Q|U,V] = (n-1) V (and not n*V). And a small amount of algebra then gives you the unbiased estimate u,v'.

This seems superior and fine, until you notice the following. A glob of messy algebra gives you E[Q^2|U,V] = (n^2 - 1) V^2 (claimed in Savage, the derivation will need the stated additional distributional assumption that the data are normal to ensure facts we need about the first four moments of the observed data hold). But this is enough to show that Q/(n+1) is lower variance than the maximum likelihood estimate v=Q/n and also lower variance than the unbiased estimate v'=Q/(n-1). So if we had stated our goal was a more statistically efficient estimate of the unknown variance (or lower variance in our estimate of variance) we might have preferred an estimate of the form:


  u = sum_{i=1...n} x_i / n
  v'' = sum_{i=1...n} (x_i - u)^2 / (n+1)

What is going on is the empirically observed variance is a different beast that the empirically observed variance is not normal, even for normal variates. For one the empirically observed variance is a non-negative random variable (so it itself is certainly not normal). And unlike the empirical mean we don’t get all of the maximum likelihood, zero bias, and minimal variance estimates all co-occurring.

The math isn’t too bad. From Savage:


  E[ (a Q - V)^2  | U, V] 
    = (a^2 (n^2 - 1) - 2 a (n - 1) + 1) V^2
    = ((a - 1/(n+1))^2 (n^2 - 1) + 2/(n+1)) V^2
    >= 2 V^2 / (n+1)

And this bound is tight at a = 1/(n+1). Note that the algebra is only valid when n>1, but Q=0 when n=1 or V=0, which means a*Q=0 for all a. So we will assume n>1 and V>0. Thus: when n>1 and V>0 we have v'' = Q/(n+1) is the unique least variance estimate of the form a*Q where a is a constant (not depending on the n, U, V, or the x_i).

Frankly we have never seen an estimate of the form v'' = Q/(n+1) in use. It is unlikely the additional distributional assumptions are worth the promised reduction in estimation variance. But the point is: we have exhibited three different “optimal” estimates for the variance, so it is a bit harder to claim one is always obviously preferred (especially without context).

Or (following the math with an attempt at interpretation): estimating the variance of even a population of normal variates is a common example of where there are lower variance estimators than the standard unbiased choices (without getting into the complications of Stein’s example, James-Stein estimators, or Hodges–Le Cam estimators). In fact it is such a common example it is often ignored.

Or (without the math): as long as our estimators are what statisticians call consistent and n is large (which is one of the great advantages of big data) we really can afford to be civil about the differences between these estimates.