Posted on Categories Mathematics, Rants, Statistics, Tutorials

Nassim Nicholas Taleb recently wrote an article advocating the abandonment of the use of standard deviation and advocating the use of mean absolute deviation. Mean absolute deviation is indeed an interesting and useful measure- but there is a reason that standard deviation is important even if you do not like it: it prefers models that get totals and averages correct. Absolute deviation measures do not prefer such models. So while MAD may be great for reporting, it can be a problem when used to optimize models.

Let’s suppose we have 2 boxes of 10 lottery tickets: all tickets were purchased for \$1 each for the same game in an identical fashion at the same time. For our highfalutin data science project let’s look at the payoffs on the tickets in the first box and try to build a best predictive model for the tickets in the second box (without looking at the payoffs in the second box). We then use our model to predict the total value of the 10 tickets in the second box.

Now since all tickets are identical if we are making a mere point-prediction (a single number value estimate for each ticket instead of a detailed posterior distribution) then there is an optimal prediction that is a single number V. Let’s explore potential values for V and how they differ if we use different measures of variation (square error, mean absolute variation and median absolute variation). To get the ball rolling let’s further suppose the payoffs of the tickets in the first box are nine zeros and one \$5 payoff. We are going to use a general measure of model goodness called a “loss function” or “loss” and ignore any issues of parametric modeling, incorporating prior knowledge or distributional summaries.

Suppose we use mean absolute deviation as our measure of model quality. Then the loss (or badness) of a value V is `loss(V) = 9*|V-0| + 1*|V-5|` which is minimized V=\$0. That is it says the best model under mean absolute error is that all the lottery tickets are worthless. I personally feel that way about lotteries, but the mean absolute deviation is missing a lot of what is going on. In fact if we have nine tickets with zero payoff and a single ticket with a non-zero payoff the mean absolute deviation is minimized for V=0 for any positive payoff on the last ticket. The mean absolute deviation says the best model for a lottery ticket given 9 non-payoffs and one \$1,000,000 payoff is that tickets are worth \$0. Meaning that we may not want to always think in terms of the mean absolute deviation summary.

Here is some R-code demonstrating what models (values of V) total absolute deviation prefers (for our original problem):

``` library(ggplot2) d <- data.frame(V=seq(-5,10,by=0.1)) f <- function(V) { 9*abs(V-0) + 1*abs(V-5)} d\$loss <- f(d\$V) ggplot(data=d,aes(x=V,y=loss)) + geom_line() ``` Notice while there is a slope-change at V=\$5, but the minimum is at \$V=0.

Suppose instead we use median absolute deviation as our measure of model quality (the more standard expansion of the MAD acronym). Things are pretty much as bad: V=\$0 is the “optimal model” for 10 tickets 9 of which payoff zero no matter what the payoff of the last ticket is.

Finally suppose instead of trendy MAD measures we use plain old square error like poor old Karl Pearson used in the 19th century. Then for our original example we have: `loss(V) = 9*(V-0)^2 + 1*(V-5)^2` which is minimized at V=\$0.5. Which says these lottery tickets seem to be worth about \$0.5 each while they cost \$1 each (typical of lotteries). Also notice we have 10*V equals \$5 the actual total value of all of the tickets in the first box of lottery tickets. This is a key advantage of RMSE: it gets group totals and averages right even when it doesn’t know how to value individual tickets. You want this property.

How can we design loss functions that get totals correct? What we want is a loss function that when we optimize to minimize loss we end up recovering totals in our original data. That is the loss function, whatever it is, should have a stationary point when we try to use it to recover a total. So in our original example we should have: `d(10*loss(V))/dV = 0` when V=\$0.5 (the total we are trying to recover). Any loss-function of the form `loss(V) = f(9*(V-0)^2 + (V-5)^2)` has a stationary point at V=\$0.5 (just an application of the chain-rule for derivatives). This is why square error, root mean square error and the standard deviation all pick the same optimal V=\$0.5. This is the core point of regression and logistic regression which both emphasize getting totals correct. This is the other reason you report RMSE: it is what regression optimizers are minimizing (so it is a good diagnostic). We can also say that in some sense a loss functions that get totals and averages right have derivatives that look locally a bit like RMSE (near the average value); which implies the loss function looks a bit like RMSE (or some transform of it) near the average value. This is one reason logistic regression can be related to standard regression by the idea of iterative re-weighting.

The overall point is: there are a lot of different useful measures of error and fit. Taleb is correct: the measure you use should not depend on mere habit or ritual. But the measure you use should depend on your intended application (in this case preferring models that get expected values and totals correct) and not merely on your taste and sophistication. We also like non-variance based methods (like quantile regression, see this example) but find for many problems you really have to pick you measure correctly. RMSE itself is often mis-used: it is not the right measure for scoring classification and ranking models (you want to prefer something like precision/recall or deviance).

1. To be clear. I like MAD. But it is like Alberto Calderon remarked: “as soon as people realize that they cannot make change of variables, the theory of distributions will be in trouble.” The point being you can’t be “the one way” while simultaneously fouling up traditional uses. Funny thing- I found that quote on page 223 of Rota’s “Indiscrete Thoughts” and yet still highly recommend Richards and Youn’s “Theory of Distributions.”

2. Brian Slesinsky says:

Thanks, this is very clear. It also explains problems where standard deviation doesn’t apply:

The most common kind of analysis I do is performance analysis, measuring latency in particular. Summing over the latencies of all the requests seems fairly meaningless, and the curve isn’t gaussian; typically there’s a minimum value (best case for no cache misses and a lightly loaded machine) and a long tail, cut off wherever you set the request timeout. So it seems like the mean and standard deviation aren’t useful for this case.

I expect Taleb is more concerned about losses in situations where, unlike a lottery, you can’t easily predict the values of significant but rare events. In that case the total isn’t really predictable from the data you have.

And come to think of it, in this particular example, if lottery wins are rare, a sample of 10 tickets doesn’t seem like enough to figure out the expected value of a lottery. Given the expected long-tailed distribution, It seems likely that you’ll either get all losses (and conclude that lottery tickets are worthless) or get lucky the first time and erroneously conclude that buying lottery tickets is profitable, so you should buy more tickets.

3. @Brian Slesinsky Thanks Brian, we could say its 100,000 tickets of which 9/10ths lose and 1/10th pay \$5 each and get pretty much the same result. I know what you mean about “more tickets” we really can’t expect to measure the actual rare events (like the 1 in a billion chance of winning \$300,000,000) without a *lot* more data.

4. And just a note: obviously I accidentally computed total absolute deviation instead of mean absolute deviation. But obviously this is just a matter of remembering to divide by population size (10) and does not change where any of the minima are.

5. Thomas Speidel says:

Very good post. I thought MAD was median absolute deviation. At any rate, this is exactly how I often use it in an exploratory setting ever since I read Rand Wilcox. I too like quantile regression, except I rarely have enough data to afford it…

6. @Thomas Speidel Thanks. MAD is indeed median absolute deviation (my link said so, but I have now edited the article to emphasize this). But Taleb discussed mean absolute deviation and both have similar issues.

7. Alan Parker says:

IFAIK the key advantage of the MAD is that it is much more robust than the variance. You do not mention this issue expicitly, and I suppose that is Taleb’s concern. Black Swans are pretty much outliers, right? So my naive point of view is that if you are sure that there are no outliers then the MAD is not very helpful. BUT, if you have outliers then variance is going to lead to wrong conclusions and the MAD should be your preferred statistic. Am I missing something?

8. Really good post.

I’m providing a reference below as a courtesy to help readers and search engines differentiate between this MAD and another MAD that refers to something different. Different context but the audience for Taleb’s MAD and the other MAD is converging or will soon be converge.

Cohen, Jeffrey, et al.
Journal Proceedings of the VLDB Endowment VLDB Endowment Homepage archive
Volume 2 Issue 2, August 2009
Pages 1481-1492

9. Lottery isn’t such a good example, since it is pure expected value, and all lotteries (if you dig) tell you the exact odds of winning the full prize. It’s then just a simple matter of calculating prob * value, and asking whether that number is greater than the ticket cost. Civilians understand this intuitively; as jackpots grow, ticket sales grow disproportionately.

10. @Alan Parker
MAD is a legit robust statistic. Which means that somebody who tells you their time series has a low MAD can hide a few rare events from you. So from a detection point of view standard deviation freaking out can sometimes be an advantage. It depends on if you want reports of small variation to be reliable (standard deviations) or reports of large variation to be reliable (MAD).

11. @Robert Young A law that is true remains true in the limit. And the reverse of lotteries are a good abstraction of something Taleb has criticized very well: running a trading desk that takes profits every year while not accounting for the low probability high risk events they are inflicting on their company and customers (essentially selling lottery tickets without the capital to pay off a very large winner). And checking if some cool method can even compute an average is a standard trick if mine: see Newton-Raphson can compute an average.

12. Philip Apps says:

Taleb’s point is about as deep as saying that we should abandon mean, and use median instead.

13. Thomas Speidel says:

I wanted to follow-up on this since there appears to be much confusion. In an earlier post, I wrote that I was under the impression that MAD was median abosulute deviation. I now realize that the term appear to be used ambivalently and this is confusing.

Sheskin (2011) p. 119 defines MAD as mean absolute deviation
Wilcox (2010) p. 33 defines MAD as median absolute deviation

Sheskin D J. Handbook of parametric and non parametric statistical procedures (2011)
Wilcox R R. Fundamentals of modern statistical methods (2010)