What is the Gauss-Markov theorem?

From “The Cambridge Dictionary of Statistics” B. S. Everitt, 2nd Edition:

A theorem that proves that if the error terms in a multiple regression have the same variance and are uncorrelated, then the estimators of the parameters in the model produced by least squares estimation are better (in the sense of having lower dispersion about the mean) than any other unbiased linear estimator.

This is pretty much considered the “big boy” reason least squares fitting can be considered a good implementation of linear regression.

Suppose you are building a model of the form:

``` y(i) = B . x(i) + e(i) ```

where `B` is a vector (to be inferred), `i` is an index that runs over the available data (say `1` through `n`), `x(i)` is a per-example vector of features, and `y(i)` is the scalar quantity to be modeled. Only `x(i)` and `y(i)` are observed. The `e(i)` term is the un-modeled component of `y(i)` and you typically hope that the `e(i)` can be thought of unknowable effects, individual variation, ignorable errors, residuals, or noise. How weak/strong assumptions you put on the `e(i)` (and other quantities) depends on what you know, what you are trying to do, and which theorems you need to meet the pre-conditions of. The Gauss-Markov theorem assures a good estimate of `B` under weak assumptions.

How to interpret the theorem

The point of the Gauss-Markov theorem is that we can find conditions ensuring a good fit without requiring detailed distributional assumptions about the `e(i)` and without distributional assumptions about the `x(i)`. However, if you are using Bayesian methods or generative models for predictions you may want to use additional stronger conditions (perhaps even normality of errors and even distributional assumptions on the `x`s).

We are going to read through the Wikipedia statement of the Gauss-Markov theorem in detail.