Posted on Categories Computer Science, Mathematics, StatisticsTags , , , ,

Why No Exact Permutation Tests at Scale?

Here at Win-Vector LLC we like permutation tests. Our team has written on them (for example: How Do You Know if Your Data Has Signal?) and they are used to estimate significances in our sigr and WVPlots R packages. For example permutation methods are used to estimate the significance reported in the following ROC plot.

NewImage

Permutation tests have their own literature and issues (examples: Permutation, Parametric and Bootstrap Tests of Hypotheses, Springer-Verlag, NY, 1994 (3rd edition, 2005), 2, 3, and 4).

In our R packages the permutation tests are estimated by a sampling procedure, and not computed exactly (or deterministically). It turns out this is likely a necessary concession; a complete exact permutation test procedure at scale would be big news. Please read on for my comments on this issue.

Introduction

In a basic permutation test we check if a relation is important by deliberately breaking the relation and seeing if outcome quality changes. The idea is: we build a deliberately damaged data set by permuting the values in one or more columns independent of other columns (breaking any per-instance relations between the permuted and non-permuted columns). If damage is unnoticeable, then what was damaged may not be critical. If damage is large, then what was damaged may be critical.

Permutation tests are related to other empirical re-sampling methods (such as the bootstrap) but have their own flavor. One of the most famous uses is random forest variable importance testing.

Definitions

Let’s define our terms using a simple example:

Suppose we have a sequence of examples (x_i, v_i, y_i) where i takes integer values from 1 to n and is our “instance number”. x_i is one set of variable values (variables we are using, but not estimating the importance of) associated with the i-th example and v_i is another set of observations associated (a variable or group or variables we are using and estimating the importance of), and y_i is the known outcome we are trying to match.

Define:

  • predict(x_i,v_i): a function or model (probably trained on earlier data) that takes two arguments and returns a prediction (a prediction being a number in the case of regression, or a probability or class label in the case of classification). For n-vectors x and v we will say predict(x,v) is the n-vector whose i-th entry is predict(x_i,v_i).
  • loss(y, yhat): a function that takes the n-vector y of known outcomes and an n-vector yhat of matching predictions and returns a criticism or score. Often loss() is a simple additive function (such as SSE = sum_{i:1 ...n} (y_i - yhat_i)^2), or it can be more involved such as 1-AUC.
  • Let Sn be the set of all permutations of integers 1 through n. For a permutation p in Sn and an n-vector v we write p(v) to represent the n-vector whose i-th entry is v_{p(i)}.

For each of these functions we will say

In this notation a permutation test of the effect of the v variables is constructed as follows:

  • Compute: lV = loss(y, predict(x,v)), the observed loss value.
  • For an arbitrary real number T, compute or estimate: pV(T) = (1/|Sn|) sum_{p in Sn} (loss(y, predict(x, p(v))) ≥ T) (where loss(y, predict(x, p(v))) ≥ T is valued at 1 if true and 0 if false).

The idea is: lV is the loss we saw, and pV(lV) is how often a random relation achieves a loss at least as large. If 1-pV(lV) is small we can say the difference is significant in a frequentist sense.

The problem is that |Sn| = n! which is typically enormous. So we don’t want to actually compute the above sum in practice. We can estimate pV(T) by sampling terms uniformly at random (which works especially well as the terms all have bounded size), but this has the (minor) undesirable side-effect that our estimator isn’t deterministic (two different honest experimenters will typically get different results for the same data).

Why no exact solution at scale?

The question is: can we directly and exactly calculate pV(T) (instead of estimating it)?

The answer is: in the general case to do so would be a very big new result. There is definitely not a publicly known trusted completely general fast method to do so.

Let’s go over why that is.

Take the simple case of “additive loss”, where: loss(y, yhat) is defined as sum_{i = 1...n} loss(y_i, yhat_i). Losses such as sum of square errors, and number of mis-classifications are of this simple form. For such a loss we can define a matrix l such that:

  • l(i,j) = loss(y_i, predict(x_i, v_j)) where i,j are both integers chosen from 1 to n.

We are going to further restrict to the case where loss(y_i, predict(x_i, v_j)) is always 0 or 1 (such as in counting mis-classifications). The issue is: it is easy to confirm that every n by n matrix of 0/1 entries can be generated by picking the appropriate predict() example. For example: encode a given 0/1 matrix M as: y = 0, x = v = (1, 2, ..., n), predict(x_i, v_j) = M(x_i, x_j), and loss(y, p) = 0 if y==p and 1 otherwise.

If we could perfectly calculate arbitrary pV(T) we would be able to compute P[loss(y, predict(x, p(v))) ≥ n] = pV(n) where P[] is the probability computed over the uniform distribution of p in Sn. If we could always exactly compute that efficiently we could already solve a deep open problem: computing the 0/1 permanent quickly. So such a method would be big news.

Here is our derivation:

 P[loss(y, predict(x, p(v))) ≥ n] 
  = (1/|Sn|) sum_{p in Sn} (loss(y, predict(x, p(v))) ≥ n)
  = (1/|Sn|) sum_{p in Sn} ((sum_{i = 1...n} l(i,p(i))) ≥ n)
  = (1/|Sn|) sum_{p in Sn} product_{i = 1...n} l(i,p(i))
  = (1/|Sn|) permanent(l)
  = (1/(n!)) permanent(l)

We are using the fact that while sum doesn’t equal product in general, for a 0/1 n-term loss we have loss(y, predict(x, p(v)) ≥ n if and only if product_{i = 1...n} l(i,p(i)) == 1. Now n! is easy to compute; so if P[loss(y, predict(x, p(v)) ≥ n] were easy to compute, then permanent(l) would also be easy to compute.

And we now have the catch: it would be big news if somebody could efficiently exactly compute arbitrary 0/1-permanents. As we said: our restricted loss function is already rich enough to every 0/1 matrix.

To be able to exactly compute pV(n) for all examples (our desired goal) would mean we are also able to compute the permanent for all 0/1 matrices (something thought to be hard). It is the expressive power that hangs us: being able to compute the loss for only some examples would not be an issue, as that would not imply the power exactly compute all 0/1 permanents.