I am working on some practical articles on variable selection, especially in the context of step-wise linear regression and logistic regression. One thing I noticed while preparing some examples is that summaries such as model quality (especially out of sample quality) and variable significances are not quite as simple as one would hope (they in fact lack a lot of the monotone structure or submodular structure that would make things easy).

That being said we have a lot of powerful and effective heuristics to discuss in upcoming articles. I am going to leave such positive results for my later articles and here concentrate on an instructive technical negative result: picking a good subset of variables is theoretically quite hard.

## Introduction

When we say something is “theoretically hard” we mean we can contrive examples of it that encode instances of other problems thought to be hard. Thus the ability to solve arbitrary instances of our problem serves to solve arbitrary instances of the thought to be hard problem. This is a technical statement and doesn’t mean we don’t know how to do a good job on the problem. It just means it would be incredibly noteworthy to claim efficient complete optimality in *all* possible cases.

## Formally

Let `Z`

denote the set of integers and `Q`

denote the rational numbers. The problem we are considering is:

INSTANCE: An integer `T`

, integer `K`

, and a data set `x(i),y(i)`

with `x(i) in Z^n`

and `y(i) in Z`

for `i=1,...,m`

.

QUESTION: Is there `B0 in Q`

and `B in Q^n`

such that `sum_{i=1,...,m} (B0 + B.x(i) - y(i))^2 ≤ T`

and no more than `K`

entries of `B`

are non-zero?

Call this problem “size `K`

regression model of quality `T`

” or (“sKrT” for short). Phrasing sKrT as a decision problem is a mere technical detail, we consider answering if there is a sKrT solution to be pretty much equivalent to finding such solutions. The input is taken to be integers for technical reasons, and can one can approximate various real number problems by scaling and rounding.

The hope is sKrT makes precise the goal one hopes stepwise regression is approximating: finding a good model for `y`

using only `K`

of the `x`

variables (at least on training data, there are also issues of multiple comparison to consider).

What I would like to point out is: solving sKrT is at least as hard as NP. That is: if we could always answer sKrT question quickly and correctly we could solve arbitrary problems in the complexity class NP (itself thought to be difficult).

The quickest way to see sKrT is likely hard is through the classic reference “Computers and Intractability, A Guide to the Theory of NP-Completeness”, Michael R. Gary, David S. Johnson, W.H. Freedman and Company, 1979. Their problem number MP5 “Minimum Weight Solution to Linear Equations” would be easy to solve given the ability to quickly solve sKrT instances. Formally MP5 is defined as:

INSTANCE: Finite set

`X`

of pairs`(x,b)`

where`x`

is an`m`

-tuple of integers and`b`

is an integer, and a positive integer`K ≤ m`

.QUESTION: Is there an

`m`

-tuple`y`

with rational entries such that`y`

has at most`K`

non-zero entries and such that`x . y = b`

for all`(x,b) in X`

?

The encoding is trivial. We encode an MP5 instance as a analogous sKrT instance with `T=0`

and one additional row of the form `(x(i)=0 in Z^n,y(i)=0 in Z)`

(which pushes `B0`

to `0`

). It should be obvious that checking for a zero sum of squared error linear regression is at least as powerful as checking for solvability of linear equations.

This means an intuition that MP5 may be hard becomes an intuition that sKrT may be hard.

## Consequences

All a hardness result seem to prohibit is a “magic wand” approach that aways returns perfect answers quickly (and it doesn’t actually prohibit it, but it means it would be very big news to find and certify such a magic wand). In many cases one can find approximately best solutions with high probability. Essentially this is a signal that in discussing variable selection it makes sense to consider heuristic and empirical results (trust methods that have tended to work well). In our follow-up articles we will discuss why you would want to prune down to `K`

variables (speed up algorithms, cut down on over-fit, and more) and effective pruning techniques (though Nina Zumel already has shared useful notes here).

Nice. It’s also easy to show it’s NP-hard to find a minimum-sized feature set that gives you a consistent input-to-output mapping on completely discrete, deterministic problems:

http://www.cs.cmu.edu/~scottd/aaai94.pdf

(Simple reduction from vertex-cover and some basic experiments on classifier accuracy while using consistency-based variable pruning. A rather trivial paper written by some overeager undergraduate bozo…)

That was a good property to took into, thank you for the reference and great to hear from you after all this time.