Rudie can’t fail (if majorized)
We have been writing for a while about the convergence of Newton steps applied to a logistic regression (See: What does a generalized linear model do?, How robust is logistic regression? and Newton-Raphson can compute an average). This is all based on our principle of working examples for understanding. This eventually progressed to some writing on the nature of problem solving (a nice complement to our earlier writing on calculation). In the course of research we were directed to a very powerful technique called the MM algorithm (see: “The MM Algorithm” Kenneth Lang, 2007; “A Tutorial on MM Algorithms”, David R. Hunter, Kenneth Lange, Amer. Statistician 58:30–37, 2004; and “Monotonicity of Quadratic-Approximation Algorithms”, Dankmar Bohning, Bruce G. Lindsay, Ann. Inst. Statist. Math, Vol. 40, No. 4, pp 641-664, 1988). The MM algorithm introduces an essential idea: majorized functions (not to be confused with the majorized order on R^d). Majorization it is an interesting way to modify Newton methods to be reliable contractions (and therefore converge in a manner similar to EM algorithms).
Here we will work an example of the MM method. We will not work it in its most general form, but in a form that quickly reveals much of the beauty of the method. We also introduce a “collared Newton step” which guarantees convergence without resorting to line-search (essentially resolving the issues in solving a logistic regression by Newton style methods).To find an optimal logistic regression model parameters we try to find a B minimizing a function of the form:
f(B) = - sum_i log(P(x(i),y(i),B))
where P(x(i),y(i),B) is the probability the model assigns to an outcome of y(i) given an input of x(i) (see The Simpler Derivation of Logistic Regression). To do this it is traditional to find a stationary point (a point where the vector f’(B) is zero) by taking Newton-Rapshon steps of the form:
B(k+1) = B(k) - inverse(f''(B(k))) f'(B(k))
where B(k) is our k’th estimate of the coefficients (starting with B(1) = 0), f’() is the gradient vector of f() and f”() is the Hessian matrix of f() (which is also the Jacobian matrix of the vector f’()). We terminate if f’(B(k)) is sufficiently close to zero. When f() is a quadratic function with a unique minimum the method is exact and B(2) is the optimal simultaneous setting of parameters. By analogy if f() is locally well approximated by a quadratic function then the method converges very fast to the optimal point.
But the method is not always monotone in f(), sometimes f() is larger after an update step instead of getting smaller. With some starts the updates diverge and fail to solve the problem. Many implementations ignore this possibility (and in fact fail). Some implementations perform a line search and use the update:
B(k+1) = B(k) - s(k) inverse(f''(B(k))) f'(B(k))
where s(k) is some near optimal step size between zero and one. This fixes the problem, but is a little unsatisfying: we still need to argue that s(k) isn’t forced to zero (preventing progress). Also, one of the virtues of the Newton-Raphson method is it tells you exactly what to do without requiring additional search. The simplicity of the original Newton-Raphson “iterate this formula until done” is much preferred to the search and bookkeeping required when using methods such as those found in “Algorithms for Minimization Without Derivatives”, Richard P. Brent ,1973.
So the question is: can we do a bit more math to avoid having to do a bit more programming?
One candidate idea is: introducing a majorizing function g(,). Such a g(,) is a function that takes two vector arguments and obeys the following relations:
- f(B) = g(B,B) for all B
- f(z) ≤ g(z,B) for all z in N(B) (the “neighborhood of B”).
Note that the standard definition takes N(B) to be all space (i.e. their is no neighborhood restriction). For our application we will take N(B) to be all z such that ||z-B|| ≤ r(k) for some r(k) > 0 (perhaps taking r(k)=1, for example). g(,) becomes very valuable when we insist on one more property:
- We can find a vector C(g,B) in N(B) such that g(C(g,B),B) < g(B,B).
Suppose we could produce such a g(,). The MM algorithm it to compute a series of parameter B(k) as follows:
B(k+1) = C(g,B(k)).
By our assumptions we have:
f(B(k+1)) = f(C(g,B(k))) ≤ g(C(g,B(k)),B(k)) < g(B(k),B(k)) = f(B(k))
or the sequence of B(k)’s is driving f() to lower and lower values as desired.
The following diagram illustrates the desired situation:
That is well and nice. But how do we know such a g(,) with all of these nice properties even exists? The more nice things we insist on the less likely there is such a g(,). Fortunately in this case we have not yet asked too much. Using the Lagrange form of Taylor’s remainder theorem can quickly write-down a suitable g(,).
The first few terms of the Taylor series of our particular f() (expanded around B) are:
f(x) ~ f(B) + f'(B)(x-B) + transpose(x-B) f''(B) (x-B)/(2!) + ...
The remainder form says we exactly have:
f(x) = f(B) + f'(B)(x-B) + transpose(x-B) f''(Z) (x-B)/(2!)
where Z is some point between x and B (in our case we will weaken this to just some point in N(B)). Further suppose we found a quadratic form G(B) such that:
transpose(z-B) G(B) (z-B) ≥ transpose(z-B) f''(Z) (z-B)
for all z,Z in N(B). Then: a great majorizing function g(,) would just be the quadratic form:
g(x,B) = f(B) + f'(B)(x-B) + transpose(x-B) G(B) (x-B)/(2!).
This g(,) satisfies: g(B,B) = f(B), f(C) ≤ g(C,B) for all C in N(B). And because g(x,B) is a quadratic form in x we know the Newton-Raphson method minimizes it in exactly one step so we can set C(g,B) to be this solution:
C(g,B) = B - inverse(G(B)) f'(B).
We know g(C(g,B),B) < g(B,B) (unless f’(B) = 0 or G(B) is rank-deficient).
So we have the majorizing g(,) we need if we can just produce a G(B) with the properties mentioned above. From The Simpler Derivation of Logistic Regression we know f”(B) itself has the form:
f''(B) = sum_i P_i(1-P_i) x(i) transpose(x(i))
where x(i) is the vector known values for the i-th example and P_i is shorthand for P(B,x(i)): the model’s current estimated probability of an example with known values x(i) is a positive or true example. Notice since P_i is between 0 and 1 we have P_i(1 – P_i) is never more than 1/4. So define G(B) as the sum:
G(B) = sum_i (1/4) x(i) transpose(x(i)).
We can show we have for all Z in N(B): G(B) ≥ f”(Z) under the usual order of positive semi-definite matrices. This means that for all z,Z in N(B) we have transpose(z-B) G(B) (z-B) ≥ transpose(z-B) f”(Z) (z-B). And we have the G(B) we needed. Thus our g(,) constructed from G() has all of the specified properties.
Also notice for the first step standard Newton Step the P_i are all equal to 1/2 (given B(1) = 0) so the original 1/4 bound is both correct and tight. We now have shown that the first Newton-Raphson step can not increase f() when solving a logistic regression (starting from zero) unless we have a rank-deficiency. A direct Newton-Raphson implementation may still diverge (as we showed in our earlier writing), but it won’t show an increase on the first step. It is not that f() is exactly a quadratic function at the first step (it is not). It is that f() is majorized by its own truncated Taylor series g(,0) at the first step.
When we take N(B) as the whole space the above is the standard majorizing construction. That standard construction has the additional advantage that G(B) is independent of B; so we can form the inverse(G(B)) just once and re-use it in every iteration.
Our non-standard suggestions (base on using smaller N(B)) are as follows. Instead of using the global relation 1/4 ≥ P_i(1-P_i) we can instead pick potentially smaller b(i) such that b(i) ≥ P_i(1-P_i) for all points in N(B). The point being if we only need the bound to be true over all of N(B) (instead of over all space) we may be able to pick b(i) much closer to the actual P_i(1-P_i). We can try b(i) = sup P(z,x(i)) (1-P(z,x(i)) for all z in N(B) (which in later iterations may be much less that 1/4 for many data points). A closer bound would potentially drive faster convergence (allowing the MM steps to behave more like the often more beneficial Newton-Raphson steps). Our point is: the standard global bound may not be as useful later in the optimization (hence the introduction of the N(B) allowing us to use possibly tighter b(i) as long as we limit steps to stay in the N(B)).
Another non-standard use of the majorizing g(,) function would be to not take the majorizing update, but instead take the Newton-Raphson step if g(B(k+1),B(k)) < g(B(k),B(k)) (which ensures the Newton-Raphson step is productive: f(B(k+1)) < f(B(k))). If this relation fails we can take the MM step (B(k+1) = B(k) – inverse(G(B(k))) f’(B(k))). Or we can perform a partial Newton-Raphson step (B(k+1) = B(k) – s(k) inverse(f”(B(k))) f’(B(k))) but use g(,B(k)) to pick the s(k) instead more expensive line-search on f().
The point is the more complicated function g(,) allows us to encode more useful facts and guarantees about the properties of f(). The richer structure and additional choice in constructing a g(,) (already given on f()) allows us to come up with reliable optimization procedures that do not require line search.