Posted on Categories data science, Opinion, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Rants, StatisticsTags , ,

Skimming statistics papers for the ideas (instead of the complete procedures)

Been reading a lot of Gelman, Carlin, Stern, Dunson, Vehtari, Rubin “Bayesian Data Analysis” 3rd edition lately. Overall in the Bayesian framework some ideas (such as regularization, and imputation) are way easier to justify (though calculating some seemingly basic quantities becomes tedious). A big advantage (and weakness) of this formulation is statistics has a much less “shrink wrapped” feeling than the classic frequentist presentations. You feel like the material is being written to peers instead of written to calculators (of the human or mechanical variety). In the Bayesian formulation you don’t feel like you will be yelled at for using 1 tablespoon of sugar when the recipe calls for 3 teaspoons (at least if you live in the United States).

Some other stuff reads differently after this though.

For example I finally got around to skimming Li, K-C. (1991) “Sliced Inverse Regression for Dimension Reduction”, Journal of the American Statistical Association, 86, 316–327. The problem formulation in this paper is very clever: suppose y isn’t a just a linear function of the x, but a linear function of an unknown low rank linear image of them. In this case how do you efficiently infer? This is clear statement of an idea that can be used to move a lot of current “wide data” (lots of variables) heuristics onto solid ground. Very very roughly the analysis method involves working with something the authors call “the inverse regression curve” which is defined as E[x|y] - E[x] (y being an outcome variable and x being an instrumental variable).

Now I have a bit of math background, so I am familiar with the idea that “inverse”, “reverse”, or “co” is a way to sex things up. Too many people of written about homologies? Then write about co-homologies! Another example: Avis, Fukuda, “Reverse Search for Enumeration”, Discrete and Applied Mathematics, 1993, volume 65, pp. 21-46 (which actually is a quite good result and paper). There are technical distinctions, but perhaps you should check your arrows if you are sprinkling your titles with “inverse”, “reverse”, or “co” (weak attempt at a category theory joke).

If we treat these inverse regression curves very loosely (as we do in other writings about regression), we can try to find a pre-existing common idea or procedure that it may at least be similar to.

Suppose we are in the special case where x and y both indicator variables that each take the value 1 when their respective conditions are met and are 0 otherwise. So instead of working with E[x] and E[x|y] we work with the related quantities P[x=True] and P[x=TRUE|y=TRUE] (actually E[x|y] is encoding information about both P[x=TRUE|y=TRUE] and P[x=TRUE|y=FALSE], but let us allow this further specialization for convenience of notation).

In Chapter 6 of Practical Data Science with R we suggest re-encoding variables as their log change in likelihood: log(P[y=TRUE|x=TRUE]/P[y=TRUE]). Now PDSwR was written long after inverse regression was invented, so we are in no way claiming priority. But we certainly are not the first people to use a log likelihood ratio. Let’s work with this quantity a bit:

By Bayes’ law P[y=TRUE|x=TRUE] = P[x=TRUE|y=TRUE] P[y=TRUE] / P[x=TRUE]. So log(P[y=TRUE|x=TRUE]/P[y=TRUE]) = log(P[x=TRUE|y=TRUE]/P[x=TRUE]) which is in turn equal to log(P[x=TRUE|y=TRUE]) - log(P[x=TRUE]).

So we have log(P[y=TRUE|x=TRUE]/P[y=TRUE]) = log(P[x=TRUE|y=TRUE]) - log(P[x=TRUE]). The term on the left is the change in log likelihood of y=TRUE given x=TRUE (which we argue is a very useful and natural quantity to work with). The term on the right is in the same form as the inverse regression curve except we are writing log(P[]) instead of E[]. So I would argue by analogy that the quantities E[x|y]-E[x] and E[y|x]-E[y] (modulo some centering and scaling monkey business) are likely of similar utility in a regression (by analogy to Bayes’ law). you can likely do whatever dimension reduction you want on either (though I prefer the E[y|x]-E[y] forward form as it seems more natural and is scale-invariant with respect to x).

Maybe the original paper needs the original quantity for some of the later steps, but that requires a much more thorough reading.