One of the shortcomings of regression (both linear and logistic) is that it doesn’t handle categorical variables with a very large number of possible values (for example, postal codes). You can get around this, of course, by going to another modeling technique, such as Naive Bayes; however, you lose some of the advantages of regression — namely, the model’s explicit estimates of variables’ explanatory value, and explicit insight into and control of variable to variable dependence.

Here we discuss one modeling trick that allows us to keep categorical variables with a large number of values, and at the same time retain much of logistic regression’s power.

Continue reading Modeling Trick: Impact Coding of Categorical Variables with Many Levels