One of the services that the `R`

package `vtreat`

provides is *level coding* (what we sometimes call *impact coding*): converting the levels of a categorical variable to a meaningful and concise single numeric variable, rather than coding them as indicator variables (AKA "one-hot encoding"). Level coding can be computationally and statistically preferable to one-hot encoding for variables that have an extremely large number of possible levels.

By default, `vtreat`

level codes to the difference between the conditional means and the grand mean (`catN`

variables) when the outcome is numeric, and to the difference between the conditional log-likelihood and global log-likelihood of the target class (`catB`

variables) when the outcome is categorical. These aren’t the only possible level codings. For example, the `ranger`

package can encode categorical variables as ordinals, sorted by the conditional expectations/means. While this is not a completely faithful encoding for all possible models (it is not completely faithful for linear or logistic regression, for example), it is often invertible for tree-based methods, and has the advantage of keeping the original levels distinct, which impact coding may not. That is, two levels with the same conditional expectation would be conflated by `vtreat`

‘s coding. This often isn’t a problem — but sometimes, it may be.

So the data scientist may want to use a level coding different from what `vtreat`

defaults to. In this article, we will demonstrate how to implement custom level encoders in `vtreat`

. We assume you are familiar with the basics of `vtreat`

: the types of derived variables, how to create and apply a treatment plan, etc.