## (or: how to correctly use `xgboost`

from `R`

)

`R`

has "one-hot" encoding hidden in most of its modeling paths. Asking an `R`

user where one-hot encoding is used is like asking a fish where there is water; they can’t point to it as it is everywhere.

For example we can see evidence of one-hot encoding in the variable names chosen by a linear regression:

```
dTrain <- data.frame(x= c('a','b','b', 'c'),
y= c(1, 2, 1, 2))
summary(lm(y~x, data= dTrain))
```

```
##
## Call:
## lm(formula = y ~ x, data = dTrain)
##
## Residuals:
## 1 2 3 4
## -2.914e-16 5.000e-01 -5.000e-01 2.637e-16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.0000 0.7071 1.414 0.392
## xb 0.5000 0.8660 0.577 0.667
## xc 1.0000 1.0000 1.000 0.500
##
## Residual standard error: 0.7071 on 1 degrees of freedom
## Multiple R-squared: 0.5, Adjusted R-squared: -0.5
## F-statistic: 0.5 on 2 and 1 DF, p-value: 0.7071
```

Continue reading Encoding categorical variables: one-hot and beyond