Posted on 1 Comment on Some Details on Running xgboost

## Some Details on Running xgboost

While reading Dr. Nina Zumel’s excellent note on bias in common ensemble methods, I ran the examples to see the effects she described (and I think it is very important that she is establishing the issue, prior to discussing mitigation).

In doing that I ran into one more avoidable but strange issue in using xgboost: when run for a small number of rounds it at first appears that xgboost doesn’t get the unconditional average or grand average right (let alone the conditional averages Nina was working with)!

Let’s take a look at that by running a trivial example in R.

Posted on

## (or: how to correctly use `xgboost` from `R`)

`R` has "one-hot" encoding hidden in most of its modeling paths. Asking an `R` user where one-hot encoding is used is like asking a fish where there is water; they can’t point to it as it is everywhere.

For example we can see evidence of one-hot encoding in the variable names chosen by a linear regression:

``````dTrain <-  data.frame(x= c('a','b','b', 'c'),
y= c(1, 2, 1, 2))
summary(lm(y~x, data= dTrain))``````
``````##
## Call:
## lm(formula = y ~ x, data = dTrain)
##
## Residuals:
##          1          2          3          4
## -2.914e-16  5.000e-01 -5.000e-01  2.637e-16
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   1.0000     0.7071   1.414    0.392
## xb            0.5000     0.8660   0.577    0.667
## xc            1.0000     1.0000   1.000    0.500
##
## Residual standard error: 0.7071 on 1 degrees of freedom
## Multiple R-squared:    0.5,  Adjusted R-squared:   -0.5
## F-statistic:   0.5 on 2 and 1 DF,  p-value: 0.7071``````