One of the attractive aspects of logistic regression models (and linear models in general) is their compactness: the size of the model grows in the number of coefficients, not in the size of the training data. With R, though, `glm`

models are not so concise; we noticed this to our dismay when we tried to automate fitting a moderate number of models (about 500 models, with on the order of 50 coefficients) to data sets of moderate size (several tens of thousands of rows). A workspace save of the models alone was in the tens of gigabytes! How is this possible? We decided to find out.

As many R users know (but often forget), a `glm`

model object carries a copy of its training data by default. You can use the settings `y=FALSE`

and `model=FALSE`

to turn this off.

set.seed(2325235)
# Set up a synthetic classification problem of a given size
# and two variables: one numeric, one categorical
# (two levels).
synthFrame = function(nrows) {
d = data.frame(xN=rnorm(nrows),
xC=sample(c('a','b'),size=nrows,replace=TRUE))
d$y = (d$xN + ifelse(d$xC=='a',0.2,-0.2) + rnorm(nrows))>0.5
d
}
# first show that model=F and y=F help reduce model size
dTrain = synthFrame(1000)
model1 = glm(y~xN+xC,data=dTrain,family=binomial(link='logit'))
model2 = glm(y~xN+xC,data=dTrain,family=binomial(link='logit'),
y=FALSE)
model3 = glm(y~xN+xC,data=dTrain,family=binomial(link='logit'),
y=FALSE, model=FALSE)
#
# Estimate the object's size as the size of its serialization
#
length(serialize(model1, NULL))
# [1] 225251
length(serialize(model2, NULL))
# [1] 206341
length(serialize(model3, NULL))
# [1] 189562
dTest = synthFrame(100)
p1 = predict(model1, newdata=dTest, type='response')
p2 = predict(model2, newdata=dTest, type='response')
p3 = predict(model3, newdata=dTest, type='response')
sum(abs(p1-p2))
# [1] 0
sum(abs(p1-p3))
# [1] 0

Read more…