One of the attractive aspects of logistic regression models (and linear models in general) is their compactness: the size of the model grows in the number of coefficients, not in the size of the training data. With R, though, glm
models are not so concise; we noticed this to our dismay when we tried to automate fitting a moderate number of models (about 500 models, with on the order of 50 coefficients) to data sets of moderate size (several tens of thousands of rows). A workspace save of the models alone was in the tens of gigabytes! How is this possible? We decided to find out.
As many R users know (but often forget), a glm
model object carries a copy of its training data by default. You can use the settings y=FALSE
and model=FALSE
to turn this off.
set.seed(2325235) # Set up a synthetic classification problem of a given size # and two variables: one numeric, one categorical # (two levels). synthFrame = function(nrows) { d = data.frame(xN=rnorm(nrows), xC=sample(c('a','b'),size=nrows,replace=TRUE)) d$y = (d$xN + ifelse(d$xC=='a',0.2,-0.2) + rnorm(nrows))>0.5 d } # first show that model=F and y=F help reduce model size dTrain = synthFrame(1000) model1 = glm(y~xN+xC,data=dTrain,family=binomial(link='logit')) model2 = glm(y~xN+xC,data=dTrain,family=binomial(link='logit'), y=FALSE) model3 = glm(y~xN+xC,data=dTrain,family=binomial(link='logit'), y=FALSE, model=FALSE) # # Estimate the object's size as the size of its serialization # length(serialize(model1, NULL)) # [1] 225251 length(serialize(model2, NULL)) # [1] 206341 length(serialize(model3, NULL)) # [1] 189562 dTest = synthFrame(100) p1 = predict(model1, newdata=dTest, type='response') p2 = predict(model2, newdata=dTest, type='response') p3 = predict(model3, newdata=dTest, type='response') sum(abs(p1-p2)) # [1] 0 sum(abs(p1-p3)) # [1] 0