Trimming the Fat from glm() Models in R

One of the attractive aspects of logistic regression models (and linear models in general) is their compactness: the size of the model grows in the number of coefficients, not in the size of the training data. With R, though, glm models are not so concise; we noticed this to our dismay when we tried to automate fitting a moderate number of models (about 500 models, with on the order of 50 coefficients) to data sets of moderate size (several tens of thousands of rows). A workspace save of the models alone was in the tens of gigabytes! How is this possible? We decided to find out.

As many R users know (but often forget), a glm model object carries a copy of its training data by default. You can use the settings y=FALSE and model=FALSE to turn this off.

set.seed(2325235)


# Set up a synthetic classification problem of a given size
# and two variables: one numeric, one categorical
# (two levels).
synthFrame = function(nrows) {
   d = data.frame(xN=rnorm(nrows),
      xC=sample(c('a','b'),size=nrows,replace=TRUE))
   d$y = (d$xN + ifelse(d$xC=='a',0.2,-0.2) + rnorm(nrows))>0.5
   d
}


# first show that model=F and y=F help reduce model size

dTrain = synthFrame(1000)
model1 = glm(y~xN+xC,data=dTrain,family=binomial(link='logit'))
model2 = glm(y~xN+xC,data=dTrain,family=binomial(link='logit'),
             y=FALSE)
model3 = glm(y~xN+xC,data=dTrain,family=binomial(link='logit'),
              y=FALSE, model=FALSE)

#
# Estimate the object's size as the size of its serialization
#
length(serialize(model1, NULL))
# [1] 225251
length(serialize(model2, NULL))
# [1] 206341
length(serialize(model3, NULL))
# [1] 189562

dTest = synthFrame(100)
p1 = predict(model1, newdata=dTest, type='response')
p2 = predict(model2, newdata=dTest, type='response')
p3 = predict(model3, newdata=dTest, type='response')
sum(abs(p1-p2))
# [1] 0
sum(abs(p1-p3))
# [1] 0

So we see (as expected) that removing the training data from the model decreases the size of the model (as estimated by the size of its serialization), without affecting the model’s predictions. What happens when you increase the training data size? The size of the model (with y=FALSE and model=FALSE) should not grow.

ndata = seq(from=0, to=100000, by=5000)[-1]
#
# A function to estimate the size of a model for
# our synthetic problem, with a training set of size n
#
getModelSize = function(n) {
  data = synthFrame(n)
  model = glm(y~xN+xC,data=data,family=binomial(link='logit'),
              y=FALSE, model=FALSE)
  length(serialize(model, NULL))
}

size1 = sapply(ndata, FUN=getModelSize)
library(ggplot2)

ggplot(data.frame(n=ndata, modelsize=size1), aes(x=n, y=modelsize)) +
  geom_point() + geom_line()

Plot1

Lo and behold, we see that the model size still grows linearly in the size of the training data! The model objects are still holding something that is proportional to the size of the training data. Where?

We can use our serialization trick to find the size of the individual components of a model:

breakItDown = function(mod) {
  sapply(mod, FUN=function(x){length(serialize(x, NULL))}, simplify=T)
}

Now let’s compare two models trained with datasets of different sizes (one ten times the size of the other).

mod1 = glm(y~xN+xC,data=synthFrame(1000),
           family=binomial(link='logit'),
           y=FALSE, model=FALSE)
c1 = breakItDown(mod1)

mod2 = glm(y~xN+xC,data=synthFrame(10000),
           family=binomial(link='logit'),
           y=FALSE, model=FALSE)
c2 = breakItDown(mod2)

# For pretty-printing a vector to a vertical blog-friendly format:
# return a string of vector formatted as a column with names
# use cat to echo the value
vfmtN = function(v) { 
  width = max(sapply(names(v),nchar))
  paste(
    sapply(1:length(v),function(i) { paste(format(names(v)[i],
                                                  width=width),
                                           format(v[[i]])) }),
    collapse='\n')
}

cat(vfmtN(c1))
# coefficients      119
# residuals         18948
# fitted.values     18948
# effects           16071
# R                 261
# rank              26
# qr                35261
# family            25160
# linear.predictors 18948
# deviance          30
# aic               30
# null.deviance     30
# iter              26
# weights           18948
# prior.weights     18948
# df.residual       26
# df.null           26
# converged         26
# boundary          26
# call              373
# formula           193
# terms             836
# data              16278
# offset            18
# control           140
# method            37
# contrasts         96
# xlevels           91

cat(vfmtN(c2))
# coefficients      119
# residuals         198949
# fitted.values     198949
# effects           160071
# R                 261
# rank              26
# qr                359262
# family            25160
# linear.predictors 198949
# deviance          30
# aic               30
# null.deviance     30
# iter              26
# weights           198949
# prior.weights     198949
# df.residual       26
# df.null           26
# converged         26
# boundary          26
# call              373
# formula           193
# terms             836
# data              160278
# offset            18
# control           140
# method            37
# contrasts         96
# xlevels           91

Look carefully, and you will see that certain objects in the glm model are large, and growing with data size.

r = c2/c1
cat(vfmtN(r))
# coefficients      1
# residuals         10.49974
# fitted.values     10.49974
# effects           9.960239
# R                 1
# rank              1
# qr                10.18865
# family            1
# linear.predictors 10.49974
# deviance          1
# aic               1
# null.deviance     1
# iter              1
# weights           10.49974
# prior.weights     10.49974
# df.residual       1
# df.null           1
# converged         1
# boundary          1
# call              1
# formula           1
# terms             1
# data              9.846296
# offset            1
# control           1
# method            1
# contrasts         1
# xlevels           1

cat(vfmtN(r[r>1]))
# residuals         10.49974
# fitted.values     10.49974
# effects           9.960239
# qr                10.18865
# linear.predictors 10.49974
# weights           10.49974
# prior.weights     10.49974
# data              9.846296

Now strictly speaking, all you need to know to apply a glm model are the coefficients of the model, and the appropriate link function. All the other things the glm model object carries around are for the purpose of characterizing the model. An example would be calculating coefficient significances (and really, for most purposes, one could just calculate the quantities one wants to know, save those, and throw the data away — but we’re here to discuss R as it is, not as it should be). Once you’ve examined a model and decided that it’s satisfactory, all you probably want to do is predict. So let’s try trimming all those large objects away.

cleanModel1 = function(cm) {
  # just in case we forgot to set
  # y=FALSE and model=FALSE
  cm$y = c()
  cm$model = c()

  cm$residuals = c()
  cm$fitted.values = c()
  cm$effects = c()
  cm$qr = c()  
  cm$linear.predictors = c()
  cm$weights = c()
  cm$prior.weights = c()
  cm$data = c()
  cm
}

cm1 = cleanModel1(mod1)
cm2 = cleanModel1(mod2)

dTest = synthFrame(100)
p1=predict(cm1, newdata=dTest, type='response') # FAILS 
# Error in qr.lm(object) : lm object does not have a proper 'qr' component.
# Rank zero or should not have used lm(.., qr=FALSE).

Ooops. We can’t null out the qr member of the model object if we want to predict. Incidentally, this is related to the observation that if you try to call lm(...., y=FALSE, model=FALSE, qr=FALSE), the result is a model object that fails to either predict or summarize. Don’t ask me why qr=FALSE is even an option. But back to the glm. What’s in the model’s qr field?

breakItDown(mod1$qr)
# qr  rank qraux pivot   tol 
# 35042    26    46    34    30 

breakItDown(mod2$qr)
# qr   rank  qraux  pivot    tol 
# 359043     26     46     34     30 

It turns out that we don’t actually need model’s qr$qr to predict, so let’s trim just that away:

cleanModel2 = function(cm) {
  cm$y = c()
  cm$model = c()
  
  cm$residuals = c()
  cm$fitted.values = c()
  cm$effects = c()
  cm$qr$qr = c()  
  cm$linear.predictors = c()
  cm$weights = c()
  cm$prior.weights = c()
  cm$data = c()
  cm
}

# More reduction in model size
length(serialize(mod2, NULL))
# [1] 1701600
cm2 = cleanModel2(mod2)
length(serialize(cm2, NULL))
# [1] 27584

# And prediction works, too
resp.full = predict(mod2, newdata=dTest, type="response")
resp.cm = predict(cm2, newdata=dTest, type="response")
sum(abs(resp.full-resp.cm))
# [1] 0

Are we done?

getModelSize = function(n) {
  data = synthFrame(n)
  model = cleanModel2(glm(y~xN+xC,data=data,
                         family=binomial(link='logit'),
                         y=FALSE, model=FALSE))
  length(serialize(model, NULL))
}

size2 = sapply(ndata, FUN=getModelSize)

ggplot(data.frame(n=ndata, modelsize=size2), aes(x=n, y=modelsize)) +
  geom_point() + geom_line()

Plot2

The models are substantially smaller than when we started, but they still grow with training data size.

A rough explanation for this is that glm hides pointers to the environment and things from the environment deep in many places. We didn’t notice this when we built models in the global environment because all those pointers pointed to the same things, so even though the models are much bigger than they need to be, they are all “too big” by the same amount, and hence don’t appear to grow as the training data grows. But when you build the models in a function (as we did in getModelSize(), you get more transient environments that are proportional to the size of the training data — and so model size grows with training data size. This isn’t going to seem clear, because it depends on a lot of complicated implementation details (for a taste of how complicated it can get, see here).

After much trial and error, this is the set of fields and attributes of the model that we found were growing with data size, and that we could eliminate without breaking predict().

stripGlmLR = function(cm) {
  cm$y = c()
  cm$model = c()
  
  cm$residuals = c()
  cm$fitted.values = c()
  cm$effects = c()
  cm$qr$qr = c()  
  cm$linear.predictors = c()
  cm$weights = c()
  cm$prior.weights = c()
  cm$data = c()

  
  cm$family$variance = c()
  cm$family$dev.resids = c()
  cm$family$aic = c()
  cm$family$validmu = c()
  cm$family$simulate = c()
  attr(cm$terms,".Environment") = c()
  attr(cm$formula,".Environment") = c()
  
  cm
}

getModelSize = function(n) {
  data = synthFrame(n)
  model = stripGlmLR(glm(y~xN+xC,data=data,
                          family=binomial(link='logit'),
                          y=FALSE, model=FALSE))
  length(serialize(model, NULL))
}

size3 = sapply(ndata, FUN=getModelSize)

ggplot(data.frame(n=ndata, modelsize=size3), aes(x=n, y=modelsize)) +
  geom_point() + geom_line()

Plot3

Yahoo! It worked! The models are constant size with respect to training data size. And prediction works.

cm2 = stripGlmLR(mod2)
resp.full = predict(mod2, newdata=dTest, type="response")
resp.cm = predict(cm2, newdata=dTest, type="response")
sum(abs(resp.full-resp.cm))
# [1] 0

Comparing the size of the final stripped-down models (in variable size3 in the demonstration code) to the originals (size1), we find that the final model is 3/10th of a percent the size of the original model for small (n=5000) training data sets, and 0.015% the size of the original model for “large” (n=100,000) data sets. That’s a heckuva savings. And we probably haven’t gotten rid of all the unnecessary baggage, but at least we seem to have stopped the growth. This was enough trimming to accomplish our task for the client (producing working models that stored and loaded quickly), so we stopped.

One point and one caveat. You can null out model$family entirely; the predict function will still return its default value, the link value (that is, predict(model, newdata=data)) will work). However, predict(model, newdata=data, type='response') will fail. You can still recover the response by passing the link value through the inverse link function: in the case of logistic regression, this is the sigmoid function, sigmoid(x) = 1/(1 + exp(-x)). I didn’t test type=terms.

The caveat: many of the other things besides predict that you might like to do with a glm model will fail on the stripped-down version: in particular summary(), anova() and step(). So any characterization that you want to do on a candidate model should be done before trimming down the fat. Once you have decided on a satisfactory model, you can strip it down and save it for use in future predictions.


  • You can trim lm and gam models in a similar way, too. The exact fields to trim are a bit different. We will leave this as an exercise for the reader.
  • We are aware of the bigglm package, for fitting generalized linear models to big data. We didn’t test it, but I would imagine that it doesn’t have this problem. Note, though that the problem here isn’t the size of the training data per se (which is only of moderate size); it’s the inordinate size of the resulting model.

Be Sociable, Share!

10 thoughts on “Trimming the Fat from glm() Models in R”

  1. I’ve encountered a similar problem with lm() performance. I have a technique I use where I fit thousands of small lm() models on KNN selected local manifolds. It works really well on certain kinds of problems, and has the advantage of creating results which can be understood by the analyst.
    I originally deployed it in lisp, and as regression fitting is just a matrix invert, it didn’t add any noticable overhead. When I tried the same trick on an R contract using lm(), it was beastly slow. Lots of the cruft that comes along with an lm is apparently computationally intensive. I’m assuming all such cruft is occasionally useful, but it is a bit ridiculous. Fortunately, lm.fit wasn’t so bad, so I didn’t have to resort to lapack calls, but it was one of those examples of R trying to do everything, and ending up awfully bloated.

  2. @Scott Locklin We’ve had various moments of “is R leaking memory?” in the past. Never ran it down, then Nina saw this one stick out like a sore thumb during a save. She tried object.size(), but it was hiding things.

  3. Absolutely wonderful post. Thank you for taking the time to thoroughly dissect the glm object in R and show people where the pain points are, and to find a way to strip a GLM object down to the minimal necessary for prediction. This is a really valuable contribution!

  4. This begs the question of why the model object is so ‘obese’ in the first place. My understanding is that glm(), lm() and friends are really for interactive use when people are exploring things, and not really suitable for, e.g. use within functions or loops for other more automatic computations (of the kind in which SAS excells, of course, …) In such cases the R philosophy is to create a model object that is as complete and informative as possible, allowing maximum interrogation and use afterwards. R objects are vehicles for information.

    When this obesity becomes a problem, my preference is not to pare down the fitted model object, but not to create it in the first place. Rather, I prefer to use low level tools that just to the job in hand, such as model.matrix(), qr() and/or lm.fit() for linear models, glm.fit() for GLMs and low-level matrix tools for standard operations. It is pretty easy to build up a special purpose model fitting function that returns a stripped down object, and the tools are still there.

    I should also acknowledge, though, that this is a very useful post in demonstrating the structure of the bog standard fitted model objects and alerting people to what is going on, often without their full knowledge. With many R functions it is good to look under the bonnet (or ‘hood’ in the USA) sometimes – that’s one reason you have the code – as well as at the help information, (which practically no one understands without a legal training, I know…). My take on what to do about it, though, is a bit different.

  5. @Bill Venables Thanks for your comment!

    I definitely think it R’s fault for building/leaving so much cruft on the standard call. I like your advice to try the .fit() forms, but I am not sure that is always convenient (it turns out you run into some complications).

    lm.fit() and glm.fit() are possibilities, but you lose the convenience of having a ready to go predict() function which might be inconvenient if you have data that includes factor valued columns (you need to pass the factor encoding from train to test somehow to make sure you are using the same level encodings). So you are going to want to use the family and possibly some other slots from the returned object. And it turns out the serialization size of the list returned by glm.fit() also grows proportional to the training data size (try the new code at the bottom of here). So you are again back at cherry picking which fields to save or which ones to excise.

    Also in packages like mgcv you see documentation such as: “gam.fit {mgcv} … is an internal function of package mgcv.” Which really makes you wary of calling it directly.

  6. Just two quick points in reply.

    One person’s “cruft” can be another’s “craft”. The focus of the discussion here has been on the predict() method. That’s probably the most important, but I am not at all sure people will never write useful methods for fitted model objects that require some of the “cruft” for which we can not yet see a use. There is a need to be balanced about this, and I concede that the objects, when created under the default settings, may have gone a bit overboard, but the philosophy of R favours erring on the side of completeness in this context.

    Second point is a query to ponder: Who is this “R” whose fault it is? It is in anyone’s hands to write a package, say GLM, which fits a model and returns a stripped down object suitable ONLY for prediction, and I think it would be very useful contribution. We are all “R”, when you think about it.

    (Disclaimer: I didn’t write any of the lm or glm software, but I know and respect the people who did.)

    Again: this has been a useful post and an even more useful discussion.

  7. @Bill Venables We are probably are on the same side on this one, just my emphasis is different than the tone I think I am reading from your comments.

    At some point design trade-offs are so out of balance that we can actually call them wrong. It doesn’t mean the developers aren’t great. But just because they are great developers doesn’t mean there isn’t a flaw. Having structures proportional to data set size in a linear or generalized linear model (and no prominent documentation of this, or built in way to turn them off) violates the principle of least astonishment. Remember good statistical software is both good statistics (so it must follow good statistical practices) and good software (so it must follow good software engineering practices).

    There are obvious functions that do currently use all the stored data (so I know it can’t be fixed): predict(model) (which really is not an improvement over predict(model,newdata=data)).

    And in some sense Nina has already supplied code that strips things down to work for prediction.

  8. Another thing R users may not know: step() takes time proportional to the data set size even for lm(). The reason this is a surprise is it it is possible (for lm() at least) to perform all the calculations needed for a stepwise regression from the summaries X^T X and X^T y (where X is the so-called design matrix and y are the training outcomes). These summaries are of a size that is independent of the number of training rows (their size is a function only of the number of columns which come from variables and levels). Probably the original data is being used to drive re-fits (which is probably necessary for a method general enough to work for both lm() and glm()). This mean for lm() one could write a faststep package (though not everybody likes step(), though we could at least add L2 regularization at this point). The graph of the step timings is here: .

    The timing code is here. Obviously this kind of thing is what the biglm packages is for, but it would be nice if the base system lm package had dome some of these things.

Comments are closed.