While following up on Nina Zumel’s excellent Trimming the Fat from glm() Models in R I got to thinking about code style in R. And I realized: you can make your code much prettier by designing more of your functions to return `data.frame`

s. That may seem needlessly heavy-weight, but it has a lot of down-stream advantages. Continue reading

# Skimming statistics papers for the ideas (instead of the complete procedures)

Been reading a lot of Gelman, Carlin, Stern, Dunson, Vehtari, Rubin “Bayesian Data Analysis” 3rd edition lately. Overall in the Bayesian framework some ideas (such as regularization, and imputation) are way easier to justify (though calculating some seemingly basic quantities becomes tedious). A big advantage (and weakness) of this formulation is statistics has a much less “shrink wrapped” feeling than the classic frequentist presentations. You feel like the material is being written to peers instead of written to calculators (of the human or mechanical variety). In the Bayesian formulation you don’t feel like you will be yelled at for using 1 tablespoon of sugar when the recipe calls for 3 teaspoons (at least if you live in the United States).

Some other stuff reads differently after this though. Continue reading

# How does Practical Data Science with R stand out?

There are a lot of good books on statistics, machine learning, analytics, and R. So it is valid to ask: how does Practical Data Science with R stand out? Why should a data scientist or an aspiring data scientist buy it?

We admit, it isn’t the only book we own. Some relevant books from the Win-Vector LLC company library include:

# Trimming the Fat from glm() Models in R

One of the attractive aspects of logistic regression models (and linear models in general) is their compactness: the size of the model grows in the number of coefficients, not in the size of the training data. With R, though, `glm`

models are not so concise; we noticed this to our dismay when we tried to automate fitting a moderate number of models (about 500 models, with on the order of 50 coefficients) to data sets of moderate size (several tens of thousands of rows). A workspace save of the models alone was in the tens of gigabytes! How is this possible? We decided to find out.

As many R users know (but often forget), a `glm`

model object carries a copy of its training data by default. You can use the settings `y=FALSE`

and `model=FALSE`

to turn this off.

set.seed(2325235) # Set up a synthetic classification problem of a given size # and two variables: one numeric, one categorical # (two levels). synthFrame = function(nrows) { d = data.frame(xN=rnorm(nrows), xC=sample(c('a','b'),size=nrows,replace=TRUE)) d$y = (d$xN + ifelse(d$xC=='a',0.2,-0.2) + rnorm(nrows))>0.5 d } # first show that model=F and y=F help reduce model size dTrain = synthFrame(1000) model1 = glm(y~xN+xC,data=dTrain,family=binomial(link='logit')) model2 = glm(y~xN+xC,data=dTrain,family=binomial(link='logit'), y=FALSE) model3 = glm(y~xN+xC,data=dTrain,family=binomial(link='logit'), y=FALSE, model=FALSE) # # Estimate the object's size as the size of its serialization # length(serialize(model1, NULL)) # [1] 225251 length(serialize(model2, NULL)) # [1] 206341 length(serialize(model3, NULL)) # [1] 189562 dTest = synthFrame(100) p1 = predict(model1, newdata=dTest, type='response') p2 = predict(model2, newdata=dTest, type='response') p3 = predict(model3, newdata=dTest, type='response') sum(abs(p1-p2)) # [1] 0 sum(abs(p1-p3)) # [1] 0

# Save 50% on Practical Data Science with R (and other titles) at Manning through May 30, 2014

Manning Publications Inc. is launching an exciting new MEAP: Practical Probabilistic Programming (which we have already subscribed to) by offering a 50% discount on Practical Probabilistic Programming and other titles (including Practical Data Science with R!). To get the discount put the books in your Manning shopping car and then add the promotional code **ppplaunch50** (through May 30, 2014) into the coupon code field in the “other” section on towards the bottom of the account form. See below for other Manning books eligible for this generous discount. Continue reading

# Save 45% on Practical Data Science with R (expires May 21, 2014)

Please share this generous deal from Manning publications: save 45% on Practical Data Science with R through May 21, 2014. Please tweet, forward and share!

Edit: we are going to try and keep the current best deals on the book at the bottom of the Practical Data Science with R page. So look there for updates (also the book is always available at Amazon.com so you may want to look what the discount there is).

# R has some sharp corners

R is definitely our first choice go-to analysis system. In our opinion you really shouldn’t use something else until you have an articulated reason (be it a need for larger data scale, different programming language, better data source integration, or something else). The advantages of R are numerous:

- Single integrated work environment.
- Powerful unified scripting/programming environment.
- Many many good tutorials and books available.
- Wide range of machine learning and statistical libraries.
- Very solid standard statistical libraries.
- Excellent graphing/plotting/visualization facilities (especially ggplot2).
- Schema oriented data frames allowing batch operations, plus simple row and column manipulation.
- Unified treatment of missing values (regardless of type).

For all that we always end up feeling just a little worried and a little guilty when introducing a new user to R. R is very powerful and often has more than one way to perform a common operation or represent a common data type. So you are never very far away from a strange and painful corner case. This why when you get R training you need to make sure you get an R expert (and not an R apologist). One of my favorite very smart experts is Norm Matloff (even his most recent talk title is smart: “What no one else will tell you about R”). Also, buy his book; we are very happy we purchased it.

But back to corner cases. For each method in R you really need to double check if it actually works over the common R base data types (numeric, integer, character, factor, and logical). Not all of them do and and sometimes you get a surprise.

Recent corner case problems we ran into include:

- randomForest regression fails on character arguments, but works on factors.
- mgcv
`gam()`

model doesn’t convert strings to formulas. - R maps can’t use the empty string as a key (that is the string of length 0, not a
`NULL`

array or`NA`

value).

These are all little things, but can be a pain to debug when you are in the middle of something else. Continue reading

# Great book discount from Manning (and more about one of our authors)

Found this great offer from mkt@manning.com in our email today! Very excited to see Nina Zumel get some recognition and thought we would share it (and the generous discount) here. Continue reading

# A clear picture of power and significance in A/B tests

A/B tests are one of the simplest reliable experimental designs.

Controlled experiments embody the best scientific design for establishing a causal relationship between changes and their influence on user-observable behavior.

“Practical guide to controlled experiments on the web: listen to your customers not to the HIPPO” Ron Kohavi, Randal M Henne, and Dan Sommerfield, Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, 2007 pp. 959-967.

The ideas is to test a variation (called “treatment” or “B”) in parallel with continuing to test a baseline (called “control” or “A”) to see if the variation drives a desired effect (increase in revenue, cure of disease, and so on). By running both tests at the same time it is hoped that any confounding or omitted factors are nearly evenly distributed between the two groups and therefore not spoiling results. This is a much safer system of testing than retrospective studies (where we look for features from data already collected).

Interestingly enough the multi-armed bandit alternative to A/B testing (a procedure that introduces online control) is one of the simplest non-trivial Markov decision processes. However, we will limit ourselves to traditional A/B testing for the remainder of this note. Continue reading

# A bit of the agenda of Practical Data Science with R

The goal of Zumel/Mount: Practical Data Science with R is to teach, through guided practice, the skills of a data scientist. We define a data scientist as the person who organizes client input, data, infrastructure, statistics, mathematics and machine learning to deploy useful predictive models into production.

Our plan to teach is to:

- Order the material by what is expected from the data scientist.
- Emphasize the already available bread and butter machine learning algorithms that most often work.
- Provide a large set of worked examples.
- Expose the reader to a number of realistic data sets.

Some of these choices may put-off some potential readers. But it is our goal to try and spend out time on what a data scientist needs to do. Our point: the data scientist is responsible for end to end results, which is not always entirely fun. If you want to specialize in machine learning algorithms or only big data infrastructure, that is a fine goal. However, the job of the data scientist is to understand and orchestrate all of the steps (working with domain experts, curating data, using data tools, and applying machine learning and statistics).

Once you define what a data scientist does, you find fewer people want to work as one.

We expand a few of our points below. Continue reading