Archive

Author Archive

Diversion: Win-Vector LLC’s Nina Zumel takes time off to publish a literary book review

July 17th, 2014 No comments

Win-Vector LLC’s Nina Zumel takes some time off to publish a literary book review: Reading Red Spectres: Russian Gothic Tales.

Hundertwasser domes

Nina Zumel also examines aspects of the supernatural in literature and in folk culture at her blog, multoghost.wordpress.com. She writes about folklore, ghost stories, weird fiction, or anything else that strikes her fancy. Follow her on Twitter @multoghost.

Automatic bias correction doesn’t fix omitted variable bias

July 8th, 2014 1 comment

Page 94 of Gelman, Carlin, Stern, Dunson, Vehtari, Rubin “Bayesian Data Analysis” 3rd Edition (which we will call BDA3) provides a great example of what happens when common broad frequentist bias criticisms are over-applied to predictions from ordinary linear regression: the predictions appear to fall apart. BDA3 goes on to exhibit what might be considered the kind of automatic/mechanical fix responding to such criticisms would entail (producing a bias corrected predictor), and rightly shows these adjusted predictions are far worse than the original ordinary linear regression predictions. BDA3 makes a number of interesting points and is worth studying closely. We work their example in a bit more detail for emphasis. Read more…

Frequentist inference only seems easy

July 1st, 2014 3 comments

Two of the most common methods of statistical inference are frequentism and Bayesianism (see Bayesian and Frequentist Approaches: Ask the Right Question for some good discussion). In both cases we are attempting to perform reliable inference of unknown quantities from related observations. And in both cases inference is made possible by introducing and reasoning over well-behaved distributions of values.

As a first example, consider the problem of trying to estimate the speed of light from a series of experiments.

In this situation the frequentist method quietly does some heavy philosophical lifting before you even start work. Under the frequentist interpretation since the speed of light is thought to have a single value it does not make sense to model it as having a prior distribution of possible values over any non-trivial range. To get the ability to infer, frequentist philosophy considers the act of measurement repeatable and introduces very subtle concepts such as confidence intervals. The frequentist statement that a series of experiments places the speed of light in vacuum at 300,000,000 meters a second plus or minus 1,000,000 meters a second with 95% confidence does not mean there is a 95% chance that the actual speed of light is in the interval 299,000,000 to 301,000,000 (the common incorrect recollection of what a confidence interval is). It means if the procedure that generated the interval were repeated on new data, then 95% of the time the speed of light would be in the interval produced: which may not be the interval we are looking at right now. Frequentist procedures are typically easy on the practitioner (all of the heavy philosophic work has already been done) and result in simple procedures and calculations (through years of optimization of practice).

Bayesian procedures on the other hand are philosophically much simpler, but require much more from the user (production and acceptance of priors). The Bayesian philosophy is: given a generative model, a complete prior distribution (detailed probabilities of the unknown value posited before looking at the current experimental data) of the quantity to be estimated, and observations: then inference is just a matter of calculating the complete posterior distribution of the quantity to be estimated (by correct application of Bayes’ Law). Supply a bad model or bad prior beliefs on possible values of the speed of light and you get bad results (and it is your fault, not the methodology’s fault). The Bayesian method seems to ask more, but you have to remember it is trying to supply more (complete posterior distribution, versus subjunctive confidence intervals).

In this article we are going to work a simple (but important) problem where (for once) the Bayesian calculations are in fact easier than the frequentist ones. Read more…

R minitip: don’t use data.matrix when you mean model.matrix

June 10th, 2014 No comments

A quick R mini-tip: don’t use data.matrix when you mean model.matrix. If you do so you may lose (without noticing) a lot of your model’s explanatory power (due to poor encoding). Read more…

R style tip: prefer functions that return data frames

June 6th, 2014 3 comments

While following up on Nina Zumel’s excellent Trimming the Fat from glm() Models in R I got to thinking about code style in R. And I realized: you can make your code much prettier by designing more of your functions to return data.frames. That may seem needlessly heavy-weight, but it has a lot of down-stream advantages. Read more…

Skimming statistics papers for the ideas (instead of the complete procedures)

June 2nd, 2014 No comments

Been reading a lot of Gelman, Carlin, Stern, Dunson, Vehtari, Rubin “Bayesian Data Analysis” 3rd edition lately. Overall in the Bayesian framework some ideas (such as regularization, and imputation) are way easier to justify (though calculating some seemingly basic quantities becomes tedious). A big advantage (and weakness) of this formulation is statistics has a much less “shrink wrapped” feeling than the classic frequentist presentations. You feel like the material is being written to peers instead of written to calculators (of the human or mechanical variety). In the Bayesian formulation you don’t feel like you will be yelled at for using 1 tablespoon of sugar when the recipe calls for 3 teaspoons (at least if you live in the United States).

Some other stuff reads differently after this though. Read more…

How does Practical Data Science with R stand out?

June 2nd, 2014 2 comments

There are a lot of good books on statistics, machine learning, analytics, and R. So it is valid to ask: how does Practical Data Science with R stand out? Why should a data scientist or an aspiring data scientist buy it?

We admit, it isn’t the only book we own. Some relevant books from the Win-Vector LLC company library include:

Books Read more…

Save 50% on Practical Data Science with R (and other titles) at Manning through May 30, 2014

May 21st, 2014 No comments

Manning Publications Inc. is launching an exciting new MEAP: Practical Probabilistic Programming (which we have already subscribed to) by offering a 50% discount on Practical Probabilistic Programming and other titles (including Practical Data Science with R!). To get the discount put the books in your Manning shopping car and then add the promotional code ppplaunch50 (through May 30, 2014) into the coupon code field in the “other” section on towards the bottom of the account form. See below for other Manning books eligible for this generous discount. Read more…

Save 45% on Practical Data Science with R (expires May 21, 2014)

May 16th, 2014 6 comments

Please share this generous deal from Manning publications: save 45% on Practical Data Science with R through May 21, 2014. Please tweet, forward and share!


Edit: we are going to try and keep the current best deals on the book at the bottom of the Practical Data Science with R page. So look there for updates (also the book is always available at Amazon.com so you may want to look what the discount there is).

R has some sharp corners

May 15th, 2014 2 comments

R is definitely our first choice go-to analysis system. In our opinion you really shouldn’t use something else until you have an articulated reason (be it a need for larger data scale, different programming language, better data source integration, or something else). The advantages of R are numerous:

  • Single integrated work environment.
  • Powerful unified scripting/programming environment.
  • Many many good tutorials and books available.
  • Wide range of machine learning and statistical libraries.
  • Very solid standard statistical libraries.
  • Excellent graphing/plotting/visualization facilities (especially ggplot2).
  • Schema oriented data frames allowing batch operations, plus simple row and column manipulation.
  • Unified treatment of missing values (regardless of type).

For all that we always end up feeling just a little worried and a little guilty when introducing a new user to R. R is very powerful and often has more than one way to perform a common operation or represent a common data type. So you are never very far away from a strange and painful corner case. This why when you get R training you need to make sure you get an R expert (and not an R apologist). One of my favorite very smart experts is Norm Matloff (even his most recent talk title is smart: “What no one else will tell you about R”). Also, buy his book; we are very happy we purchased it.

But back to corner cases. For each method in R you really need to double check if it actually works over the common R base data types (numeric, integer, character, factor, and logical). Not all of them do and and sometimes you get a surprise.

Recent corner case problems we ran into include:

  • randomForest regression fails on character arguments, but works on factors.
  • mgcv gam() model doesn’t convert strings to formulas.
  • R maps can’t use the empty string as a key (that is the string of length 0, not a NULL array or NA value).

These are all little things, but can be a pain to debug when you are in the middle of something else. Read more…