Win-Vector LLC’s Nina Zumel takes some time off to publish a literary book review: Reading Red Spectres: Russian Gothic Tales.

Nina Zumel also examines aspects of the supernatural in literature and in folk culture at her blog, multoghost.wordpress.com. She writes about folklore, ghost stories, weird fiction, or anything else that strikes her fancy. Follow her on Twitter @multoghost.

Page 94 of Gelman, Carlin, Stern, Dunson, Vehtari, Rubin “Bayesian Data Analysis” 3rd Edition (which we will call BDA3) provides a great example of what happens when common broad frequentist bias criticisms are over-applied to predictions from ordinary linear regression: the predictions appear to fall apart. BDA3 goes on to exhibit what might be considered the kind of automatic/mechanical fix responding to such criticisms would entail (producing a bias corrected predictor), and rightly shows these adjusted predictions are far worse than the original ordinary linear regression predictions. BDA3 makes a number of interesting points and is worth studying closely. We work their example in a bit more detail for emphasis. Read more…

Two of the most common methods of statistical inference are frequentism and Bayesianism (see Bayesian and Frequentist Approaches: Ask the Right Question for some good discussion). In both cases we are attempting to perform reliable inference of unknown quantities from related observations. And in both cases inference is made possible by introducing and reasoning over well-behaved distributions of values.

As a first example, consider the problem of trying to estimate the speed of light from a series of experiments.

In this situation the frequentist method quietly does some heavy philosophical lifting before you even start work. Under the frequentist interpretation since the speed of light is thought to have a single value it does not make sense to model it as having a prior distribution of possible values over any non-trivial range. To get the ability to infer, frequentist philosophy considers the act of measurement repeatable and introduces very subtle concepts such as *confidence intervals*. The frequentist statement that a series of experiments places the speed of light in vacuum at 300,000,000 meters a second plus or minus 1,000,000 meters a second with 95% confidence does not mean there is a 95% chance that the actual speed of light is in the interval 299,000,000 to 301,000,000 (the common incorrect recollection of what a confidence interval is). It means if the procedure that generated the interval were repeated on new data, then 95% of the time the speed of light would be in the interval produced: which may not be the interval we are looking at right now. Frequentist procedures are typically easy on the practitioner (all of the heavy philosophic work has already been done) and result in simple procedures and calculations (through years of optimization of practice).

Bayesian procedures on the other hand are philosophically much simpler, but require much more from the user (production and acceptance of priors). The Bayesian philosophy is: *given* a generative model, a complete prior distribution (detailed probabilities of the unknown value posited before looking at the current experimental data) of the quantity to be estimated, and observations: then inference is just a matter of calculating the complete posterior distribution of the quantity to be estimated (by correct application of Bayes’ Law). Supply a bad model or bad prior beliefs on possible values of the speed of light and you get bad results (and it is your fault, not the methodology’s fault). The Bayesian method seems to ask more, but you have to remember it is trying to supply more (complete posterior distribution, versus subjunctive confidence intervals).

In this article we are going to work a simple (but important) problem where (for once) the Bayesian calculations are in fact easier than the frequentist ones. Read more…

Been reading a lot of Gelman, Carlin, Stern, Dunson, Vehtari, Rubin “Bayesian Data Analysis” 3rd edition lately. Overall in the Bayesian framework some ideas (such as regularization, and imputation) are way easier to justify (though calculating some seemingly basic quantities becomes tedious). A big advantage (and weakness) of this formulation is statistics has a much less “shrink wrapped” feeling than the classic frequentist presentations. You feel like the material is being written to peers instead of written to calculators (of the human or mechanical variety). In the Bayesian formulation you don’t feel like you will be yelled at for using 1 tablespoon of sugar when the recipe calls for 3 teaspoons (at least if you live in the United States).

Some other stuff reads differently after this though. Read more…

There are a lot of good books on statistics, machine learning, analytics, and R. So it is valid to ask: how does Practical Data Science with R stand out? Why should a data scientist or an aspiring data scientist buy it?

We admit, it isn’t the only book we own. Some relevant books from the Win-Vector LLC company library include:

Read more…

R is definitely our first choice go-to analysis system. In our opinion you really shouldn’t use something else until you have an articulated reason (be it a need for larger data scale, different programming language, better data source integration, or something else). The advantages of R are numerous:

- Single integrated work environment.
- Powerful unified scripting/programming environment.
- Many many good tutorials and books available.
- Wide range of machine learning and statistical libraries.
- Very solid standard statistical libraries.
- Excellent graphing/plotting/visualization facilities (especially ggplot2).
- Schema oriented data frames allowing batch operations, plus simple row and column manipulation.
- Unified treatment of missing values (regardless of type).

For all that we always end up feeling just a little worried and a little guilty when introducing a new user to R. R is very powerful and often has more than one way to perform a common operation or represent a common data type. So you are never very far away from a strange and painful corner case. This why when you get R training you need to make sure you get an R expert (and not an R apologist). One of my favorite very smart experts is Norm Matloff (even his most recent talk title is smart: “What no one else will tell you about R”). Also, buy his book; we are very happy we purchased it.

But back to corner cases. For each method in R you really need to double check if it actually works over the common R base data types (numeric, integer, character, factor, and logical). Not all of them do and and sometimes you get a surprise.

Recent corner case problems we ran into include:

- randomForest regression fails on character arguments, but works on factors.
- mgcv
`gam()`

model doesn’t convert strings to formulas.
- R maps can’t use the empty string as a key (that is the string of length 0, not a
`NULL`

array or `NA`

value).

These are all little things, but can be a pain to debug when you are in the middle of something else. Read more…

What is meant by regression modeling?

Linear Regression is one of the most common statistical modeling techniques. It is very powerful, important, and (at first glance) easy to teach. However, because it is such a broad topic it can be a minefield for teaching and discussion. It is common for angry experts to accuse writers of carelessness, ignorance, malice and stupidity. If the type of regression the expert reader is expecting doesn’t match the one the writer is discussing then the writer is assumed to be ill-informed. The writer is especially vulnerable to experts when writing for non-experts. In such writing the expert finds nothing new (as they already know the topic) and is free to criticize any accommodation or adaption made for the intended non-expert audience. We argue that many of the corrections are not so much evidence of wrong ideas but more due a lack of empathy for the necessary informality necessary in concise writing. You can only define so much in a given space, and once you write too much you confuse and intimidate a beginning audience. Read more…

We use R to take a *very* brief look at the distribution of e-book sales on Amazon.com. Read more…

Many data science projects and presentations are needlessly derailed by not having set shared business relevant quantitative expectations early on (for some advice see Setting expectations in data science projects). One of the most common issues is the common layman expectation of “perfect prediction” from classification projects. It is important to set expectations correctly so your partners know what you are actually working towards and do not consider late choices of criteria disappointments or “venue shopping.” Read more…

Categories: data science, Opinion, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, Tutorials Tags: classifier quality, deviance, Entropy, likelihood, log-likelihood
Some researchers (in both science and marketing) abuse a slavish view of p-values to try and falsely claim credibility. The incantation is: “we achieved p = x (with x ≤ 0.05) so you should trust our work.” This might be true if the published result had been performed as a single project (and not as the sole shared result in longer series of private experiments) and really points to the fact that even frequentist significance is a subjective and intensional quantity (an accusation usually reserved for Bayesian inference). In this article we will comment briefly on the negative effect of un-reported repeated experiments and what should be done to compensate. Read more…