Posted on Categories data science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , , , , 3 Comments on Bad Bayes: an example of why you need hold-out testing

Bad Bayes: an example of why you need hold-out testing

We demonstrate a dataset that causes many good machine learning algorithms to horribly overfit.

The example is designed to imitate a common situation found in predictive analytic natural language processing. In this type of application you are often building a model using many rare text features. The rare text features are often nearly unique k-grams and the model can be anything from Naive Bayes to conditional random fields. This sort of modeling situation exposes the modeler to a lot of training bias. You can get models that look good on training data even though they have no actual value on new data (very poor generalization performance). In this sort of situation you are very vulnerable to having fit mere noise.

Often there is a feeling if a model is doing really well on training data then must be some way to bound generalization error and at least get useful performance on new test and production data. This is, of course, false as we will demonstrate by building deliberately useless features that allow various models to perform well on training data. What is actually happening is you are working through variations of worthless models that only appear to be good on training data due to overfitting. And the more “tweaking, tuning, and fixing” you try only appears to improve things because as you peek at your test-data (which you really should have held some out until the entire end of project for final acceptance) your test data is becoming less exchangeable with future new data and more exchangeable with your training data (and thus less helpful in detecting overfit).

Any researcher that does not have proper per-feature significance checks or hold-out testing procedures will be fooled into promoting faulty models. Continue reading Bad Bayes: an example of why you need hold-out testing

Posted on Categories Mathematics, Rants, Statistics, TutorialsTags , , , 14 Comments on Use standard deviation (not mad about MAD)

Use standard deviation (not mad about MAD)

Nassim Nicholas Taleb recently wrote an article advocating the abandonment of the use of standard deviation and advocating the use of mean absolute deviation. Mean absolute deviation is indeed an interesting and useful measure- but there is a reason that standard deviation is important even if you do not like it: it prefers models that get totals and averages correct. Absolute deviation measures do not prefer such models. So while MAD may be great for reporting, it can be a problem when used to optimize models. Continue reading Use standard deviation (not mad about MAD)

Posted on Categories Coding, math programming, Statistics, TutorialsTags , , , , , , , 4 Comments on The Extra Step: Graphs for Communication versus Exploration

The Extra Step: Graphs for Communication versus Exploration

Visualization is a useful tool for data exploration and statistical analysis, and it’s an important method for communicating your discoveries to others. While those two uses of visualization are related, they aren’t identical.

One of the reasons that I like ggplot so much is that it excels at layering together multiple views and summaries of data in ways that improve both data exploration and communication. Of course, getting at the right graph can be a bit of work, and often I will stop when I get to a visualization that tells me what I need to know — even if no one can read that graph but me. In this post I’ll look at a couple of ggplot graphs that take the extra step: communicating effectively to others.

For my examples I’ll use a pre-treated sample from the 2011 U.S. Census American Community Survey. The dataset is available as an R object in the file phsample.RData; the data dictionary and additional information can be found here. Information about getting the original source data from the U.S. Census site is at the bottom of this post.

The file phsample.RData contains two data frames: dhus (household information), and dpus (information about individuals; they are joined to households using the column SERIALNO). We will only use the dhus data frame.

library(ggplot2)
load("phsample.RData")

# Restrict to non-institutional households
# (No jails, schools, convalescent homes, vacant residences)
hhonly = subset(dhus, (dhus$TYPE==1) &(dhus$NP > 0))

Continue reading The Extra Step: Graphs for Communication versus Exploration

Posted on Categories data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , 1 Comment on Generalized linear models for predicting rates

Generalized linear models for predicting rates

I often need to build a predictive model that estimates rates. The example of our age is: ad click through rates (how often a viewer clicks on an ad estimated as a function of the features of the ad and the viewer). Another timely example is estimating default rates of mortgages or credit cards. You could try linear regression, but specialized tools often do much better. For rate problems involving estimating probabilities and frequencies we recommend logistic regression. For non-frequency (and non-categorical) rate problems (such as forecasting yield or purity) we suggest beta regression.

In this note we will work a toy problem and suggest some relevant R analysis libraries. Continue reading Generalized linear models for predicting rates

Posted on Categories Administrativia, data science, Practical Data Science, StatisticsTags , 3 Comments on Big News! Practical Data Science with R is content complete!

Big News! Practical Data Science with R is content complete!

The last appendix has gone to the editors; the book is now content complete. What a relief!

We are hoping to release the book late in the first quarter of next year. In the meantime, you can still get early drafts of our chapters through Manning’s Early Access program, if you haven’t yet. The link is here.

We look forward to sharing the final version of the book with you next year.

Posted on Categories Pragmatic Machine Learning, Statistics, TutorialsTags , , , , 1 Comment on Sample size and power for rare events

Sample size and power for rare events

We have written a bit on sample size for common events, we have written about rare events, and we have written about frequentist significance testing. We would like to specialize our sample size analysis to rare events (which allows us to derive a somewhat tighter estimate). Continue reading Sample size and power for rare events

Posted on Categories Administrativia, Practical Data Science, StatisticsTags ,

Practical Data Science with R: Manning Deal of the Day November 19th 2013

Please share: Manning Deal of the Day November 19: Half off Practical Data Science with R. Use code dotd1119au at www.manning.com/zumel/.

Posted on Categories Administrativia, data science, Practical Data Science, StatisticsTags 2 Comments on Practical Data Science with R October 2013 update

Practical Data Science with R October 2013 update

A quick status update on our upcoming book “Practical Data Science with R” by Nina Zumel and John Mount.

We are really happy with how the book is coming out. We were able to cover most everything we hoped to. Part 1 (especially chapter 3) is already being used in courses, and has some very good stuff on how to review data. Part 2 covers the “statistical / machine-learning canon,” and turns out to be a very complete demonstration of what odd steps are needed to move from start to finish for each example in R. Part 3 is going to finish with the important (but neglected) topics of delivering results to production, and building good documentation and presentations. Continue reading Practical Data Science with R October 2013 update

Posted on Categories Administrativia, Practical Data Science, StatisticsTags , , , 3 Comments on Practical Data Science with R, deal of the day Aug 1 2013

Practical Data Science with R, deal of the day Aug 1 2013

Deal of the Day August 1: Half off my book Practical Data Science with R. Use code dotd0801au at www.manning.com/zumel/

Posted on Categories Administrativia, data science, Pragmatic Data Science, StatisticsTags , , ,

What is “Practical Data Science with R”?

A bit about our upcoming book “Practical Data Science with R”. Nina and I share our current draft of the front matter from the book, which is a description which will help you decide if this is the book for you (we hope that it is). Or this could be the book that helps explain what you do to others.

Continue reading What is “Practical Data Science with R”?