The release date for Zumel, Mount “Practical Data Science with R” is getting close. I thought I would share a few statistics about what goes into this kind of book. Continue reading Some statistics about the book
There’s a new post up at the ninazumel.com blog that looks at the statistics of “verification by multiplicity” — the statistical technique that is behind NASA’s announcement of 715 new planets that have been validated in the data from the Kepler Space Telescope.
We normally don’t write about science here at Win-Vector, but we do sometimes examine the statistics and statistical methods behind scientific announcements and issues. NASA’s new technique is a cute and relatively straightforward (statistically speaking) approach.
From what I understand of the introduction to the paper, there are two ways to determine whether or not a planet candidate is really a planet: the first is to confirm the fact with additional measurements of the target star’s gravitational wobble, or by measurements of the transit times of the apparent planets across the face of the star. Getting sufficient measurements can take time. The other way is to “validate” the planet by showing that it’s highly unlikely that the sighting was a false positive. Specifically, the probability that the signal observed was caused by a planet should be at least 100 times larger than the probability that the signal is a false positive. The validation analysis is a Bayesian approach that considers various mechanisms that produce false positives, determines the probability that these various mechanisms could have produced the signal in question, and compares them to the probability that a planet produced the signal.
The basic idea behind verification by multiplicity is that planets are often clustered in multi-planet star systems, while false positive measurements (mistaken identification of potential planets) occur randomly. Putting this another way: if false positives are random, then they won’t tend to occur together near the same star. So if you observe a star with multiple “planet signals,” it’s unlikely that all the signals are false positives. We can use that observation to quantify how much more likely it is that a star with multiple candidates actually hosts a planet. The resulting probability can be used as an improved prior for the planet model when doing the statistical validation described above.
You can read the rest of the article here.
Please forward and share this discount offer for our upcoming book. Manning Deal of the Day February 22: Half off Practical Data Science with R. Use code dotd022214au at www.manning.com/zumel/.
The Facebook data science blog shared some fun data explorations this Valentine’s Day in Carlos Greg Diuk’s “The Formation of Love”. They are rightly receiving positive interest in and positive reviews of their work (for example Robinson Meyer’s Atlantic article). The finding is also a great opportunity to discuss the gap between cool data mining results and usable predictive models. Data mining results like this (and the infamous “Beer and Diapers story”) face an expectation that one is immediately ready to implement something like what is claimed in: “Target Figured Out A Teen Girl Was Pregnant Before Her Father Did” once an association is plotted.
Producing a revenue improving predictive model is much harder than mining an interesting association. And this is what we will discuss here. Continue reading The gap between data mining and predictive models
As a data scientist I have seen variations of principal component analysis and factor analysis so often blindly misapplied and abused that I have come to think of the technique as unprincipled component analysis. PCA is a good technique often used to reduce sensitivity to overfitting. But this stated design intent leads many to (falsely) believe that any claimed use of PCA prevents overfit (which is not always the case). In this note we comment on the intent of PCA like techniques, common abuses and other options.
The idea is to illustrate what can quietly go wrong in an analysis and what tests to perform to make sure you see the issue. The main point is some analysis issues can not be fixed without going out and getting more domain knowledge, more variables or more data. You can’t always be sure that you have insufficient data in your analysis (there is always a worry that some clever technique will make the current data work), but it must be something you are prepared to consider. Continue reading Unprincipled Component Analysis
We demonstrate a dataset that causes many good machine learning algorithms to horribly overfit.
The example is designed to imitate a common situation found in predictive analytic natural language processing. In this type of application you are often building a model using many rare text features. The rare text features are often nearly unique k-grams and the model can be anything from Naive Bayes to conditional random fields. This sort of modeling situation exposes the modeler to a lot of training bias. You can get models that look good on training data even though they have no actual value on new data (very poor generalization performance). In this sort of situation you are very vulnerable to having fit mere noise.
Often there is a feeling if a model is doing really well on training data then must be some way to bound generalization error and at least get useful performance on new test and production data. This is, of course, false as we will demonstrate by building deliberately useless features that allow various models to perform well on training data. What is actually happening is you are working through variations of worthless models that only appear to be good on training data due to overfitting. And the more “tweaking, tuning, and fixing” you try only appears to improve things because as you peek at your test-data (which you really should have held some out until the entire end of project for final acceptance) your test data is becoming less exchangeable with future new data and more exchangeable with your training data (and thus less helpful in detecting overfit).
Any researcher that does not have proper per-feature significance checks or hold-out testing procedures will be fooled into promoting faulty models. Continue reading Bad Bayes: an example of why you need hold-out testing
Nassim Nicholas Taleb recently wrote an article advocating the abandonment of the use of standard deviation and advocating the use of mean absolute deviation. Mean absolute deviation is indeed an interesting and useful measure- but there is a reason that standard deviation is important even if you do not like it: it prefers models that get totals and averages correct. Absolute deviation measures do not prefer such models. So while MAD may be great for reporting, it can be a problem when used to optimize models. Continue reading Use standard deviation (not mad about MAD)
Visualization is a useful tool for data exploration and statistical analysis, and it’s an important method for communicating your discoveries to others. While those two uses of visualization are related, they aren’t identical.
One of the reasons that I like ggplot so much is that it excels at layering together multiple views and summaries of data in ways that improve both data exploration and communication. Of course, getting at the right graph can be a bit of work, and often I will stop when I get to a visualization that tells me what I need to know — even if no one can read that graph but me. In this post I’ll look at a couple of ggplot graphs that take the extra step: communicating effectively to others.
For my examples I’ll use a pre-treated sample from the 2011 U.S. Census American Community Survey. The dataset is available as an R object in the file
phsample.RData; the data dictionary and additional information can be found here. Information about getting the original source data from the U.S. Census site is at the bottom of this post.
phsample.RData contains two data frames:
dhus (household information), and
dpus (information about individuals; they are joined to households using the column
SERIALNO). We will only use the
dhus data frame.
library(ggplot2) load("phsample.RData") # Restrict to non-institutional households # (No jails, schools, convalescent homes, vacant residences) hhonly = subset(dhus, (dhus$TYPE==1) &(dhus$NP > 0))
I often need to build a predictive model that estimates rates. The example of our age is: ad click through rates (how often a viewer clicks on an ad estimated as a function of the features of the ad and the viewer). Another timely example is estimating default rates of mortgages or credit cards. You could try linear regression, but specialized tools often do much better. For rate problems involving estimating probabilities and frequencies we recommend logistic regression. For non-frequency (and non-categorical) rate problems (such as forecasting yield or purity) we suggest beta regression.
In this note we will work a toy problem and suggest some relevant R analysis libraries. Continue reading Generalized linear models for predicting rates
The last appendix has gone to the editors; the book is now content complete. What a relief!
We are hoping to release the book late in the first quarter of next year. In the meantime, you can still get early drafts of our chapters through Manning’s Early Access program, if you haven’t yet. The link is here.
We look forward to sharing the final version of the book with you next year.