Some researchers (in both science and marketing) abuse a slavish view of p-values to try and falsely claim credibility. The incantation is: “we achieved p = x (with x ≤ 0.05) so you should trust our work.” This might be true if the published result had been performed as a single project (and not as the sole shared result in longer series of private experiments) and really points to the fact that even frequentist significance is a subjective and intensional quantity (an accusation usually reserved for Bayesian inference). In this article we will comment briefly on the negative effect of un-reported repeated experiments and what should be done to compensate. Continue reading Drowning in insignificance
Please forward and share this discount offer for our upcoming book. Manning Deal of the Day February 22: Half off Practical Data Science with R. Use code dotd022214au at www.manning.com/zumel/.
The Facebook data science blog shared some fun data explorations this Valentine’s Day in Carlos Greg Diuk’s “The Formation of Love”. They are rightly receiving positive interest in and positive reviews of their work (for example Robinson Meyer’s Atlantic article). The finding is also a great opportunity to discuss the gap between cool data mining results and usable predictive models. Data mining results like this (and the infamous “Beer and Diapers story”) face an expectation that one is immediately ready to implement something like what is claimed in: “Target Figured Out A Teen Girl Was Pregnant Before Her Father Did” once an association is plotted.
Producing a revenue improving predictive model is much harder than mining an interesting association. And this is what we will discuss here. Continue reading The gap between data mining and predictive models
As a data scientist I have seen variations of principal component analysis and factor analysis so often blindly misapplied and abused that I have come to think of the technique as unprincipled component analysis. PCA is a good technique often used to reduce sensitivity to overfitting. But this stated design intent leads many to (falsely) believe that any claimed use of PCA prevents overfit (which is not always the case). In this note we comment on the intent of PCA like techniques, common abuses and other options.
The idea is to illustrate what can quietly go wrong in an analysis and what tests to perform to make sure you see the issue. The main point is some analysis issues can not be fixed without going out and getting more domain knowledge, more variables or more data. You can’t always be sure that you have insufficient data in your analysis (there is always a worry that some clever technique will make the current data work), but it must be something you are prepared to consider. Continue reading Unprincipled Component Analysis
We demonstrate a dataset that causes many good machine learning algorithms to horribly overfit.
The example is designed to imitate a common situation found in predictive analytic natural language processing. In this type of application you are often building a model using many rare text features. The rare text features are often nearly unique k-grams and the model can be anything from Naive Bayes to conditional random fields. This sort of modeling situation exposes the modeler to a lot of training bias. You can get models that look good on training data even though they have no actual value on new data (very poor generalization performance). In this sort of situation you are very vulnerable to having fit mere noise.
Often there is a feeling if a model is doing really well on training data then must be some way to bound generalization error and at least get useful performance on new test and production data. This is, of course, false as we will demonstrate by building deliberately useless features that allow various models to perform well on training data. What is actually happening is you are working through variations of worthless models that only appear to be good on training data due to overfitting. And the more “tweaking, tuning, and fixing” you try only appears to improve things because as you peek at your test-data (which you really should have held some out until the entire end of project for final acceptance) your test data is becoming less exchangeable with future new data and more exchangeable with your training data (and thus less helpful in detecting overfit).
Any researcher that does not have proper per-feature significance checks or hold-out testing procedures will be fooled into promoting faulty models. Continue reading Bad Bayes: an example of why you need hold-out testing