Many data science projects and presentations are needlessly derailed by not having set shared business relevant quantitative expectations early on (for some advice see Setting expectations in data science projects). One of the most common issues is the common layman expectation of “perfect prediction” from classification projects. It is important to set expectations correctly so your partners know what you are actually working towards and do not consider late choices of criteria disappointments or “venue shopping.” Read more…
We demonstrate a dataset that causes many good machine learning algorithms to horribly overfit.
The example is designed to imitate a common situation found in predictive analytic natural language processing. In this type of application you are often building a model using many rare text features. The rare text features are often nearly unique k-grams and the model can be anything from Naive Bayes to conditional random fields. This sort of modeling situation exposes the modeler to a lot of training bias. You can get models that look good on training data even though they have no actual value on new data (very poor generalization performance). In this sort of situation you are very vulnerable to having fit mere noise.
Often there is a feeling if a model is doing really well on training data then must be some way to bound generalization error and at least get useful performance on new test and production data. This is, of course, false as we will demonstrate by building deliberately useless features that allow various models to perform well on training data. What is actually happening is you are working through variations of worthless models that only appear to be good on training data due to overfitting. And the more “tweaking, tuning, and fixing” you try only appears to improve things because as you peek at your test-data (which you really should have held some out until the entire end of project for final acceptance) your test data is becoming less exchangeable with future new data and more exchangeable with your training data (and thus less helpful in detecting overfit).
Any researcher that does not have proper per-feature significance checks or hold-out testing procedures will be fooled into promoting faulty models. Read more…
Nassim Nicholas Taleb recently wrote an article advocating the abandonment of the use of standard deviation and advocating the use of mean absolute deviation. Mean absolute deviation is indeed an interesting and useful measure- but there is a reason that standard deviation is important even if you do not like it: it prefers models that get totals and averages correct. Absolute deviation measures do not prefer such models. So while MAD may be great for reporting, it can be a problem when used to optimize models. Read more…
Visualization is a useful tool for data exploration and statistical analysis, and it’s an important method for communicating your discoveries to others. While those two uses of visualization are related, they aren’t identical.
One of the reasons that I like ggplot so much is that it excels at layering together multiple views and summaries of data in ways that improve both data exploration and communication. Of course, getting at the right graph can be a bit of work, and often I will stop when I get to a visualization that tells me what I need to know — even if no one can read that graph but me. In this post I’ll look at a couple of ggplot graphs that take the extra step: communicating effectively to others.
For my examples I’ll use a pre-treated sample from the 2011 U.S. Census American Community Survey. The dataset is available as an R object in the file
phsample.RData; the data dictionary and additional information can be found here. Information about getting the original source data from the U.S. Census site is at the bottom of this post.
phsample.RData contains two data frames:
dhus (household information), and
dpus (information about individuals; they are joined to households using the column
SERIALNO). We will only use the
dhus data frame.
library(ggplot2) load("phsample.RData") # Restrict to non-institutional households # (No jails, schools, convalescent homes, vacant residences) hhonly = subset(dhus, (dhus$TYPE==1) &(dhus$NP > 0))
I was watching my cousins play Unspeakable Words over Christmas break and got interested in the end game. The game starts out as a spell a word from cards and then bet some points game, but in the end (when you are down to one marker) it becomes a pure betting game. In this article we analyze an idealized form of the pure betting end game. Read more…
I often need to build a predictive model that estimates rates. The example of our age is: ad click through rates (how often a viewer clicks on an ad estimated as a function of the features of the ad and the viewer). Another timely example is estimating default rates of mortgages or credit cards. You could try linear regression, but specialized tools often do much better. For rate problems involving estimating probabilities and frequencies we recommend logistic regression. For non-frequency (and non-categorical) rate problems (such as forecasting yield or purity) we suggest beta regression.
It recently hit me that I see unit tests as a form of penance (in addition to being a great tool for specification and test driven development). If you fix a bug and don’t add a unit test I suspect you are not actually sorry. Read more…
We have written a bit on sample size for common events, we have written about rare events, and we have written about frequentist significance testing. We would like to specialize our sample size analysis to rare events (which allows us to derive a somewhat tighter estimate). Read more…
I strongly advise using version control, and usually recommend using git as your version control system. Usually I feel a bit guilty about this advice as git is so general that it is more of a toolkit for a version control system than a complete proscriptive version control system (the missing pieces being the selection and documentation of a workflow and conventions).
But I still feel git is the one to use. My requirements involve not writing dot files in every single directory (breaks some OSX tools, and both CVS and Subversion do this), being able to work disconnected (eliminates Perforce), being cross-platform, being actively maintained, and being able to easily change decision such as where the gold standard repository lives (or even changing your mind on collaborating or not). This makes me lean towards BZR, git and Mecurial. Git is the most popular one of the bunch and has the most popular repository aggregator: GitHub.
For beginners I teach treating git like old-school RCS or SCCS: just use git to maintain versions of your local files. Don’t worry about using it to share or distribute files (but do make sure to back-up you directory in some way). To use git in this way you only need to run three commands regularly: “git status,” “git add,” and “git commit” (see Minimal Version Control Lesson: Use It). Roughly status shows you what is going on and add/commit pairs checkpoint your work. To work in this way you don’t need to know anything about branching (version control nerds’ favorite confusing topic), merging and so on. The idea is that as long as you are running add/commit pairs often enough any other problem you run into can be solved (though it make take an hour of searching books and Stack Overflow to find the answer). Git’s user interface is horrible (in part) because “everything is possible,” but that also means you can (with difficulty) solve just about any problem you run into with git (except, it seems, nested or dependent repositories).
However eventually you want to work with a collaborator or distribute your results to a client. To do that effectively with git you need to start using additional commands such as “git pull,” “git rebase,” and “git push.” Things seem more confusing at this point (though you still do not yet need to worry about branching in its full generality), but are in fact far less confusing and far less error prone than ad-hoc solutions such as emailing zip files. I almost always advise sharing work in “star workflow” where each worker has their own repository and a single common “naked” repository (that is a repository with only git data structures and no ready to use files) is used to coordinate (thought of as a server or gold standard, often named “origin”). This is treating git as if it were just a better CVS or SVN (the difference being if you want to perform a truly distributed step like pushing code to a collaborator without using the main server, you can and git will actually help with the record keeping). The central repository can be GitHub, GitLab or even a directory on a machine with ssh access. A lot of ink is spilled on how such a workflow doesn’t feel like a “distributed workflow,” but it is (you can work when disconnected from the central repository, and if the central repository is lost any up to date worker can provision a new central repository).
To get familiar with git I recommend a good book such as Jon Loeliger and Matthew McCullough’s “Version Control with Git” 2nd Edition, O’Reilly 2012. Or, better yet, work with people who know git. In all cases you need to keep notes, git issues are often solved by sequences of of three to five esoteric commands. Even after working with git for some time I still run into major “hair pullers.” One of these major “hair pullers” I run into is what I call “pseudo conflicts” and is what I am going to describe in this article. Read more…