Nassim Nicholas Taleb recently wrote an article advocating the abandonment of the use of standard deviation and advocating the use of mean absolute deviation. Mean absolute deviation is indeed an interesting and useful measure- but there is a reason that standard deviation is important even if you do not like it: it prefers models that get totals and averages correct. Absolute deviation measures do not prefer such models. So while MAD may be great for reporting, it can be a problem when used to optimize models. Read more…
Visualization is a useful tool for data exploration and statistical analysis, and it’s an important method for communicating your discoveries to others. While those two uses of visualization are related, they aren’t identical.
One of the reasons that I like ggplot so much is that it excels at layering together multiple views and summaries of data in ways that improve both data exploration and communication. Of course, getting at the right graph can be a bit of work, and often I will stop when I get to a visualization that tells me what I need to know — even if no one can read that graph but me. In this post I’ll look at a couple of ggplot graphs that take the extra step: communicating effectively to others.
For my examples I’ll use a pre-treated sample from the 2011 U.S. Census American Community Survey. The dataset is available as an R object in the file
phsample.RData; the data dictionary and additional information can be found here. Information about getting the original source data from the U.S. Census site is at the bottom of this post.
phsample.RData contains two data frames:
dhus (household information), and
dpus (information about individuals; they are joined to households using the column
SERIALNO). We will only use the
dhus data frame.
library(ggplot2) load("phsample.RData") # Restrict to non-institutional households # (No jails, schools, convalescent homes, vacant residences) hhonly = subset(dhus, (dhus$TYPE==1) &(dhus$NP > 0))
I was watching my cousins play Unspeakable Words over Christmas break and got interested in the end game. The game starts out as a spell a word from cards and then bet some points game, but in the end (when you are down to one marker) it becomes a pure betting game. In this article we analyze an idealized form of the pure betting end game. Read more…
I often need to build a predictive model that estimates rates. The example of our age is: ad click through rates (how often a viewer clicks on an ad estimated as a function of the features of the ad and the viewer). Another timely example is estimating default rates of mortgages or credit cards. You could try linear regression, but specialized tools often do much better. For rate problems involving estimating probabilities and frequencies we recommend logistic regression. For non-frequency (and non-categorical) rate problems (such as forecasting yield or purity) we suggest beta regression.
The last appendix has gone to the editors; the book is now content complete. What a relief!
We are hoping to release the book late in the first quarter of next year. In the meantime, you can still get early drafts of our chapters through Manning’s Early Access program, if you haven’t yet. The link is here.
We look forward to sharing the final version of the book with you next year.
It recently hit me that I see unit tests as a form of penance (in addition to being a great tool for specification and test driven development). If you fix a bug and don’t add a unit test I suspect you are not actually sorry. Read more…
I have been doing a lot of writing lately (the book, clients, blog, status updates, and the occasional tweet). This has made me acutely aware of how different many of these writing tasks tend to be. Read more…
We have written a bit on sample size for common events, we have written about rare events, and we have written about frequentist significance testing. We would like to specialize our sample size analysis to rare events (which allows us to derive a somewhat tighter estimate). Read more…
Bitcoin continues to surge in buzz and price (194,993 coin transfer, US Senate hearings and astronomical price and total capitalization). This gets me to thinking: what in finance terms is Bitcoin? It claims aspire to be a currency, but what is it actually behaving like? Read more…
Please share: Manning Deal of the Day November 19: Half off Practical Data Science with R. Use code dotd1119au at www.manning.com/zumel/.