I was flipping through my copy of William Cleveland’s The Elements of Graphing Data the other day; it’s a book worth revisiting. I’ve always liked Cleveland’s approach to visualization as statistical analysis. His quest to ground visualization principles in the context of human visual cognition (he called it “graphical perception”) generated useful advice for designing effective graphics .
I confess I don’t always follow his advice. Sometimes it’s because I don’t agree with him, but also it’s because I use ggplot for visualization, and I’m lazy. I like ggplot because it excels at layering multiple graphics into a single plot and because it looks good; but deviating from the default presentation is often a bit of work. How much am I losing out on by this? I decided to do the work and find out.
Details of specific plots aside, the key points of Cleveland’s philosophy are:
- A graphic should display as much information as it can, with the lowest possible cognitive strain to the viewer.
- Visualization is an iterative process. Graph the data, learn what you can, and then regraph the data to answer the questions that arise from your previous graphic.
Of course, when you are your own viewer, part of the cognitive strain in visualization comes from difficulty generating the desired graphic. So we’ll start by making the easiest possible ggplot graph, and working our way from there — Cleveland style.
Continue reading Revisiting Cleveland’s The Elements of Graphing Data in ggplot2
Recently Heroku was accused of using random queue routing while claiming to supply something similar to shortest queue routing (see: James Somers – Heroku’s Ugly Secret and more discussion at hacker news: Heroku’s Ugly Secret). If this is true it is pretty bad. I like randomized algorithms and I like queueing theory, but you need to work through proofs or at least simulations when playing with queues. You don’t want to pick an arbitrary algorithm and claim it works “due to randomness.” We will show a very quick example where randomized routing is very bad with near certainty. Just because things are “random” doesn’t mean you can’t or shouldn’t characterize them. Continue reading A randomized algorithm that fails with near certainty
One thing I have observed in multiple software engineering and data science projects is inconvenient steps get skipped. This is negative and happens despite rules, best intentions and effort. Recently I have noticed a positive re-formulation of this in that project quality increases rapidly when you make it take more effort to do the wrong thing. Continue reading Make it more effort to do the wrong thing
Given the range of wants, diverse data sources, required innovation and methods it often feels like data science projects are immune to planning, scoping and tracking. Without a system to break a data science project into smaller observable components you greatly increase your risk of failure. As a followup to the statistical ideas we shared in setting expectations in data science projects we share a few project planning ideas from software engineering. Continue reading Data science project planning
A bit more on the ROC/AUC
The receiver operating characteristic curve (or ROC) is one of the standard methods to evaluate a scoring system. Nina Zumel has described its application, but I would like to call out some additional details. In my opinion while the ROC is a useful tool, the “area under the curve” (AUC) summary often read off it is not as intuitive and interpretable as one would hope or some writers assert.
Continue reading More on ROC/AUC
Check out Nina Zumel’s latest article: On Balance.
XCOM: Enemy Unknown is a turn based video game where the player choses among actions (for example shooting an alien) that are labeled with a declared probability of success.
Image copyright Firaxis Games
A lot of gamers, after missing a 80% chance of success shot, start asking if the game’s pseudo random number generator is fair. Is the game really rolling the dice as stated, or is it cheating? Of course the matching question is: are player memories at all fair; would they remember the other 4 out of 5 times they made such a shot?
This article is intended as an introduction to the methods you would use to test such a question (be it in a video game, in science, or in a business application such as measuring advertisement conversion). There are already some interesting articles on collecting and analyzing XCOM data and finding and characterizing the actual pseudo random generator code in the game, and discussing the importance of repeatable pseudo-random results. But we want to add a discussion pointed a bit more at analysis technique in general. We emphasize methods that are efficient in their use of data. This is a statistical term meaning that a maximal amount of learning is gained from the data. In particular we do not recommend data binning as a first choice for analysis as it cuts down on sample size and thus is not the most efficient estimation technique.
Continue reading How to test XCOM “dice rolls” for fairness
I know “officially” data scientists all always work in “big data” environments with data in a remote database, streaming store or key-value system. But in day to day work Excel files and Excel export files get used a lot and cause a disproportionate amount of pain.
I would like to make a plea to my fellow data scientists to stop using Excel-like formats for informal data exchange and become much stricter in producing and insisting on truly open machine readable files. Open files are those in an open format (not proprietary like Microsoft Excel) and machine readable in this case means readable by a very simple program (preferring simple escaping strategies to complicated quoting strategies). A lot of commonly preferred formats surprisingly do not meet these conditions: for example Microsoft Excel, XML and quoted CSV all fail the test. A few formats that do meet these conditions: SQL dumps, JSON and what I call “strong TSV.” I will illustrate some of the difficulty in using ad-hoc formats in R and suggest work-arounds. Continue reading Please stop using Excel-like formats to exchange data
From time to time we work on projects that would benefit from a free lightweight pure Java linear programming library. That is a library unencumbered by a bad license, available cheaply, without an infinite amount of file format and interop cruft and available in Java (without binary blobs and JNI linkages). There are a few such libraries, but none have repeatably, efficiently and reliably met our needs. So we have re-packaged an older one of our own for release under the Apache 2.0 license. This code will have its own rough edges (not having been used widely in production), but I still feel fills an important gap. This article is brief introduction to our WVLPSolver Java library. Continue reading Yet Another Java Linear Programming Library
von Neumann and Morgenstern’s “Theory of Games and Economic Behavior” is the famous basis for game theory. One of the central accomplishments is the rigorous proof that comparative “preference methods” over fairly complicated “event spaces” are no more expressive than numeric (real number valued) utilities. That is: for a very wide class of event spaces and comparison functions “>” there is a utility function u() such that:
a > b (“>” representing the arbitrary comparison or preference for the event space) if and only if u(a) > u(b) (this time “>” representing the standard order on the reals).
However, an active reading of sections 1 through 3 and even the 2nd edition’s axiomatic appendix shows that the concept of “events” (what preferences and utilities are defined over) is deliberately left undefined. There is math and objects and spaces, but not all of them are explicitly defined in term of known structures (are they points in R^n, sets, multi-sets, sums over sets or what?). The word “event” is used early in the book and not in the index. Axiomatic treatments often rely on intentionally leaving ground-concepts undefined, but we are going to work a concrete example through von Neumann and Morgenstern to try and illustrate a bit more of the required intuition and deep nature of their formal notions of events and utility. I also will illustrate how, at least in discussion, von Neuman and Morgenstern may have held on to a naive “single outcome” intuition of events and a naive “direct dollars” intuition of utility despite erecting a theory carefully designed to support much more structure. This is possible because they never have to calculate in the general event space: they prove access to the preference allows them to construct the utility funciton u() and then work over the real numbers. Sections 1 through 3 are designed to eliminate the need for a theory of preference or utility and allow von Neuman and Morgenstern to work with real numbers (while achieving full generality). They never need to make the translations explicit, because soon after showing the translations are possible they assume they have already been applied. Continue reading Working an example of von Neumann and Morgenstern utility