Using correlation to track model performance is “a mistake that nobody would ever make” combined with a vague “what would be wrong if I did do that” feeling. I hope after reading this feel a least a small urge to double check your work and presentations to make sure you have not reported correlation where R-squared, likelihood or root mean square error (RMSE) would have been more appropriate.
It is tempting (but wrong) to use correlation to track the performance of model predictions. The temptation arises because we often (correctly) use correlation to evaluate possible model inputs. And the correlation function is often a convenient built-in function. Continue reading Don’t use correlation to track prediction performance
I was flipping through my copy of William Cleveland’s The Elements of Graphing Data the other day; it’s a book worth revisiting. I’ve always liked Cleveland’s approach to visualization as statistical analysis. His quest to ground visualization principles in the context of human visual cognition (he called it “graphical perception”) generated useful advice for designing effective graphics .
I confess I don’t always follow his advice. Sometimes it’s because I don’t agree with him, but also it’s because I use ggplot for visualization, and I’m lazy. I like ggplot because it excels at layering multiple graphics into a single plot and because it looks good; but deviating from the default presentation is often a bit of work. How much am I losing out on by this? I decided to do the work and find out.
Details of specific plots aside, the key points of Cleveland’s philosophy are:
- A graphic should display as much information as it can, with the lowest possible cognitive strain to the viewer.
- Visualization is an iterative process. Graph the data, learn what you can, and then regraph the data to answer the questions that arise from your previous graphic.
Of course, when you are your own viewer, part of the cognitive strain in visualization comes from difficulty generating the desired graphic. So we’ll start by making the easiest possible ggplot graph, and working our way from there — Cleveland style.
Continue reading Revisiting Cleveland’s The Elements of Graphing Data in ggplot2
Recently Heroku was accused of using random queue routing while claiming to supply something similar to shortest queue routing (see: James Somers – Heroku’s Ugly Secret and more discussion at hacker news: Heroku’s Ugly Secret). If this is true it is pretty bad. I like randomized algorithms and I like queueing theory, but you need to work through proofs or at least simulations when playing with queues. You don’t want to pick an arbitrary algorithm and claim it works “due to randomness.” We will show a very quick example where randomized routing is very bad with near certainty. Just because things are “random” doesn’t mean you can’t or shouldn’t characterize them. Continue reading A randomized algorithm that fails with near certainty
One thing I have observed in multiple software engineering and data science projects is inconvenient steps get skipped. This is negative and happens despite rules, best intentions and effort. Recently I have noticed a positive re-formulation of this in that project quality increases rapidly when you make it take more effort to do the wrong thing. Continue reading Make it more effort to do the wrong thing
Given the range of wants, diverse data sources, required innovation and methods it often feels like data science projects are immune to planning, scoping and tracking. Without a system to break a data science project into smaller observable components you greatly increase your risk of failure. As a followup to the statistical ideas we shared in setting expectations in data science projects we share a few project planning ideas from software engineering. Continue reading Data science project planning