The important criterion for a graph is not simply how fast we can see a result; rather it is whether through the use of the graph we can see something that would have been harder to see otherwise or that could not have been seen at all.

– William Cleveland, *The Elements of Graphing Data*, Chapter 2

In this article, I will discuss some graphs that I find extremely useful in my day-to-day work as a data scientist. While all of them are helpful (to me) for statistical visualization during the analysis process, not all of them will necessarily be useful for presentation of final results, especially to non-technical audiences.

I tend to follow Cleveland’s philosophy, quoted above; these graphs show me — and hopefully you — aspects of data and models that I might not otherwise see. Some of them, however, are non-standard, and tend to require explanation. My purpose here is to share with our readers some ideas for graphical analysis that are either useful to you directly, or will give you some ideas of your own.

Read more…

Categories: Applications, Opinion, Pragmatic Machine Learning, Statistics, Tutorials Tags: boxplots, ggplot, ggplot2, graphical perception, linear regression, Logistic Regression, R, statistical graphs
One of the current best tools in the machine learning toolbox is the 1930s statistical technique called logistic regression. We explain how to add professional quality logistic regression to your analytic repertoire and describe a bit beyond that. Read more…

Categories: Applications, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, Tutorials Tags: Learn a Powerful Machine Learning Tool, Logistic Regression, Max-Ent, Maximum Entropy, R, Regularization, Statistics
We extend the ideas of from Automatic Differentiation with Scala to include the *reverse accumulation*. Reverse accumulation is a non-obvious improvement to automatic differentiation that can in many cases vastly speed up calculations of gradients. Read more…

Categories: Applications, Coding, Exciting Techniques, math programming, Mathematics, Programming, Tutorials Tags: Automatic Differentiation, Conjugate Gradient, Gradient, Mathematical Bedside Reading, Optimization, Reverse Accumulation, Scala
This article is a worked-out exercise in applying the Scala type system to solve a small scale optimization problem. For this article we supply complete Scala source code (under a GPLv3 license) and some design discussion. Read more…

Categories: Applications, Coding, Computer Science, Exciting Techniques, Mathematics, Programming, Tutorials Tags: Automatic Differentiation, Conjugate Gradient, Dual Numbers, Geometric Median, Numeric Methods, Optimization, Scala, Steiner Tree
Recently, we had a client come to us with (among other things) the following question:

Who is more valuable, Customer Type A, or Customer Type B?

This client already tracked the net profit and loss generated by every customer who used his services, and had begun to analyze his customers by group. He was especially interested in Customer Type A; his gut instinct told him that Type A customers were quite profitable compared to the others (Type B) and he wanted to back up this feeling with numbers.

He found that, on average, Type A customers generate about $92 profit per month, and Type B customers average about $115 per month (The data and figures that we are using in this discussion aren’t actual client data, of course, but a notional example). He also found that while Type A customers make up about 4% of the customer base, they generate less than 4% of the net profit per month. So Type A customers actually seem to be less profitable than Type B customers. Apparently, our client was mistaken.

Or was he? Read more…

Categories: Applications, Expository Writing, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, Statistics To English Translation Tags: customer value, lognormal distribution, long tail theory, McPhee's Theory of Exposure, median versus mean, power law distribution, Statistics
In the previous installment of the Statistics to English Translation, we discussed the technical meaning of the term ”significant”. In this installment, we look at how significance is calculated. This article will be a little more technically detailed than the last one, but our primary goal is still to help you decipher statements about significance in research papers: statements like “

”.

As in the last article, we will concentrate on situations where we want to test the difference of means. You should read that previous article first, so you are familiar with the terminology that we use in this one.

A pdf version of this current article can be found here.

Read more…

In this installment of our ongoing Statistics to English Translation series^{1}, we will look at the technical meaning of the term ”significant”. As you might expect, what it means in statistics is not exactly what it means in everyday language.

As always, a pdf version of this article is available as well. Read more…

Scientists, engineers, and statisticians share similar concerns about evaluating the accuracy of their results, but they don’t always talk about it in the same language. This can lead to misunderstandings when reading across disciplines, and the problem is exacerbated when technical work is communicated to and by the popular media.

The “Statistics to English Translation” series is a new set of articles that we will be posting from time to time, as an attempt to bridge the language gaps. Our goal is to increase statistical literacy: we hope that you will find it easier to read and understand the statistical results in research papers, even if you can’t replicate the analyses. We also hope that you will be able to read popular media accounts of statistical and scientific results more critically, and to recognize common misunderstandings when they occur.

The first installment discusses some different accuracy measures that are commonly used in various research communities, and how they are related to each other. There is also a more legible PDF version of the article here.

Read more…

Categories: Applications, Expository Writing, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, Statistics To English Translation Tags: Accuracy Measures, Classifiers, Diagnostic Tests, Precision and Recall, ROC Curves, Sensitivity and Specificity, Statistics
REPOST (now in HTML in addition to the original PDF).

This paper demonstrates and explains some of the basic techniques used in data mining. It also serves as an example of some of the kinds of analyses and projects Win Vector LLC engages in. Read more…

We explore some of the ideas from the seminal paper “The Data-Enrichment Method” ( Henry R Lewis, Operations Research (1957) vol. 5 (4) pp. 1-5). The paper explains a technique of improving the quality of statistical inference by increasing the effective size of the data-set. This is called “Data-Enrichment.”

Now more than ever we must be familiar with the consequences of these important techniques. Especially if we don’t know if we might already be a victim of them.

Read more…