Posted on Categories Coding, data science, Exciting Techniques, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Programming, Statistics, TutorialsTags , , , , , , , 9 Comments on Wanted: A Perfect Scatterplot (with Marginals)

Wanted: A Perfect Scatterplot (with Marginals)

We saw this scatterplot with marginal densities the other day, in a blog post by Thomas Wiecki:

NewImage

The graph was produced in Python, using the seaborn package. Seaborn calls it a “jointplot;” it’s called a “scatterhist” in Matlab, apparently. The seaborn version also shows the strength of the linear relationship between the x and y variables. Nice.

I like this plot a lot, but we’re mostly an R shop here at Win-Vector. So we asked: can we make this plot in ggplot2? Natively, ggplot2 can add rugs to a scatterplot, but doesn’t immediately offer marginals, as above.

However, you can use Dean Attali’s ggExtra package. Here’s an example using the same data as the seaborn jointplot above; you can download the dataset here.

library(ggplot2)
library(ggExtra)
frm = read.csv("tips.csv")

plot_center = ggplot(frm, aes(x=total_bill,y=tip)) + 
  geom_point() +
  geom_smooth(method="lm")

# default: type="density"
ggMarginal(plot_center, type="histogram")

I didn’t bother to add the internal annotation for the goodness of the linear fit, though I could.

PltggMarginal

The ggMarginal() function goes to heroic effort to line up the coordinate axes of all the graphs, and is probably the best way to do a scatterplot-plus-marginals in ggplot (you can also do it in base graphics, of course). Still, we were curious how close we could get to the seaborn version: marginal density and histograms together, along with annotations. Below is our version of the graph; we report the linear fit’s R-squared, rather than the Pearson correlation.

# our own (very beta) plot package: details later
library(WVPlots)
frm = read.csv("tips.csv")

ScatterHist(frm, "total_bill", "tip",
            smoothmethod="lm",
            annot_size=3,
            title="Tips vs. Total Bill")

PlotPkg

You can see that (at the moment) we’ve resorted to padding the axis labels with underbars to force the x-coordinates of the top marginal plot and the scatterplot to align; white space gets trimmed. This is profoundly unsatisfying, and less robust than the ggMarginal version. If you’re curious, the code is here. It relies on some functions in the file sharedFunctions.R in the same repository. Our more general version will do either a linear or lowess/spline smooth, and you can also adjust the histogram and density plot parameters.

Thanks to Slawa Rokicki’s excellent ggplot2: Cheatsheet for Visualizing Distributions for our basic approach. Check out the graph at the bottom of her post — and while you’re at it, check out the rest of her blog too.

Posted on Categories Coding, math programming, Statistics, TutorialsTags , , , , , , , 4 Comments on The Extra Step: Graphs for Communication versus Exploration

The Extra Step: Graphs for Communication versus Exploration

Visualization is a useful tool for data exploration and statistical analysis, and it’s an important method for communicating your discoveries to others. While those two uses of visualization are related, they aren’t identical.

One of the reasons that I like ggplot so much is that it excels at layering together multiple views and summaries of data in ways that improve both data exploration and communication. Of course, getting at the right graph can be a bit of work, and often I will stop when I get to a visualization that tells me what I need to know — even if no one can read that graph but me. In this post I’ll look at a couple of ggplot graphs that take the extra step: communicating effectively to others.

For my examples I’ll use a pre-treated sample from the 2011 U.S. Census American Community Survey. The dataset is available as an R object in the file phsample.RData; the data dictionary and additional information can be found here. Information about getting the original source data from the U.S. Census site is at the bottom of this post.

The file phsample.RData contains two data frames: dhus (household information), and dpus (information about individuals; they are joined to households using the column SERIALNO). We will only use the dhus data frame.

library(ggplot2)
load("phsample.RData")

# Restrict to non-institutional households
# (No jails, schools, convalescent homes, vacant residences)
hhonly = subset(dhus, (dhus$TYPE==1) &(dhus$NP > 0))

Continue reading The Extra Step: Graphs for Communication versus Exploration

Posted on Categories Coding, data science, Pragmatic Data Science, Statistics, TutorialsTags , , , , 3 Comments on Revisiting Cleveland’s The Elements of Graphing Data in ggplot2

Revisiting Cleveland’s The Elements of Graphing Data in ggplot2

I was flipping through my copy of William Cleveland’s The Elements of Graphing Data the other day; it’s a book worth revisiting. I’ve always liked Cleveland’s approach to visualization as statistical analysis. His quest to ground visualization principles in the context of human visual cognition (he called it “graphical perception”) generated useful advice for designing effective graphics [1].

I confess I don’t always follow his advice. Sometimes it’s because I don’t agree with him, but also it’s because I use ggplot for visualization, and I’m lazy. I like ggplot because it excels at layering multiple graphics into a single plot and because it looks good; but deviating from the default presentation is often a bit of work. How much am I losing out on by this? I decided to do the work and find out.

Details of specific plots aside, the key points of Cleveland’s philosophy are:

  • A graphic should display as much information as it can, with the lowest possible cognitive strain to the viewer.
  • Visualization is an iterative process. Graph the data, learn what you can, and then regraph the data to answer the questions that arise from your previous graphic.

Of course, when you are your own viewer, part of the cognitive strain in visualization comes from difficulty generating the desired graphic. So we’ll start by making the easiest possible ggplot graph, and working our way from there — Cleveland style.

Continue reading Revisiting Cleveland’s The Elements of Graphing Data in ggplot2

Posted on Categories Exciting Techniques, Expository Writing, Mathematics, Pragmatic Data Science, Pragmatic Machine Learning, StatisticsTags , , , , , , 7 Comments on Good Graphs: Graphical Perception and Data Visualization

Good Graphs: Graphical Perception and Data Visualization

What makes a good graph? When faced with a slew of numeric data, graphical visualization can be a more efficient way of getting a feel for the data than going through the rows of a spreadsheet. But do we know if we are getting an accurate or useful picture? How do we pick an effective visualization that neither obscures important details, or drowns us in confusing clutter? In 1968, William Cleveland published a text called The Elements of Graphing Data, inspired by Strunk and White’s classic writing handbook The Elements of Style . The Elements of Graphing Data puts forward Cleveland’s philosophy about how to produce good, clear graphs — not only for presenting one’s experimental results to peers, but also for the purposes of data analysis and exploration. Cleveland’s approach is based on a theory of graphical perception: how well the human perceptual system accomplishes certain tasks involved in reading a graph. For a given data analysis task, the goal is to align the information being presented with the perceptual tasks the viewer accomplishes the best. Continue reading Good Graphs: Graphical Perception and Data Visualization