Posted on 9 Comments on Wanted: A Perfect Scatterplot (with Marginals)

## Wanted: A Perfect Scatterplot (with Marginals)

We saw this scatterplot with marginal densities the other day, in a blog post by Thomas Wiecki: The graph was produced in Python, using the seaborn package. Seaborn calls it a “jointplot;” it’s called a “scatterhist” in Matlab, apparently. The seaborn version also shows the strength of the linear relationship between the x and y variables. Nice.

I like this plot a lot, but we’re mostly an R shop here at Win-Vector. So we asked: can we make this plot in ggplot2? Natively, ggplot2 can add rugs to a scatterplot, but doesn’t immediately offer marginals, as above.

However, you can use Dean Attali’s ggExtra package. Here’s an example using the same data as the seaborn jointplot above; you can download the dataset here.

``` library(ggplot2) library(ggExtra) frm = read.csv("tips.csv") plot_center = ggplot(frm, aes(x=total_bill,y=tip)) + geom_point() + geom_smooth(method="lm") # default: type="density" ggMarginal(plot_center, type="histogram") ```

I didn’t bother to add the internal annotation for the goodness of the linear fit, though I could. The `ggMarginal()` function goes to heroic effort to line up the coordinate axes of all the graphs, and is probably the best way to do a scatterplot-plus-marginals in ggplot (you can also do it in base graphics, of course). Still, we were curious how close we could get to the seaborn version: marginal density and histograms together, along with annotations. Below is our version of the graph; we report the linear fit’s R-squared, rather than the Pearson correlation.

``` # our own (very beta) plot package: details later library(WVPlots) frm = read.csv("tips.csv") ScatterHist(frm, "total_bill", "tip", smoothmethod="lm", annot_size=3, title="Tips vs. Total Bill") ``` You can see that (at the moment) we’ve resorted to padding the axis labels with underbars to force the x-coordinates of the top marginal plot and the scatterplot to align; white space gets trimmed. This is profoundly unsatisfying, and less robust than the `ggMarginal` version. If you’re curious, the code is here. It relies on some functions in the file `sharedFunctions.R` in the same repository. Our more general version will do either a linear or lowess/spline smooth, and you can also adjust the histogram and density plot parameters.

Thanks to Slawa Rokicki’s excellent ggplot2: Cheatsheet for Visualizing Distributions for our basic approach. Check out the graph at the bottom of her post — and while you’re at it, check out the rest of her blog too.