Posted on

# Wanted: A Perfect Scatterplot (with Marginals)

We saw this scatterplot with marginal densities the other day, in a blog post by Thomas Wiecki: The graph was produced in Python, using the seaborn package. Seaborn calls it a “jointplot;” it’s called a “scatterhist” in Matlab, apparently. The seaborn version also shows the strength of the linear relationship between the x and y variables. Nice.

I like this plot a lot, but we’re mostly an R shop here at Win-Vector. So we asked: can we make this plot in ggplot2? Natively, ggplot2 can add rugs to a scatterplot, but doesn’t immediately offer marginals, as above.

However, you can use Dean Attali’s ggExtra package. Here’s an example using the same data as the seaborn jointplot above; you can download the dataset here.

``` library(ggplot2) library(ggExtra) frm = read.csv("tips.csv") plot_center = ggplot(frm, aes(x=total_bill,y=tip)) + geom_point() + geom_smooth(method="lm") # default: type="density" ggMarginal(plot_center, type="histogram") ```

I didn’t bother to add the internal annotation for the goodness of the linear fit, though I could. The `ggMarginal()` function goes to heroic effort to line up the coordinate axes of all the graphs, and is probably the best way to do a scatterplot-plus-marginals in ggplot (you can also do it in base graphics, of course). Still, we were curious how close we could get to the seaborn version: marginal density and histograms together, along with annotations. Below is our version of the graph; we report the linear fit’s R-squared, rather than the Pearson correlation.

``` # our own (very beta) plot package: details later library(WVPlots) frm = read.csv("tips.csv") ScatterHist(frm, "total_bill", "tip", smoothmethod="lm", annot_size=3, title="Tips vs. Total Bill") ``` You can see that (at the moment) we’ve resorted to padding the axis labels with underbars to force the x-coordinates of the top marginal plot and the scatterplot to align; white space gets trimmed. This is profoundly unsatisfying, and less robust than the `ggMarginal` version. If you’re curious, the code is here. It relies on some functions in the file `sharedFunctions.R` in the same repository. Our more general version will do either a linear or lowess/spline smooth, and you can also adjust the histogram and density plot parameters.

Thanks to Slawa Rokicki’s excellent ggplot2: Cheatsheet for Visualizing Distributions for our basic approach. Check out the graph at the bottom of her post — and while you’re at it, check out the rest of her blog too.

## 9 thoughts on “Wanted: A Perfect Scatterplot (with Marginals)”

1. An easy way to try Nina’s plot is to install the package from Github (using devtools):

``` devtools::install_github("WinVector/WVPlots") ```

2. You might be able to do what you’re trying with ggplot2, gtable and grid. I did something requiring similar manipulations and alignments of multiple plot panels for my plot.qcc rewrite (devtools::install_github(“tomhopper/gcc_ggplot”).

1. Thanks! We’ve been working with grid, but I haven’t tried gtable yet. I will check it out (and your code, too).

1. Thank you! As I said to the commenter above, we’ve not tried gtable. Thanks for the pointer, and I will check out your version as well.

3. D L McArthur says:

Though fancy can certainly be good, a more generalized panelplot might be of some interest. Here is a riff with marginals, borrowing on functionalities in `asbio::panel.cor.res` The robust confidence bounds (in green and grey), robust correlation coefficient and robust analogue of the t-test are from Rand Wilcox. Dotted verticals and horizontals are arithmetic means; red is linear fit while blue is loess estimator (when appropriate). Though the x-axis labeling for binary items does need tweaking, this kind of automated panelplot, called with just one line, provides a lot of payback for datasets containing reasonable k’s and n’s.

1. Thanks for posting this! I’ve seen variations of pair plots like that before. I like them very much when I want to get a quick overview of several variables at once.

I sometimes use a version based on ggplot (package ggally, I think), but your base plot version with the additional annotations (linear fits/loess, means, robust bounds, etc) is quite nice.

4. I’m glad you found my package easy to use! Your ScatterHist output looks really nice (haven’t looked at the code).

I never really thought of the usecase of having both types of plots on top of each other. You’re welcome to submit a PR or open a github issue if it’s something that you think more people will find useful :)

1. Thanks for stopping by! ggExtra was a good find for us. And I liked your marginal boxplot variation on ggMarginal, too.