Posted on 9 Comments on Wanted: A Perfect Scatterplot (with Marginals)

## Wanted: A Perfect Scatterplot (with Marginals)

We saw this scatterplot with marginal densities the other day, in a blog post by Thomas Wiecki:

The graph was produced in Python, using the seaborn package. Seaborn calls it a “jointplot;” it’s called a “scatterhist” in Matlab, apparently. The seaborn version also shows the strength of the linear relationship between the x and y variables. Nice.

I like this plot a lot, but we’re mostly an R shop here at Win-Vector. So we asked: can we make this plot in ggplot2? Natively, ggplot2 can add rugs to a scatterplot, but doesn’t immediately offer marginals, as above.

However, you can use Dean Attali’s ggExtra package. Here’s an example using the same data as the seaborn jointplot above; you can download the dataset here.

``` library(ggplot2) library(ggExtra) frm = read.csv("tips.csv") plot_center = ggplot(frm, aes(x=total_bill,y=tip)) + geom_point() + geom_smooth(method="lm") # default: type="density" ggMarginal(plot_center, type="histogram") ```

I didn’t bother to add the internal annotation for the goodness of the linear fit, though I could.

The `ggMarginal()` function goes to heroic effort to line up the coordinate axes of all the graphs, and is probably the best way to do a scatterplot-plus-marginals in ggplot (you can also do it in base graphics, of course). Still, we were curious how close we could get to the seaborn version: marginal density and histograms together, along with annotations. Below is our version of the graph; we report the linear fit’s R-squared, rather than the Pearson correlation.

``` # our own (very beta) plot package: details later library(WVPlots) frm = read.csv("tips.csv") ScatterHist(frm, "total_bill", "tip", smoothmethod="lm", annot_size=3, title="Tips vs. Total Bill") ```

You can see that (at the moment) we’ve resorted to padding the axis labels with underbars to force the x-coordinates of the top marginal plot and the scatterplot to align; white space gets trimmed. This is profoundly unsatisfying, and less robust than the `ggMarginal` version. If you’re curious, the code is here. It relies on some functions in the file `sharedFunctions.R` in the same repository. Our more general version will do either a linear or lowess/spline smooth, and you can also adjust the histogram and density plot parameters.

Thanks to Slawa Rokicki’s excellent ggplot2: Cheatsheet for Visualizing Distributions for our basic approach. Check out the graph at the bottom of her post — and while you’re at it, check out the rest of her blog too.

Posted on Categories Opinion, ProgrammingTags , , 4 Comments on R in a 64 bit world

## R in a 64 bit world

32 bit data structures (pointers, integer representations, single precision floating point) have been past their “best before date” for quite some time. R itself moved to a 64 bit memory model some time ago, but still has only 32 bit integers. This is going to get more and more awkward going forward. What is R doing to work around this limitation?

We discuss this in this article, the first of a new series of articles discussing aspects of “R as it is” that we are publishing with cooperation from Revolution Analytics. Continue reading R in a 64 bit world

Posted on Tags , 1 Comment on What is new in the vtreat library?

## What is new in the vtreat library?

The Win-Vector LLC vtreat library is a library we supply (under a GPL license) for automating the simple domain independent part of variable cleaning an preparation.

The idea is you supply (in R) an example general `data.frame` to vtreat’s `designTreatmentsC` method (for single-class categorical targets) or `designTreatmentsN` method (for numeric targets) and vtreat returns a data structure that can be used to `prepare` data frames for training and scoring. A vtreat-prepared data frame is nice in the sense:

• All result columns are numeric.
• No odd type columns (dates, lists, matrices, and so on) are present.
• No columns have `NA`, `NaN`, `+-infinity`.
• Categorical variables are expanded into multiple indicator columns with all levels present which is a good encoding if you are using any sort of regularization in your modeling technique.
• No rare indicators are encoded (limiting the number of indicators on the translated `data.frame`).
• Categorical variables are also impact coded, so even categorical variables with very many levels (like zip-codes) can be safely used in models.
• Novel levels (levels not seen during design/train phase) do not cause `NA` or errors.

The idea is vtreat automates a number of standard inspection and preparation steps that are common to all predictive analytic projects. This leaves the data scientist more time to work on important domain specific steps. vtreat also leaves as much of variable selection to the down-stream modeling software. The goal of vtreat is to reliably (and repeatably) generate a `data.frame` that is safe to work with.

This note explains a few things that are new in the vtreat library. Continue reading What is new in the vtreat library?

Posted on Categories Opinion, Programming, Rants, Statistics, Tutorials3 Comments on What can be in an R data.frame column?

## What can be in an R data.frame column?

As an R programmer have you every wondered what can be in a `data.frame` column? Continue reading What can be in an R data.frame column?

Posted on Categories Coding, Programming, Statistics, Tutorials

## How and why to return functions in R

One of the advantages of functional languages (such as R) is the ability to create and return functions “on the fly.” We will discuss one good use of this capability and what to look out for when creating functions in R. Continue reading How and why to return functions in R

Posted on 13 Comments on Using closures as objects in R

## Using closures as objects in R

For more and more clients we have been using a nice coding pattern taught to us by Garrett Grolemund in his book Hands-On Programming with R: make a function that returns a list of functions. This turns out to be a classic functional programming techique: use closures to implement objects (terminology we will explain).

It is a pattern we strongly recommend, but with one caveat: it can leak references similar to the manner described in here. Once you work out how to stomp out the reference leaks the “function that returns a list of functions” pattern is really strong.

We will discuss this programming pattern and how to use it effectively. Continue reading Using closures as objects in R

Posted on Categories Programming, Rants, Statistics4 Comments on Check your return types when modeling in R

## Check your return types when modeling in R

Just a warning: double check your return types in R, especially when using different modeling packages. Continue reading Check your return types when modeling in R

Posted on 4 Comments on Factors are not first-class citizens in R

## Factors are not first-class citizens in R

The primary user-facing data types in the R statistical computing environment behave as vectors. That is: one dimensional arrays of scalar values that have a nice operational algebra. There are additional types (lists, data frames, matrices, environments, and so-on) but the most common data types are vectors. In fact vectors are so common in R that scalar values such as the number `5` are actually represented as length-1 vectors. We commonly think about working over vectors of “logical”, “integer”, “numeric”, “complex”, “character”, and “factor” types. However, a “factor” is not a R vector. In fact “factor” is not a first-class citizen in R, which can lead to some ugly bugs.

For example, consider the following R code.

``` levels <- c('a','b','c') f <- factor(c('c','a','a',NA,'b','a'),levels=levels) print(f) ## [1] c a a <NA> b a ## Levels: a b c print(class(f)) ## [1] "factor" ```

This example encoding a series of 6 observations into a known set of factor-levels (`'a'`, `'b'`, and `'c'`). As is the case with real data some of the positions might be missing/invalid values such as `NA`. One of the strengths of R is we have a uniform explicit representation of bad values, so with appropriate domain knowledge we can find and fix such problems. Suppose we knew (by policy or domain experience) that the level `'a'` was a suitable default value to use when the actual data is missing/invalid. You would think the following code would be the reasonable way to build a new revised data column.

``` fRevised <- ifelse(is.na(f),'a',f) print(fRevised) ## [1] "3" "1" "1" "a" "2" "1" print(class(fRevised)) ## [1] "character" ```

Notice the new column `fRevised` is an absolute mess (and not even of class/type factor). This sort of fix would have worked if `f` had been a vector of characters or even a vector of integers, but for factors we get gibberish.

We are going to work through some more examples of this problem. Continue reading Factors are not first-class citizens in R

Posted on Tags , 3 Comments on R style tip: prefer functions that return data frames

## R style tip: prefer functions that return data frames

While following up on Nina Zumel’s excellent Trimming the Fat from glm() Models in R I got to thinking about code style in R. And I realized: you can make your code much prettier by designing more of your functions to return `data.frame`s. That may seem needlessly heavy-weight, but it has a lot of down-stream advantages. Continue reading R style tip: prefer functions that return data frames