Posted on Categories TutorialsTags , , 2 Comments on Scatterplot matrices (pair plots) with cdata and ggplot2

## Scatterplot matrices (pair plots) with cdata and ggplot2

In my previous post, I showed how to use `cdata` package along with `ggplot2`‘s faceting facility to compactly plot two related graphs from the same data. This got me thinking: can I use `cdata` to produce a `ggplot2` version of a scatterplot matrix, or pairs plot?

A pairs plot compactly plots every (numeric) variable in a dataset against every other one. In base plot, you would use the `pairs()` function. Here is the base version of the pairs plot of the `iris` dataset:

``````pairs(iris[1:4],
main = "Anderson's Iris Data -- 3 species",
pch = 21,
bg = c("#1b9e77", "#d95f02", "#7570b3")[unclass(iris\$Species)])``````

There are other ways to do this, too:

``````# not run

library(ggplot2)
library(GGally)
ggpairs(iris, columns=1:4, aes(color=Species)) +
ggtitle("Anderson's Iris Data -- 3 species")

library(lattice)
splom(iris[1:4],
groups=iris\$Species,
main="Anderson's Iris Data -- 3 species")``````

But I wanted to see if `cdata` was up to the task. So here we go….

Posted on Categories Programming, TutorialsTags , , 8 Comments on Faceted Graphs with cdata and ggplot2

## Faceted Graphs with cdata and ggplot2

In between client work, John and I have been busy working on our book, Practical Data Science with R, 2nd Edition. To demonstrate a toy example for the section I’m working on, I needed scatter plots of the petal and sepal dimensions of the `iris` data, like so:

I wanted a plot for petal dimensions and sepal dimensions, but I also felt that two plots took up too much space. So, I thought, why not make a faceted graph that shows both:

Except — which columns do I plot and what do I facet on?

``head(iris)``
``````##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa``````

Here’s one way to create the plot I want, using the `cdata` package along with `ggplot2`.

Posted on Tags , , , 2 Comments on Partial Pooling for Lower Variance Variable Encoding

## Partial Pooling for Lower Variance Variable Encoding

In a previous article, we showed the use of partial pooling, or hierarchical/multilevel models, for level coding high-cardinality categorical variables in `vtreat`. In this article, we will discuss a little more about the how and why of partial pooling in `R`.

We will use the `lme4` package to fit the hierarchical models. The acronym “lme” stands for “linear mixed-effects” models: models that combine so-called “fixed effects” and “random effects” in a single (generalized) linear model. The `lme4` documentation uses the random/fixed effects terminology, but we are going to follow Gelman and Hill, and avoid the use of the terms “fixed” and “random” effects.

The varying coefficients [corresponding to the levels of a categorical variable] in a multilevel model are sometimes called random effects, a term that refers to the randomness in the probability model for the group-level coefficients….

The term fixed effects is used in contrast to random effects – but not in a consistent way! … Because of the conflicting definitions and advice, we will avoid the terms “fixed” and “random” entirely, and focus on the description of the model itself…

– Gelman and Hill 2007, Chapter 11.4

We will also restrict ourselves to the case that `vtreat` considers: partially pooled estimates of conditional group expectations, with no other predictors considered.

Posted on Tags , , ,

## Custom Level Coding in vtreat

One of the services that the `R` package `vtreat` provides is level coding (what we sometimes call impact coding): converting the levels of a categorical variable to a meaningful and concise single numeric variable, rather than coding them as indicator variables (AKA "one-hot encoding"). Level coding can be computationally and statistically preferable to one-hot encoding for variables that have an extremely large number of possible levels.

By default, `vtreat` level codes to the difference between the conditional means and the grand mean (`catN` variables) when the outcome is numeric, and to the difference between the conditional log-likelihood and global log-likelihood of the target class (`catB` variables) when the outcome is categorical. These aren’t the only possible level codings. For example, the `ranger` package can encode categorical variables as ordinals, sorted by the conditional expectations/means. While this is not a completely faithful encoding for all possible models (it is not completely faithful for linear or logistic regression, for example), it is often invertible for tree-based methods, and has the advantage of keeping the original levels distinct, which impact coding may not. That is, two levels with the same conditional expectation would be conflated by `vtreat`‘s coding. This often isn’t a problem — but sometimes, it may be.

So the data scientist may want to use a level coding different from what `vtreat` defaults to. In this article, we will demonstrate how to implement custom level encoders in `vtreat`. We assume you are familiar with the basics of `vtreat`: the types of derived variables, how to create and apply a treatment plan, etc.

Posted on

## Introduction

In teaching thinking in terms of coordinatized data we find the hardest operations to teach are joins and pivot.

One thing we commented on is that moving data values into columns, or into a “thin” or entity/attribute/value form (often called “un-pivoting”, “stacking”, “melting” or “gathering“) is easy to explain, as the operation is a function that takes a single row and builds groups of new rows in an obvious manner. We commented that the inverse operation of moving data into rows, or the “widening” operation (often called “pivoting”, “unstacking”, “casting”, or “spreading”) is harder to explain as it takes a specific group of columns and maps them back to a single row. However, if we take extra care and factor the pivot operation into its essential operations we find pivoting can be usefully conceptualized as a simple single row to single row mapping followed by a grouped aggregation.

Posted on Tags ,

## A Simple Example of Using replyr::gapply

It’s a common situation to have data from multiple processes in a “long” data format, for example a table with columns `measurement` and `process_that_produced_measurement`. It’s also natural to split that data apart to analyze or transform it, per-process — and then to bring the results of that data processing together, for comparison. Such a work pattern is called “Split-Apply-Combine,” and we discuss several R implementations of this pattern here. In this article we show a simple example of one such implementation, `replyr::gapply`, from our latest package, `replyr`.

Illustration by Boris Artzybasheff. Image: James Vaughn, some rights reserved.

The example task is to evaluate how several different models perform on the same classification problem, in terms of deviance, accuracy, precision and recall. We will use the “default of credit card clients” data set from the UCI Machine Learning Repository.

Posted on Tags , , 2 Comments on Using replyr::let to Parameterize dplyr Expressions

## Using replyr::let to Parameterize dplyr Expressions

Imagine that in the course of your analysis, you regularly require summaries of numerical values. For some applications you want the mean of that quantity, plus/minus a standard deviation; for other applications you want the median, and perhaps an interval around the median based on the interquartile range (IQR). In either case, you may want the summary broken down with respect to groupings in the data. In other words, you want a table of values, something like this:

```dist_intervals(iris, "Sepal.Length", "Species")

# A tibble: 3 × 7
Species  sdlower  mean  sdupper iqrlower median iqrupper

1     setosa 4.653510 5.006 5.358490   4.8000    5.0   5.2000
2 versicolor 5.419829 5.936 6.452171   5.5500    5.9   6.2500
3  virginica 5.952120 6.588 7.223880   6.1625    6.5   6.8375
```

For a specific data frame, with known column names, such a table is easy to construct using `dplyr::group_by` and `dplyr::summarize`. But what if you want a function to calculate this table on an arbitrary data frame, with arbitrary quantity and grouping columns? To write such a function in `dplyr` can get quite hairy, quite quickly. Try it yourself, and see.

Enter `let`, from our new package `replyr`.

Posted on Categories Administrativia, data science, Statistics, Tutorials3 Comments on Upcoming Talks

## Upcoming Talks

I (Nina Zumel) will be speaking at the Women who Code Silicon Valley meetup on Thursday, October 27.

The talk is called Improving Prediction using Nested Models and Simulated Out-of-Sample Data.

In this talk I will discuss nested predictive models. These are models that predict an outcome or dependent variable (called y) using additional submodels that have also been built with knowledge of y. Practical applications of nested models include “the wisdom of crowds”, prediction markets, variable re-encoding, ensemble learning, stacked learning, and superlearners.

Nested models can improve prediction performance relative to single models, but they introduce a number of undesirable biases and operational issues, and when they are improperly used, are statistically unsound. However modern practitioners have made effective, correct use of these techniques. In my talk I will give concrete examples of nested models, how they can fail, and how to fix failures. The solutions we will discuss include advanced data partitioning, simulated out-of-sample data, and ideas from differential privacy. The theme of the talk is that with proper techniques, these powerful methods can be safely used.

John Mount and I will also be giving a workshop called A Unified View of Model Evaluation at ODSC West 2016 on November 4 (the premium workshop sessions), and November 5 (the general workshop sessions).

We will present a unified framework for predictive model construction and evaluation. Using this perspective we will work through crucial issues from classical statistical methodology, large data treatment, variable selection, ensemble methods, and all the way through stacking/super-learning. We will present R code demonstrating principled techniques for preparing data, scoring models, estimating model reliability, and producing decisive visualizations. In this workshop we will share example data, methods, graphics, and code.

I’m looking forward to these talks, and I hope some of you will be able to attend.

Posted on 1 Comment on Principal Components Regression, Pt. 3: Picking the Number of Components

## Principal Components Regression, Pt. 3: Picking the Number of Components

In our previous note we demonstrated Y-Aware PCA and other y-aware approaches to dimensionality reduction in a predictive modeling context, specifically Principal Components Regression (PCR). For our examples, we selected the appropriate number of principal components by eye. In this note, we will look at ways to select the appropriate number of principal components in a more automated fashion.