We have been recently working on and presenting on nested modeling issues. These are situations where the output of one trained machine learning model is part of the input of a later model or procedure. I am now of the opinion that correct treatment of nested models is one of the biggest opportunities for improvement in data science practice. Nested models can be more powerful than non-nested, but are easy to get wrong.

# Category: Tutorials

## Improved vtreat documentation

Nina Zumel has donated some time to greatly improve the vtreat R package documentation (now available as pre-rendered HTML here).

vtreat is an R data.frame processor/conditioner package that helps prepare real-world data for predictive modeling in a statistically sound manner. Continue reading Improved vtreat documentation

## Free data science video lecture: debugging in R

We are pleased to release a new free data science video lecture: Debugging R code using R, RStudio and wrapper functions. In this 8 minute video we demonstrate the incredible power of R using wrapper functions to catch errors for later reproduction and debugging. If you haven’t tried these techniques this will *really* improve your debugging game.

All code and examples can be found here and in WVPlots. Continue reading Free data science video lecture: debugging in R

## A bit on the F1 score floor

At Strata+Hadoop World “R Day” Tutorial, Tuesday, March 29 2016, San Jose, California we spent some time on classifier measures derived from the so-called “confusion matrix.”

We repeated our usual admonition to not use “accuracy itself” as a project quality goal (business people tend to ask for it as it is the word they are most familiar with, but it usually isn’t what they really want).

One reason not to use accuracy: an example where a classifier that does nothing is “more accurate” than one that actually has some utility. (Figure credit Nina Zumel, slides here)

And we worked through the usual bestiary of other metrics (precision, recall, sensitivity, specificity, AUC, balanced accuracy, and many more).

Please read on to see what stood out. Continue reading A bit on the F1 score floor

## WVPlots: example plots in R using ggplot2

Nina Zumel and I have been working on packaging our favorite graphing techniques in a more reusable way that emphasizes the analysis task at hand over the steps needed to produce a good visualization. The idea is: we sacrifice some of the flexibility and composability inherent to ggplot2 in R for a menu of prescribed presentation solutions (which we are sharing on Github).

For example the plot below showing both an observed discrete empirical distribution (as stems) and a matching theoretical distribution (as bars) is a built in “one liner.”

Please read on for some of the ideas and how to use this package. Continue reading WVPlots: example plots in R using ggplot2

## Upcoming Win-Vector LLC appearances

Win-Vector LLC will be presenting on statistically validating models using R and data science at:

- Strata+Hadoop World “R Day” Tutorial 9:00am–5:00pm Tuesday, March 29 2016, San Jose, California.
- ODSC San Francisco Meetup, 6:30pm-9:00pm Thursday, March 31, 2016, San Francisco, California.

We will share code and examples.

Registration required (and Strata is a paid conference). Please Tweet/forward. We hope to see you soon!

## More on preparing data

The Microsoft Data Science User Group just sponsored Dr. Nina Zumel‘s presentation “Preparing Data for Analysis Using R”. Microsoft saw Win-Vector LLC‘s ODSC West 2015 presentation “Prepping Data for Analysis using R” and generously offered to sponsor improving it and disseminating it to a wider audience.

We feel Nina really hit the ball out of the park with over 400 new live viewers. Read more for links to even more free materials! Continue reading More on preparing data

## Neglected optimization topic: set diversity

The mathematical concept of set diversity is a somewhat neglected topic in current applied decision sciences and optimization. We take this opportunity to discuss the issue.

## The problem

Consider the following problem: for a number of items `U = {x_1`

, … `x_n}`

pick a small set of them `X = {x_i1, x_i2, ..., x_ik}`

such that there is a high probability one of the `x in X`

is a “success.” By success I mean some standard business outcome such as making a sale (in the sense of any of: propensity, appetency, up selling, and uplift modeling), clicking an advertisement, adding an account, finding a new medicine, or learning something useful.

This is common in:

- Search engines. The user is presented with a page consisting of “top results” with the hope that one of the results is what the user wanted.
- Online advertising. The user is presented with a number of advertisements in enticements in the hope that one of them matches user taste.
- Science. A number of molecules are simultaneously presented to biological assay hoping that at least one of them is a new drug candidate, or that the simultaneous set of measurements shows us where to experiment further.
- Sensor/guard placement. Overlapping areas of coverage don’t make up for uncovered areas.
- Machine learning method design. The random forest algorithm requires diversity among its sub-trees to work well. It tries to ensure by both per-tree variable selections and re-sampling (some of these issues discussed here).

In this note we will touch on key applications and some of the theory involved. While our group specializes in practical data science implementations, applications, and training, our researchers experience great joy when they can re-formulate a common problem using known theory/math and the reformulation is game changing (as it is in the case of set-scoring).

Minimal spanning trees, the basis of one set diversity metric.

## Using PostgreSQL in R: A quick how-to

The combination of R plus SQL offers an attractive way to work with what we call medium-scale data: data that’s perhaps too large to gracefully work with in its entirety within your favorite desktop analysis tool (whether that be R or Excel), but too small to justify the overhead of big data infrastructure. In some cases you can use a *serverless* SQL database that gives you the power of SQL for data manipulation, while maintaining a lightweight infrastructure.

We call this work pattern “SQL Screwdriver”: delegating data handling to a lightweight infrastructure with the power of SQL for data manipulation.

We assume for this how-to that you already have a PostgreSQL database up and running. To get PostgreSQL for Windows, OSX, or Unix use the instructions at PostgreSQL downloads. If you happen to be on a Mac, then Postgres.app provides a “serverless” (or application oriented) install option.

For the rest of this post, we give a quick how-to on using the `RpostgreSQL`

package to interact with Postgres databases in R.