Posted on Categories math programming, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , , 2 Comments on Why do Decision Trees Work?

Why do Decision Trees Work?

In this article we will discuss the machine learning method called “decision trees”, moving quickly over the usual “how decision trees work” and spending time on “why decision trees work.” We will write from a computational learning theory perspective, and hope this helps make both decision trees and computational learning theory more comprehensible. The goal of this article is to set up terminology so we can state in one or two sentences why decision trees tend to work well in practice.

Continue reading Why do Decision Trees Work?

Posted on Categories data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , , , , 3 Comments on A Theory of Nested Cross Simulation

A Theory of Nested Cross Simulation

[Reader’s Note. Some of our articles are applied and some of our articles are more theoretical. The following article is more theoretical, and requires fairly formal notation to even work through. However, it should be of interest as it touches on some of the fine points of cross-validation that are quite hard to perceive or discuss without the notational framework. We thought about including some “simplifying explanatory diagrams” but so many entities are being introduced and manipulated by the processes we are describing we found equation notation to be in fact cleaner than the diagrams we attempted and rejected.]

Please consider either of the following common predictive modeling tasks:

  • Picking hyper-parameters, fitting a model, and then evaluating the model.
  • Variable preparation/pruning, fitting a model, and then evaluating the model.

In each case you are building a pipeline where “y-aware” (or outcome aware) choices and transformations made at each stage affect later stages. This can introduce undesirable nested model bias and over-fitting.

Our current standard advice to avoid nested model bias is either:

  • Split your data into 3 or more disjoint pieces, such as separate variable preparation/pruning, model fitting, and model evaluation.
  • Reserve a test-set for evaluation and use “simulated out of sample data” or “cross-frame”/“cross simulation” techniques to simulate dividing data among the first two model construction stages.

The first practice is simple and computationally efficient, but statistically inefficient. This may not matter if you have a lot of data, as in “big data”. The second procedure is more statistically efficient, but is also more complicated and has some computational cost. For convenience the cross simulation method is supplied as a ready to go procedure in our R data cleaning and preparation package vtreat.

What would it look like if we insisted on using cross simulation or simulated out of sample techniques for all three (or more) stages? Please read on to find out.

CleanAllTheThings

Hyperbole and a Half copyright Allie Brosh (use allowed in some situations with attribution)

Posted on Categories Coding, Computer Science, data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Programming, StatisticsTags , Leave a comment on A Simple Example of Using replyr::gapply

A Simple Example of Using replyr::gapply

It’s a common situation to have data from multiple processes in a “long” data format, for example a table with columns measurement and process_that_produced_measurement. It’s also natural to split that data apart to analyze or transform it, per-process — and then to bring the results of that data processing together, for comparison. Such a work pattern is called “Split-Apply-Combine,” and we discuss several R implementations of this pattern here. In this article we show a simple example of one such implementation, replyr::gapply, from our latest package, replyr.


4140852348 2ebe864822 z
Illustration by Boris Artzybasheff. Image: James Vaughn, some rights reserved.

The example task is to evaluate how several different models perform on the same classification problem, in terms of deviance, accuracy, precision and recall. We will use the “default of credit card clients” data set from the UCI Machine Learning Repository.

Continue reading A Simple Example of Using replyr::gapply

Posted on Categories Opinion, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, StatisticsTags , , , , , , 2 Comments on The case for index-free data manipulation

The case for index-free data manipulation

Statisticians and data scientists want a neat world where data is arranged in a table such that every row is an observation or instance, and every column is a variable or measurement. Getting to this state of “ready to model format” (often called a denormalized form by relational algebra types) often requires quite a bit of data manipulation. This is how R data.frames describe themselves (try “str(data.frame(x=1:2))” in an R-console to see this) and is part of the tidy data manifesto.

Tools like SQL (structured query language) and dplyr can make the data arrangement process less burdensome, but using them effectively requires “index free thinking” where the data are not thought of in terms of row indices. We will explain and motivate this idea below. Continue reading The case for index-free data manipulation

Posted on Categories Coding, Computer Science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Programming, StatisticsTags , , 2 Comments on Using replyr::let to Parameterize dplyr Expressions

Using replyr::let to Parameterize dplyr Expressions

Rplot

Imagine that in the course of your analysis, you regularly require summaries of numerical values. For some applications you want the mean of that quantity, plus/minus a standard deviation; for other applications you want the median, and perhaps an interval around the median based on the interquartile range (IQR). In either case, you may want the summary broken down with respect to groupings in the data. In other words, you want a table of values, something like this:

dist_intervals(iris, "Sepal.Length", "Species")

# A tibble: 3 × 7
     Species  sdlower  mean  sdupper iqrlower median iqrupper
                         
1     setosa 4.653510 5.006 5.358490   4.8000    5.0   5.2000
2 versicolor 5.419829 5.936 6.452171   5.5500    5.9   6.2500
3  virginica 5.952120 6.588 7.223880   6.1625    6.5   6.8375

For a specific data frame, with known column names, such a table is easy to construct using dplyr::group_by and dplyr::summarize. But what if you want a function to calculate this table on an arbitrary data frame, with arbitrary quantity and grouping columns? To write such a function in dplyr can get quite hairy, quite quickly. Try it yourself, and see.

Enter let, from our new package replyr.

Continue reading Using replyr::let to Parameterize dplyr Expressions

Posted on Categories Exciting Techniques, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, StatisticsTags , , , , , Leave a comment on vtreat data cleaning and preparation article now available on arXiv

vtreat data cleaning and preparation article now available on arXiv

Nina Zumel and I are happy to announce a formal article discussing data preparation and cleaning using the vtreat methodology is now available from arXiv.org as citation arXiv:1611.09477 [stat.AP].

vtreat is an R data.frame processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner. It prepares variables so that data has fewer exceptional cases, making it easier to safely use models in production. Common problems vtreat defends against include: infinity, NA, too many categorical levels, rare categorical levels, and new categorical levels (levels seen during application, but not during training). vtreat::prepare should be your first choice for real world data preparation and cleaning.

We hope this article will make getting started with vtreat much easier. We also hope this helps with citing the use of vtreat in scientific publications. Continue reading vtreat data cleaning and preparation article now available on arXiv

Posted on Categories Coding, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, ProgrammingTags , , , , , 2 Comments on New R package: replyr (get a grip on remote dplyr data services)

New R package: replyr (get a grip on remote dplyr data services)

It is a bit of a shock when R dplyr users switch from using a tbl implementation based on R in-memory data.frames to one based on a remote database or service. A lot of the power and convenience of the dplyr notation is hard to maintain with these more restricted data service providers. Things that work locally can’t always be used remotely at scale. It is emphatically not yet the case that one can practice with dplyr in one modality and hope to move to another back-end without significant debugging and work-arounds. replyr attempts to provide a few helpful work-arounds.

Our new package replyr supplies methods to get a grip on working with remote tbl sources (SQL databases, Spark) through dplyr. The idea is to add convenience functions to make such tasks more like working with an in-memory data.frame. Results still do depend on which dplyr service you use, but with replyr you have fairly uniform access to some useful functions.

Continue reading New R package: replyr (get a grip on remote dplyr data services)

Posted on Categories Administrativia, Opinion, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, StatisticsTags Leave a comment on Teaching Practical Data Science with R

Teaching Practical Data Science with R

Practical Data Science with R, Zumel, Mount; Manning 2014 is a book Nina Zumel and I are very proud of.

I have written before how I think this book stands out and why you should consider studying from it.

600 387630642

Please read on for a some additional comments on the intent of different sections of the book. Continue reading Teaching Practical Data Science with R

Posted on Categories Administrativia, data science, Opinion, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, StatisticsTags , , 2 Comments on You should re-encode high cardinality categorical variables

You should re-encode high cardinality categorical variables

Nina Zumel and I have been doing a lot of writing on the (important) details of re-encoding high cardinality categorical variables for predictive modeling. These are variables that essentially take on string-values (also called levels or factors) and vary through many such levels. Typical examples include zip-codes, vendor IDs, and product codes.

In a sort of “burying the lede” way I feel we may not have sufficiently emphasized that you really do need to perform such re-encodings. Below is a graph (generated in R, code available here) of the kind of disaster you see if you throw such variables into a model without any pre-processing or post-controls.

NewImage

In the above graph each dot represents the performance of a model fit on synthetic data. The x-axis is model performance (in this case pseudo R-squared, 1 being perfect and below zero worse than using an average). The training pane represents performance on the training data (perfect, but over-fit) and the test pane represents performance on held-out test data (an attempt to simulate future application data). Notice the test performance implies these models are dangerously worse than useless.

Please read on for how to fix this. Continue reading You should re-encode high cardinality categorical variables

Posted on Categories Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, StatisticsTags , , , , Leave a comment on Some vtreat design principles

Some vtreat design principles

We have already written quite a few times about our vtreat open source variable treatment package for R (which implements effects/impact coding, missing value replacement, and novel value replacement; among other important data preparation steps), but we thought we would take some time to describe some of the principles behind the package design.

Introduction

vtreat is something we really feel you you should add to your predictive analytics or data science work flow.


NewImage
vtreat getting a call-out from Dmitry Larko, photo Erin LeDell

vtreat’s design and implementation follows from a number of reasoned assumptions or principles, a few of which we discuss below.

Continue reading Some vtreat design principles