In this article we will discuss the machine learning method called “decision trees”, moving quickly over the usual “how decision trees work” and spending time on “why decision trees work.” We will write from a computational learning theory perspective, and hope this helps make both decision trees and computational learning theory more comprehensible. The goal of this article is to set up terminology so we can state in one or two sentences why decision trees tend to work well in practice.

# Category: Statistics

## A Theory of Nested Cross Simulation

[Reader’s Note. Some of our articles are applied and some of our articles are more theoretical. The following article is more theoretical, and requires fairly formal notation to even work through. However, it should be of interest as it touches on some of the fine points of cross-validation that are quite hard to perceive or discuss without the notational framework. We thought about including some “simplifying explanatory diagrams” but so many entities are being introduced and manipulated by the processes we are describing we found equation notation to be in fact cleaner than the diagrams we attempted and rejected.]

Please consider either of the following common predictive modeling tasks:

- Picking hyper-parameters, fitting a model, and then evaluating the model.
- Variable preparation/pruning, fitting a model, and then evaluating the model.

In each case you are building a pipeline where “y-aware” (or outcome aware) choices and transformations made at each stage affect later stages. This can introduce undesirable nested model bias and over-fitting.

Our current standard advice to avoid nested model bias is either:

- Split your data into 3 or more disjoint pieces, such as separate variable preparation/pruning, model fitting, and model evaluation.
- Reserve a test-set for evaluation and use “simulated out of sample data” or “cross-frame”/“cross simulation” techniques to simulate dividing data among the first two model construction stages.

The first practice is simple and computationally efficient, but statistically inefficient. This may not matter if you have a lot of data, as in “big data”. The second procedure is more statistically efficient, but is also more complicated and has some computational cost. For convenience the cross simulation method is supplied as a ready to go procedure in our `R`

data cleaning and preparation package `vtreat`

.

What would it look like if we insisted on using cross simulation or simulated out of sample techniques for all three (or more) stages? Please read on to find out.

Hyperbole and a Half copyright Allie Brosh (use allowed in some situations with attribution)

## Data Preparation, Long Form and tl;dr Form

Data preparation and cleaning are some of *the* most important steps of predictive analytic and data science tasks. They are laborious, where most of the errors are made, your last line of defense against a wild data, and hold the biggest opportunities for outcome improvement. No matter how much time you spend on them, they still seem like a neglected topic. Data preparation isn’t as self contained or genteel as tweaking machine learning models or hyperparameter tuning; and that is one of the reasons data preparation represents such an important practical opportunity for improvement.

Photo: NY – http://nyphotographic.com/, License: Creative Commons 3 – CC BY-SA 3.0

Our group is distributing a detailed writeup of the theory and operation behind our R realization of a set of sound data preparation and cleaning procedures called vtreat here: arXiv:1611.09477 [stat.AP]. This is where you can find out what `vtreat`

does, decide if it is appropriate for your problem, or even find a specification allowing the use of the techniques in non-`R`

environments (such as `Python`

/`Pandas`

/`scikit-learn`

, `Spark`

, and many others).

We have submitted this article for formal publication, so it is our intent you can cite this article (as it stands) in scientific work as a pre-print, and later cite it from a formally refereed source.

Or alternately, below is the tl;dr (“too long; didn’t read”) form. Continue reading Data Preparation, Long Form and tl;dr Form

## Does replyr::let work with data.table?

I’ve been asked if the adapter “`let`

” from our `R`

package `replyr`

works with `data.table`

.

My answer is: it does work. I am not a `data.table`

user so I am not the one to ask if `data.table`

benefits a from a non-standard evaluation to standard evaluation adapter such as `replyr::let`

. Continue reading Does replyr::let work with data.table?

## Comparative examples using replyr::let

Consider the problem of “parametric programming” in R. That is: simply writing correct code before knowing some details, such as the names of the columns your procedure will have to be applied to in the future. Our latest version of `replyr::let`

makes such programming easier.

Archie’s Mechanics #2 (1954) copyright Archie Publications

(edit: great news! CRAN just accepted our `replyr 0.2.0`

fix release!)

Please read on for examples comparing standard notations and `replyr::let`

. Continue reading Comparative examples using replyr::let

## A Simple Example of Using replyr::gapply

It’s a common situation to have data from multiple processes in a “long” data format, for example a table with columns `measurement`

and `process_that_produced_measurement`

. It’s also natural to split that data apart to analyze or transform it, per-process — and then to bring the results of that data processing together, for comparison. Such a work pattern is called “Split-Apply-Combine,” and we discuss several R implementations of this pattern here. In this article we show a simple example of one such implementation, `replyr::gapply`

, from our latest package, `replyr`

.

Illustration by Boris Artzybasheff. Image: James Vaughn, some rights reserved.

The example task is to evaluate how several different models perform on the same classification problem, in terms of deviance, accuracy, precision and recall. We will use the “default of credit card clients” data set from the UCI Machine Learning Repository.

## help(let, package=’replyr’)

A bit more on the `let`

wrapper from our replyr R package. Continue reading help(let, package=’replyr’)

## Organize your data manipulation in terms of “grouped ordered apply”

Consider the common following problem: compute for a data set (say the infamous `iris`

example data set) per-group ranks. Suppose we want the rank of `iris`

`Sepal.Length`

s on a per-`Species`

basis. Frankly this is an “ugh” problem for many analysts: it involves all at the same time grouping, ordering, and window functions. It also is not likely ever the analyst’s end goal but a sub-step needed to transform data on the way to the prediction, modeling, analysis, or presentation they actually wish to get back to.

Iris, by Diliff – Own work, CC BY-SA 3.0, Link

In our previous article in this series we discussed the general ideas of “row-ID independent data manipulation” and “Split-Apply-Combine”. Here, continuing with our example, we will specialize to a data analysis pattern I call: “Grouped-Ordered-Apply”. Continue reading Organize your data manipulation in terms of “grouped ordered apply”

## The Case For Using -> In R

`R`

has a number of assignment operators (at least “`<-`

“, “`=`

“, and “`->`

“; plus “`<<-`

” and “`->>`

” which have different semantics).

The `R`

-style guides routinely insist on “`<-`

” as being the only preferred form. In this note we are going to *try* to make the case for “`->`

” when using magrittr pipelines. [edit: After reading this article, please be sure to read Konrad Rudolph’s masterful argument for using only “`=`

” for assignment. He also demonstrates a function to land values from pipelines (though that is not his preference). All joking aside, the value-landing part of the proposal does not violate current style guidelines.]

Don Quijote and Sancho Panza, by Honoré Daumier

Continue reading The Case For Using -> In R

## The case for index-free data manipulation

Statisticians and data scientists want a neat world where data is arranged in a table such that every row is an observation or instance, and every column is a variable or measurement. Getting to this state of “ready to model format” (often called a denormalized form by relational algebra types) often requires quite a bit of data manipulation. This is how `R`

`data.frame`

s describe themselves (try “`str(data.frame(x=1:2))`

” in an `R`

-console to see this) and is part of the tidy data manifesto.

Tools like `SQL`

(structured query language) and `dplyr`

can make the data arrangement process less burdensome, but using them effectively requires “index free thinking” where the data are not thought of in terms of row indices. We will explain and motivate this idea below. Continue reading The case for index-free data manipulation