On the other hand, there are situations where balancing the classes, or at least enriching the prevalence of the rarer class, might be necessary, if not desirable. Fraud detection, anomaly detection, or other situations where positive examples are hard to get, can fall into this case. In this situation, I’ve suspected (without proof) that SVM would perform well, since the formulation of hard-margin SVM is pretty much distribution-free. Intuitively speaking, if both classes are far away from the margin, then it shouldn’t matter whether the rare class is 10% or 49% of the population. In the soft-margin case, of course, distribution starts to matter again, but perhaps not as strongly as with other classifiers like logistic regression, which explicitly encodes the distribution of the training data.

So let’s run a small experiment to investigate this question.

**Experimental Setup**

We used the ISOLET dataset, available at the UCI Machine Learning repository. The task is to recognize spoken letters. The training set consists of 120 speakers, each of whom uttered the letters A-Z twice; 617 features were extracted from the utterances. The test set is another 30 speakers, each of whom also uttered A-Z twice.

Our chosen task was to identify the letter “n”. This target class has a native prevalence of about 3.8% in both test and training, and is to be identified from out of several other distinct co-existing populations. This is similar to a fraud detection situation, where a specific rare event has to be a population of disparate “innocent” events.

We trained our models against a training set where the target was present at its native prevalence; against training sets where the target prevalence was enriched by resampling to twice, five times, and ten times its native prevalence; and against a training set where the target prevalence was enriched to 50%. This replicates some plausible enrichment scenarios: enriching the rare class by a large multiplier, or simply balancing the classes. All training sets were the same size (N=2000). We then ran each model against the same test set (with the target variable at its native prevalence) to evaluate model performance. We used a threshold of 50% to assign class labels (that is, we labeled the data by the most probable label). To get a more stable estimate of how enrichment affected performance, we ran this loop ten times and averaged the results for each model type.

We tried three model types:

`cv.glmnet`

from R package`glmnet`

: Regularized logistic regression, with`alpha=0`

(L2 regularization, or ridge).`cv.glmnet`

chooses the regularization penalty by cross-validation.`randomForest`

from R package`randomForest`

: Random forest with the default settings (500 trees,`nvar/3`

, or about 205 variables drawn at each node).`ksvm`

from R pacakge`kernlab`

: Soft-margin SVM with the radial basis kernel and C=1

Since there are many ways to resample the data for enrichment, here’s how I did it. The target variable is assumed to be TRUE/FALSE, with TRUE as the class of interest (the rare one). `dataf`

is the data frame of training data, `N`

is the desired size of the enriched training set, and `prevalence`

is the desired target prevalence.

makePrevalence = function(dataf, target, prevalence, N) { # indices of T/F tset_ix = which(dataf[[target]]) others_ix = which(!dataf[[target]]) ntarget = round(N*prevalence) heads = sample(tset_ix, size=ntarget, replace=TRUE) tails = sample(others_ix, size=(N-ntarget), replace=TRUE) dataf[c(heads, tails),] }

**Training at the Native Target Prevalence**

Before we run the full experiment, let’s look at how each of these three modeling approaches does when we fit models the obvious way — where the training and test sets have the same distribution:

## [1] "Metrics on training data" ## accuracy precision recall specificity label ## 0.9985 1.0000000 0.961039 1.00000 logistic ## 1.0000 1.0000000 1.000000 1.00000 random forest ## 0.9975 0.9736842 0.961039 0.99896 svm ## [1] "Metrics on test data" ## accuracy precision recall specificity label ## 0.9807569 0.7777778 0.7000000 0.9919947 logistic ## 0.9717768 1.0000000 0.2666667 1.0000000 random forest ## 0.9846055 0.7903226 0.8166667 0.9913276 svm

We looked at four metrics. *Accuracy* is simply the fraction of datums classified correctly. *Precision* is the fraction of datums classified as positive that really were; equivalently, it’s an estimate of the conditional probability of a datum being in the positive class, given that it was classified as positive. *Recall* (also called *sensitivity* or the true positive rate) is the fraction of positive datums in the population that were correctly identified. *Specificity* is the true negative rate, or one minus the false positive rate: the number of negative datums correctly identified as such.

As the table above shows, random forest did perfectly on the training data, and the other two did quite well, too, with nearly perfect precision/specificity and high recall. However, random forest’s recall plummeted on the hold-out set, to 27%. The other two models degraded as well (logistic regression more than SVM), but still manage to retain decent recall, along with good precision and specificity. Random forest also has the lowest accuracy on the test set (although 97% still *looks* pretty good — another reason why accuracy is not always a good metric to evaluate classifiers on. In fact, since the target prevalence in the data set is only about 3.8%, a model that always returned FALSE would have an accuracy of 96.2%!).

One could argue that if precision is the goal, then random forest is still in the running. However, remember that the goal here is to identify a rare event. In many such situations (like fraud detection) one would expect that high recall is the most important goal, as long as precision/specificity are still reasonable.

Let’s see if enriching the target class prevalence during training improves things.

**How Enriching the Training Data Changes Model Performance**

First, let’s look at accuracy.

The x-axis is the prevalence of the target in the training data; the y-axis gives the accuracy of the model on the test set (with the target at its native prevalence), averaged over ten draws of the training set. The error bars are the bootstrap estimate of the 98% confidence interval around the mean, and the values for the individual runs appear as transparent dots at each value. The dashed horizontal represents the accuracy of a model trained at the target class’s true prevalence, which we’ll call the model’s *baseline performance*. Logistic regression degraded the most dramatically of the three models as target prevalence increased. SVM degraded only slightly. Random forest improved, although its best performance (when training at about 19% prevalence, or five times native prevalence) is only slightly better than SVM’s baseline performance, and its performance at 50% prevalence is worse than the baseline performance of the other two classifiers.

Logistic regression’s degradation should be no surprise. Logistic regression optimizes deviance, which is strongly distributional; in fact, logistic regression (without regularization) preserves the marginal probabilities of the training data. Since logistic regression is so well calibrated to the training distribution, changes in the distribution will naturally affect model performance.

The observation that SVM’s accuracy stayed very stable is consistent with my surmise that SVM’s training procedure is not strongly dependent on the class distributions.

Now let’s look at precision:

All of the models degraded on precision, random forest the most dramatically (since it started at a higher baseline), SVM the least. SVM and logistic regression were comparable at baseline.

Let’s look at recall:

Enrichment improved the recall of all the classifiers, random forest most dramatically, although its best performance, at 50% enrichment, is not really any better than SVM’s baseline recall. Again, SVM’s recall moved the least.

Finally, let’s look at specificity:

Enrichment degraded all models’ specificity (i.e. they all make more false positives), logistic regression’s the most dramatically, SVM’s the least.

**The Verdict**

Based on this experiment, I would say that balancing the classes, or enrichment in general, is of limited value if your goal is to apply class labels. It did improve the performance of random forest, but mostly because random forest was a rather poor choice for this problem in the first place (It would be interesting to do a more comprehensive study of the effect of target prevalence on random forest. Does it often perform poorly with rare classes?).

Enrichment is not a good idea for logistic regression models. If you must do some enrichment, then these results suggest that SVM is the safest classifier to use, and even then you probably want to limit the amount of enrichment to less than five times the target class’s native prevalence — certainly a far cry from balancing the classes, if the target class is very rare.

**The Inevitable Caveats**

The first caveat is that we only looked at one data set, only three modeling algorithms, and only one specific implementation of each of these algorithms. A more thorough study of this question would consider far more datasets, and more modeling algorithms and implementations thereof.

The second caveat is that we were specifically supplying class labels, using a threshold. I didn’t show it here, but one of the notable issues with the random forest model when it was applied to hold-out was that it no longer scored the datums along the full range of 0-1 (which it did, on the training data); it generally maxed out at around 0.6 or 0.7. This possibly makes using 0.5 as the threshold suboptimal. The following graph was produced with a model trained with the target class at native prevalence, and evaluated on our test set.

The x-axis corresponds to different thresholds for setting class labels, ranging between 0.25 (more permissive about marking datums as positive) and 0.75 (less permissive about marking datums as classifiers). You can see that the random forest model (which didn’t score anything in the test set higher than 0.65) would have better accuracy with a lower threshold (about 0.3). The other two models have fairly close to optimal accuracy at the default threshold of 0.5. So perhaps it’s not fair to look at the classifier performance without tuning the thresholds. However, if you’re tuning a model that was trained on enriched data, you still have to calibrate the threshold on un-enriched data — in which case, you might as well train on un-enriched data, too. In the case of this random forest model, its best accuracy (at threshold=0.3) is about as good as random forest’s accuracy when trained on a balanced data set, again suggesting that balancing the training set doesn’t contribute much. Tuning the threshold may be enough.

However, suppose we don’t need to assign class labels? Suppose we only need the score to sort the datums, hoping to sort most of the items of interest to the top? This could be the case when prioritizing transactions to be investigated as fraudulent. The exact fraud score of a questionable transaction might not matter — only that it’s higher than the score of non-fraudulent events. In this case, would enrichment or class balancing improve? I didn’t try it (mostly because I didn’t think of it until halfway through writing this), but I suspect not.

**Conclusions**

- Balancing class prevalence before training a classifier does
*not*across-the-board improve classifier performance. - In fact, it is contraindicated for logistic regression models.
- Balancing classes or enriching target class prevalence may improve random forest classifiers.
- But random forest models may not be the best choice for very unbalanced classes.
- If target class enrichment is necessary (perhaps because of data scarcity issues), SVM may be the safest choice for modeling.

A knitr document of our experiment, along with the accompanying R markdown file, can be downloaded here, along with a copy of the ISOLET data.

]]>

We designed the course as an introduction to an advanced topic. The course description is:

The R language provides a way to tackle day-to-day data science tasks, and this course will teach you how to apply the R programming language and useful statistical techniques to everyday business situations.

With this course, you’ll be able to use the visualizations, statistical models, and data manipulation tools that modern data scientists rely upon daily to recognize trends and suggest courses of action.

- Use R and RStudio
- Master Modeling and Machine Learning
- Load, Visualize, and Interpret Data

This course is designed for those who are analytically minded and are familiar with basic statistics and programming or scripting. Some familiarity with R is strongly recommended; otherwise, you can learn R as you go.

You’ll learn applied predictive modeling methods, as well as how to explore and visualize data, how to use and understand common machine learning algorithms in R, and how to relate machine learning methods to business problems.

All of these skills will combine to give you the ability to explore data, ask the right questions, execute predictive models, and communicate your informed recommendations and solutions to company leaders.

This course begins with a walk-through of a template data science project before diving into the R statistical programming language.

You will be guided through modeling and machine learning. You’ll use machine learning methods to create algorithms for a business, and you’ll validate and evaluate models.

You’ll learn how to load data into R and learn how to interpret and visualize the data while dealing with variables and missing values. You’ll be taught how to come to sound conclusions about your data, despite some real-world challenges.

By the end of this course, you’ll be a better data analyst because you’ll have an understanding of applied predictive modeling methods, and you’ll know how to use existing machine learning methods in R. This will allow you to work with team members in a data science project, find problems, and come up solutions.

You’ll complete this course with the confidence to correctly analyze data from a variety of sources, while sharing conclusions that will make a business more competitive and successful.

The course will teach students how to use existing machine learning methods in R, but will not teach them how to implement these algorithms from scratch. Students should be familiar with basic statistics and basic scripting/programming.

The course has a different emphasis than our book Practical Data Science with R and *does not* require the book.

Most of the course materials are freely available from GitHub in the form of pre-prepared knitr workbooks.

]]>Figure: the standard SVM margin diagram, this time with some un-marked data added.

I spend a lot of my time writing and teaching about the proper use and consequences of choosing different machine learning techniques in data science projects. Some of the experience comes from working with our clients (you don’t need a theory to tell you random forest *can* in fact overfit after you see it actually do so on client data, though it does pay to follow-up on such things). Studying implementation details is in fact useful, but it is only one source of insight. It is also an already over-represented teaching choice, and isn’t always the best first exposure for all students.

That being said, my background is as a “hacking theorist.” I do toy with experimental side implementations (some public examples here and here) and even more I like pushing some math around to find the edges of what is possible (see here).

Along these lines over the holiday I decided to re-study support vector machines, from primary and secondary sources. I wanted to see what was originally claimed, what the original proof ideas were, and try and see what was left open. What I found is the proof chains are bit longer than I had hoped and I feel we should really thank researchers who took the trouble to re-specialize re-write all of the proofs into a linear sequence of arguments (instead of merely citing). In particular I came to re-appreciate an item already in my library: Cristianini, N. and Shawe-Taylor, J., *An Introduction to Support Vector Machines*, Cambridge, 2000.

That being said: here are my new notes on the original proofs that large margin establishes low VC dimension (which in turn establishes good generalization error). To my mind there are a few twists and surprises that will have (necessarily) been smoothed over in any first course on support vector machines.

]]>

We consider ourselves pretty familiar with R. We have years of experience, many other programming languages to compare R to, and we have taken Hadley Wickham’s Master R Developer Workshop (highly recommended). We already knew R’s `predict`

function is pretty idiosyncratic (takes different arguments per model type, returns different types depending on model and arguments, which is why we wrapped it in our Bad Bayes article).

But here is unnecessarily nasty puzzle we ran into recently.

```
```library('mgcv')
library('ROCR')
d <- data.frame(x=1:10,y=(1:10)>=5)
model <- lm(y~x,data=d)
d$predLM <- predict(model,type='response')
plot(performance(prediction(d$predLM,d$y),'tpr','fpr'))
model <- gam(y~x,family=binomial,data=d)
d$predGAM <- predict(model,type='response')
plot(performance(prediction(d$predGAM,d$y),'tpr','fpr'))
## Error in plot(performance(prediction(d$predGAM, d$y), "tpr", "fpr")) :
## error in evaluating the argument 'x' in selecting a method for
## function 'plot': Error in prediction(d$predGAM, d$y) :
## Format of predictions is invalid.

It is a silly example, but one really wonders why the plot of the `lm`

model works and the *exact same code* fails to plot the `gam`

model. Now (as with most runtime bugs brought on by overly dynamic languages) we ran into this problem while in the middle doing something else (while doing data analysis, not while coding). So we were not in the right frame of mind to deduce the solution without further experiment.

Now that we are calm we can try and look for the problem. The first step of effective debugging is to put aside what you had been working on and admit you are now debugging. So you write in your notebook what you had been trying to do, and temporarily clear that from your mind.

Professor Norman Matloff describes debugging as:

Finding your bug is a process of confirming the many things you believe are true, until you find one which is not true.

What do we believe? We believe that `d$predLM`

and `d$predGAM`

should both give us a plot. So in some sense we believe they have the same structure. The superficially look to have the same structure:

```
```> print(d)
x y predLM predGAM
1 1 FALSE -0.05454545 2.220446e-16
2 2 FALSE 0.09090909 2.220446e-16
3 3 FALSE 0.23636364 2.220446e-16
4 4 FALSE 0.38181818 2.085853e-10
5 5 TRUE 0.52727273 1.000000e+00
6 6 TRUE 0.67272727 1.000000e+00
7 7 TRUE 0.81818182 1.000000e+00
8 8 TRUE 0.96363636 1.000000e+00
9 9 TRUE 1.10909091 1.000000e+00
10 10 TRUE 1.25454545 1.000000e+00

Let’s look closer:

```
```> print(d$predLM)
[1] -0.05454545 0.09090909 0.23636364 0.38181818 0.52727273 0.67272727 0.81818182
[8] 0.96363636 1.10909091 1.25454545
> print(d$predGAM)
1 2 3 4 5 6 7
2.220446e-16 2.220446e-16 2.220446e-16 2.085853e-10 1.000000e+00 1.000000e+00 1.000000e+00
8 9 10
1.000000e+00 1.000000e+00 1.000000e+00

That is weird, `print`

formats them different. Let’s see what these items really are.

```
```> print(typeof(d$predGAM))
[1] "double"
> print(typeof(d$predLM))
[1] "double"
> print(class(d$predLM))
[1] "numeric"
> print(class(d$predGAM))
[1] "array"
> print(str(d$predLM))
num [1:10] -0.0545 0.0909 0.2364 0.3818 0.5273 ...
NULL
> print(str(d$predGAM))
num [1:10(1d)] 2.22e-16 2.22e-16 2.22e-16 2.09e-10 1.00 ...
- attr(*, "dimnames")=List of 1
..$ : chr [1:10] "1" "2" "3" "4" ...
NULL

Using all of `typeof`

, `class`

, and `str`

(which we didn’t know about when we wrote Survive R) gives us the story. `d$predGAM`

isn’t a vector in R’s specific peculiar sense of the word:

```
```> is.vector(d$predLM)
[1] TRUE
> is.vector(d$predGAM)
[1] FALSE

Had we known to look, we could have found the problem in one step with `str(d)`

:

```
```> print(str(d))
'data.frame': 10 obs. of 4 variables:
$ x : int 1 2 3 4 5 6 7 8 9 10
$ y : logi FALSE FALSE FALSE FALSE TRUE TRUE ...
$ predLM : num -0.0545 0.0909 0.2364 0.3818 0.5273 ...
$ predGAM: num [1:10(1d)] 2.22e-16 2.22e-16 2.22e-16 2.09e-10 1.00 ...
..- attr(*, "dimnames")=List of 1
.. ..$ : chr "1" "2" "3" "4" ...
NULL

R’s type system is strange. `typeof`

returns what primitive type is used to *implement* the item at hand (in this case a vector of doubles). `class`

returns what classes have been declared for this item.

To computer scientist: `d$predGAM`

is a double vector that has some additional attributes (such as the `array`

class declaration and shape parameters). It would commonly be thought of as a derived or refined type of vector. The writers of `mgcv`

were probably thinking in these terms and figured it is okay to return a more refined type than one would expect from the generic `predict`

signature. This is how most object oriented languages work. It is hard to call this code incorrect when the `help(predict)`

documentation (for the generic base-method) is:

The form of the value returned by predict depends on the class of its argument. See the documentation of the particular methods for details of what is produced by that method.

To R: `d$predGAM`

is an array which is a class that is different than a double vector (though it is implemented in terms of a double vector, similar to C++’s idea of private or implementation inheritance). In R arrays support most of the same operations as vectors and can even interoperate with them (you can add an array to a vector). However, the `ROCR`

package likely explicitly checked (at runtime) the types of its arguments. This is an actual correct instance of irony: an added type safety check (meant to defend against and give good error messages in the case of unexpected types) triggered an error on a type that probably could have quietly been used safely. (Note: in general I *very much like* such checks, they tend to cut down on errors and move detection of errors much closer to error origins- making debugging and maintenance much easier).

The two packages failed to interoperate because they happened to have slightly incompatible (but likely each internally consistent) visions the R type system.

A final question is: why didn’t stuffing the value into a `data.frame`

coerce the types or get rid of some of the additional annotations? The reason is that in R data.frames are in fact lists of columns and they only appear to be two-dimensional row-oriented structures due to some clever over-riding of the two-dimensional `[,]`

operator. Despite expectations data.frame columns are not always simple primitive vectors and can hold complex composite objects such as `POSIXlt`

which would break a lot more code if it didn’t have a built-in conversion to numeric).

Here we are extracting an appendix: “Soft margin is not as good as hard margin.” In it we build a toy problem that is not large-margin separated and note that if the dimension of the concept space you were working in was not obvious (i.e. you were forced to rely on the margin derived portion of generalization bounds) then generalization improvement for a soft margin SVM is much slower than you would expect given experience from the hard margin theorems. The punch-line is: every time you get eight times as much training data you only halve your expected excess generalization error bound (whereas once you get below a data-set’s hard-margin bound you expect one to one reduction of the bound with respect to training data set size). What this points out is: the soft margin idea can simulate margin, but it comes at a cost.

The PDF extract is here and the work in iPython notebook form is here (though that obviously would take some set-up to run).

]]>`[,]`

” is a bit more irregular than I remembered.
The subsetting section of Advanced R has a very good discussion on the subsetting and selection operators found in R. In particular it raises the important distinction of two simultaneously valuable but incompatible desiderata: simplification of results versus preservation of results.

The issue is: when you pull a single row or column out of R’s most important structure (the data frame) do you get a data frame, a list, or a vector? Not all code that works on one of these types works equivalently across all of these types, so this can be a serious issue. We have written about this before (see selection in R). But it wasn’t until we got more into teaching (and co-authored the book Practical Data Science with R) that we really appreciated how confusing this can be for the beginner.

Let’s start with an example.

```
```> d <- data.frame(x=c(1,2),y=c(3,4))
> print(d)
x y
1 1 3
2 2 4
> print(d[1,])
x y
1 1 3
> print(d[,1])
[1] 1 2

What we see is: when using the two-argument `[,]`

extract operator on a simple data frame.

- Extracting a single row returns a data frame (confirm with the
`class()`

method). - Extracting a single column returns a vector (instead of a data frame).

And this is pretty much what a user sitting in front of an interactive system would want: simplification on columns and preservation on rows. And this is compatible with R’s history as an interactive analysis system (versus as a batch programming language, as outlined here).

Where we run into trouble is when we are writing code that we expect to run correctly in all situations (even when we are not watching). Consider the following example.

```
```> selector1 <- c(TRUE,FALSE)
> selector2 <- c(TRUE,TRUE)
> print(d[,selector1])
[1] 1 2
> print(d[,selector2])
x y
1 1 3
2 2 4

In the first case our boolean selection vector returned a vector, and in the second case it returned a data frame. Believe it or not this is problem. If we were reading this code and the values of `selector1`

and `selector2`

were set somewhere else (say as the result of a complicated calculation) we would have no way of knowing what type would be returned by `d[,selector1]`

. This even if we were lucky enough to have documentation asserting `selector1`

and `selector2`

are logical vectors of the correct length.

At runtime we can see how many positions of `selector1`

are set to `TRUE`

. But we can’t reliably infer this count from looking at just an isolated code snippet. So we would not know at coding time what code would be safe to apply to the result `d[,selector1]`

. The changing of the return type based on mere variation of argument value (not argument type) is very bad thing in terms of readability. A code reader can’t set simple (non data-dependent) expectations on the code. Or they can’t use assumed pre-conditions known about the inputs (such as documented type) to establish useful post-conditions (guaranteed behavior of the code).

Why should we care about prior expectations? Can’t we just consider those uniformed presumptions and teach past them? To my mind this violates some concepts of efficient learning and teaching. In my opinion there is no such thing as passive learning (or completely pure teaching). Students learn by thinking and base their expectations for new material by generalizing and regularizing lessons from older material. The more effective students can be at this the faster they learn.

Also, pity the student who makes a mistake while trying to learn about the square-bracket extraction operator through the R help system. If they accidentally type `help('[')`

instead of `help('[.data.frame')`

, then they see the following confusing help.

Figure:

`help('[')`

.Instead of seeing the relevant definition, which is as follows.

Figure:

`help('[.data.frame')`

.Notice the first help implies there is an argument called `drop`

that defaults to `TRUE`

. This is true for matrices (what the help is talking about), but false for data frames (the central class of R, nobody should choose R for the matrix operations). You could (informally) think of `[.data.frame`

as being a specialization of the base `[`

in the sense of object-oriented inheritance. Except, it is considered very bad form to change the semantics or rules when extending types and operators. The expectations set in the base class (and especially those set in the base-class documentation) should hold in derived classes and methods.

We can confirm `[.data.frame,]`

does not act like either of `[.data.frame,,drop=TRUE]`

or `[.data.frame,,drop=FALSE]`

. It picks its own behavior depending on if you end up with a single column or not (note: I didn’t say “if you picked a single column or not”). The code below shows some of the variations in behavior.

```
```> print(d[1,])
x y
1 1 3
> print(d[,1])
[1] 1 2
> print(d[1,,drop=TRUE])
$x
[1] 1
$y
[1] 3
> print(d[,1,drop=TRUE])
[1] 1 2
> print(d[1,,drop=FALSE])
x y
1 1 3
> print(d[,1,drop=FALSE])
x
1 1
2 2

Notice how none of the complete results of these three experiments (running without the drop argument, running with it set to `TRUE`

, and running with it set to `FALSE`

) entirely match any of the others.

Also you can trigger the “only one column causes type conversion” issue even when you are not selecting on columns (in fact even when selecting the entire data frame!):

```
```> d1 <- data.frame(x=c(1,2))
> print(d1)
x
1 1
2 2
> print(d1[c(TRUE,TRUE),])
[1] 1 2

This is a good point to return to the article about the historic context and influences of R, which gives us the following quote:

Pat begins with how R began as an experimental offshoot from S (there’s an adorable 1990’s-era photo of R’s creators Ross Ihaka and Robert Gentleman in Auckland on page 23, reproduced below), and then evolved into a language used first interactively, and then for programming. The tensions between the two modes of use led to some of the quirkier aspects of R. (Pat’s moral: “if you want to create a beautiful language, for god’s sake don’t make it useful”.)

How would I like R to behave if it evolved anew and didn’t have to support older code? I’d like (but know I can’t have) the following:

`[,]`

is reserved to select sets of rows and columns and by default guarantees “preserving” behavior in all cases (i.e. all variations of`[,]`

default to`drop=FALSE`

).`[[]]`

is reserved for extracting a single item and is “simplifying”.- To extract a single column as a vector from a data frame you must use the single argument list operator
`[[]]`

. - In all cases
`[[]]`

signals an error if you do not select exactly one element.

When I say I want these things: understand this means both I already known this is not the way they are and I know (for practical reasons) they can not be changed to be so. The fact that none of the above statements as currently true will come as a surprise to many R users. For example it is widely thought that `[[]]`

behaves everywhere as it behaves on lists: properly signaling errors if you try to select more than one element. Notice this does not turn out to be the case. For vectors and lists we have good error-indicating behaviors:

```
```> c(1,2,3)[[c(1,2)]]
Error in c(1, 2, 3)[[c(1, 2)]] : attempt to select more than one element
> list(1,2,3)[[c(1,2)]]
Error in list(1, 2, 3)[[c(1, 2)]] : subscript out of bounds
> list(1,2,3)[[2]]

For data frames we have a less desirable “anything goes” situation:

```
```> d[[c(1,2)]]
[1] 2

Remember: a situation that should have signaled an error and did not is worse than a situation with a signaling error. (Note: `subset(d1,x==1,select=c('x'))`

seems to reliably avoid unwanted simplification, but is not advised as it invokes non-standard evaluation issues. Look at `getS3method('subset','data.frame')`

for details.)

Data frames are guaranteed to be lists of columns (a publicly exposed implementation detail, a bit obscured by the fact that the derived two-argument operator `[,]`

superficially appears to be row-oriented). So we would expect `d[[c(1,2)]]`

to properly error-out as it does for lists. However, it appears to behaving more like a two-dimensional index operator. Probably some code is using this, but it is a pretty clear violation of exceptions (especially for a new student). Repeating: data frames are lists of columns (you can check this with `unclass(d)`

) and this is not a hidden implementation detail (it is commonly discussed and expected). ~~However the ~~ (Please see comments below for corrections on `[[.data.frame`

operator has extended or overridden behavior that is different than any notional base-`[[`

method/operator.`d[[c(1,2)]]`

.)

One of the reasons we need two extraction operators (`[]`

and `[[]]`

) is: R does not expose true scalar types (even the number `3`

is in length-1 vector) so we have no convenient way to signal (even using runtime types) if we thought we were coding a set-based extraction (through a set/vector of indices or a vector of booleans) or a scalar based extraction (through a single index, the case where simplification is most likely to be desirable). It is likely the designers understood that return types changing on mere change in values of arguments (and not in more fundamental changes of types of arguments) is confusing and undesirable (as it eliminates any chance at pure type to type reasoning) that led to S/R having so many extraction/selection operators. They saw the need to isolate and document different behaviors. However these abstractions turn out to be a bit leaky.

For my part I teach designing your code assuming you had simple regular versions of the above operators, and then implementing defensively (specifying `drop`

, and preferring `subset()`

and `[[]]`

to `[]`

) to ensure you get good regular behavior.

The related concepts from the two articles are:

- limitations of Random Test/Train splits: a randomized split of data into test and training is generally a good idea. However, in the presence of omitted variables, time dependent effects, serial correlation, concept changes, or data-grouping it can fail to estimate your classifier performance correctly. The point is: splitting data from a retrospective study randomly is no where near as powerful as prior randomized test design (though some seem to intentionally conflate the two situations for their own benefit).
- predictive analytics product evaluation: If your end-goal was to predict well only in a back-testing environment, then you in fact could use simple black-box testing as your only evaluation step. If your actual goal is to work well on unknown future data, then you may need to take some additional steps to try and correctly estimate how a product would perform in such a new situation.

The reason that these issues don’t usually get commented on is: usually we exhaust our allotted time trying to get beginning analysts to even implement randomized retrospective testing (a great good, but not a complete panacea). Moving on to proper prior experimental design, or structured simulations of good prior experimental design often seems like a bridge too far.

]]>With enough data and a big enough arsenal of methods, it’s relatively easy to find a classifier that *looks* good; the trick is finding one that *is* good. What many data science practitioners (and consumers) don’t seem to remember is that when evaluating a model, a random test/train split may not always be enough.

The true purpose of a test procedure is to estimate how well a classifier will work in future production situations. We don’t evaluate the classifier on training data because training error has a significant undesirable upward scoring bias: that is, it is easy to find classifiers that do well on training and then do not work at all on future data. The error on test data — data the the classifier has never seen — is meant to be a better estimate of the model’s future performance. The underlying assumption of using a random test/train split is that future data is exchangeable with past data: that is, the informative variables will be distributed the same way, so that the training data is a good estimate of the test data — and the test data is a good estimate of future data.

However in many fields your data is not exchangeable due to time based issues such as auto-correlation, or because of omitted variables. In these situations, a random test/train split will cause the test data to look *too much* like the training data, and *not enough* like future data. This will tend to make a classifier look better than it really is, so you can’t be sure that your testing procedure has eliminated bad classifiers. In fact, you might accidentally eliminate what would be a good classifier in favor of a worse one that outperforms it in this artificial situation. Random test/train split is clearly unbiased, but bad classifiers benefit more from insensitivity of tests that good ones. To prevent this, you must apply some of your domain knowledge to build a testing procedure that will safely simulate the possible future performance of your classifier.

This may seem like contrary information as many people mis-remember “random test/train split” as being the only possible practice and the only legitimate procedure for things like a clinical trial. This is in fact not true. For example, there are fields where a random test/train split would never be considered appropriate.

One such field is finance. A trading strategy is always tested only on data that is entirely from the future of any data used in training. Nobody ever builds a trading strategy using a random subset of the days from 2014 and then claims it is a good strategy if it makes money on a random set of test days from 2014. You would be laughed out of the market. You could build a strategy using data from the first six months of 2014 and test if it works well on the last six months of 2014, as a pilot study before attempting the strategy in 2015 (though, due to seasonality effects, a full year of training would be much more desirable). This is the basis of what is known in many fields as *backtesting*, or *hindcasting*. Finance would happily use random test-train split — it is much easier to implement and less sensitive to seasonal effects — if it worked for them. But it does not work, due to unignorable details of their application domain, so they have to use domain knowledge to build more representative splits.

Another example is news topic classification. Classifying articles into categories (sports, medicine, finance, and so on) is a common task. The problem is that many articles are duplicated through multiple feeds. So a simple random test/train split (without article clustering and de-duplication) will likely put a burst of near duplicate articles into both the test and train sets, even if all of these articles come out together in a short time frame. Consider a very simple lookup procedure: classify each article as being in the topic of the closest training article. With a simple random test/train split, the test set will almost always contain a near duplicate of each article in the training set, so this nearest-neighbor classifier will work very well in evaluation. But it will not work as well in actual application, because you will not have such close duplicates in your historic training data to rely on. The random test/train split did not respect how time works in the actual application — that it moves forward and there are bursts of very correlated articles — and the bad testing procedure could lead you to pick a very ineffective procedure over other procedures that may work just fine.

Any classification problem where there are alignments to external data, grouping of data, concept changes, time, key omitted variables, auto-correlation, burstiness of data, or any other problem that breaks the exchangeability hypothesis needs a bit of care during model evaluation. Random test/train split may work, but there also may be obvious reasons why it will not work, and you may need to take the time to design application-sensitive testing procedures. A randomized test/train split of *retrospective* data is not the same as a full prospective randomized controlled trial. And you must remember that the true purpose of hold-out testing is to estimate the quality of future performance, so you must take responsibility for designing testing procedures that are good estimates of future application, rather than simply claim random test/train split is always sufficient by an appeal to authority.

I’ll take a quick stab at explaining a very tiny bit of the motivation of schemes. I not sure the kind of chain of analogies argument I am attempting would work in an obituary (or in a short length), so I certainly don’t presume to advise professor Mumford on his obituary of a great mathematician (and person).

A quick warning: I am a Ph.D. computer scientist with an undergraduate education in mathematics (plus some graduate work in mathematics). I have never worked with schemes, but I have worked with computational algebraic geometry. I can’t explain schemes to you, because I frankly find them a bit abstract. But I can explain a near-relative or ancestor: varieties. From that I think I can at least motivate schemes. But again, I am only going to explain the sliver that excites me: so I am going to neglect a lot (describe a very important work as being merely important).

What non-mathematicians often don’t know and mathematicians forget to explain is: the reason mathematics tolerates strange and abstract definitions is to make theorems stronger and simpler. Despite what it seems from the outside, obscurity and strangeness are not valued in mathematics.

Let’s start what is considered a concrete example: the fundamental theorem of algebra. I almost said “let’s start with the complex numbers,” but that is exactly the kind of “cart before the horse” mis-motivation I don’t want to make.

From the Wikipedia: “Peter Rothe, in his book Arithmetica Philosophica (published in 1608), wrote that a polynomial equation of degree n (with real coefficients) may have n solutions.” This is an exciting possibility with tons of applications, people very much wanted this to be true. It would mean you could write any polynomial as product of linear terms, and you could solve a lot of concrete equations and problems. The catch is: when you get precise you find out the statement isn’t true. There is no real number that is a solution to the polynomial equation `x^2+1=0`

.

To fix this you go one of two ways.

- You state more complicated, less powerful, and less appealing versions of the theorem that are correct.
Such as: a polynomial equation of degree n (with real coefficients) may be factored into a product of linear and quadratic terms. Notice it isn’t just the proofs that are getting complex it is also the statements of conditions and results.

This is undesirable: we were forced to move from solutions (numbers that when plugged into the polynomial simplify the whole thing to zero) to factoring polynomials. And we ended up with two types of terms in the factorization: quadratic and linear terms. The theorem is vacuously true when applied to the polynomial

`x^2+1`

as it just says it factors to itself.This assertion was sufficiently complicated that as late as the 1740s mathematicians as notable as Gottfried Wilhelm Leibniz and Nikolaus Bernoulli were (incorrectly) claiming to exhibit polynomials that did not so factor.

- You replace your current abstraction (the real numbers) with a new one better suited to encode the theorem you are interested in.
In this case we introduce the complex numbers (a different number system than the reals). Then the following theorem is true: all polynomial equations of degree n (with complex coefficients) have n complex solutions (counting with repetition). From the Wikipedia again: Gerolamo Cardano introduced the complex numbers around 1545. This is a monumental step in mathematics and by the mid 1750s many attempted proofs of this theorem were published (now all considered to be incomplete in that they assumed a few things not yet known/proven). By 1821 complex number based proofs are making it to textbooks.

So the complex numbers allowed the simplification of strongly coveted theorem and drove hundreds of years of mathematical research.

Let’s move one step closer to schemes: polynomial ideals and affine varieties.

I like polynomial ideals. The reason is: a great number of very hard problems can be encoded as asking if a given polynomial (in possibly more than one variable) is in a special type of set called an polynomial ideal. There are algorithms for working with polynomial ideals (in particular the Groebner basis reduction and the Buchberger’s algorithm). The natural companion object to an polynomial ideal is something called an affine variety. Think of polynomial ideals as special sets of multivariate polynomials and think of affine varieties as sheets of points these polynomials are simultaneously zero on. So polynomial ideas are generalizations of polynomials (to more than one polynomial and more than one variable) and affine varieties are generalizations of solutions (to sets of points describe many different assignments to multiple variables).

This set of mathematical tools and algorithms under research since the mid 1960s translate a lot of the most important algorithms from linear algebra (such as Gaussian elimination) and number theory (such as computing greatest common divisors) into a unified framework over multivariate polynomials. These algorithms are why packages like Macsyma, Maple, Mathematica, and SymPy can solve many equations.

Polynomial ideals and varieties are related in an interesting way: the bigger the polynomial ideal the smaller the corresponding variety. For example all of R^n is a solution to the set of polynomial equal to `{ 0 }`

. And only the points `{ i, -i }`

are solutions to the single variable polynomial `x^2+1`

. This sort of linkage is called a Galois connection finding theorems like this motivates a lot of category theory. The idea is: we are working directly with polynomial ideals, but the affine varieties (or sets of simultaneous zeros) help us and do a lot of the bookkeeping for us (making it much easier to prove a lot more theorems).

Except, the relation doesn’t quite work as well as we would like. We would like something simpler and more powerful than a mere Galois connection: to have affine varieties be in one to one correspondence with polynomial ideals. That way each one has enough detail to be used as for detailed record keeping on the other. It turns out affine varieties are not quite up to the job. Affine varieties can not carry as much detail as the corresponding polynomial ideals. Affine varieties are only tracking details of a subset of polynomial ideas called radical polynomial ideals. A radical polynomial ideal is such that if `p(x)^k`

is in the polynomial ideal for some integer `k ≥ 1`

then `p(x)`

is in the polynomial ideal. So the set `{x^{2k} | k>=1}`

is an polynomial ideal, but not a radical polynomial ideal (the corresponding radical polynomial ideal is `{x^{k} | k>=1}`

). A polynomial ideal and its corresponding radical polynomial ideal are associated with the same affine variety (so the space of affine varieties can’t tell them apart).

The issue is: we wanted to work directly with polynomial ideas (we had some great problems and algorithms ready to go). Affine varieties were only introduced to help with the record keeping (mostly in proofs). We don’t want to mess up our work by switching from polynomial ideals (that encode what we want) to radical polynomial ideals (which add in more constraints). What if instead of fixing the polynomial ideals, we fixed the affine varieties? Affine varieties are the ones not doing their job. It turns out we can in fact work with polynomial ideals: we just need to replace affine varieties with an more detailed abstract structure called schemes.

If you want to work with general ideals (that is subsets of arbitrary rings closed (and absorbing) under multiplication, not just ideals of polynomials) then your natural most detailed “sets of solutions” structure is not varieties but schemes. Alexander Grothendieck in worked this out in the 1960. For some specialized fields it as a revelatory as the introduction of complex numbers. Discoveries like this do not happen often.

It turns out the math is very general (so a bunch of fields I have neglected also use schemes). Because it is general it ends up being defined in terms more abstract than polynomials and roots (in terms of morphisms, topologies, Spec, and so on). Schemes are great because they work even over very general concepts (and not because they bring in very general concepts).

(For a text dealing with the algorithmic aspects of ideals and varieties (but, not schemes) I recommend David A. Cox, John Little, Donal O’Shea, “Ideals, Varieties, and Algorithms: An Introduction to Computational Algebraic Geometry and Commutative Algebra” 3rd Edition, 2008. Schemes are a pretty advanced topic, you can try Robin Hartshorne “Algebraic Geometry”, 1997, but to even use the book you would need a background in commutative algebra.)

]]>

The long answer is: when Amazon.com supplies a Kindle edition readers have to deal with the following:

- Amazon.com digital right management locking the material to a single format and Amazon.com devices/readers.
- Careless mechanical re-formatting of the book material yielding either poor rendering or re-packaging of PDFs that you can only zoom and scan across (and not true re-flow of text).
- Amazon.com prices for Kindle versions are often as high as 70 to 90 percent of the print edition. Meaning to get both editions (print and Kindle) you spend at-least half as much again as getting either edition.

Some readers don’t like this and (rightly) complain. Some of the best books in our field have the occasional 1-star review due to a throughly frustrated Amazon Kindle customer. As an author you wish reviews were faceted with completely separate and mandatory sub-scores for vendor experience, price, delivery, print-quality, ebook-rendering, relevance to particular reader, and finally book quality (instead of a single rating perceived as “book quality”). But from a buyer’s point of view: rating an item low that has given you a bad experience is completely legitimate (be it for print quality, or the utility of the eBook rendering).

Practical Data Science R does have an e-copy. For our book when we say e-copy we mean:

- An electronic copy available without any intrusive digital right management (beyond requiring registration for initial download and a watermark). These are maximally useful copies as you can search them, print them, and place them on arbitrary devices.
- Unlimited downloads and re-downloads of your copies.
- e-copy available in three formats: PDF, ePub, and Kindle. And you can download all three.
- e-copies are produced and inspected by the actual book editors during the production of the book (not a later mechanical transcription).

We offer readers more than one way to get an good e-copy. Though not all customers are aware of all the options.

- Each new standard copy (though
*not*the international discount reprint) offers an access code that gives single-user rights to an e-copy. This is true for any new standard edition (be it sold by Manning, Amazon, or any other bookseller). Note: used copies may have already consumed codes and discount international editions do not include codes (so if somebody is re-selling you a book you will want to check if it includes an unused code). This is a good deal as for the price of a new standard print edition you get both a print and e-copy (typically much cheaper than buying a p-copy and an e-copy separately). - Manning itself sells e-copies where for a single discounted price you again get access to non-DRM “e-copy” editions (again giving you all of PDF, ePub, and Kindle). We know some readers do not want a physical book, and expect a discounted e-only option.
- Manning books are often available through Safari online, so you or your enterprise may already have some (restricted online) access through Safari.

In conclusion.

Manning reserved the right to be the only seller of e-only editions of Practical Data Science with R. For a full legitimate e-only copy you must go through them. Manning includes a free e-copy code in all new standard editions of the book. Wherever you buy a legitimate new copy of the standard edition you get the same e-rights as bonus. Used copies and discount international editions have their roles, but may not have a e-copy included (someone may have consumed the right on a used copy, and the discount international edition doesn’t include a code).

Obviously the customers and readers get to decide what is of value to them. This describes the options we were able to supply.

I thought I would show how to register a Manning e-book from your physical copy. The process is fairly quick you just need your physical book, an internet connection, and an email address to register a Manning account when prompted.

- Cut open the attached code sheet in the book front-matter.
- This reveals a large code spreadsheet and the redemption URL. Don’t worry you only have to enter a couple of these cells.
- Go to http://www.manning.com/ebookoffer/ and enter the codes from two cells when prompted.

I know the code sheet differs from book to book. I guess it is large to make it less practical for somebody to peek and copy out the code in a bookstore. I presume once a code is associated with a Manning account it can’t be re-entered with another account. Obviously a direct purchase of an e-only copy directly from Manning is a much less involved process.

]]>