We consider ourselves pretty familiar with R. We have years of experience, many other programming languages to compare R to, and we have taken Hadley Wickham’s Master R Developer Workshop (highly recommended). We already knew R’s `predict`

function is pretty idiosyncratic (takes different arguments per model type, returns different types depending on model and arguments, which is why we wrapped it in our Bad Bayes article).

But here is unnecessarily nasty puzzle we ran into recently.

```
```library('mgcv')
library('ROCR')
d <- data.frame(x=1:10,y=(1:10)>=5)
model <- lm(y~x,data=d)
d$predLM <- predict(model,type='response')
plot(performance(prediction(d$predLM,d$y),'tpr','fpr'))
model <- gam(y~x,family=binomial,data=d)
d$predGAM <- predict(model,type='response')
plot(performance(prediction(d$predGAM,d$y),'tpr','fpr'))
## Error in plot(performance(prediction(d$predGAM, d$y), "tpr", "fpr")) :
## error in evaluating the argument 'x' in selecting a method for
## function 'plot': Error in prediction(d$predGAM, d$y) :
## Format of predictions is invalid.

It is a silly example, but one really wonders why the plot of the `lm`

model works and the *exact same code* fails to plot the `gam`

model. Now (as with most runtime bugs brought on by overly dynamic languages) we ran into this problem while in the middle doing something else (while doing data analysis, not while coding). So we were not in the right frame of mind to deduce the solution without further experiment.

Now that we are calm we can try and look for the problem. The first step of effective debugging is to put aside what you had been working on and admit you are now debugging. So you write in your notebook what you had been trying to do, and temporarily clear that from your mind.

Professor Norman Matloff describes debugging as:

Finding your bug is a process of confirming the many things you believe are true, until you find one which is not true.

What do we believe? We believe that `d$predLM`

and `d$predGAM`

should both give us a plot. So in some sense we believe they have the same structure. The superficially look to have the same structure:

```
```> print(d)
x y predLM predGAM
1 1 FALSE -0.05454545 2.220446e-16
2 2 FALSE 0.09090909 2.220446e-16
3 3 FALSE 0.23636364 2.220446e-16
4 4 FALSE 0.38181818 2.085853e-10
5 5 TRUE 0.52727273 1.000000e+00
6 6 TRUE 0.67272727 1.000000e+00
7 7 TRUE 0.81818182 1.000000e+00
8 8 TRUE 0.96363636 1.000000e+00
9 9 TRUE 1.10909091 1.000000e+00
10 10 TRUE 1.25454545 1.000000e+00

Let’s look closer:

```
```> print(d$predLM)
[1] -0.05454545 0.09090909 0.23636364 0.38181818 0.52727273 0.67272727 0.81818182
[8] 0.96363636 1.10909091 1.25454545
> print(d$predGAM)
1 2 3 4 5 6 7
2.220446e-16 2.220446e-16 2.220446e-16 2.085853e-10 1.000000e+00 1.000000e+00 1.000000e+00
8 9 10
1.000000e+00 1.000000e+00 1.000000e+00

That is weird, `print`

formats them different. Let’s see what these items really are.

```
```> print(typeof(d$predGAM))
[1] "double"
> print(typeof(d$predLM))
[1] "double"
> print(class(d$predLM))
[1] "numeric"
> print(class(d$predGAM))
[1] "array"
> print(str(d$predLM))
num [1:10] -0.0545 0.0909 0.2364 0.3818 0.5273 ...
NULL
> print(str(d$predGAM))
num [1:10(1d)] 2.22e-16 2.22e-16 2.22e-16 2.09e-10 1.00 ...
- attr(*, "dimnames")=List of 1
..$ : chr [1:10] "1" "2" "3" "4" ...
NULL

Using all of `typeof`

, `class`

, and `str`

(which we didn’t know about when we wrote Survive R) gives us the story. `d$predGAM`

isn’t a vector in R’s specific peculiar sense of the word:

```
```> is.vector(d$predLM)
[1] TRUE
> is.vector(d$predGAM)
[1] FALSE

Had we known to look, we could have found the problem in one step with `str(d)`

:

```
```> print(str(d))
'data.frame': 10 obs. of 4 variables:
$ x : int 1 2 3 4 5 6 7 8 9 10
$ y : logi FALSE FALSE FALSE FALSE TRUE TRUE ...
$ predLM : num -0.0545 0.0909 0.2364 0.3818 0.5273 ...
$ predGAM: num [1:10(1d)] 2.22e-16 2.22e-16 2.22e-16 2.09e-10 1.00 ...
..- attr(*, "dimnames")=List of 1
.. ..$ : chr "1" "2" "3" "4" ...
NULL

R’s type system is strange. `typeof`

returns what primitive type is used to *implement* the item at hand (in this case a vector of doubles). `class`

returns what classes have been declared for this item.

To computer scientist: `d$predGAM`

is a double vector that has some additional attributes (such as the `array`

class declaration and shape parameters). It would commonly be thought of as a derived or refined type of vector. The writers of `mgcv`

were probably thinking in these terms and figured it is okay to return a more refined type than one would expect from the generic `predict`

signature. This is how most object oriented languages work. It is hard to call this code incorrect when the `help(predict)`

documentation (for the generic base-method) is:

The form of the value returned by predict depends on the class of its argument. See the documentation of the particular methods for details of what is produced by that method.

To R: `d$predGAM`

is an array which is a class that is different than a double vector (though it is implemented in terms of a double vector, similar to C++’s idea of private or implementation inheritance). In R arrays support most of the same operations as vectors and can even interoperate with them (you can add an array to a vector). However, the `ROCR`

package likely explicitly checked (at runtime) the types of its arguments. This is an actual correct instance of irony: an added type safety check (meant to defend against and give good error messages in the case of unexpected types) triggered an error on a type that probably could have quietly been used safely. (Note: in general I *very much like* such checks, they tend to cut down on errors and move detection of errors much closer to error origins- making debugging and maintenance much easier).

The two packages failed to interoperate because they happened to have slightly incompatible (but likely each internally consistent) visions the R type system.

A final question is: why didn’t stuffing the value into a `data.frame`

coerce the types or get rid of some of the additional annotations? The reason is that in R data.frames are in fact lists of columns and they only appear to be two-dimensional row-oriented structures due to some clever over-riding of the two-dimensional `[,]`

operator. Despite expectations data.frame columns are not always simple primitive vectors and can hold complex composite objects such as `POSIXlt`

which would break a lot more code if it didn’t have a built-in conversion to numeric).

Here we are extracting an appendix: “Soft margin is not as good as hard margin.” In it we build a toy problem that is not large-margin separated and note that if the dimension of the concept space you were working in was not obvious (i.e. you were forced to rely on the margin derived portion of generalization bounds) then generalization improvement for a soft margin SVM is much slower than you would expect given experience from the hard margin theorems. The punch-line is: every time you get eight times as much training data you only halve your expected excess generalization error bound (whereas once you get below a data-set’s hard-margin bound you expect one to one reduction of the bound with respect to training data set size). What this points out is: the soft margin idea can simulate margin, but it comes at a cost.

The PDF extract is here and the work in iPython notebook form is here (though that obviously would take some set-up to run).

]]>`[,]`

” is a bit more irregular than I remembered.
The subsetting section of Advanced R has a very good discussion on the subsetting and selection operators found in R. In particular it raises the important distinction of two simultaneously valuable but incompatible desiderata: simplification of results versus preservation of results.

The issue is: when you pull a single row or column out of R’s most important structure (the data frame) do you get a data frame, a list, or a vector? Not all code that works on one of these types works equivalently across all of these types, so this can be a serious issue. We have written about this before (see selection in R). But it wasn’t until we got more into teaching (and co-authored the book Practical Data Science with R) that we really appreciated how confusing this can be for the beginner.

Let’s start with an example.

```
```> d <- data.frame(x=c(1,2),y=c(3,4))
> print(d)
x y
1 1 3
2 2 4
> print(d[1,])
x y
1 1 3
> print(d[,1])
[1] 1 2

What we see is: when using the two-argument `[,]`

extract operator on a simple data frame.

- Extracting a single row returns a data frame (confirm with the
`class()`

method). - Extracting a single column returns a vector (instead of a data frame).

And this is pretty much what a user sitting in front of an interactive system would want: simplification on columns and preservation on rows. And this is compatible with R’s history as an interactive analysis system (versus as a batch programming language, as outlined here).

Where we run into trouble is when we are writing code that we expect to run correctly in all situations (even when we are not watching). Consider the following example.

```
```> selector1 <- c(TRUE,FALSE)
> selector2 <- c(TRUE,TRUE)
> print(d[,selector1])
[1] 1 2
> print(d[,selector2])
x y
1 1 3
2 2 4

In the first case our boolean selection vector returned a vector, and in the second case it returned a data frame. Believe it or not this is problem. If we were reading this code and the values of `selector1`

and `selector2`

were set somewhere else (say as the result of a complicated calculation) we would have no way of knowing what type would be returned by `d[,selector1]`

. This even if we were lucky enough to have documentation asserting `selector1`

and `selector2`

are logical vectors of the correct length.

At runtime we can see how many positions of `selector1`

are set to `TRUE`

. But we can’t reliably infer this count from looking at just an isolated code snippet. So we would not know at coding time what code would be safe to apply to the result `d[,selector1]`

. The changing of the return type based on mere variation of argument value (not argument type) is very bad thing in terms of readability. A code reader can’t set simple (non data-dependent) expectations on the code. Or they can’t use assumed pre-conditions known about the inputs (such as documented type) to establish useful post-conditions (guaranteed behavior of the code).

Why should we care about prior expectations? Can’t we just consider those uniformed presumptions and teach past them? To my mind this violates some concepts of efficient learning and teaching. In my opinion there is no such thing as passive learning (or completely pure teaching). Students learn by thinking and base their expectations for new material by generalizing and regularizing lessons from older material. The more effective students can be at this the faster they learn.

Also, pity the student who makes a mistake while trying to learn about the square-bracket extraction operator through the R help system. If they accidentally type `help('[')`

instead of `help('[.data.frame')`

, then they see the following confusing help.

Figure:

`help('[')`

.Instead of seeing the relevant definition, which is as follows.

Figure:

`help('[.data.frame')`

.Notice the first help implies there is an argument called `drop`

that defaults to `TRUE`

. This is true for matrices (what the help is talking about), but false for data frames (the central class of R, nobody should choose R for the matrix operations). You could (informally) think of `[.data.frame`

as being a specialization of the base `[`

in the sense of object-oriented inheritance. Except, it is considered very bad form to change the semantics or rules when extending types and operators. The expectations set in the base class (and especially those set in the base-class documentation) should hold in derived classes and methods.

We can confirm `[.data.frame,]`

does not act like either of `[.data.frame,,drop=TRUE]`

or `[.data.frame,,drop=FALSE]`

. It picks its own behavior depending on if you end up with a single column or not (note: I didn’t say “if you picked a single column or not”). The code below shows some of the variations in behavior.

```
```> print(d[1,])
x y
1 1 3
> print(d[,1])
[1] 1 2
> print(d[1,,drop=TRUE])
$x
[1] 1
$y
[1] 3
> print(d[,1,drop=TRUE])
[1] 1 2
> print(d[1,,drop=FALSE])
x y
1 1 3
> print(d[,1,drop=FALSE])
x
1 1
2 2

Notice how none of the complete results of these three experiments (running without the drop argument, running with it set to `TRUE`

, and running with it set to `FALSE`

) entirely match any of the others.

Also you can trigger the “only one column causes type conversion” issue even when you are not selecting on columns (in fact even when selecting the entire data frame!):

```
```> d1 <- data.frame(x=c(1,2))
> print(d1)
x
1 1
2 2
> print(d1[c(TRUE,TRUE),])
[1] 1 2

This is a good point to return to the article about the historic context and influences of R, which gives us the following quote:

Pat begins with how R began as an experimental offshoot from S (there’s an adorable 1990’s-era photo of R’s creators Ross Ihaka and Robert Gentleman in Auckland on page 23, reproduced below), and then evolved into a language used first interactively, and then for programming. The tensions between the two modes of use led to some of the quirkier aspects of R. (Pat’s moral: “if you want to create a beautiful language, for god’s sake don’t make it useful”.)

How would I like R to behave if it evolved anew and didn’t have to support older code? I’d like (but know I can’t have) the following:

`[,]`

is reserved to select sets of rows and columns and by default guarantees “preserving” behavior in all cases (i.e. all variations of`[,]`

default to`drop=FALSE`

).`[[]]`

is reserved for extracting a single item and is “simplifying”.- To extract a single column as a vector from a data frame you must use the single argument list operator
`[[]]`

. - In all cases
`[[]]`

signals an error if you do not select exactly one element.

When I say I want these things: understand this means both I already known this is not the way they are and I know (for practical reasons) they can not be changed to be so. The fact that none of the above statements as currently true will come as a surprise to many R users. For example it is widely thought that `[[]]`

behaves everywhere as it behaves on lists: properly signaling errors if you try to select more than one element. Notice this does not turn out to be the case. For vectors and lists we have good error-indicating behaviors:

```
```> c(1,2,3)[[c(1,2)]]
Error in c(1, 2, 3)[[c(1, 2)]] : attempt to select more than one element
> list(1,2,3)[[c(1,2)]]
Error in list(1, 2, 3)[[c(1, 2)]] : subscript out of bounds
> list(1,2,3)[[2]]

For data frames we have a less desirable “anything goes” situation:

```
```> d[[c(1,2)]]
[1] 2

Remember: a situation that should have signaled an error and did not is worse than a situation with a signaling error. (Note: `subset(d1,x==1,select=c('x'))`

seems to reliably avoid unwanted simplification, but is not advised as it invokes non-standard evaluation issues. Look at `getS3method('subset','data.frame')`

for details.)

Data frames are guaranteed to be lists of columns (a publicly exposed implementation detail, a bit obscured by the fact that the derived two-argument operator `[,]`

superficially appears to be row-oriented). So we would expect `d[[c(1,2)]]`

to properly error-out as it does for lists. However, it appears to behaving more like a two-dimensional index operator. Probably some code is using this, but it is a pretty clear violation of exceptions (especially for a new student). Repeating: data frames are lists of columns (you can check this with `unclass(d)`

) and this is not a hidden implementation detail (it is commonly discussed and expected). ~~However the ~~ (Please see comments below for corrections on `[[.data.frame`

operator has extended or overridden behavior that is different than any notional base-`[[`

method/operator.`d[[c(1,2)]]`

.)

One of the reasons we need two extraction operators (`[]`

and `[[]]`

) is: R does not expose true scalar types (even the number `3`

is in length-1 vector) so we have no convenient way to signal (even using runtime types) if we thought we were coding a set-based extraction (through a set/vector of indices or a vector of booleans) or a scalar based extraction (through a single index, the case where simplification is most likely to be desirable). It is likely the designers understood that return types changing on mere change in values of arguments (and not in more fundamental changes of types of arguments) is confusing and undesirable (as it eliminates any chance at pure type to type reasoning) that led to S/R having so many extraction/selection operators. They saw the need to isolate and document different behaviors. However these abstractions turn out to be a bit leaky.

For my part I teach designing your code assuming you had simple regular versions of the above operators, and then implementing defensively (specifying `drop`

, and preferring `subset()`

and `[[]]`

to `[]`

) to ensure you get good regular behavior.

The related concepts from the two articles are:

- limitations of Random Test/Train splits: a randomized split of data into test and training is generally a good idea. However, in the presence of omitted variables, time dependent effects, serial correlation, concept changes, or data-grouping it can fail to estimate your classifier performance correctly. The point is: splitting data from a retrospective study randomly is no where near as powerful as prior randomized test design (though some seem to intentionally conflate the two situations for their own benefit).
- predictive analytics product evaluation: If your end-goal was to predict well only in a back-testing environment, then you in fact could use simple black-box testing as your only evaluation step. If your actual goal is to work well on unknown future data, then you may need to take some additional steps to try and correctly estimate how a product would perform in such a new situation.

The reason that these issues don’t usually get commented on is: usually we exhaust our allotted time trying to get beginning analysts to even implement randomized retrospective testing (a great good, but not a complete panacea). Moving on to proper prior experimental design, or structured simulations of good prior experimental design often seems like a bridge too far.

]]>With enough data and a big enough arsenal of methods, it’s relatively easy to find a classifier that *looks* good; the trick is finding one that *is* good. What many data science practitioners (and consumers) don’t seem to remember is that when evaluating a model, a random test/train split may not always be enough.

The true purpose of a test procedure is to estimate how well a classifier will work in future production situations. We don’t evaluate the classifier on training data because training error has a significant undesirable upward scoring bias: that is, it is easy to find classifiers that do well on training and then do not work at all on future data. The error on test data — data the the classifier has never seen — is meant to be a better estimate of the model’s future performance. The underlying assumption of using a random test/train split is that future data is exchangeable with past data: that is, the informative variables will be distributed the same way, so that the training data is a good estimate of the test data — and the test data is a good estimate of future data.

However in many fields your data is not exchangeable due to time based issues such as auto-correlation, or because of omitted variables. In these situations, a random test/train split will cause the test data to look *too much* like the training data, and *not enough* like future data. This will tend to make a classifier look better than it really is, so you can’t be sure that your testing procedure has eliminated bad classifiers. In fact, you might accidentally eliminate what would be a good classifier in favor of a worse one that outperforms it in this artificial situation. Random test/train split is clearly unbiased, but bad classifiers benefit more from insensitivity of tests that good ones. To prevent this, you must apply some of your domain knowledge to build a testing procedure that will safely simulate the possible future performance of your classifier.

This may seem like contrary information as many people mis-remember “random test/train split” as being the only possible practice and the only legitimate procedure for things like a clinical trial. This is in fact not true. For example, there are fields where a random test/train split would never be considered appropriate.

One such field is finance. A trading strategy is always tested only on data that is entirely from the future of any data used in training. Nobody ever builds a trading strategy using a random subset of the days from 2014 and then claims it is a good strategy if it makes money on a random set of test days from 2014. You would be laughed out of the market. You could build a strategy using data from the first six months of 2014 and test if it works well on the last six months of 2014, as a pilot study before attempting the strategy in 2015 (though, due to seasonality effects, a full year of training would be much more desirable). This is the basis of what is known in many fields as *backtesting*, or *hindcasting*. Finance would happily use random test-train split — it is much easier to implement and less sensitive to seasonal effects — if it worked for them. But it does not work, due to unignorable details of their application domain, so they have to use domain knowledge to build more representative splits.

Another example is news topic classification. Classifying articles into categories (sports, medicine, finance, and so on) is a common task. The problem is that many articles are duplicated through multiple feeds. So a simple random test/train split (without article clustering and de-duplication) will likely put a burst of near duplicate articles into both the test and train sets, even if all of these articles come out together in a short time frame. Consider a very simple lookup procedure: classify each article as being in the topic of the closest training article. With a simple random test/train split, the test set will almost always contain a near duplicate of each article in the training set, so this nearest-neighbor classifier will work very well in evaluation. But it will not work as well in actual application, because you will not have such close duplicates in your historic training data to rely on. The random test/train split did not respect how time works in the actual application — that it moves forward and there are bursts of very correlated articles — and the bad testing procedure could lead you to pick a very ineffective procedure over other procedures that may work just fine.

Any classification problem where there are alignments to external data, grouping of data, concept changes, time, key omitted variables, auto-correlation, burstiness of data, or any other problem that breaks the exchangeability hypothesis needs a bit of care during model evaluation. Random test/train split may work, but there also may be obvious reasons why it will not work, and you may need to take the time to design application-sensitive testing procedures. A randomized test/train split of *retrospective* data is not the same as a full prospective randomized controlled trial. And you must remember that the true purpose of hold-out testing is to estimate the quality of future performance, so you must take responsibility for designing testing procedures that are good estimates of future application, rather than simply claim random test/train split is always sufficient by an appeal to authority.

I’ll take a quick stab at explaining a very tiny bit of the motivation of schemes. I not sure the kind of chain of analogies argument I am attempting would work in an obituary (or in a short length), so I certainly don’t presume to advise professor Mumford on his obituary of a great mathematician (and person).

A quick warning: I am a Ph.D. computer scientist with an undergraduate education in mathematics (plus some graduate work in mathematics). I have never worked with schemes, but I have worked with computational algebraic geometry. I can’t explain schemes to you, because I frankly find them a bit abstract. But I can explain a near-relative or ancestor: varieties. From that I think I can at least motivate schemes. But again, I am only going to explain the sliver that excites me: so I am going to neglect a lot (describe a very important work as being merely important).

What non-mathematicians often don’t know and mathematicians forget to explain is: the reason mathematics tolerates strange and abstract definitions is to make theorems stronger and simpler. Despite what it seems from the outside, obscurity and strangeness are not valued in mathematics.

Let’s start what is considered a concrete example: the fundamental theorem of algebra. I almost said “let’s start with the complex numbers,” but that is exactly the kind of “cart before the horse” mis-motivation I don’t want to make.

From the Wikipedia: “Peter Rothe, in his book Arithmetica Philosophica (published in 1608), wrote that a polynomial equation of degree n (with real coefficients) may have n solutions.” This is an exciting possibility with tons of applications, people very much wanted this to be true. It would mean you could write any polynomial as product of linear terms, and you could solve a lot of concrete equations and problems. The catch is: when you get precise you find out the statement isn’t true. There is no real number that is a solution to the polynomial equation `x^2+1=0`

.

To fix this you go one of two ways.

- You state more complicated, less powerful, and less appealing versions of the theorem that are correct.
Such as: a polynomial equation of degree n (with real coefficients) may be factored into a product of linear and quadratic terms. Notice it isn’t just the proofs that are getting complex it is also the statements of conditions and results.

This is undesirable: we were forced to move from solutions (numbers that when plugged into the polynomial simplify the whole thing to zero) to factoring polynomials. And we ended up with two types of terms in the factorization: quadratic and linear terms. The theorem is vacuously true when applied to the polynomial

`x^2+1`

as it just says it factors to itself.This assertion was sufficiently complicated that as late as the 1740s mathematicians as notable as Gottfried Wilhelm Leibniz and Nikolaus Bernoulli were (incorrectly) claiming to exhibit polynomials that did not so factor.

- You replace your current abstraction (the real numbers) with a new one better suited to encode the theorem you are interested in.
In this case we introduce the complex numbers (a different number system than the reals). Then the following theorem is true: all polynomial equations of degree n (with complex coefficients) have n complex solutions (counting with repetition). From the Wikipedia again: Gerolamo Cardano introduced the complex numbers around 1545. This is a monumental step in mathematics and by the mid 1750s many attempted proofs of this theorem were published (now all considered to be incomplete in that they assumed a few things not yet known/proven). By 1821 complex number based proofs are making it to textbooks.

So the complex numbers allowed the simplification of strongly coveted theorem and drove hundreds of years of mathematical research.

Let’s move one step closer to schemes: polynomial ideals and affine varieties.

I like polynomial ideals. The reason is: a great number of very hard problems can be encoded as asking if a given polynomial (in possibly more than one variable) is in a special type of set called an polynomial ideal. There are algorithms for working with polynomial ideals (in particular the Groebner basis reduction and the Buchberger’s algorithm). The natural companion object to an polynomial ideal is something called an affine variety. Think of polynomial ideals as special sets of multivariate polynomials and think of affine varieties as sheets of points these polynomials are simultaneously zero on. So polynomial ideas are generalizations of polynomials (to more than one polynomial and more than one variable) and affine varieties are generalizations of solutions (to sets of points describe many different assignments to multiple variables).

This set of mathematical tools and algorithms under research since the mid 1960s translate a lot of the most important algorithms from linear algebra (such as Gaussian elimination) and number theory (such as computing greatest common divisors) into a unified framework over multivariate polynomials. These algorithms are why packages like Macsyma, Maple, Mathematica, and SymPy can solve many equations.

Polynomial ideals and varieties are related in an interesting way: the bigger the polynomial ideal the smaller the corresponding variety. For example all of R^n is a solution to the set of polynomial equal to `{ 0 }`

. And only the points `{ i, -i }`

are solutions to the single variable polynomial `x^2+1`

. This sort of linkage is called a Galois connection finding theorems like this motivates a lot of category theory. The idea is: we are working directly with polynomial ideals, but the affine varieties (or sets of simultaneous zeros) help us and do a lot of the bookkeeping for us (making it much easier to prove a lot more theorems).

Except, the relation doesn’t quite work as well as we would like. We would like something simpler and more powerful than a mere Galois connection: to have affine varieties be in one to one correspondence with polynomial ideals. That way each one has enough detail to be used as for detailed record keeping on the other. It turns out affine varieties are not quite up to the job. Affine varieties can not carry as much detail as the corresponding polynomial ideals. Affine varieties are only tracking details of a subset of polynomial ideas called radical polynomial ideals. A radical polynomial ideal is such that if `p(x)^k`

is in the polynomial ideal for some integer `k ≥ 1`

then `p(x)`

is in the polynomial ideal. So the set `{x^{2k} | k>=1}`

is an polynomial ideal, but not a radical polynomial ideal (the corresponding radical polynomial ideal is `{x^{k} | k>=1}`

). A polynomial ideal and its corresponding radical polynomial ideal are associated with the same affine variety (so the space of affine varieties can’t tell them apart).

The issue is: we wanted to work directly with polynomial ideas (we had some great problems and algorithms ready to go). Affine varieties were only introduced to help with the record keeping (mostly in proofs). We don’t want to mess up our work by switching from polynomial ideals (that encode what we want) to radical polynomial ideals (which add in more constraints). What if instead of fixing the polynomial ideals, we fixed the affine varieties? Affine varieties are the ones not doing their job. It turns out we can in fact work with polynomial ideals: we just need to replace affine varieties with an more detailed abstract structure called schemes.

If you want to work with general ideals (that is subsets of arbitrary rings closed (and absorbing) under multiplication, not just ideals of polynomials) then your natural most detailed “sets of solutions” structure is not varieties but schemes. Alexander Grothendieck in worked this out in the 1960. For some specialized fields it as a revelatory as the introduction of complex numbers. Discoveries like this do not happen often.

It turns out the math is very general (so a bunch of fields I have neglected also use schemes). Because it is general it ends up being defined in terms more abstract than polynomials and roots (in terms of morphisms, topologies, Spec, and so on). Schemes are great because they work even over very general concepts (and not because they bring in very general concepts).

(For a text dealing with the algorithmic aspects of ideals and varieties (but, not schemes) I recommend David A. Cox, John Little, Donal O’Shea, “Ideals, Varieties, and Algorithms: An Introduction to Computational Algebraic Geometry and Commutative Algebra” 3rd Edition, 2008. Schemes are a pretty advanced topic, you can try Robin Hartshorne “Algebraic Geometry”, 1997, but to even use the book you would need a background in commutative algebra.)

]]>

The long answer is: when Amazon.com supplies a Kindle edition readers have to deal with the following:

- Amazon.com digital right management locking the material to a single format and Amazon.com devices/readers.
- Careless mechanical re-formatting of the book material yielding either poor rendering or re-packaging of PDFs that you can only zoom and scan across (and not true re-flow of text).
- Amazon.com prices for Kindle versions are often as high as 70 to 90 percent of the print edition. Meaning to get both editions (print and Kindle) you spend at-least half as much again as getting either edition.

Some readers don’t like this and (rightly) complain. Some of the best books in our field have the occasional 1-star review due to a throughly frustrated Amazon Kindle customer. As an author you wish reviews were faceted with completely separate and mandatory sub-scores for vendor experience, price, delivery, print-quality, ebook-rendering, relevance to particular reader, and finally book quality (instead of a single rating perceived as “book quality”). But from a buyer’s point of view: rating an item low that has given you a bad experience is completely legitimate (be it for print quality, or the utility of the eBook rendering).

Practical Data Science R does have an e-copy. For our book when we say e-copy we mean:

- An electronic copy available without any intrusive digital right management (beyond requiring registration for initial download and a watermark). These are maximally useful copies as you can search them, print them, and place them on arbitrary devices.
- Unlimited downloads and re-downloads of your copies.
- e-copy available in three formats: PDF, ePub, and Kindle. And you can download all three.
- e-copies are produced and inspected by the actual book editors during the production of the book (not a later mechanical transcription).

We offer readers more than one way to get an good e-copy. Though not all customers are aware of all the options.

- Each new standard copy (though
*not*the international discount reprint) offers an access code that gives single-user rights to an e-copy. This is true for any new standard edition (be it sold by Manning, Amazon, or any other bookseller). Note: used copies may have already consumed codes and discount international editions do not include codes (so if somebody is re-selling you a book you will want to check if it includes an unused code). This is a good deal as for the price of a new standard print edition you get both a print and e-copy (typically much cheaper than buying a p-copy and an e-copy separately). - Manning itself sells e-copies where for a single discounted price you again get access to non-DRM “e-copy” editions (again giving you all of PDF, ePub, and Kindle). We know some readers do not want a physical book, and expect a discounted e-only option.
- Manning books are often available through Safari online, so you or your enterprise may already have some (restricted online) access through Safari.

In conclusion.

Manning reserved the right to be the only seller of e-only editions of Practical Data Science with R. For a full legitimate e-only copy you must go through them. Manning includes a free e-copy code in all new standard editions of the book. Wherever you buy a legitimate new copy of the standard edition you get the same e-rights as bonus. Used copies and discount international editions have their roles, but may not have a e-copy included (someone may have consumed the right on a used copy, and the discount international edition doesn’t include a code).

Obviously the customers and readers get to decide what is of value to them. This describes the options we were able to supply.

I thought I would show how to register a Manning e-book from your physical copy. The process is fairly quick you just need your physical book, an internet connection, and an email address to register a Manning account when prompted.

- Cut open the attached code sheet in the book front-matter.
- This reveals a large code spreadsheet and the redemption URL. Don’t worry you only have to enter a couple of these cells.
- Go to http://www.manning.com/ebookoffer/ and enter the codes from two cells when prompted.

I know the code sheet differs from book to book. I guess it is large to make it less practical for somebody to peek and copy out the code in a bookstore. I presume once a code is associated with a Manning account it can’t be re-entered with another account. Obviously a direct purchase of an e-only copy directly from Manning is a much less involved process.

]]>`scikit-learn`

, the Python machine learning library. We were interested not just in classifier accuracy, but also in seeing if there is a “geometry” of classifiers: which classifiers produce predictions patterns that look similar to each other, and which classifiers produce predictions that are quite different? To examine these questions, we put together a Shiny app to interactively explore how the relative behavior of classifiers changes for different types of data sets.

We looked at seven classifiers from `scikit-learn`

:

- SVM (
`sklearn.svm.SVC`

) with the radial basis function kernel, gamma=0.001 and C=10 - Random Forest (
`sklearn.ensemble.RandomForestClassifier`

) with 100 trees, each limited to a maximum depth of 10 - Gradient Boosting (
`sklearn.ensemble.GradientBoostingClassifier`

) - Decision Tree (
`sklearn.tree.DecisionTreeClassifier`

) - Gaussian Naive Bayes (
`sklearn.naive_bayes.GaussianNB`

) - Logistic Regression (
`sklearn.linear_model.LogisticRegression`

) - K-Nearest Neighbors (
`sklearn.neighbors.KNeighborsClassifier`

) with K=5

We predicted class probabilities for each target class (using the `predict_proba()`

method), rather than simply predicting class. Note that the decision tree and K-Nearest Neighbors implementations only return 0/1 predictions — that is, they only predict class — even when using their `predict_proba()`

methods. We made no effort to optimize the classifier performance on a per-data-set basis.

We used the 123 pre-prepared data sets compiled by the authors of the DWN study; these datasets have been centered, scaled, and stored in a format (`.arff`

) that can be easily read into Python. The data sets vary in size from 16 to over 130,000 rows, and from 3 to 262 variables. They ranged from having 2 to 100 target classes (most were two-class problems; the median number of classes was three; the average, about seven).

As we noted in our previous post, eight of the 123 data sets in the collection encoded categorical variables as real numbers, by hand-converting them to ordered levels (a variation on this trick is to hash the strings that describe each category). As our previous post pointed out, this is not the correct way to encode categoricals; you should instead convert the categories to indicator variables. However, rather than re-doing the encodings, we left them as-is for our study. This disadvantages logistic regression and SVM, which cannot undo the information loss that results from this encoding, and probably disadvantages K-Nearest Neighbors as well; however the number of affected data sets is relatively small.

Some, but not all, of the data sets in the DWM study were broken into training and test sets; we only looked at the training sets. If the training set was larger than 500 rows, then we randomly selected 100 items and held them out as the test set and used the remaining data to train the classifiers. If the training set was smaller than 500 rows, then we used hold-one-out cross validation on 100 random rows (or on all the rows, for data sets with less than 100 datums): that is, for each row, we trained on all the data except that row, and then classified the held-out row.

We considered two questions in our study. First, which classification methods are most accurate in general — that is, which methods identify the correct class most of the time. Second, which classifiers behave most like each other, in terms of the class probabilities that they assign to each of the target classes.

To answer the first question, we first consider each classifiers’ accuracy rate over all data sets:

Over all the data sets we considered, random forest (highlighted in blue) had the highest accuracy, correctly identifying the class of a held-out instance about 82% of the time (considered over all test sets).

Note that we are using accuracy as a quick and convenient indication of classifier quality. See this post (and our discussion later in the present article) for why accuracy should not be considered an end-all measure.

We also counted up how often a classifier did the best on a given test set. For each data set *D*, we call the best that any classifier did on that set *N_best*. For example, if the best any classifier did on data set *D* was 97/100, then *N_best* for *D* is 97/100. Every classifier that achieved 97/100 on *D* is then counted as part of “the winner’s circle”. We then tallied how many times each classifier ended up in the winner’s circle over all the data sets under consideration. Here are the results for all the data sets:

This graph shows that over all the data sets we considered, gradient boosting and random forest (highlighted in blue) reached the best achievable accuracy (*N_best*) the most often: on 39 of the 123 data sets. This is consistent with the findings of the DWN paper, which noted that Random Forest is often among the most accurate classifiers over this sample of data sets. Notice that the sum of counts over all classifiers is greater than 123, meaning that quite often more than one classifier achieved *N_best*.

To answer the second question, we took the vector of class probabilities returned by each classifier for a given data set *D*, and measured squared Euclidean distance between each pair of models, and between each model and ground truth. The (squared) distance between two models is then the sum of the squared Euclidean distance between them over all data sets. We can use these distances to determine the similarity between various models. One way to visualize this is via a dendrogram:

Over this set of data sets, gradient boosting and random forest behaved very similarly (that is, tended to return similar class probability estimates), which is not too surprising, since they are both ensembles over decision trees. It’s not visible from the dendrogram, but they are also the closest to ground truth. Logistic regression and SVM are also quite similar to each other. We found this second observation somewhat surprising, but the similarity of logistic regression and SVM has been observed previously, for example by Jian Zhang, et. al. for text classification (“Modified Logistic Regression: An Approximation to SVM and Its Applications in Large-Scale Text Classification”, ICML 2003). Nearest neighbor behaves somewhat similarly to the first four classifiers, but the naive Bayes and decision tree classifiers do not.

Alternatively, we can visualize the approximate distances between the classifiers using multidimensional scaling. This is a screenshot of a rotatable visualization of the model distances in 3-D that uses the package `rgl`

(and `shinyRGL`

).

These distances are only approximate, but they are consistent with dendrogram above.

As a side note, there are other ways to define similarity between classifiers. Since we are looking at distributions over classes, we could consider Kullback-Leibler divergence (although KL divergence is not symmetric, and hence not a metric); or we could try cosine similarity. We chose squared Euclidean distance as the most straightforward measure.

Since we’ve collected all the measurements, we can further explore how the results vary with different properties of the data sets: their size, the number of target classes, or even the “shape” of the data: narrow (few variables relative to the number of datums) or wide (many variables relative to the number of rows). To do that we built a Shiny App that lets us produce the above visualizations for different slices of the data. Since we only have 123 data sets, not all possible combinations of the data set parameters are well represented, but we can still find interesting results. For example, here we see that for small, moderately narrow data sets, logistic regression is frequently a good choice.

Overall, we noticed that random forest and gradient boosting were strong performers over a variety of data set conditions. You can explore for yourself; our app is online at https://win-vector.shinyapps.io/ExploreModels/ .

While the results we got were suggestive, and are consistent with the results of the DWN paper, there are a lot of caveats. First of all, the data sets we used do not really represent all the kinds of data sets that we might encounter in the wild. There are no text data sets in this collection, and the variables tend to be numeric rather than categorical (and when they are categorical, the current data treatment is not ideal, as we discussed above). Many of the data sets are far, far smaller than we would expect to encounter in a typical data science project, and they are all fairly clean; much of the data treatment and feature engineering was done before the data was submitted to the UCI repository.

Second, even though we used accuracy to evaluate the classifiers, this may not be the criterion you want. Classifier performance is (at least) two-dimensional; generally you are trading off one performance metric (like precision) for another (like recall). Simple accuracy may not capture what is most important to you in a classifier.

Other points: we could probably get better performance out of many of the classifiers with some per-dataset tuning, rather than picking one setting for all the data sets. It’s also worth remembering that our results are strictly observations about the `scikit-learn`

implementations of these algorithms; different implementations could behave differently relative to each other. For example, the decision tree classifier does not actually return class probabilities, despite the fact that it could (R’s decision tree implementation can return class probabilities, for one). In addition, many of these implementations do not implement true multinomial classification for the multiclass case (even though in theory, there may be a multinomial version of the algorithm); instead they use a set of binomial classifiers in either a one-against-one (compare all pairs of classes) or one-against-all (compare each class against all the rest) approach. It may be the case in some situations that one approach will work better than the other.

One data set characteristic that we didn’t investigate, but feel would be interesting, is the rarity of the target class (assuming there is a single target class). In fraud detection, for example, the target class is (hopefully) rare. Some classification approaches will do much better in that situation than others.

Caveats notwithstanding, we feel that the DWM paper (and our little follow-on) represent some good, useful effort toward characterizing classifier performance. We don’t expect that there is a single, one-size-fits-all, best classification algorithm, and we’d like to see some science around what types of data sets (and which problem situations or use cases) represent “sweet spots” for different algorithms. We would also like to see more studies about the similarities of different algorithms. After all, why use a computationally expensive algorithm in a situation where a simpler approach is likely to do just as well?

In short, we’d like to see see further studies similar to the DWN paper, hopefully done over data sets that better represent the situations that data scientists are likely to encounter. Hopefully, exploring the current results with our Shiny app will give you some ideas for further work.

We have made our source code available on Github. The repository includes

- The Python script for scoring all the data sets against the different models
- The R script for creating the summary tables used by the Shiny app
- The code for the Shiny app itself

Some of the relative paths for reading or writing files may not be correct, but it should be easy to figure out how to fix them. You will need the `ggplot2`

, `sqldf`

, `rgl`

and `ShinyRGL`

packages to run the R code. Tutorials and documentation for Shiny, as well as directions for building and launching an app (it’s quite easy in RStudio, and not much harder without it) are available here.

The DWN paper is an interesting empirical study that measures the performance of a good number of popular classifiers (179 but their own account) on about 120 data sets (mostly from UCI).

This actually represents a bit of work as the UCI data sets are not all in exactly the same format. The data sets have varying file names, varying separators, varying missing value symbols, varying quoting/escaping conventions, non-machine readable headers, some data sets have row-ids, column to be predicted in varying positions, some data in zip files, and many other painful variations. I have always described UCI as “not quite machine readable.” Working with any one data set is easy, but the prospect of building an adapter for each of a large number of such data sets is unappealing. Combined with the fact that the data sets are often of small size, and often artificial/synthetic (designed to show off one particular inference method) few people work with more than a few of these data sets. The authors of DMW worked with well over 100 *and* shared their fully machine readable results ( `.arff`

and apparently standardized `*_R.dat`

files) in a convenient single downloadable tar-file (see their paper for the URL).

The stated conclusion of the paper is comforting, and not entirely unexpected: random forest methods are usually in the top 3 classifiers in terms of accuracy.

The problem is: we are always more accepting of an expected outcome. To confirm such a conclusion we will, of course, need more studies (on larger and more industry-typical data sets), better measures than accuracy (see here for some details), and a lot of digging in to methodology (including data preparation).

To be clear: I like the paper. The authors (as good scientists) publicly shared their data and a bit of their preparation code. This is something most authors do not do, and should in fact be our standard for accepting work for evaluation.

But, let us get down to quibbles. Let’s unpack the data and look at an example. Suppose we start with “car” a synthetic data set we have often used as an example. The UCI repository supplies 3 files: car.c45-names, car.data, and car.names

`car.names`

Free-form description of the data-set and format.`car.data`

Comma separated data (without header).`car.c45-names`

Presumably machine readable header for a`C4.5`

package

The standard way to deal with this data is to (by hand) inspect `car.names`

or `car.c45-names`

and hand-build a custom command to load the data. Example R code to do this is given below:

```
```library(RCurl)
url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data"
tab <- read.table(text=getURL(url,write=basicTextGatherer()),
header=F,sep=',')
colnames(tab) <- c('buying', 'maint', 'doors',
'persons', 'lug_boot', 'safety', 'class')
options(width=50)
print(summary(tab))

Which (assuming `RCurl`

is properly installed) yields:

```
``` buying maint doors persons
high :432 high :432 2 :432 2 :576
low :432 low :432 3 :432 4 :576
med :432 med :432 4 :432 more:576
vhigh:432 vhigh:432 5more:432
lug_boot safety class
big :576 high:576 acc : 384
med :576 low :576 good : 69
small:576 med :576 unacc:1210
vgood: 65

For any one data set having to read the documentation and adapt that into custom loading code is not a big deal. However, having to do this for over 100 data sets is an effort. Let’s look into how the DWN paper did this.

The DWN paper `car`

directory has 9 items:

`car.data`

original file from UCI.`car.names`

original file from UCI.`le_datos.m`

Matlab custom data loading code.`car.txt`

Facts about the data set.`car.arff`

Derived`.arff`

format version of the data set.`car.cost`

Pricing of classification errors.`car_R.dat`

Derived standard tab separated values file with header.`conxuntos.dat`

Likely a result file.`conxuntos_kfold.dat`

Likely a result file.

The files I am interested in are `car_R.dat`

and `le_datos.m`

. `car_R.dat`

looks to be a TSV (tab separated values) file with header, likely intended to be read into R. It looks like the file is in a very regular format with row numbers, feature columns first (and named `f*`

) and category to be predicted last (named `clase`

and re-encoded as an integer). Notice that all features (which in this case were originally strings or factors) have been re-encoded as floating point numbers. That is potentially a problem. Let’s try and dig in how this conversion may have been done. We look into `le_datos.m`

and see the following code fragment:

```
```for i_fich=1:n_fich
f=fopen(fich{i_fich}, 'r');
if -1==f
error('erro en fopen abrindo %s\n', fich{i_fich});
end
for i=1:n_patrons(i_fich)
fprintf(2,'%5.1f%%\r', 100*n_iter++/n_patrons_total);
for j = 1:n_entradas
t= fscanf(f,'%s',1);
if j==1 || j==2
val={'vhigh', 'high', 'med', 'low'};
elseif j==3
val={'2', '3', '4', '5-more'};
elseif j==4
val={'2', '4', 'more'};
elseif j==5
val={'small', 'med', 'big'};
elseif j==6
val={'low', 'med', 'high'};
end
n=length(val); a=2/(n-1); b=(1+n)/(1-n);
for k=1:n
if strcmp(t,val{k})
x(i_fich,i,j)=a*k+b; break
end
end
end
t = fscanf(f,'%s',1); % lectura da clase
for j=1:n_clases
if strcmp(t,clase{j})
cl(i_fich,i)=j; break
end
end
end
fclose(f);
end

It looks like for each categorical variable the researchers have hand-coded an ordered choice of levels. Then each level is replaced by equally spaced code-number from `-1`

through `1`

(using the linear rule `x(i_fich,i,j)=a*k+b`

). Then (in code not shown) possibly more transformations are applied to numeric variables (such as centering and scaling to unit variance). This changes the original data which looks like this:

```
``` buying maint doors persons lug_boot safety class
1 vhigh vhigh 2 2 small low unacc
2 vhigh vhigh 2 2 small med unacc
3 vhigh vhigh 2 2 small high unacc
4 vhigh vhigh 2 2 med low unacc
5 vhigh vhigh 2 2 med med unacc
6 vhigh vhigh 2 2 med high unacc

To this

```
``` f1 f2 f3 f4 f5 f6 clase
1 -1.34125 -1.34125 -1.52084 -1.22439 -1.22439 -1.22439
1
2 -1.34125 -1.34125 -1.52084 -1.22439 -1.22439 0 1
3 -1.34125 -1.34125 -1.52084 -1.22439 -1.22439 1.22439 1
4 -1.34125 -1.34125 -1.52084 -1.22439 0 -1.22439 1

It appears as if one of the machine learning libraries the authors are using only accepts numeric features (I think some of the `Python scikit-learn`

methods have this limitation) or the authors believe they are using such a package. Whomever prepared this data seemed to be unaware that the standard way to convert categorical variables to numeric is the introduction of multiple indicator variables (see page 33 of chapter 2 of Practical Data Science with R for more details).

Indicator variables encoding US Census reported levels of education.

The point is: encoding multiple levels of a categorical variable into a single number may seem reversible to a person (as it is a 1-1 map), but some machine learning methods can not undo the geometric detail lost in such an encoding. For example: with a linear method (be it regression, logistic regression, a linear SVM, or so on) we lose explanatory power unless the encoding has properly guessed both the correct order of the attributes *and* the relative magnitudes. Even tree-based methods (like decision trees, or even random forest) waste part of their explanatory power (roughly degrees of freedom) trying to invert the encoding (leaving less power remaining to explain the original relation in the data). This sort of ad-hoc encoding may not cause much harm in this one example, but it is exactly what you don’t want to do if there are a great number of levels, cases where the order isn’t obvious, or when you are comparing different methods (as different methods are damaged to different degrees by this encoding).

This sort of “convert categorical features” through an arbitrary function is something we have seen a few times. It is one of the reasons we explicitly discuss indicator variables in “Practical Data Science with R” despite the common wisdom that “everybody already knows about them.” When you are trying to get best possible results for a client, you don’t want to inflict avoidable errors in your data transforms.

If you absolutely don’t want to use indicator variables consider impact coding or a safe automated transform such as vtreat. In both cases the actual training data is used to try and estimate the order and relative magnitudes of an encoding that would be useful for downstream modeling.

Is there any actual damage in this encoding? Let’s load the processed data set and see.

```
```url2 <- 'http://winvector.github.io/uciCar/car_R.dat'
dTreated <- read.table(url2,
sep='\t',header=TRUE)

The original data set supports a pretty good logistic regression model for unaccaptable cars:

```
```set.seed(32353)
train <- rbinom(dim(tab)[[1]],1,0.5)==1
m1 <- glm(class=='unacc'~buying+maint+doors+persons+lug_boot+safety,
family=binomial(link='logit'),
data=tab[train,])
tab$pred <- predict(m1,newdata=tab,type="response")
print(table(class=tab[!train,'class'],
unnacPred=tab[!train,'pred']>0.5))
## unnacPred
## class FALSE TRUE
## acc 181 18
## good 30 0
## unacc 22 577
## vgood 35 0

The transformed data set does not support as good a logistic regression mode.

```
```m2 <- glm(clase==1~f1+f2+f3+f4+f5+f6,
family=binomial(link='logit'),
data=dTreated[train,])
dTreated$pred <- predict(m2,newdata=dTreated,type="response")
print(table(class=dTreated[!train,'clase'],
unnacPred=dTreated[!train,'pred']>0.5))
## unnacPred
## class FALSE TRUE
## 0 35 0
## 1 43 556
## 2 118 81
## 3 28 2

Now obviously some modeling methods are more sensitive to this mis-coding than others. In fact for a moderate number of levels you would expect random forest methods to actually invert the coding. But the fact that some methods are more affected than others is one reason why you don’t want to perform this encoding before making comparisons. As to the question why to ever use logistic regression? Because when you have a proper encoding of the data and the model structure is in fact somewhat linear, logistic regression can in fact be a very good method.

In the DWN paper 8 data sets (out of 123) have the `a*k+b`

fragment in their `le_datos.m`

file. So likely the study was largely driven by data sets that natively have only numeric features. Also, we emphasize the DWN paper shared its data and a bit of its methods, which puts it light-years ahead of most published empirical studies. The only reason we can’t so critique other authors is many other authors don’t share their work.

It always surprises statisticians that the indicator variable trick is not always first in mind. This means we forget to teach and re-teach the method enough. We also need to do more to root-out the incorrect alternatives to the method. Indicator encoding is sometimes hard to point out as it is either not done correctly or done silently.

In `R`

, `strings`

and `factors`

can be treated as single columns or variables and are silently converted during model training and application (or can be explicitly built using `model.matrix()`

. Oddly enough `R`

also goes out of its way to also provide a publicly visible “convert to numbers by using interior codes” method (`data.matrix()`

) which in my opinion is almost *always* the wrong method and lures unsuspecting programmers and engineers into error. I have written on this before, but if anything failed to fully appreciate the pervasive nature of the incorrect practice.

`Python`

‘s `scikit-learn`

supplies the correct encoding methods in sklearn.feature_extraction.DictVectorizer/sklearn.preprocessing.OneHotEncoder(). I think a lot of Python users get confused because they do not appreciate that `Pandas`

(which deals so well with data representation) and `scikit-learn`

(which really only wants to work with numbers) are two independent packages (and coded not to depend on each other) and some work is required to faithfully move data from one package to the other.

Note: as expected `randomForest`

does better reversing the re-encoding. Also we accidentally left out the variable `f6`

in an early version of this post.

```
```library(randomForest)
m1F <- randomForest(as.factor(class=='unacc')~
buying+maint+doors+persons+lug_boot+safety,
data=tab[train,])
tab$predF <- predict(m1F,newdata=tab,type="response")
print(table(class=tab[!train,'class'],
unnacPred=tab[!train,'predF']))
## unnacPred
## class FALSE TRUE
## acc 193 6
## good 30 0
## unacc 9 590
## vgood 35 0
m2F <- randomForest(as.factor(clase==1)~f1+f2+f3+f4+f5+f6,
data=dTreated[train,])
dTreated$predF <- predict(m2F,newdata=dTreated,type="response")
print(table(class=dTreated[!train,'clase'],
unnacPred=dTreated[!train,'predF']))
## unnacPred
## class FALSE TRUE
## 0 35 0
## 1 10 589
## 2 193 6
## 3 30 0

And we can confirm the encoding is in fact reversible by showing which variables and outcomes are in bijective correspondence. This means something as simple as changing the `type/class`

declaration from `real`

to `string/factor`

would undoing the coding problem. The machine learning doesn’t need to know the original names of the levels, it just needs to know to treat the data as levels.

```
```print(table(tab$class,dTreated$clase))
## 0 1 2 3
## acc 0 0 384 0
## good 0 0 0 69
## unacc 0 1210 0 0
## vgood 65 0 0 0
print(table(tab$buying,dTreated$f1))
## -1.34125 -0.447084 0.447084 1.34125
## high 0 432 0 0
## low 0 0 0 432
## med 0 0 432 0
## vhigh 432 0 0 0
print(table(tab$maint,dTreated$f2))
## -1.34125 -0.447084 0.447084 1.34125
## high 0 432 0 0
## low 0 0 0 432
## med 0 0 432 0
## vhigh 432 0 0 0
print(table(tab$doors,dTreated$f3))
## -1.52084 -0.168982 0.506946 1.18287
## 2 432 0 0 0
## 3 0 432 0 0
## 4 0 0 0 432
## 5more 0 0 432 0
print(table(tab$persons,dTreated$f4))
## -1.22439 0 1.22439
## 2 576 0 0
## 4 0 576 0
## more 0 0 576
print(table(tab$lug_boot,dTreated$f5))
## -1.22439 0 1.22439
## big 0 0 576
## med 0 576 0
## small 576 0 0
print(table(tab$safety,dTreated$f6))
## -1.22439 0 1.22439
## high 0 0 576
## low 576 0 0
## med 0 576 0

Typeset in The Future has a great example of semiotic sign standards in alien (including the infamous “Purina alien chow” symbol).

]]>