Manning.com is offering FREE shipping with code `SHIP35`

for US residents only. Use this link to link to purchase http://www.manning.com/?a_aid=zm.

And, Manning.com is offering 50% off all eBooks and 35% off all print books. Take advantage of this great deal here http://www.manning.com/?a_aid=zm and learn some new skills!

Please share!

]]>

This is an example of regularizing a neural net by its activations (values seen at nodes), conditioned on outcome.

The idea is: at the last layer of a binary classification neural net we ideally want that the prediction is 1 when the training outcome is 1 and that the prediction is 0 when the training outcome is 0. Our improvement is to generalize this training criteria in such a way that it can be applied to interior layers of the neural net as follows.

Suppose our training examples are indexed by i for i = 1,…,m and our neural net nodes are labeled by j = 1,…,n. y(i) is the training outcome which we are assuming is 0/1. d(j) is the depth of the layer the j-th node is in, which ranges from 1 to k. Depth 1 denotes the input layer (explanatory variables) and depth k is the output layer. We will assume, without loss of generality, that the node with j=n is the unique node of depth k; i.e. it is the output or prediction. Define a(j, i) as the activation or value seen at node j (the value after the node’s transfer and activation) for training example i. We have assumed that a(n, i) is the output or prediction layer of the neural net. a(j, i) varies as the neural net weights are updated during training, we are looking at a(j, i) for a given value of the weights, and will need to re-compute a(j, i) after any training step or weight update.

We can generalize the empirical “match the outputs condition” to the following. For node-j define the y-conditioned variance as:

```
mean(j, Y) := sum_{i = 1,...m; y(i) = Y} a(j, i) / sum_{i = 1,...m; y(i) = Y} 1
cross(j) := (mean(j, 0) - mean(j, 1))^2
var(j, i) := (a(j, i) - mean(i, j, y(i)))^2
rat(j, i) := var(j) / cross(j)
```

The intent is rat(j, i) should look a lot like a supervised version of the Calinski and Harabasz variance ratio criterion. A small value of rat(j, i) can be taken to mean that, for a given j, most of the variation in a(j, i) is associated with variation in y(i). mean(j, 0), mean(j, 1), and cross(j) are all many-example aggregates, for simplicity we will estimate them per-batch and thus prefer large batch sizes for better estimates.

A typical objective function for a binary classification problem is to minimize the following cross-entropy.

```
loss(i) := - y(i) log(a(n, i)) - (1 - y(i)) log(1 - a(n, i))
total_loss := sum_{i = 1,...,m} loss(i)
```

Minimizing total_loss tends to also make var(n) small relative to cross(n). This is because the loss tries to concentrate a(n, i) near 1 for all i such that y(i) is 1, and a(n, i) near 0 for all such that y(i) is 0. For a perfect fit this would imply var(n) = 0 and cross(n) = 1. So we can consider adding minimizing rat(n, i) as an additional auxiliary term to our loss/objective function. Of course this doesn’t yet add much, as total_loss is already a good objective function on the last-layer or prediction activations.

Now we try something new that may have advantages: add rat(j, i) for intermediate j as additional terms for our objective function. Define our new regularized loss as:

```
w(j) := (d(j) - 1) / ((k - 1) sum_{a = 2,..,n-1; w(a) = w(j)} 1)
regularized_loss(i) := loss(i) + alpha sum_{j = 2,...,n-1} w(j) rat(j, i)
```

The weight w(j) is picked so each node in a layer has the same weight and the early layers, near the explanatory/input variables, get weaker regularization. alpha is a hyper-parameter specifying the strength of the regularization. Notice also regularized_loss(i) is per-example, we deliberately have not summed it up.

The overall idea is: later layers in the neural net should have unsurprising values given the training outcome. So we are adding a simple norm, one of many possible, to try and enforce that. Or: to reduce over-fit try to minimize non-explanatory variation in the network activations. Variations of the idea are to make values at each layer unsurprising given the values at the layer after it. One can also think of this as taking ideas from the stationary points of an action, with an appropriate stand-in for a Lagrangian or, if we added some useful symmetries, a gauge-like principle.

In this project we demonstrate the effect on a simple data set using Keras and Tensorflow. Now, Keras doesn’t idiomatically supply a simple interface for regularizing activations on all layers. Keras does have generic loss-functions and per-layer weight regularizers, but attempting to code this effect into those interfaces is going against their intent/design. So we use a couple of engineering tricks to get Keras to do the work for us.

- We include the dependent or outcome variable y(i) in our neural net input.
- We build a layer called TrimmingLayer that strips out the y(i) and sends the rest of the inputs for normal processing.
- We build a special layer called ScoringLayer that collects the outputs of all the layers (including the original normal prediction, and the extra outcome value y(i)) and computes the square-root of the regularized loss we have described above. Some debug-logging of how the adaption is realized can be found here.
- We use an adapter to allow sklearn style training via Keras/Tensorflow to train the above neural net as a regression (using square-residual loss) against an additional objective function that is identically zero. The idea is: in the adapted network this layer is the regularized loss, which ideally is zero. The original informative true classification outcome is still part of the net input, though isolated from the standard portion of the net by the TrimmingLayer. Only the outer regression is told the outcome is to be all zero.
- After the training we copy the layer weights from the above exotic regression network into a standard network that can then be used to make actual predictions.

The entirety of the above process is demonstrated in SmoothedNet.ipynb . This original example is adapted from Jason Brownlee’s “Binary Classification Tutorial with the Keras Deep Learning Library”, which we reproduce for clarity here.

The estimate out of sample performance of the y-conditional activation regularized network is as graphed below. We are showing the distribution of predictions conditioned on actual outcome.

This y-conditional regularized network had an accuracy of about 89%.

The non-regularized version of the network had the following performance.

The non-regularized, or original, network had an accuracy of about 84%.

We haven’t seen a truly braggable improvement yet (evaluation is noisy, and our regularization introduces one more hyper-parameter), but we need to try this regularization on more data sets and deeper neural nets (where we think the effects will be more pronounced).

]]>Of course, sometimes it takes a while to figure out how to do this. Please read for a great R matrix lookup problem and solution.

In R we can specify operations over vectors. For arithmetic this is easy, but some more complex operations you “need to know the trick.”

Patrick Freeman (@PTFreeman) recently asked: what is the idiomatic way to look up a bunch of values from a matrix by row and column keys? This is actually easy to do if we first expand the data matrix into RDF-triples. If our data were in this format we could merge/join it against our desired column indices.

Let’s start with an example data matrix.

```
# example matrix data
m <- matrix(1:9, nrow = 3)
row.names(m) <- c('R1' ,'R2', 'R3')
colnames(m) <- c('C1', 'C2', 'C3')
knitr::kable(m)
```

C1 | C2 | C3 | |
---|---|---|---|

R1 | 1 | 4 | 7 |

R2 | 2 | 5 | 8 |

R3 | 3 | 6 | 9 |

And our data-frame containing the indices we want to look-up.

```
# row/columns we want
w <- data.frame(
i = c('R1', 'R2', 'R2'),
j = c('C2', 'C3', 'C2'))
knitr::kable(w)
```

i | j |
---|---|

R1 | C2 |

R2 | C3 |

R2 | C2 |

That is: we want to know the matrix values from [R1, C2], [R2, C3], and [R2, C2].

The trick is: how do we convert the matrix into triples? digEmAll, has a great solution to that here.

```
# unpack into 3-column format from:
# https://stackoverflow.com/a/9913601
triples <- data.frame(
i = rep(row.names(m), ncol(m)),
j = rep(colnames(m), each = nrow(m)),
v = as.vector(m))
knitr::kable(triples)
```

i | j | v |
---|---|---|

R1 | C1 | 1 |

R2 | C1 | 2 |

R3 | C1 | 3 |

R1 | C2 | 4 |

R2 | C2 | 5 |

R3 | C2 | 6 |

R1 | C3 | 7 |

R2 | C3 | 8 |

R3 | C3 | 9 |

What the above code has done is: write each entry of the original matrix as a separate row with the original row and column ids landed as new columns. This data format is very useful.

The above code is worth saving as a re-usable snippet, as getting it right is a clever step.

Now we can vectorize our lookup using the merge command, which produces a new joined table where the desired values have been landed as a new column.

```
res <- merge(w, triples, by = c('i', 'j'), sort = FALSE)
knitr::kable(res)
```

i | j | v |
---|---|---|

R1 | C2 | 4 |

R2 | C3 | 8 |

R2 | C2 | 5 |

And that is it: we have used vectorized and relational concepts to look up many values from a matrix very quickly.

]]>This means the `:=`

variant of `unpack[]`

is now easy to install.

Please give it a try!

]]>vtreat is a system for preparing messy real world data for predictive modeling tasks (classification, regression, and so on). In particular it is very good at re-coding high-cardinality string-valued (or categorical) variables for later use.

A nice introductory video lecture on vtreat can be found here, and the latest copy of the lecture slides here. Or, you can check out chapter 8 “Advanced data preparation” of Zumel, Mount, *Practical Data Science with R*, 2nd Edition, Manning 2019– which covers the use of vtreat.

The vtreat documentation is organized by task (regression, classification, multinomial classification, and unsupervised), language (R or Python) and interface style (design/prepare, or fit/prepare). In particular the R code now supports variations of the interfaces, allowing users to choose what works best with their coding style. Either design/prepare, which is very fluid when combined with wrapr::unpack notation or the fit/prepare (which uses mutable state to organize steps).

**Regression**:`Python`

regression example,`R`

regression example, fit/prepare interface,`R`

regression example, design/prepare/experiment interface.**Classification**:`Python`

classification example,`R`

classification example, fit/prepare interface,`R`

classification example, design/prepare/experiment interface.**Unsupervised tasks**:`Python`

unsupervised example,`R`

unsupervised example, fit/prepare interface,`R`

unsupervised example, design/prepare/experiment interface.**Multinomial classification**:`Python`

multinomial classification example,`R`

multinomial classification example, fit/prepare interface,`R`

multinomial classification example, design/prepare/experiment interface.

Please read on for our justification.

The issue we are facing is: Chesterton’s Fence

In the matter of reforming things, as distinct from deforming them, there is one plain and simple principle; a principle which will probably be called a paradox. There exists in such a case a certain institution or law; let us say, for the sake of simplicity, a fence or gate erected across a road. The more modern type of reformer goes gaily up to it and says, “I don’t see the use of this; let us clear it away.” To which the more intelligent type of reformer will do well to answer: “If you don’t see the use of it, I certainly won’t let you clear it away. Go away and think. Then, when you can come back and tell me that you do see the use of it, I may allow you to destroy it.”

How this appears in software or data science projects is often: “harmless cleanup” steps break your project, and you don’t detect this until much later.

The Chesterton’s Fence parable always amused me, as it doesn’t have an actual example of adverse consequences (though I always mis-remember it as having an example). Nobody who does actual work is in fact careful enough or knowledgable enough to always avoid removing Chesterton’s fence as a matter of foresight. However, in hindsight you often can see the problem. Luckily: version control is a time machine that translates common hindsight into more valuable foresight. You can travel to before a mistake, with knowledge of the consequences of making such a mistake.

So, let’s add a minor data science example.

I’ve recently been playing around with a Keras/Tensorflow project, which I will probably write-up later. At some point I “cleaned up” the code by replacing a unsightly tensor slice of the form `x[:, (j-1):j]`

with a more natural looking indexing `x[:, j-1]`

. What I neglected is, Tensorflow uses the tensor rank/shape details to record the difference between a single data-column and a data-frame containing a single data-column (a small distinction that is *very* important to maintain in data science projects). This “cleanup” broke the code in a non-signaling way as additional Tensorflow re-shaping rules allowed the calculation to move forward with incorrect values. A few changes later I re-ran the project evaluation, and the model performance fell precipitously. I had no idea why a model that recently performed well now didn’t work.

The saving grace was: I had committed at very fine granularity even during the “harmless code clean-up” using git version control. Exactly the set of commits you would be embarrassed to share. These “useless” commits saved me. I could quickly bisection search for the poison commit. The concept is illustrated in chapter 11 of Practical Data Science with R (please check it out!) as follows:

Now git is a bit of “when you walk with it you need fear no other” protector. In the process of finding the breaking change I accidentally checked out the repository to a given version (instead of a specific file), causing the dreaded “git Detached HEAD” issue in my source control repository. But the win was: that was a common researchable problem with known fixes. I was happy to trade my “why did this stop working for no reason” mystery for the routine maintenance task of fixing the repository after finding the root cause.

And that is the nature of source control or version control: it is a bunch of technical considerations that end-up being a net positive as they can save you from worse issues.

After note: a much worse, and more memorable parable, on the value of source control is the following. I remember a masters degree in mathematics candidate at UC Berkeley losing an entire draft of her dissertation as she accidentally typed “`rm * .log`

” instead of “`rm *.log`

” to clean-up side-effect files in her working directory. The extra space allowed the remove command to nuke important files. Without source control, this set her back a month.

For a nice lecture on the inevitability of errors (and thus why we need to mitigate them, as they can not be fully eliminated) I recommend The Lead Developer’s “Who Destroyed Three Mile Island” presentation.

]]>It is often too much to ask for the data scientist to become a domain expert. However, in all cases the data scientist must develop strong

domain empathyto help define and solve the right problems.

Interested? Please check it out.

]]>I am now sharing a note that works all of the above as specific examples: “Multiple Split Cross-Validation Data Leak” (a follow-up to our larger article “Cross-Methods are a Leak/Variance Trade-Off”).

]]>

Also, we have translated the Python vtreat steps from our recent “Cross-Methods are a Leak/Variance Trade-Off” article into R vtreat steps here.

This R-port demonstrates the new to R fit/prepare notation!

We want vtreat to be a platform agnostic (works in R, works in Python, works elsewhere) well documented standard methodology.

To this end: Nina and I have re-organized the basic vtreat use documentation as follows:

**Regression**:`R`

regression example, fit/prepare

interface,

`R`

regression example, design/prepare/experiment

interface,

`Python`

regression

example.**Classification**:`R`

classification example, fit/prepare

interface,

`R`

classification example, design/prepare/experiment

interface,

`Python`

classification

example.**Unsupervised tasks**:`R`

unsupervised example, fit/prepare

interface,

`R`

unsupervised example, design/prepare/experiment

interface,

`Python`

unsupervised

example.**Multinomial classification**:`R`

multinomial classification

example, fit/prepare

interface,

`R`

multinomial classification example, design/prepare/experiment

interface,

`Python`

multinomial classification

example.