I am just going to add a few additional references (mostly from Nina) and some more discussion on log-normal distributions versus Zipf-style distributions or Pareto distributions.

In analytics, data science, and statistics we often assume we are dealing with nice or tightly concentrated distributions such as the normal or Gaussian distribution. Analysis tends to be very easy in these situation and not require much data. However, for many quantities of interest (wealth, company sizes, sales, and many more) it becomes obvious that we cannot be dealing with such a distribution. The telltale sign is usually when relative error is more plausible than absolute error. For example it is much more plausible we know our net worth to within plus or minus 10% than to within plus or minus $10.

In such cases you have to deal with the consequences of slightly more wild distributions such as at least the log-normal distribution. In fact this is the important point and I suggest you read Nina’s article for motivation, explanation, and methods. We have found this article useful both in working with data scientists and in working with executives and other business decision makers. The article formalizes ideas all of these people already “get” or anticipate into concrete examples.

In addition to trying to use mathematics to make things more clear, there is a mystic sub-population of mathematicians that try to use mathematics to make things more esoteric. They are literally disappointed when things make sense. For this population it isn’t enough to see if switching from a normal to log-normal distribution will fix the issues in their analysis. They want to move on to even more exotic distributions such as Pareto (which has even more consequences) with or without any evidence of such a need.

The issue is: in a log-normal distribution we see rare large events much more often than in a standard normal distribution. Modeling this can be crucial as it tells us not to be lulled into to strong a sense of security by small samples. This concern *can* be axiomatized into “heavy tailed” or “fat tailed” distributions, but be aware: these distributions tend to be more extreme than what is implied by a relative error model. The usual heavy tail examples are Zipf-style distributions or Pareto distributions (people tend to ignore the truly nasty example the Cauchy distribution, possibly because it dates back the 17th century and thus doesn’t seem hip).

The hope seems to be that one is saving the day by brining in new esoteric or exotic knowledge such as fractal dimension or Zipf’s law. The actual fact is this sort of power-law structure has been know for a very long time under many names. Here are some more references:

- “Power laws, Pareto distributions and Zipf’s law”, Mark Newman, Complex Systems 899, Winter 2006: Theory of Complex Systems.
- “The Long Tail”, Chris Anderson, Wired 10.01.04.
- “A Brief History of Generative Models for Power Law and Lognormal Distributions”, Michael Mitzenmacher, Internet Mathematics Vol. 1, No. 2: 226-251.
- “Zipf’s word frequency law in natural language: a critical review and future directions”, Steven T. Piantadosi, June 2, 2015.
- “On the statistical laws of linguistic distribution”, Vitold Belevitch, Annales de la Société Scientifique de Bruxelles, vol.3, iss.73, pp. 310–326.
- “Living in a Lognormal World,” Nina Zumel, Win-Vector blog, February 3, 2010.

Reading these we see that the relevant statistical issues have been well known since at least the 1920’s (so were not a new discovery by the later loud and famous popularizers). The usual claim of old wine in new bottles is that there is some small detail (and mathematics is a detailed field) that is now set differently. To this I put forward a quote from Banach (from *Adventures of a Mathematician* S.M. Ulam, University of California Press, 1991, page 203):

Good mathematicians see analogies between theorems or theories, the very best ones see analogies between analogies.

Drowning in removable differences and distinctions is the world of the tyro, not the master.

From Piantadosi we have:

The apparent simplicity of the distribution is an artifact of how the distribution is plotted. The standard method for visualizing the word frequency distribution is to count how often each word occurs in a corpus, and sort the word frequency counts by decreasing magnitude. The frequency f(r) of the r’th most frequent word is then plotted against the frequency rank r, yielding typically a mostly linear curve on a log-log plot (Zipf, 1936), corresponding to roughly a power law distribution. This approach— though essentially universal since Zipf—commits a serious error of data visualization. In estimating the frequency-rank relationship this way, the frequency f(r) and frequency rank r of a word are estimated on the same corpus, leading to correlated errors between the x-location r and y-location f(r) of points in the plot.

Let us work through this one detailed criticism using R (all synthetic data/graphs found here). We start with the problem and a couple of observations.

Suppose we are running a business and organize our sales data as follows. We compute what fraction of our sales each item is (be it a count, or be it in dollars) and then rank them (item 1 is top selling, item 2 is next, and so on).

The insight of the Pareto-ists and Zipfians is if we plot sales intensity (probability or frequency) as a function of sales rank we are in fact very likely to get a graph that looks like the following:

Instead of all items selling at the same rate we see the top selling item can often make up a signficant fraction of the sales (such as 20%). There are a lot of 80/20 rules based on this empirical observation.

Notice also the graph is fairly illegible, the curve hugs the axes and most of the visual space is wasted. The next suggestion is to plot on “log-log paper” or plot the logarithm of frequency as a function of logarithm of rank. That gives us a graph that looks like the following:

If the original data is Zipfian distributed (as it is in the artificial example) the graph becomes a very legible straight line. The slope of the line is the important feature of the distribution and is (in a very loose sense) the “fractal dimension” of this data. The mystics think that by identifying the slope you have identified some key esoteric fact about the data and can then somehow “make hay” with this knowledge (though they never go on to explain how).

Chris Anderson in his writings on the “long tail” (including his book) clearly described a very practical use of such graphs. Suppose instead of assuming the line on log-log plots is a consequence of something special, suppose it is a consequence of something mundane. Maybe graphs tend to look like this for catalogs, sales, wealth, company sizes, and so on. So instead of saying the perfect fit is telling us something, look at defects in fit. Perhaps they indicate something. For example: suppose something we are selling products online and something is wrong with a great part of our online catalogue. Perhaps many of the products don’t have pictures, don’t have good descriptions, or some other common defect. We might expect our rank/frequency graph to look more like the following:

What happened is after product 20 something went wrong. In this case (because the problem happened early at an important low rank) can see it, but it is even more legible on the log-log plot.

The business advice is: look for that jump, sample items above and below the jump, and look for a difference. As we said the difference could be no images on such items, no free shipping, or some other sensible business impediment. The reason we care is this large population of low-volume items could represent a non-negligible fraction of sales. Below is the theoretical graph if we fixed whatever is wrong with the rarer items and plotted sales:

From this graph we can calculate that the missing sales represent a loss of about 32% of revenue. If we could service these sales cheaply we would want them.

In the above I used a theoretical Zipfian world to generate my example. But suppose the world isn’t Zipfian (there are many situations where log-normal is a much more plausible situation). Just because the analyst wishes things were exotic (requiring their unique heroic contribution) doesn’t mean they are in fact exotic. Log-log paper is legible because it reprocesses the data fairly violently. As Piantadosi said: we may see patterns in such plots that are features of the analysis technique, and not features of the world.

Suppose the underlying sales dates is log-normal distributed instead of Zipfian distributed (a plausible assumption until eliminated). If we had full knowledge of every possible sale for all time we could make a log-log plot over all time and get the following graph.

What we want to point out is: this is not a line. The hook down at the right side means that rare items have far fewer sales than a Zipfian model would imply. It isn’t just a bit of noise to be ignored. This means when one assumes a Zipfian model one is assuming the rare items as a group are in fact very important. This may be true or may be false, which is why you want to measure such a property and not assume it one way or the other.

The above graph doesn’t look so bad. The honest empiricist may catch the defect and say it doesn’t look like a line (though obviously a quantitive test of distributions would also be called for). But this graph was plotting all sales over all time. We would never see that. Statistically we usually model observed sales as a sample drawn from this larger ideal sampling population. Let’s take a look at what that graph may look like. An example is given below.

I’ll confess, I’d have a hard time arguing this wasn’t a line. It may or may not be a line, but it is certainly not strong evidence of a non-line. This data did not come from a Zipfian distribution (I know I drew it from a log-normal distribution), yet I would have a hard time convincing a Zipfian that it wasn’t from a Zipfian source.

And this brings us back to Piantadosi’s point. We used the same sample to estimate both sales frequencies and sales ranks. Neither of those are actually known to us (we can only estimate them from samples). And when we use the same sample to estimate both, they necessarily come out very related due to the sampling procedure. Some of the biases seem harmless such as frequency monotone decreasing in rank (which is true for unknown true values). But remember: relations that are true in the full population are not always true in the sample. Suppose we had a peek at the answers and instead of estimating the ranks took them from the theoretical source. In this case we could plot true rank versus estimated frequency:

This graph is much less orderly because we have eliminated some of the plotting bias which was introducing its own order. There are still analysis artifacts visible, but that is better than hidden artifacts. For example the horizontal strips are items that occurred with the same frequency in our sample, but had different theoretical ranks. In fact our sample is size 1000, so the rarest frequency we can measures is 1/1000 which creates the lowest horizontal stripe. The neatness of the previous graph were dots standing on top of each other as we estimated frequency as function of rank.

We are not advocating specific changes, we are just saying the log-log plot is a fairly refined view, and as such many of its features are details of processing- not all correctly inferred or estimated features of the world. Again, for a more useful applied view we suggest Nina Zumel’s living in a log-normal world.

]]>

Classic machine learning (especially as it is taught in classes) emphasizes a nice safe static environment where you are given some unchanging data and are asked to produce a nice predictive model one time. It is formally easier that casual inference or statistical inference as being right often is enough, no matter what the reason. It lives in an overly idealized world where one implicitly assumes the following simplifying assumptions:

- The world does not know you are trying to model it (and so can’t take counter-measures, for ideas see Ling Huang, Anthony D. Joseph, Blaine Nelson, Benjamin I.P. Rubinstein, and J. D. Tygar. 2011. Adversarial machine learning. In Proceedings of the 4th ACM workshop on Security and artificial intelligence (AISec ’11). ACM, New York, NY, USA, 43-58. DOI=http://dx.doi.org/10.1145/2046684.2046692).
- Your model has no effect on the world (positive or negative: see
*Weapons of Math Destruction*, Cathy O’Neil, Crown (September 6, 2016) for some discussion).

Adversarial machine learning is the formal name for studying what happens when conceding even a slightly more realistic alternative to assumptions of these types (harmlessly called “relaxing assumptions”).

At startup.ml’s adversarial machine learning conference Dr. Alyssa Frazee gave a good talk her work at Stripe. One point she was particularly clear on: once you actually start using your model in a sense you become an additional adversary.

Her example was denying payment requests. Suppose you have a model that for a transaction `x`

returns an estimate `pfraud(x)`

, the estimated probability that a payment request is fraudulent. Further suppose you set up your business rules to refuse all transactions `x`

where `pfraud(x) ≥ T`

, where `T`

is a chosen threshold. Then after running your system for a while you will no longer have any recent observations on the behavior of transactions where your model thinks `pfraud(x) ≥ T`

(as you never let them through!). In particular you can no longer asses your false-positive rate in a meaningful way as you are no longer collecting outcome data on items our classifier thinks are in the fraud class.

I don’t want to try explain the setup or solution any further as Alyssa Frazee developed it very well and very concretely, and I assume we will be hearing more of her speaking and writing in the future.

The solution suggested is standard, clever, simple and clear: intentionally let some of the `pfraud(x) ≥ T`

cases through to see what happens (though if possible spend to take some additional measures to mitigate potential loss on these) and then use inverse probability weighting to adjust the impact of these test cases. The idea is if you are letting through these “I should have rejected these” items at a rate of 1 in 100 (instead of the full rejection rate of 0 in 100) then each of these requests in fact represents a collection of 100 similar requests: so replicate each of them 100 times in your data and you have an estimate of what would have followed all of these cases through to the end.

The above may sound “dangerous and expensive” but I’ve never seen anything safer or cheaper that actually works reliably. And it is classic experimental design in disguise (the “accept even though I think I should reject” group can be thought of having been marked as “control” before scoring).

There is a tempting (but very wrong) alternative of treating the data marked as potentially fraudulent as being confirmed fraudulent during re-training (something that can actually happen in semi-supervised learning if you are not careful). I wrote on the dangers of this (incorrect) alternate method in my praise of a famous joke (DO NOT USE) method called the data enrichment method.

It is not surprising that the correct adjustment is already well known to statisticians; statistics is largely a field of trying to reliably extract meaningful summaries and inferences from a potentially hostile data environment. This distinction is why I say machine learning stands out from statistics in being a more optimistic (meaning more naive) field.

]]>Nina Zumel and I definitely troubled over possibilities for some time before deciding to write *Practical Data Science with R*, Nina Zumel, John Mount, Manning 2014.

In the end we worked very hard to organize and share a lot of good material in what we feel is a very readable manner. But I think the first-author may have been signaling and preparing a bit earlier than I was aware we were writing a book. Please read on to see some of her prefiguring work.

- September 4, 2012 “On Writing Technical Articles for the Nonspecialist”
- September 19, 2012 “On Being a Data Scientist”
- October 11, 2012 “I Write, Therefore I Think”
- December 6, 2012 Good News: We’re Writing a Book!

Suppose we have the task of predicting an outcome `y`

given a number of variables `v1,..,vk`

. We often want to “prune variables” or build models with fewer than all the variables. This can be to speed up modeling, decrease the cost of producing future data, improve robustness, improve explain-ability, even reduce over-fit, and improve the quality of the resulting model.

For some informative discussion on such issues please see the following:

- How Do You Know if Your Data Has Signal?
- How do you know if your model is going to work?
- Variable pruning is NP hard

In this article we are going to deliberately (and artificially) find and test one of the limits of the technique. We recommend simple variable pruning, but also think it is important to be aware of its limits.

To be truly effective in applied fields (such as data science) one often has to use (with care) methods that “happen to work” in addition to methods that “are known to always work” (or at least be aware, you are always competing against such); hence the interest in mere heuristic.

Let \(L(m;S)\) denote the estimate loss (or badness of performance, so smaller is better) of a model for \(y\) fit using modeling method \(m\) and the variables \(v_i : i \in S\). Let \(d(m;a)\) denote the portion of \(L(m;\{ \})-L(m;\{ a \} )\) credited to the variable \(v_a\). This could be the change in loss, something like \(\mathrm{effectsize}(v_a)\), or \(-\log(\mathrm{significance}(v_a))\); in all cases *larger* is considered better.

For practical variable pruning (during predictive modeling) our intuition often implicitly relies on the following heuristic arguments.

- \(L(m; )\) is monotone decreasing, we expect \(L(m;S \cup \{ a \} )\) is no larger than \(L(m;S)\). Note this may be achievable “in sample” (or on training data), but is often false if \(L(m; )\) accounts for model complexity or is estimated on out of sample data (itself a good practice).
- If \(L(m;S \cup \{ a \} )\) is significantly lower than \(L(m;S)\) then we will be lucky enough to have \(d(m;a)\) not too small.
- If \(d(m;a)\) is not too small then we will be lucky enough to have \(d(\mathrm{lm};a)\) is non-negligible (where modeling method
`lm`

is one linear regression or logistic regression).

Intuitively we are *hoping* (for ease of calculation) variable utility has a roughly diminishing return structure and at least some non-vanishing fraction of a variable’s utility can be seen in simple linear or generalized linear models. Obviously this can not be true in general (interactions in decision trees being a well know situation where variable utility can increase in the presence of other variables, and there are many non-linear relations that escape detection by linear models). Synergy is a good thing, we just would hate to miss it, and one way to prove we don’t miss it would be to know it isn’t there. We will show there is in fact synergy, so naive methods may in fact miss it.

However, if the above were true (or often nearly true) we could effectively prune variables by keeping only the set of variables \(\left\{ a \; \left| \; d(\mathrm{lm};a) \; \text{is non negligible} \right. \right\}\). This is a (user controllable) heuristic built into our `vtreat`

R package and proves to be quite useful in practice.

I’ll repeat: we feel in real world data you can use the above heuristics to usefully prune variables. Complex models do eventually get into a regime of diminishing returns, and real world engineered useful variables usually (by design) have a hard time hiding. Also, remember data science is an empirical field- methods that happen to work will dominate (even if they do not apply in all cases).

For every heuristic you should crisply know if it is true (and is in fact a theorem) or it is false (and has counter-examples). We stand behind the above heuristics, and will show their empirical worth in a follow-up article. Let’s take some time and show that they are not in fact laws.

We are going to show that per-variable coefficient significances and effect sizes are not monotone in that adding more variables can in fact improve them.

First (using R) we build a data frame where `y = a xor b`

. This is a classic example of `y`

being a function of two variable but not a *linear* function of them (at least over the real numbers, it is a linear relation over the field GF(2)).

```
d <- data.frame(a=c(0,0,1,1),b=c(0,1,0,1))
d$y <- as.numeric(d$a == d$b)
```

We look at the (real) linear relations between `y`

and `a`

, `b`

.

`summary(lm(y~a+b,data=d))`

```
##
## Call:
## lm(formula = y ~ a + b, data = d)
##
## Residuals:
## 1 2 3 4
## 0.5 -0.5 -0.5 0.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.500 0.866 0.577 0.667
## a 0.000 1.000 0.000 1.000
## b 0.000 1.000 0.000 1.000
##
## Residual standard error: 1 on 1 degrees of freedom
## Multiple R-squared: 3.698e-32, Adjusted R-squared: -2
## F-statistic: 1.849e-32 on 2 and 1 DF, p-value: 1
```

`anova(lm(y~a+b,data=d))`

```
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## a 1 0 0 0 1
## b 1 0 0 0 1
## Residuals 1 1 1
```

As we expect linear methods fail to find any evidence of a relation between `y`

and `a`

, `b`

. This clearly violates our hoped for heuristics.

For details on reading these summaries we strongly recommend *Practical Regression and Anova using R*, Julian J. Faraway, 2002.

In this example the linear model fails to recognize `a`

and `b`

as useful variables (even though `y`

is a function of `a`

and `b`

). From the linear model’s point of view variables are not improving each other (so that at least looks monotone), but it is largely because the linear model can not see the relation unless we were to add an interaction of `a`

and `b`

(denoted `a:b`

).

Let us develop this example a bit more to get a more interesting counterexample.

Introduce new variables `u = a and b`

, `v = a or b`

. By the rules of logic we have `y == 1+u-v`

, so there is a linear relation.

```
d$u <- as.numeric(d$a & d$b)
d$v <- as.numeric(d$a | d$b)
print(d)
```

```
## a b y u v
## 1 0 0 1 0 0
## 2 0 1 0 0 1
## 3 1 0 0 0 1
## 4 1 1 1 1 1
```

`print(all.equal(d$y,1+d$u-d$v))`

`## [1] TRUE`

We can now see the counter-example effect: together the variables work better than they did alone.

`summary(lm(y~u,data=d))`

```
##
## Call:
## lm(formula = y ~ u, data = d)
##
## Residuals:
## 1 2 3 4
## 6.667e-01 -3.333e-01 -3.333e-01 -1.388e-16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.3333 0.3333 1 0.423
## u 0.6667 0.6667 1 0.423
##
## Residual standard error: 0.5774 on 2 degrees of freedom
## Multiple R-squared: 0.3333, Adjusted R-squared: 5.551e-16
## F-statistic: 1 on 1 and 2 DF, p-value: 0.4226
```

`anova(lm(y~u,data=d))`

```
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## u 1 0.33333 0.33333 1 0.4226
## Residuals 2 0.66667 0.33333
```

`summary(lm(y~v,data=d))`

```
##
## Call:
## lm(formula = y ~ v, data = d)
##
## Residuals:
## 1 2 3 4
## 5.551e-17 -3.333e-01 -3.333e-01 6.667e-01
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.0000 0.5774 1.732 0.225
## v -0.6667 0.6667 -1.000 0.423
##
## Residual standard error: 0.5774 on 2 degrees of freedom
## Multiple R-squared: 0.3333, Adjusted R-squared: 0
## F-statistic: 1 on 1 and 2 DF, p-value: 0.4226
```

`anova(lm(y~v,data=d))`

```
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## v 1 0.33333 0.33333 1 0.4226
## Residuals 2 0.66667 0.33333
```

`summary(lm(y~u+v,data=d))`

```
## Warning in summary.lm(lm(y ~ u + v, data = d)): essentially perfect fit:
## summary may be unreliable
```

```
##
## Call:
## lm(formula = y ~ u + v, data = d)
##
## Residuals:
## 1 2 3 4
## -1.849e-32 7.850e-17 -7.850e-17 1.849e-32
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.00e+00 1.11e-16 9.007e+15 <2e-16 ***
## u 1.00e+00 1.36e-16 7.354e+15 <2e-16 ***
## v -1.00e+00 1.36e-16 -7.354e+15 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.11e-16 on 1 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 4.056e+31 on 2 and 1 DF, p-value: < 2.2e-16
```

`anova(lm(y~u+v,data=d))`

```
## Warning in anova.lm(lm(y ~ u + v, data = d)): ANOVA F-tests on an
## essentially perfect fit are unreliable
```

```
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## u 1 0.33333 0.33333 2.7043e+31 < 2.2e-16 ***
## v 1 0.66667 0.66667 5.4086e+31 < 2.2e-16 ***
## Residuals 1 0.00000 0.00000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

In this example we see synergy instead of diminishing returns. Each variable becomes better in the presence of the other. This is on its own good, but indicates variable pruning is harder than one might expect- even for a linear model.

We can get around the above warnings by adding some rows to the data frame that don’t follow the designed relation. We can even draw rows from this frame to show the effect on a “more row independent looking” data frame.

```
d0 <- d
d0$y <- 0
d1 <- d
d1$y <- 1
dG <- rbind(d,d,d,d,d0,d1)
set.seed(23235)
dR <- dG[sample.int(nrow(dG),100,replace=TRUE),,drop=FALSE]
summary(lm(y~u,data=dR))
```

```
##
## Call:
## lm(formula = y ~ u, data = dR)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.8148 -0.3425 -0.3425 0.3033 0.6575
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.34247 0.05355 6.396 5.47e-09 ***
## u 0.47235 0.10305 4.584 1.35e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4575 on 98 degrees of freedom
## Multiple R-squared: 0.1765, Adjusted R-squared: 0.1681
## F-statistic: 21.01 on 1 and 98 DF, p-value: 1.349e-05
```

`anova(lm(y~u,data=dR))`

```
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## u 1 4.3976 4.3976 21.01 1.349e-05 ***
## Residuals 98 20.5124 0.2093
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

`summary(lm(y~v,data=dR))`

```
##
## Call:
## lm(formula = y ~ v, data = dR)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.7619 -0.3924 -0.3924 0.6076 0.6076
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.7619 0.1049 7.263 9.12e-11 ***
## v -0.3695 0.1180 -3.131 0.0023 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4807 on 98 degrees of freedom
## Multiple R-squared: 0.09093, Adjusted R-squared: 0.08165
## F-statistic: 9.802 on 1 and 98 DF, p-value: 0.002297
```

`anova(lm(y~v,data=dR))`

```
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## v 1 2.265 2.26503 9.8023 0.002297 **
## Residuals 98 22.645 0.23107
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

`summary(lm(y~u+v,data=dR))`

```
##
## Call:
## lm(formula = y ~ u + v, data = dR)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.8148 -0.1731 -0.1731 0.1984 0.8269
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.76190 0.08674 8.784 5.65e-14 ***
## u 0.64174 0.09429 6.806 8.34e-10 ***
## v -0.58883 0.10277 -5.729 1.13e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3975 on 97 degrees of freedom
## Multiple R-squared: 0.3847, Adjusted R-squared: 0.3721
## F-statistic: 30.33 on 2 and 97 DF, p-value: 5.875e-11
```

`anova(lm(y~u+v,data=dR))`

```
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## u 1 4.3976 4.3976 27.833 8.047e-07 ***
## v 1 5.1865 5.1865 32.826 1.133e-07 ***
## Residuals 97 15.3259 0.1580
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

Consider the above counter example as *exceptio probat regulam in casibus non exceptis* (“the exception confirms the rule in cases not excepted”). Or roughly outlining the (hopefully labored and uncommon) structure needed to break the otherwise common and useful heuristics.

In later articles in this series we will show more about the structure of model quality and show the above heuristics actually working very well in practice (and adding a lot of value to projects).

// add bootstrap table styles to pandoc tables $(document).ready(function () { $('tr.header').parent('thead').parent('table').addClass('table table-condensed'); });

In each case I was treated very politely, listened to, and actually got fixes back in a very short timeframe from volunteers. That is really great and probably one of the many reasons R is a great ecosystem.

Please read on for my list of `n=3`

interactions.

- While discussing plotting market data I ran into a corner-case with ggplot2. Even though I figured out how to work around it, it is now fixed by the ggplot2 team!
- I wrote an entire article denouncing a default setting of a single argument in the ranger random forest library. The ranger author himself replied with a fix that is very clever and mathematically well-founded (I suspect he had be researching this issue a while on his own).
- I complained about summary presentation fidelity in base R
`summary.default`

. You guessed it: the volunteers have generously fielded a patch!

Like any real-world system R represents a sequence of history and compromises. Only unused systems can be perfect without compromise. It is very evident how eager and able the volunteers who maintain it are to make sure R represents very good compromises.

I would like to offer a sincere appreciation and thank you from me to the R community. If this is what you can expect using R it is yet another strong argument for R.

And personal thanks to: Martin Maechler, Hadley Wickham, and Marvin N. Wright.

]]>That being said we have a lot of powerful and effective heuristics to discuss in upcoming articles. I am going to leave such positive results for my later articles and here concentrate on an instructive technical negative result: picking a good subset of variables is theoretically quite hard.

When we say something is “theoretically hard” we mean we can contrive examples of it that encode instances of other problems thought to be hard. Thus the ability to solve arbitrary instances of our problem serves to solve arbitrary instances of the thought to be hard problem. This is a technical statement and doesn’t mean we don’t know how to do a good job on the problem. It just means it would be incredibly noteworthy to claim efficient complete optimality in *all* possible cases.

Let `Z`

denote the set of integers and `Q`

denote the rational numbers. The problem we are considering is:

INSTANCE: An integer `T`

, integer `K`

, and a data set `x(i),y(i)`

with `x(i) in Z^n`

and `y(i) in Z`

for `i=1,...,m`

.

QUESTION: Is there `B0 in Q`

and `B in Q^n`

such that `sum_{i=1,...,m} (B0 + B.x(i) - y(i))^2 ≤ T`

and no more than `K`

entries of `B`

are non-zero?

Call this problem “size `K`

regression model of quality `T`

” or (“sKrT” for short). Phrasing sKrT as a decision problem is a mere technical detail, we consider answering if there is a sKrT solution to be pretty much equivalent to finding such solutions. The input is taken to be integers for technical reasons, and can one can approximate various real number problems by scaling and rounding.

The hope is sKrT makes precise the goal one hopes stepwise regression is approximating: finding a good model for `y`

using only `K`

of the `x`

variables (at least on training data, there are also issues of multiple comparison to consider).

What I would like to point out is: solving sKrT is at least as hard as NP. That is: if we could always answer sKrT question quickly and correctly we could solve arbitrary problems in the complexity class NP (itself thought to be difficult).

The quickest way to see sKrT is likely hard is through the classic reference “Computers and Intractability, A Guide to the Theory of NP-Completeness”, Michael R. Gary, David S. Johnson, W.H. Freedman and Company, 1979. Their problem number MP5 “Minimum Weight Solution to Linear Equations” would be easy to solve given the ability to quickly solve sKrT instances. Formally MP5 is defined as:

INSTANCE: Finite set

`X`

of pairs`(x,b)`

where`x`

is an`m`

-tuple of integers and`b`

is an integer, and a positive integer`K ≤ m`

.QUESTION: Is there an

`m`

-tuple`y`

with rational entries such that`y`

has at most`K`

non-zero entries and such that`x . y = b`

for all`(x,b) in X`

?

The encoding is trivial. We encode an MP5 instance as a analogous sKrT instance with `T=0`

and one additional row of the form `(x(i)=0 in Z^n,y(i)=0 in Z)`

(which pushes `B0`

to `0`

). It should be obvious that checking for a zero sum of squared error linear regression is at least as powerful as checking for solvability of linear equations.

This means an intuition that MP5 may be hard becomes an intuition that sKrT may be hard.

All a hardness result seem to prohibit is a “magic wand” approach that aways returns perfect answers quickly (and it doesn’t actually prohibit it, but it means it would be very big news to find and certify such a magic wand). In many cases one can find approximately best solutions with high probability. Essentially this is a signal that in discussing variable selection it makes sense to consider heuristic and empirical results (trust methods that have tended to work well). In our follow-up articles we will discuss why you would want to prune down to `K`

variables (speed up algorithms, cut down on over-fit, and more) and effective pruning techniques (though Nina Zumel already has shared useful notes here).

`vtreat`

is a data.frame processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner.

Very roughly `vtreat`

accepts an arbitrary “from the wild” data frame (with different column types, `NA`

s, `NaN`

s and so forth) and returns a transformation that reliably and repeatably converts similar data frames to numeric (matrix-like) frames (all independent variables numeric free of `NA`

, `NaN`

s, infinities, and so on) ready for predictive modeling. This is a systematic way to work with high-cardinality character and factor variables (which are incompatible with some machine learning implementations such as random forest, and also bring in a danger of statistical over-fitting) and leaves the analyst more time to incorporate domain specific data preparation (as `vtreat`

tries to handle as much of the common stuff as practical). For more of an overall description please see here.

We suggest any users please update (and you will want to re-run any “design” steps instead of mixing “design” and “prepare” from two different versions of `vtreat`

).

For what is new in version 0.5.27 please read on.

`vtreat`

0.5.27 is a maintenance release. User visible improvements include.

- Switching `catB` encodings to a logit scale (instead of the previous log scale).
- Increasing the degree of parallelism by separately parallelizing the level pruning steps (using the methods outlined here).
- Changing the default for
`catScaling`

to`FALSE`

. We still think working logistic link-space is a great idea for classification problems, we are just not fully satisfied that un-regularized logistic regressions are the best way to get there (largely due to issues of separation and quasi-separation). In the meantime we think working in an expectation space is the safer (and now default) alternative. - Falling back to
`stats::chisq.test()`

instead of insisting on`stats::fisher.test()`

for large counts. This calculation is used for level pruning and only relevant if`rareSig < 1`

(the default is`1`

). We caution that setting`rareSig < 1`

remains a fairly expensive setting. We are trying to make significance estimation much more transparent, for example we now return how many extra degrees of freedom are hidden by categorical variable re-encodings in a new score frame column called`extraModelDegrees`

(found in`designTreatments*()$scoreFrame`

).

The idea is having data preparation as a re-usable library lets us research, document, optimize, and fine tune a lot more details than would make sense on any one analysis project. The main design difference from other data preparation packages is we emphasize “y-aware” (or outcome aware) processing (using the training outcome to generate useful re-encodings of the data).

We have pre-rendered a lot of the package documentation, examples, and tutorials here.

]]>`summary()`

method is: it is unfaithful to numeric arguments (due to bad default behavior) and frankly it should be considered unreliable. It is likely the way it is for historic and compatibility reasons, but in my opinion it does not currently represent a desirable set of tradeoffs. `summary()`

likely represents good work by high-ability researchers, and the sharp edges are due to historically necessary trade-offs.
The Big Lebowski, 1998.

Please read on for some context and my criticism.

Edit 8/25/2016: Martin Maechler generously committed a fix! Assuming this works out in testing it looks like we could see an improvement on this core function in April 2017. I really want to say “thank you” to Martin Maechler and the rest of the team for not only this, for all the things they do, and for putting up with me.

My group has been doing a lot more professional training lately. This is interesting because bright students really put a lot of interesting demands on how you organize and communicate. They want things that make sense (so they can learn them), that are powerful (so it is worth learning them), and that are *regular* (so they can compose them and move beyond what you are teaching). Students are less sympathetic to implementation history and unstated conventions, as new users tend not to benefit from them. Remember a new `R`

student is still deciding if they want to use `R`

, to them it is new so an instructor needs to defend `R`

‘s current trade-offs (not its evolutionary path). We find it is best to point out both what is great in `R`

and what isn’t great (versus skipping such, or worse trying to justify such portions).

Please keep this in mind when I demonstrate what goes wrong when one attempts to teach R’s `summary()`

function to the laity.

Suppose you had a list or vector of numbers in R. It would be useful to be able to produce and view some summaries or statistics about these numbers. The primary way to do this in R is to call the `summary()`

method. Here is an example below:

```
```numbers <- 1:7
print(numbers)
## [1] 1 2 3 4 5 6 7
summary(numbers)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 2.5 4.0 4.0 5.5 7.0

From the names attached to the results you can get the meanings and move on. But the whole time you are hoping none of your students call `summary()`

on a single number. Because if the do, they have a *very good* chance of seeing `summary()`

fail. And now you have broken trust in `R`

.

Let’s tack into the wind and demonstrate the failure:

```
```summary(15555)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 15560 15560 15560 15560 15560 15560

`summary()`

is claiming the minimum value from the set of numbers `c(15555)`

is `15560`

. Now this is a deliberately trivial example where we can see what is going on (it sure looks like presentation rounding). To make matters worse, this isn’t just confusion generated during presentation- the actual values are wrong.

```
```str(summary(15555))
## Classes 'summaryDefault', 'table' Named num [1:6] 15560 15560 15560 15560 15560 ...
## ..- attr(*, "names")= chr [1:6] "Min." "1st Qu." "Median" "Mean" ...
summary(15555)[['Min.']] == min(15555)
## [1] FALSE

It may seem silly to expect the slots from a `summary()`

call on a vector would be used in calculation (when we have direct functions such as `quantile()`

and `mean()`

for getting the same results), but using values from summaries of models is standard practice in R. The trivial linear model summary `summary(lm(y~0,data.frame(y=15555)))`

shows rounded results (though it appears to hold accurate results, and only round during presentation; use `unclass()`

to inspect the actual values).

This *is* in fact a problem. You can say this is a consequence of the “default settings of `summary()`

” and it is my fault for not changing those settings. But frankly it is quite fair to expect the default settings to be safe and sane.

Let us also appeal to authority:

The many computational steps between original data source and displayed results must all be truthful, or the effect of the analysis may be worthless, if not pernicious. This places an obligation on all creators of software to program in such a way that the computations can be understood and trusted. This obligation I label the

Prime Directive.John Chambers,

Software for Data Analysis: Programming with R, Springer 2008.

The point is you are delegating work to your system. If it needlessly fails (no matter how trivially) when observed, how can you trust it when unobserved? John Chambers’ point is that trust is very expensive to build up, so you really don’t want to squander it.

I used to try to “lecture this away” as just being “rounding in the presentation for neatness.” But this runs into two objections:

- Why doesn’t the presentation hint at this by switching to scientific notation such as
`1.556e+4`

? - If
`summary()`

“is just presentation” wouldn’t it be a string?

We are losing substitutability. We would love to be able to say to students that “`summary()`

is a convenient shorthand and you can treat the following as equivalent”:

`summary(x)[['Min.']] == min(x)`

`summary(x)[['1st Qu.']] == quantile(x,0.25)`

`summary(x)[['Median']] == median(x)`

`summary(x)[['Mean']] == mean(x)`

`summary(x)[['3rd Qu.']] == quantile(x,0.75)`

`summary(x)[['Max.']] == max(x)`

But the above isn’t always the case. What we would like is for `summary()`

to contain these values and get pretty printing by using the S3 or S4 object system to override the `print()`

method. It is quite likely `summary()`

predates these object systems, so achieved pretty printing through rounding of values.

We can take a look at the actual code and see what is happening. We are looking for a reason, not an excuse.

From `help(summary)`

we see summary takes a `digits`

option with default value `digits = max(3, getOption("digits")-3)`

(lets not even get into why setting `digits`

directly does one thing and the system default is shifted by `3`

). `getOption("digits")`

returns `7`

on my machine so we see we are asking for four digit rounding, which is consistent with what we saw. Digging through the dispatch rules we can eventually determine that for a numeric vector `summary()`

eventually calls `summary.default()`

. By calling `print(summary.default)`

we can look at the code. The offending snippet is:

```
``` qq <- stats::quantile(object)
qq <- signif(c(qq[1L:3L], mean(object), qq[4L:5L]), digits)

After computing the quantiles summary then calls `signif()`

to round the results. `R`

isn’t inaccurate, it just went out of its way to round the results.

One reason this article is long is the behavior we are describing breaks expectations. So we end up having to document what is actually going on (a laborious process) instead of being able to rely on shared educated expectations. The whining is where actualities and expectations diverge.

`summary()`

attempts to achieve neatness and legibility. This is a laudable goal, if achievable. Numeric analysis is not so simple that rounding could safely achieve such a goal.

It is well known that rounding is not a safe or faithful operation (it loses information, and can be catastrophic if naively applied in many stages of a complex calculation). Because it is obvious rounding is dangerous, sophisticated students are surprised that it defaults to “on” in common calculations without indication or warning (such as moving to scientific notation). `summary()`

compounds this error by returning rounded values (instead of rounding only at `print`

/presentation). As `summary()`

is often a first view of data (along with `print()`

) we encounter confusing inconsistent situations where un-rounded values (presentation of original data) and rounded values are compared.

Of course, we can (and should) teach students to call `mean(x)`

and `quantile(x)`

rather than `summary(x)`

when they want to reuse the summary statistics. But then we have to explain *why*. After seeing something like this it becomes an unfortunate additional teaching goal to convince students that more of `R`

doesn’t behave like `summary()`

.

For your convenience here they are in order:

- A gentle introduction to parallel computing in R
- Running R jobs quickly on many machines
- Can you nest parallel operations in R?

Please check it out, and please do Tweet/share these tutorials.

]]>`parallel`

(please see here for example). This is, in our opinion, a necessary step One question that comes up over and over again is “can you nest `parLapply`

?”

The answer is “no.” This is in fact an advanced topic, but it is one of the things that pops up when you start worrying about parallel programming. Please read on for what that is the right answer and how to work around that (simulate a “yes”).

I don’t think the above question is usually given sufficient consideration (nesting parallel operations can in fact make a lot of sense). You can’t directly nest `parLapply`

, but that is a different issue than can one invent a work-around. For example: a “yes” answer (really meaning there are work-arounds) can be found here. Again this is a different question than “is there a way to nest foreach loops” (which is possible through the nesting operator `%.%`

which presumably handles working around nesting issues in `parLapply`

).

Let’s set up a concrete example, so we can ask and answer a precise question. Suppose we have a list of jobs (coming from an external source) that we will simulate with the code fragment below.

`jobs <- list(1:3,5:10,100:200)`

Notice the jobs have wildly diverging sizes, this is an important consideration.

Suppose the task we want to perform is some the square roots of the entries. The standard (non-parallel) calculation would look like the following.

```
worker1 <- function(x) {
sum(sqrt(x))
}
lapply(jobs,worker1)
```

```
## [[1]]
## [1] 4.146264
##
## [[2]]
## [1] 16.32201
##
## [[3]]
## [1] 1231.021
```

For didactic purposes please pretend that the `sum`

function is very expensive and the `sqrt`

function is somewhat expensive.

If it was obvious we always had a great number of small sub-lists we would want to use parallelization to make sure we are performing many `sum`

s at the same time. We would then parallelize over the first level as below.

`clus <- parallel::makeCluster(4)`

`parallel::parLapplyLB(clus,jobs,worker1)`

```
## [[1]]
## [1] 4.146264
##
## [[2]]
## [1] 16.32201
##
## [[3]]
## [1] 1231.021
```

Notice that `parallel::parLapplyLB`

uses almost the same calling convention as `lapply`

and returns the exact same answer.

If it was obvious we had a single large sub-list we would want to make sure we were always parallelizing the `sqrt`

operations so we would prefer to parallelize as follows:

```
mkWorker2 <- function(clus) {
force(clus)
function(x) {
xs <- parallel::parLapplyLB(clus,x,sqrt)
sum(as.numeric(xs))
}
}
worker2 <- mkWorker2(clus)
lapply(jobs,worker2)
```

```
## [[1]]
## [1] 4.146264
##
## [[2]]
## [1] 16.32201
##
## [[3]]
## [1] 1231.021
```

(For the details of building functions and passing values to remote workers please see here.)

If we were not sure if in the future what structure we would encounter we would prefer to schedule all operations for possible parallel execution. This would minimize the number of idle resources and minimize the time to finish the jobs. Ideally that would look like the following (a nested use of parallel):

`parallel::parLapplyLB(clus,jobs,worker2)`

`## Error in checkForRemoteErrors(val): 3 nodes produced errors; first error: invalid connection`

Notice the above fails with an error. Wishing for flexible code is what beginners intuitively mean when they as if you can nest parallel calls. They may not be able to explain it, but they are worried they don’t have a good characterization of the work they are trying to parallelize over. They are not asking if things get magically faster by “parallelizing parallel.”

It isn’t too hard to find out what the nature of the error is: the communication connection socket file descriptors (`con`

) are passed as integers to each machine, but they are not valid descriptors where they arrive (they are just integers). We can see this by looking at the structure of the cluster:

`str(clus)`

```
## List of 4
## $ :List of 3
## ..$ con :Classes 'sockconn', 'connection' atomic [1:1] 5
## .. .. ..- attr(*, "conn_id")=<externalptr>
## ..$ host: chr "localhost"
## ..$ rank: int 1
## ..- attr(*, "class")= chr "SOCKnode"
## $ :List of 3
## ..$ con :Classes 'sockconn', 'connection' atomic [1:1] 6
## .. .. ..- attr(*, "conn_id")=<externalptr>
## ..$ host: chr "localhost"
## ..$ rank: int 2
## ..- attr(*, "class")= chr "SOCKnode"
## $ :List of 3
## ..$ con :Classes 'sockconn', 'connection' atomic [1:1] 7
## .. .. ..- attr(*, "conn_id")=<externalptr>
## ..$ host: chr "localhost"
## ..$ rank: int 3
## ..- attr(*, "class")= chr "SOCKnode"
## $ :List of 3
## ..$ con :Classes 'sockconn', 'connection' atomic [1:1] 8
## .. .. ..- attr(*, "conn_id")=<externalptr>
## ..$ host: chr "localhost"
## ..$ rank: int 4
## ..- attr(*, "class")= chr "SOCKnode"
## - attr(*, "class")= chr [1:2] "SOCKcluster" "cluster"
```

```
mkWorker3 <- function(clus) {
force(clus)
function(x) {
as.character(clus)
}
}
worker3 <- mkWorker3(clus)
parallel::parLapplyLB(clus,jobs,worker3)
```

```
## [[1]]
## [1] "list(con = 5, host = \"localhost\", rank = 1)"
## [2] "list(con = 6, host = \"localhost\", rank = 2)"
## [3] "list(con = 7, host = \"localhost\", rank = 3)"
## [4] "list(con = 8, host = \"localhost\", rank = 4)"
##
## [[2]]
## [1] "list(con = 5, host = \"localhost\", rank = 1)"
## [2] "list(con = 6, host = \"localhost\", rank = 2)"
## [3] "list(con = 7, host = \"localhost\", rank = 3)"
## [4] "list(con = 8, host = \"localhost\", rank = 4)"
##
## [[3]]
## [1] "list(con = 5, host = \"localhost\", rank = 1)"
## [2] "list(con = 6, host = \"localhost\", rank = 2)"
## [3] "list(con = 7, host = \"localhost\", rank = 3)"
## [4] "list(con = 8, host = \"localhost\", rank = 4)"
```

What we are getting wrong is: we can’t share control of the cluster to each worker just by passing the cluster object around. This would require some central registry and call-back scheme (which is one of the things packages like `foreach`

and `doParallel`

accomplish when they “register a parallel back-end to use”). Base `parallel`

depends more on explicit reference to the cluster data structure, so it isn’t “idiomatic parLapply” to assume we can find “the parallel cluster” (there could in fact be more than one at the same time).

So what is the work around?

One work around is to move to sophisticated wrappers (like `doParallel`

or even `future`

, also see here).

These fixes roughly split the calculation into two phases one dedicated to the `sqrt`

step and the second dedicated to the `sum`

step (remember we are pretending both of these operations are expensive). We can directly demonstrate such a reorganization as follows.

```
library('magrittr')
mkWorker4a <- function(clus) {
force(clus)
function(x) {
as.numeric(parallel::parLapplyLB(clus,x,sqrt))
}
}
worker4a <- mkWorker4a(clus)
worker4b <- function(x) {
sum(x)
}
jobs %>%
lapply(X=.,FUN=worker4a) %>%
parallel::parLapplyLB(cl=clus,X=.,fun=worker4b)
```

```
## [[1]]
## [1] 4.146264
##
## [[2]]
## [1] 16.32201
##
## [[3]]
## [1] 1231.021
```

The above depends on not too many of the sub-lists being short (and hiding opportunities for parallelism).

Another fix is to (at the cost of time effort and space) to re-organize the calculation into two sequenced phases, each of which is parallel- but not nested. It is a bit involved, but we show how to do that below (using `R`

’s `Reduce`

and `split`

functions to reorganize the data, though one could also use so-called “tidyverse” methods). This fix is more general, but introduces reorganization overhead.

```
# Preparation 1: collect all items into one flat list
sqrtjobs <- as.list(Reduce(c,jobs))
# Phase 1: sqrt every item in parallel
sqrts <- parallel::parLapplyLB(clus,sqrtjobs,sqrt)
# Preparation 2: re-assemble new job list that needs only sums
lengths <- vapply(jobs,length,numeric(1))
pattern <- lapply(seq_len(length(lengths)),
function(i) {rep(i,lengths[[i]])})
pattern <- Reduce(c,pattern)
sumjobs <- split(sqrts,Reduce(c,pattern))
sumjobs <- lapply(sumjobs,as.numeric)
names(sumjobs) <- names(jobs)
# Phase 2: sum all items in parallel
parallel::parLapplyLB(clus,sumjobs,sum)
```

```
## [[1]]
## [1] 4.146264
##
## [[2]]
## [1] 16.32201
##
## [[3]]
## [1] 1231.021
```

In conclusion: you can’t *directly* nest `parLapply`

, but you can usefully sequence through it.

`parallel::stopCluster(clus)`

]]>