Recently Hugh Howey shared some eBook sales data spidered from Amazon.com: The 50k Report. The data is largely a single scrape of statistics about various anonymized books. Howey’s analysis tries to break sales down by declared category and source, but there are a lot of difficulties due to the quality of the tags in the data. A lot of the questions we would like to look into (such as do reviews drive sales or sales drive reviews) are not practical unless we had a more longitudinal data set that includes many observations on a repeated set of books over time.

However, we can try to relate one type of reported outcome (sales rank, a number Amazon visibly shares on ebook product pages) and number of sales (a harder to find quantity). Note: we are not really doing any *predictive* modeling as we are not trying to predict future sales from features, but instead we are just try to learn an approximate relation between two different encodings of outcomes (sales count and sales rank).

We share down the steps to convert the Excel data to a usable R format here on GitHub. A quick use of the data is as follows:

```
```library('RCurl')
url <- paste('https://raw.github.com/WinVector/',
'Examples/master/',
'AmazonBookData/amazonBookData.Rdata',sep='')
load(rawConnection(getBinaryURL(url)))

The data is now in a dataframe named “`d`

“. The crude analysis we want to do is to relate `Kindle.eBooks.Sales.Rank`

to `Daily.Units.Sold`

. We will do this on “log-log” paper (where famously most anything looks like a line).

```
```model <- lm(log(Daily.Units.Sold)~log(Kindle.eBooks.Sales.Rank),
data=d)
d$EstLogUnitsSold <- predict(model,newdata=d)
library('ggplot2')
ggplot(data=d,aes(x=log(Kindle.eBooks.Sales.Rank))) +
geom_point(aes(y=log(Daily.Units.Sold))) +
geom_line(aes(y=EstLogUnitsSold))

The line fit looks plausible for ebooks in the sales-rank range around 200 through 150,000. Lets take a quick look at the model:

```
```print(model)
Call:
lm(formula = log(Daily.Units.Sold) ~ log(Kindle.eBooks.Sales.Rank),
data = d)
Coefficients:
(Intercept) log(Kindle.eBooks.Sales.Rank)
11.5063 -0.9334

This is roughly saying `Daily.Units.Sold ~ exp(11.5 - 0.93*log(Kindle.eBooks.Sales.Rank))`

or (with a little algebra): `Daily.Units.Sold ~ 99339.64 / Kindle.eBooks.Sales.Rank^0.93`

.

This isn’t too far from the following easy rule of thumb: `Daily.Units.Sold ~ 100000 / Kindle.eBooks.Sales.Rank`

. Applying this we would expect a typical ebook ranked at position 100,000 to sell about 1 copy a day. Now we don’t want to read too much into this, as fitting a line onto log-log paper is a classic example of heavy-handed econometrics (in econometrics you often force the structure of the results by model selection, see “Bad models and the end of the world” for some enjoyable vitriol on abuses of the idea).

However this rule of thumb is consistent with Chris Anderson’s point in the The Long Tail. The fact we see a plausible power law over a large range means we can (crudely) estimate the entire expected sales of an infinite sized catalog as: `sum_{rank=1...infinity} 99339.64 rank^pow`

. In our case `pow=-0.93`

which is `≥ -1`

: meaning the sum diverges or the total is infinite. If `pow`

had been something smaller (like `pow=-2`

) then even an infinite catalog would only have a finite total value. But in this case the theory says the ebook distributor can grow their total revenue to just about any level, if they can add enough books cheaply (they don’t get overwhelmed by diminishing revenue returns early).

Amazon clearly wants the large revenue found in the popular (or “head” books), but you can see that it is plausible they will always have more opportunity to grow their business by increasing coverage (and making the handling of) many less popular products (the so-called “long tail”). Not a new observation, but fun to be able to pull it quickly from shared data.

(Funny side note. This sort of analysis can be stretched to say that the expected lifetime sales of any book that stays in print forever is infinite. This argument only works *if* cumulative sales rank has an exponent of `-1`

or larger (and Amazon seems to be using a recent sales rank, so we don’t actually have any estimate for the exponent of cumulative sales rank). Suppose our book starts at rank-A and each day k more books are written and they all are more popular than our book. Then the modeled total unit sales of our book is `sum_{rank=A,A+k,A+2k...infinity} 100000/rank`

which also diverges (though would stay bounded if we added a reasonable discount term for future value). Mostly we are showing you can push these analyses way too far; to get better results you need to correctly model more of the market.)

In our new book (Practical Data Science with R) we didn’t get into the lack of pointers for a purely didactic reason. To tell a general audience (perhaps one new to scripting or programming) that they don’t need to know about pointers, we would have to first explain what pointers are (somewhat losing the cognitive savings). We settled for demonstrating R’s (primarily) call by value semantics for functions (which we already needed to explain) with the following example:

```
```> vec <- c(1,2)
> fun <- function(v) { v[[2]]<-5; print(v)}
> fun(vec)
[1] 1 5
> print(vec)
[1] 1 2

Notice how the mutation (changing an entry to 5) does not escape the function as a side effect. Because R is a bit of kitchen sink (everything and its opposite is pretty much available) we had to cautiously title this example as “R behaves like a call-by-value language” in our book (R in fact has a number of sharable reference structures including `environment`

s, `ReferenceClasses`

, lazy evaluation systems like promises/`delayedAssign`

, and more). (The ugly `[[]]`

notation is something we recommend as it catches a few more errors than the more common `[]`

notation. For details please see appendix A of our book.)

What we didn’t discuss is that you get this sort of change isolation and safety in R in just about every situation (not just when binding values to function arguments). Here is another example (this time not from the book):

```
```> vec <- c(1,2)
> v2 <- vec
> v2[[2]] <- 5
> print(v2)
[1] 1 5
> print(vec)
[1] 1 2

Unlike many languages the assignment “`v2 <- vec`

” does not end up with `vec`

and `v2`

as references (or pointers) entangled to the same object. Instead they behave as if they are two different objects. This does prevent using these two symbols to communicate results (a legitimate programming practice) but it also prevents a whole host of errors and confusions that beginning programmers run into in the presence of such *shared mutability*. R protects the programmer by treating objects directly without exposing the additional ideas of references or pointers. Many ideal functional programming languages more directly expose references but mitigate their danger by insisting on immutable structures; but this requires the user to learn (in addition to data handling, statistics and programming) the fairly alien discipline of composing immutable data structures.

We encourage beginning programmers to think of programs as organizing sequences of transformations over data. So the simpler (and fewer) the mutations are, the easier it is to reason about programs. When you program in R you are mostly working with values and not variables (which is good, as it leaves you more time to think about data). So, as much as we complain about R, it is in fact a good choice for teaching, analysis, data science and even basic scripting tasks.

However, you do eventually have to deal with the unpleasant details of side-effects and shared mutability. One place where R doesn’t hide the sharp edges from you is in *closures* (the structure R uses to represent the context of a function). Consider the following code puzzle where we wonder what gets printed by the following:

```
```# make an array of 3 functions
f <- vector('list',3)
# set the i'th function to return i
for(i in 1:length(f)) {
f[[i]] <- function() { i }
}
# apply the functions using a different loop variable
for(j in 1:length(f)) {
print(f[[j]]())
}

Note this is one place where you really do need to use the uglier `[[]]`

notation. In the current version of R (3.0.2) if you try to use `[]`

you get the error message “cannot coerce type ‘closure’ to vector of type ‘list’.” But the puzzle is: what do you expect to be printed. If R was binding the value of `i`

into the `i`

‘th function you would expect to see the sequence “1,2,3.” Instead each function in fact gets its value for `i`

by using what is current in its capture of the evaluation environment. So this code in fact prints “3,3,3″, as this is the value i has after the first loop is finished. This is unfortunate, as a lot of productive programming patterns depend on capturing safe isolated values- not capturing entangled references.

This sort of puzzle may seem unpleasant and unnatural, but when pointers (and other sort of shared references) are involved you are forced to solve this sort of puzzle to understand the meaning or semantics of a code fragment or program. It is because these puzzles are laborious that languages like R emphasize isolation, so there is much less to worry about when you try to compose useful data transformations.

Closures and environments are very powerful tools (many of R’s features and built in terms of them). And this common shared mutability of them is a huge source of confusion in many programming languages (Javascript also has this issue, and Java only allows closures to capture final variables to try and cut down on some of the possible interference). To get the behavior we want (each function capturing the current value of `i`

in its closure and not sharing a common reference) we can write the following code:

```
```f <- vector('list',3)
for(i in 1:length(f)) {
f[[i]] <- function() { i }
e <- new.env()
assign('i',i,envir=e)
environment(f[[i]]) <- e
}
for(j in 1:length(f)) {
print(f[[j]]())
}

And this prints 1,2,3 as we would hope. Note we are now in *very* deep programming ground (closures being at least as confusing to beginners as pointers) and no longer even thinking about data. We have to admit: we really counted to 3 the hard way.

It took a little longer than we’d hoped, but we did it! *Practical Data Science with R* will be released on April 2nd (physical version). The eBook version will follow soon after, on April 15th. You can preorder the pBook now on the Manning book page. The physical version comes with a complimentary eBook version (when the eBook is released), in all three formats: PDF, ePub, and Kindle.

If you haven’t yet, order it now!

(softbound 416 pages, black and white; includes access to color PDF, ePub and Kindle when available)

]]>

Let’s work with a simple but very common example. You are asked to build a classification engine for a rare event: say default in credit card accounts. In good times for well managed accounts it is easy to imagine the default rate per year could be well under 1%. In this situation you do not want to propose predicting which accounts will actually default in a given year. This may be what the client asks for, but it isn’t reasonable to presume this is always achievable. You need to talk the client out of a business process that requires perfect prediction and work with them to design a business process that works well with reasonable forecasting.

Why is such prediction hard? Usually prediction in these situations is hard because while you usually have access to a lot broad summary data for each account (net-worth, age, family size, number of years account has been active, patterns of borrowing, amount of health insurance, amount of life insurance, patterns of re-payment and so on) you usually do not have access to many of the factors that trigger the default or even when you do such variables are not available very long before the event to be predicted. Trigger events for default can include sudden illness, physical accident, falling victim to a crime and other acute set-backs. The point is: two families without health insurance may have an equally elevated probability of credit default, but until you know which family gets sick you don’t know which one is much more likely to default.

Why does everybody ask for prediction? First: good prediction would fantastic, if they could get it. Second: most layman have no familiar notion of classifier quality other than accuracy (and measures similar to accuracy). And if all you know is accuracy then all you are prepared to discuss is prediction. So the client is really unlikely to ask to optimize a metric they are unfamiliar with. The measures that help get you out of this rut are statistical deviance and information theoretic entropy; so you will want to start hinting at these measures early on.

How do we show the value of achievable forecasting? For this discussion we define forecasting credit default as the calculation of good conditional probability estimate of credit default. To evaluate forecasts we need measures beyond accuracy and measurers that directly evaluate scores (without having to set a threshold to convert scores into predictions).

Back to our example. Suppose that in our population we expect 1% of the accounts to default. And we build a good forecast or scoring procedure that for 2% of the population returns a score of 0.3 and for the remaining 98% of the population returns a score near 0.01. Further suppose our scoring algorithm is well calibrated and excellent: the 2% of the population that it returns a score of 0.3 and above on actually tends to default at a rate of 30%.

Such a forecast identifies a 2% subset of the population that has a 30% chance of defaulting. Treated as a classifier it never says “yes” because it has not identified any examples that are estimated to have at least a 50% chance of defaulting (obviously we can force it to say “yes” by monkeying with scoring thresholds). So the classifier is not a silver bullet predictor. But it may (when backed with the right business process) be a fantastic forecaster: the subset it identifies is only 2% of the overall population yet has 60% of the expected defaults! Designing procedures to protect the lender from these accounts (insurance, cancellation, intervention, tighter limits, tighter payment schedule or even assistance) represents a potential opportunity to manage half of the lender’s losses at minimal cost. To benefit the client must both be able to sort or score accounts and have a business process that is not forced to treat all accounts as identical.

As we have said: laymen tend to only be familiar with accuracy. And accuracy is not a good measure of forecasts (see: “I don’t think that means what you think it means;” Statistics to English Translation, Part 1: Accuracy Measures). What you need to do is shop through metrics before starting your project and find one that is good for your client. Finding a metric that is good for your client involves helping them specify how classifier information will be used (i.e. you have to help them design a business process). Some types of scores to try with your client include: lift, precision/recall, sensitivity/specificity, AUC, deviance, KL-divergence and log-likelihood.

Time spent researching and discussing these metrics with your client is more valuable to the client than endless tweaking and tuning of a machine learning algorithm.

For a more on designing projects around good data science project metrics please see Zumel, Mount, “Practical Data Science with R” Chapter 5 Choosing and Evaluating Models which discusses many of the above metrics.

]]>

“Practical Data Science with R” started formal work in October of 2012. We had always felt the Win-Vector blog represented practice and research for such an effort, but this is when we started outlining a concrete book proposal. Most of a book proposal is specifying and limiting scope down to something that has a coherent point of view.

By May 2013 we had three chapters written and were able to launch the MEAP (Manning Early Access Program, where chapters drafts are shared to subscribers). By December 2013 the book was “content complete” (everything had been written and was accepted by initial editors and technical reviewers). Even though a lot of work had gone into writing, editing and technical review (see On writing a technical book) the pace actually picked up at this point.

We continue working with additional formal technical reviewers, proof editors, copy editors, indexers, graphic artists, layout specialists, QA readers and many more to give the book what one editor called “the sparkle the book deserves.” The MEAP now has all chapters available to subscribers, though even subscribers will not see a great number of the fixes and improvements until the final book is released.

But let’s get down to some of the numbers produced in the process of writing the book.

- Final chapter count: 11 (one chapter got moved to the appendixes).
- Page count: 416 (softbound black and white).
- Number of figures: 159.
- Number of words: about 130,000.
- Size of book text: 1.8MB.
- Number of git commits in book text repository: 742.
- Number of example code extracts: 274 (about 1.1MB).
- Size of example support site: 100MB.
- Number of git commits in example repository: 151.
- Number of book related emails in my email folder: 968.

We (Nina, myself and Manning Publications Co.) have put a *lot* into this book to make it easier for readers to get a lot out of it. We can’t wait to put it in your hands.

Just for the fun: the cover page of a book I very much respect that got me thinking about counting things.

]]>We normally don’t write about science here at Win-Vector, but we do sometimes examine the statistics and statistical methods behind scientific announcements and issues. NASA’s new technique is a cute and relatively straightforward (statistically speaking) approach.

From what I understand of the introduction to the paper, there are two ways to determine whether or not a planet candidate is really a planet: the first is to confirm the fact with additional measurements of the target star’s gravitational wobble, or by measurements of the transit times of the apparent planets across the face of the star. Getting sufficient measurements can take time. The other way is to “validate” the planet by showing that it’s highly unlikely that the sighting was a false positive. Specifically, the probability that the signal observed was caused by a planet should be at least 100 times larger than the probability that the signal is a false positive. The validation analysis is a Bayesian approach that considers various mechanisms that produce false positives, determines the probability that these various mechanisms could have produced the signal in question, and compares them to the probability that a planet produced the signal.

The basic idea behind verification by multiplicity is that planets are often clustered in multi-planet star systems, while false positive measurements (mistaken identification of potential planets) occur randomly. Putting this another way: if false positives are random, then they won’t tend to occur together near the same star. So if you observe a star with multiple “planet signals,” it’s unlikely that all the signals are false positives. We can use that observation to quantify how much more likely it is that a star with multiple candidates actually hosts a planet. The resulting probability can be used as an improved prior for the planet model when doing the statistical validation described above.

You can read the rest of the article here.

]]>

The most common reported significance is the frequentist p-value. Formally the p-value is the probability a repeat of the current experiment would show an effect as large as the current one assuming the null-hypothesis that there is in fact no effect present. This is frequentist because we are assuming an unknown fixed state of the world and variation in the possibility of alternative or repeated experiments. The issue is: significance tests are neither as simple as one would like nor as powerful as one would hope. Usually significance is misstated (either through sloppiness, ignorance, or malice) as being the chance the given result is false. Failure to reject the null hypothesis is only one possible source of error, so a low p-value is necessary but in no way sufficient condition to having a good result. False positives of this sort are not reproducible and show what is called reversion to mediocrity.

The Bayesian version of such a test would assume a prior distribution of the unknown quantity and hope to infer a low posterior probability on the “no effect” alternative. This leads to a similar calculation as the frequentist, but with the the ability to interpret a low probability of mistake as a high probability of success. An issue with the Bayesian analysis is you must supply priors, so your conclusion is dependent on and sensitive to your choice of priors (another possible avenue of abuse).

At best what a p-value represents is the degree of filtering the experiment has (under ideal conditions) against non-results. Run 100 experiments at p=0.05 and you expect to see at least 5 results that *appear to be* good; even if there was in fact no improvement to be measured. This is unfortunately standard practice for many. It is not enough to work hard on many projects and report your good results, see: “Why Most Published Research Findings Are False” John P A Ioannidis. Plos Med, 2005 vol. 2 (8) p. e124; and “Does your model weigh the same as a Duck?” Ajay N Jain and Ann E Cleves, J Comput Aided Mol Des, 2011 vol. 26 (1) pp. 57-67. Also shotgun style A/B testing of pointless variations is particularly problematic (see “Most winning A/B test results are illusory” Martin Goodson, qubitproducts.com, 2014). Projects like 41 blues are not only bad design they are likely bad science.

Combine a large number of bad hypotheses, the impossibility of “accepting the null hypothesis” and you have no reason to believe any result through mere first report. The issue being: while only 5% of the tests ran falsely appear to succeed, if mostly useless experiments are run it can easily be that nearly 100% of what gets published and acted on are false results. A stream of nonsense can drown out and hide more expensive and rare actual good work, if your filter is sloppy enough. Add in the inability to reproduce results and have a large problem.

Two questions we want to comment on: why would a researcher submit a bunch of bad work to testing and surely there is an easy fix?

Why are unsubstantiated work and ideas submitted for testing? Ideally testing is a means of scientific confirmation: you submit an idea that has good reasons to work in principle and then confirm the improvement in performance with a test. In fact to correctly design an A/B test you must propose the smallest difference you expect to detect. The reason you get so many meaningless changes submitted as meaningful experiments are varied. First A/B testing has been sold as a way to avoid bike shedding (avoiding the debate of meaningless differences by attempting to test meaningless differences). Also you get what you reward: if there is a benefit (getting a publication or bonus) for having the appearance of a good result, then you will eventually only get results that merely appear to be good. Once people figure out the appearance of success is rewarded your field becomes dominated by shotgun studies (proposing many useless variations is easier than inventing a plausible improvement) using a fixed p-threshold (p=0.05, because you are not traditionally allowed to get away with p any higher and p any lower just makes it take longer to appear to succeed).

There is a any easy fix: apply the Bonferroni correction. This is just a fancy way of saying: if we allow somebody to submit 10 ideas to test and report success if any of them look good, then we need to tighten the test criterion. If we are convinced that p=0.05 is a valid threshold for a single test (which should not be automatic, just because everybody uses p=0.05 doesn’t mean you should) then we should force somebody submitting 10 tests to run each test at p=0.005 to try and compensate for their venue shopping. A possible Bayesian adjustment would be to force the prior estimate of the probability of success to fall linearly in the number of experiments run.

Tests are filters. What p-value you should use is not set in stone at p=0.05. It depends on your prior model of the distribution of items you are going to test (are we confirming experiments thought to work, or are we running through a haystack looking for rumored needle?) and your estimates of the relative costs of type-1 versus type-2 errors (is this early screen where false negatives are to be avoided, or a final decision where false positive are to be avoided?). With a good loss model and prior estimates it is mere arithmetic to pick an optimal p-value.

Experimental design and significance encompass the whole experimental process. To calculate correct significances you must include facts about many experiments, not just a given single experiment. You must think in terms of actual probability of correctness, not mere procedures.

]]>Producing a revenue improving predictive model is *much* harder than mining an interesting association. And this is what we will discuss here.

The following very interesting graph is shared in “The Formation of Love”:

The data preparation is not explained in great detail (that is okay, this is popular blog post- not a formal paper requiring a material and methods section). But it looks like the y-axis of the graph is something like directed timeline posts per day, the x-axis is day number (relative to when a relationship is declared), and each dot represents a person or a group of people (more on this last point in a bit). The point of the graph is: there is an increase in timeline posts between people who end up declaring a relationship on Facebook and then a drop off after the declaration of relationship. This is interesting behavioral data. The unwarranted expectation is with such a strong relation you could use immediately use daily posting rates to identify who is likely to declare a relationship and when this might happen (this would be valuable as these are unobserved until “day 0″).

However, as many other commenters have noticed: the range of posts per day is very small ranging from 1.62 well before declaration, to 1.65 right before declaration and settling to 1.54 well after declaration. And the variation away from the smoothing line seems unbelievable low (the fit is way too good). Finally the dots are not grouped at integer rates or rates with small denominators (suggesting each dot is in fact already aggregated from many users, and it is likely a jitter has been added to make the plot more legible). The possibility of aggregation is consistent with the start of the article which says: “This research has been conducted on anonymized, aggregated data.” The question is: how much aggregation is going on? Trends that can be seen in large aggregates are often very hard to apply to individuals. Or alternately very weak individual effects can be made to stand out if you aggregate enough data. There is no question in my mind we have a valid statistically significant result in front of us- it is just is there enough practical or “clinical significance” to immediately drive usable individual predictions?

To start our investigation let’s get an approximate copy of the data from the graph. This can be done by running the graph through an online digitizing tool such as: WebPlotDigitizer. A little work yields the data file NMsg.csv. And a small bit of R code lets us reproduce the graph.

```
```# load libraries
require(RCurl)
require(ggplot2)
require(mgcv)
require(reshape2)
# Load data, and make sure x is increasing
urlBase <-
'https://raw2.github.com/WinVector/Examples/master/MiningGap/'
mkCon <- function(nm) {
textConnection(getURL(paste(urlBase,nm,sep='')))
}
d <- read.table(mkCon('NMsg.csv'),header=T,sep=',')
d <- d[order(d$DaysBeforeAfterRelationship),]
# plot something similar to original graph
ggplot(data=d,aes(x=DaysBeforeAfterRelationship,
y=NumberOfTimelinePosts)) +
geom_point() +
stat_smooth(method = "gam", formula = y ~ s(x))

Which yields a graph fairly similar to the original:

We generated our intensity estimates as follows:

```
```earlyIntensity <- mean(
d[d$DaysBeforeAfterRelationship >= -100 &
d$DaysBeforeAfterRelationship < -80,
'NumberOfTimelinePosts'])
print(earlyIntensity)
## [1] 1.622637
nearIntensity <- mean(
d[d$DaysBeforeAfterRelationship >= -20 &
d$DaysBeforeAfterRelationship < 0,
'NumberOfTimelinePosts'])
print(nearIntensity)
## [1] 1.653082
afterIntensity <- mean(
d[d$DaysBeforeAfterRelationship >= 80 &
d$DaysBeforeAfterRelationship < 100,
'NumberOfTimelinePosts'])
print(afterIntensity)
## [1] 1.539452

And here is the problem for a predictive model. If we assume (without justification) that users are generating these messages with from a Poisson model with the above given intensities we find it would be difficult to tell from the number of posts if a user is 100 days from declaring a relationship, about to declare one, or long after declaring one. Now the Poisson assumption is unjustified (and whole families of methods, such as over-dispersed models, have been developed to work around its limits) but it is a reasonable first guess for any sort of event counts per interval problem.

Below is a bar-chart of the probability distribution (under our Poisson model with the estimated intensities) of the observed number of posts expected from a person right before they declare a relationship compared to a person long after they declared a relationship (the pair of situations most easy to tell apart by posting rates):

The issue is: the two distributions are almost indistinguishable. We can’t tell how many posts a person would make given their group (long time from declaration, about to declare, or long after declaration), therefore we can’t reliably tell what group a person is in given their post count (a sort of contrapositive version of Bayes’ law).

The code to make the bar chart is as follows:

```
```# What would individuals generating posts
# as a Poisson process with a given intensity
# look like around 100 days before forming
# a relationship, right before forming a relationship,
# and long after a relationship is formed?
density <- data.frame(NumberOfTimelinePosts=0:10)
density$nearCounts <- dpois(
density$NumberOfTimelinePosts,
lambda=nearIntensity)
density$afterCounts <- dpois(
density$NumberOfTimelinePosts,
lambda=afterIntensity)
dm <- melt(density,
id.vars=c('NumberOfTimelinePosts'),
variable.name='group',
value.name='density')
ggplot(data=dm) +
geom_bar(stat='identity',position='dodge',
aes(x=NumberOfTimelinePosts,y=density,fill=group)) +
scale_x_continuous(breaks=density$NumberOfTimelinePosts)

But our bar chart is ignoring even larger blockers to making a usable predictive model. 100 days before a person declares a relationship we would not know how many days they are away from declaring (we know after they declare how many days ago they declared, but we don’t know the future) and we don’t know a priori *who* they are thinking of declaring with. A business doesn’t want a system that uses things that are hard to know (who will get into a relationship with whom) to predict easy measurements (post counts) after the fact. A business wants a real time system that uses things that are easy to measure (post counts) to predict future events that are valuable (for example: who will declare a relationship with whom and when). The original graph shows a relation between things, but a business needs a directional inference (from what is easy and cheap to things that are valuable).

And this is the jump: a plotted relation is not always immediately the basis for a usable predictive model. The popular press and management do not always seem to remember this distinction.

We started the article guessing that the dots must be aggregations of users (otherwise such tight error bars would be beyond what could be expected). Let’s use our (unjustified) Poisson process assumption to estimate how many users are likely aggregated in each dot. To attempt this we need to appeal to three facts:

- The mean equals the variance for a Poisson distribution.
- For independent identically distributed data the expected square distance between pairs of points is twice the variance.
- When you average or aggregate k independently identically distributed items this average has an expected variance of 1/k of the original items.

For individuals emitting posts in a Poisson distribution of intensity 1.5 posts per day we would expect a variance of 1.5 yielding error bars of around +-1.2, much larger than on the graphs we have seen. So we expect each dot is the aggregation of k individuals for some large k.

Suppose x is a random variable that is the average of k independent identically distributed Poisson distributed random variables (each with intensity lambda). Then we expect: E[x]/Var[x] to be independent of lambda and approximately equal to k (we are a little disturbed that our estimate is not dimensionless, but we must remember that Poisson distributions of different intensities have different shapes). Under this reasoning we estimate each dot on the graph represents almost 70,000 individuals. Under our (unjustified) Poisson assumption somebody with less data than this could not get an aggregate graph as tight as the one presented (at such low activity rates, however it is possible that aggregating data across many days could in fact give usable predictive models).

This illustrates the large gap between a good data mining result and a profitable predictive model.

The R-code to estimate the de-trended variance and k (the degree of aggregation) is a as follows:

```
```# If the original data was generated from
# independent individuals with slowly varying
# Poisson intensity, then we could estimate the
# degree of aggregation to see the mean/variance
# ration in the plot.
odd <- 2*(1:(dim(d)[[1]]/2))-1
varEst <- sum((d[odd,'NumberOfTimelinePosts']
-d[odd+1,'NumberOfTimelinePosts'])^2)/(2*length(odd))
print(varEst)
## [1] 2.289848e-05
meanEst <- mean(d$NumberOfTimelinePosts)
print(meanEst)
## [1] 1.599492
kEst <- meanEst/varEst
print(kEst)
## [1] 69851.47
print(dim(d)[[1]])
## [1] 193

(Note: Corrected an error on my part. I used to try and estimate the total study size multiplying the number of dots times the estimated number of subjects per dot. That made no sense as each dot is presumably the same set of individuals- so the total study size is estimated at around 70,000 people, not 13 million.)

More Win-Vector LLC work on post-hoc inspection of results include:

]]>The idea is to illustrate what can quietly go wrong in an analysis and what tests to perform to make sure you see the issue. The main point is some analysis issues can not be fixed without going out and getting more domain knowledge, more variables or more data. You can’t always be sure that you have insufficient data in your analysis (there is always a worry that some clever technique will make the current data work), but it must be something you are prepared to consider.

A typical misuse of principal component analysis is pointed out in iowahawk’s Fables of the Reconstruction. To be clear: iowahawk does good work pointing out the problems in the poor results of others. I would, however, advise against trying to draw any climate or political conclusions. iowahawk performed a very smart reconstruction of a typical poor result. It is a dangerous fallacy to think you can find truth by reversing the conclusion of a poor result. The technique wasn’t bad, it was just the data wasn’t enough to support the desired result and it was wrong to not check more thoroughly and to promote such a weak result. Technique can be fixed, but the hardest thing to get right is having enough good data. The original analysis ideas are fairly well organized and clever, the failings were: not checking enough and promoting the result when the data really wasn’t up to supporting all of the claimed steps.

The typical unprincipled component analysis is organized as follows:

- Collect a bunch of data in wildly different units related to the quantity you are trying to predict.
- Without any concern for dimensionality or scaling extract the principal components of the explanatory variables. Perform no research into appropriate model structure, data exchangeability, omitted variable issues and actual relations to the response variable.
- Do not examine or use any of the principal component diagnostics and include all principal components in your linear model (completely missing the point of PCA).
- Do not attempt to simulate your proposed application with a test/train holdout split and rely on qualitative measures to promote (not check) your result.

We can simulate this typical bad analysis ourselves (so it is me making the mistakes here, so in addition to having given deliberately bad advice we now offer a deliberately bad analysis) using data copied from iowahawk’s Open Office document (Which seems to have changed since I last looked at his example which had 23 variables when I last looked, a very good lesson in saving data and recording provenance. For example the copy of warming.ods we just downloaded has a shasum of 683d931e650f9bbf098985311c400f126a14e5cf and we stored a copy here. This is something we try to teach in our new data science book.). Note: that at the time I thought the PCA issue was due to including rotation elements matching near-zero principal components (which can be pure noise vectors as the need to be orthogonal to early components starts to dominate their calculation). I now think more of the error should be ascribed to simple overfitting and not to any special flaws of PCA- but just the fallacy that PCA always prevents overfit.

But let’s start the (bad) analysis. The task at hand was two part: show a recent spike in global temperature and show a very long history of global temperatures with no similar event. If done properly this would demonstrate recent warning, and help support the argument that such warming could be due to recent industrial activity (by finding no such spike in pre-history). Overall this was a reasonable plan. In fact it stands out as more planning than is typical. And for all its warts more data and details (code) ended up being shared than is common. But the analysis did fall sort and was over-sold. It needed some more domain specific steps and more checking that the type of model built actually could be used in the way proposed (sorting through correlations and causes and extrapolating into the past).

We are only going to concentrate on the second step: trying to establish a claim that there is no sign of large temperature changes in early history. We will ignore the steps of trying to establish a recent temperature spike and of trying to relate that spike to industrial activity and by-products (also done poorly in the study, but these are separate issues). Also: to be clear any criticism of one analysis is not a refutation of the theory of anthropogenic global warming (which I think is likely true), though obviously a bad study is certainly not a proof of anthropogenic global warming (and should not be promoted as being such). To emphasize: the step we are looking at is only try to estimate temperatures far in the past using measured outcomes (tree rings, ice layers); we are not trying to model the future and we are not trying to even estimate correlation with environmental conditions (CO2 isn’t even a variable in this data extract) or human activity.

A big issue is this data set has only recorded temperatures from 1856 through 2001, so you don’t have any direct temperature data even from 1400AD. So you are not yet in a position to establish a calm early climate. But you do have a number of measurements of things related to historic temperatures: tree ring widths, ice core layer thicknesses, and so on. With enough of these you could hope to get estimates of historic temperatures. With too many of these you just over-fit your known data and learn nothing about the past. And in this sort of model you have a lot of co-linearity (a good source of fitting problems) between these measurements and between expected temperature year to year (which we are assuming is a slow trend with high-autocorrelation hidden by a noise process).

The simple analysis doesn’t bother with any research on the relations between these variables and temperature (either through experiments, first principles or something like generalized additive models) and goes immediately to a PCA regression. The idea is sound: run the data through a constriction (the PCA) to make things more learnable (one of the good ideas in neural nets). The R code for such a shotgun analysis is roughly as follows (the steps happen to be easy in R, but with sufficiently clumsy tools you could write much longer code and really convince yourself that you are accomplishing something):

```
```# load and prepare the data
urlBase <- 'http://www.win-vector.com/dfiles/PCA/'
pv <- read.table(paste(urlBase,'ProxyVariables.csv',sep=''),
sep=',',header=T,comment.char='')
ot <- read.table(paste(urlBase,'ObservedTemps.csv',sep=''),
sep=',',header=T,comment.char='')
keyName <- 'Year'
varNames <- setdiff(colnames(pv),keyName)
yName = setdiff(colnames(ot),keyName)
d <- merge(pv,ot,by=c(keyName))
# perform a regression on new PCA variables
pcomp <- prcomp(d[,varNames])
synthNames <- colnames(pcomp$rotation)
d <- cbind(d,
as.matrix(d[,varNames]) %*% pcomp$rotation)
f <- as.formula(paste(yName,
paste(synthNames,collapse=' + '),sep=' ~ '))
model <- step(lm(f,data=d))
# print and plot a little
print(summary(model)$r.squared)
d$pred <- predict(model,newdata=d)
library(ggplot2)
ggplot(data=d) +
geom_point(aes_string(x=keyName,y=yName)) +
geom_line(aes_string(x=keyName,y='pred'))

And this produces an R-squared of 0.49 (not good, but often valid models achieve less) and the following graph showing recorded temperature (recorded as differences from a reference temperature) as points and the model as a line:

Remember: the purpose of this step is not to establish a temperature spike on the right. The goal is to show that we can use data from the recent past to infer temperatures of the further back past. A cursory inspection of the graph would *seem* to give the impression the modeling idea worked: we get a fit line through history that explains a non-negligible fraction of the variation and seems to track trends. Our step-wise regression model used only six principal components- so we have some generalization occurring.

On the minus side: since we used principal components and have no side knowledge of how temperatures are supposed to affect the stand-in variables we can’t check if the model coefficient signs make sense (an often important task, though we could at least compare signs from the complex model to signs from single variable models). We haven’t looked at the model summary (though it is likely to be forced to be “good” by the use of stepwise regression which rejects many models with bad summaries to get a final model).

And while we have a graph we could talk up, we have not simulated how this model would be used: to try to use relations fit from more recent data to extrapolate past temperatures from past signs. Let’s see if we can even do that on the data we have available.

Our intended use of the data is: use all of the data we have both temperatures and proxy variables for (1856 through 1980) to build a model that imputes older temperatures from proxy variables (at least back to 1400AD). To simulate this intended use of the model we will split our data into what I call a *structured test and train split*. The test and train split will not be random (which tends not to be an effective test for auto-correlated time series data) but instead split in a way related to our intended use: using the future to extrapolate into the past. We will use all data at or after the median known date (1918) for training and all older data to simulate trying to predict further into the past. If we can’t use data after 1917 to predict 1856 (a claim we can test because we have enough data) then we can’t expect to predict back to 1400AD (a claim we can’t check as we don’t know the “true” tempuratre for 1400AD).

The code to perform the analysis in this style is:

```
```d$trainGroup <- d[,keyName]>=median(d[,keyName])
dTrain <- subset(d,trainGroup)
dTest <- subset(d,!trainGroup)
model <- step(lm(f,data=dTrain))
dTest$pred <- predict(model,newdata=dTest)
dTrain$pred <- predict(model,newdata=dTrain)
ggplot() +
geom_point(data=dTest,aes_string(x=keyName,y=yName)) +
geom_line(data=dTest,aes_string(x=keyName,y='pred'),
color='blue',linetype=2) +
geom_point(data=dTrain,aes_string(x=keyName,y=yName)) +
geom_line(data=dTrain,aes_string(x=keyName,y='pred'),
color='red',linetype=1)

What we are predicting is the difference in temperature from a reference temperature (fixes some issues of scale). The analysis produces the graph below which shows that modeling the past is harder than just running a regression through all of our data. Notice how the dashed line (predictions made on data not available during training) are all upwardly biased having just copied trends from the future into the past.

Our first graph only looked as “good” as it did because it was not even simulating the task we were going to try to use the derived model for: extrapolating into the past. In fact we find this model is worse than a type of null-model on the past: using a single (magically guessed) historic temperature difference of `mean(dTest[,yName])`

. The point is: the model is better than any single constant on the data it was trained on, but not better than all constants on data it did not see. This is a common symptom of overfitting.

We show a quantitative measure of the poor performance with root mean square error below:

```
```rmse <- function(x,y) { sqrt(mean((x-y)^2))}
print(rmse(dTrain[,yName],mean(dTrain[,yName])))
## [1] 0.1404713
print(rmse(dTrain[,yName],dTrain$pred))
## [1] 0.1062604
print(rmse(dTest[,yName],mean(dTest[,yName])))
## [1] 0.1312853
print(rmse(dTest[,yName],dTest$pred))
## [1] 0.2004035

We used RMSE instead of correlation to track predictor performance for a subtle reason (see: Don’t use correlation to track prediction performance). And as we said: the RMSE error on test is bigger for the model than for using a single historic average: `mean(dTest[,yName])`

. Now the historic average is a constant that we would not know in actual application, but for evaluating the quality of the model it is appropriate to compare RMSE’s in this manner. In fact the whole point of this sort of evaluation is to expose the data you wouldn’t know in practice (the actual older temperatures) and see if your “looks plausible” model models them well or not. We can present the issue graphically by re-plotting the data and fit with a smoothing line run through the actual data (you want to work with a system that makes exploratory graphing easy, as to get your “hands in the data” you need produce a lot of plots and summaries).

```
```ggplot() +
geom_point(data=dTest,aes_string(x=keyName,y=yName)) +
geom_line(data=dTest,aes_string(x=keyName,y='pred'),
color='blue',linetype=2) +
geom_segment(aes(x=min(dTest[,keyName]),
xend=max(dTest[,keyName]),
y=mean(dTest[,yName]),
yend=mean(dTest[,yName]))) +
geom_point(data=dTrain,aes_string(x=keyName,y=yName)) +
geom_line(data=dTrain,aes_string(x=keyName,y='pred'),
color='red',linetype=1) +
geom_segment(aes(x=min(dTrain[,keyName]),
xend=max(dTrain[,keyName]),
y=mean(dTrain[,yName]),
yend=mean(dTrain[,yName]))) +
geom_smooth(data=d,aes_string(x=keyName,y=yName),
color='black')

The problem is: that while the predictions that were made in the data range available during training (1918 and beyond the predictions being the red line plot through the data which is black dots) look okay the predictions in the test region (all data before 1918, the blue dotted line being the predictions and the data again being the black dots) are systematically off (mostly too high). The black smoothing curve run through all of the data shows this- notice how the predictions center around the curve in the training region, but mostly stay above the curve in the test region. The warning sign is: the behavior of the model is systematically off once we apply it on data it didn’t see during training. The model is good at memorizing what it has been shown, but can’t be trusted where it can’t be checked (as it may not have learned a useful general relation). A good result would have had the model continue to track the smoothing line in the held-out past.

Honestly the fitting ideas were pretty good, but the results just were not good enough to be used as intended (to prove a recent temperature spike would be rare by claiming no such spikes for an interval even longer than we have directly recorded temperature measurements).

Another thing we never did check is: if the PCA actually did anything for us? I doubt it made matters worse, but it does represent extra complexity that should not be allowed until we see an actual benefit (otherwise it is just slavish ritual). Lets check if there was any PCA benefit by comparing to a naive regression on the raw variables now:

```
```g <- as.formula(paste(yName,
paste(varNames,collapse=' + '),sep=' ~ '))
model <- lm(g,data=dTrain)
dTest$pred <- predict(model,newdata=dTest)
dTrain$pred <- predict(model,newdata=dTrain)
print(rmse(dTrain[,yName],mean(dTrain[,yName])))
## [1] 0.1404713
print(rmse(dTrain[,yName],dTrain$pred))
## [1] 0.1018893
print(rmse(dTest[,yName],mean(dTest[,yName])))
## [1] 0.1312853
print(rmse(dTest[,yName],dTest$pred))
## [1] 0.1973062

Almost indistinguishable results. PCA, at least how we used it, brought us almost nothing. Any procedure you use without checking degenerates into mere ritual.

And just to be cruel we have hidden a few other common PCA errors in the above writeup.

First we used PCA in a “dimensionally invalid” way: by not setting the `scale.=T`

option in the `prcomp()`

(which should have been `pcomp <- prcomp(d[,varNames],scale.=T)`

). The point is without scaling you get different principal components depending on units used int the records. It is not true that something measured in millimeters has more range than the same quantity measured in kilometers, but if you don’t use scaling PCA makes exactly this mistake.

Also it was probably not a good idea to rely only on stepwise regression for variable selection, we should probably have not made the later principal components ever available to the analysis. This could be done by setting the synthetic variables to at least have some minimum variation before they are eligible for analysis. Something like the following could work (though you have to pick the threshold sensibly): `synthNames <- colnames(pcomp$rotation)[pcomp$sdev>1]`

.

We can easily re-run the PCA analysis with these fixes:

```
```pcomp <- prcomp(d[,varNames],scale.=T)
synthNames &tl;- colnames(pcomp$rotation)[pcomp$sdev>1]
f <- as.formula(paste(yName,
paste(synthNames,collapse=' + '),sep=' ~ '))
model <- step(lm(f,data=dTrain))
dTest$pred <- predict(model,newdata=dTest)
dTrain$pred <- predict(model,newdata=dTrain)
print(rmse(dTrain[,yName],mean(dTrain[,yName])))
## [[1] 0.1404713
print(rmse(dTrain[,yName],dTrain$pred))
## [1] 0.1179709
print(rmse(dTest[,yName],mean(dTest[,yName])))
## [1] 0.1312853
print(rmse(dTest[,yName],dTest$pred))
## [1] 0.2778389

This gives us a model that uses only three principal components with even worse generalization error. Being able to recognize (and admit to) bad results or equivocal results is key skill of a good data scientist.

I originally made a funny mistake when trying this test. I accidentally wrote `model <- step(lm(f,data=d))`

and trained a model on all of the data without any holdout. This gave RMSEs on the dTrain and dTest sets that were both worse than using the means as the predictors. This is how I noticed I used the wrong data set to train (as least squares minimizes RMSE so it can not underperform a constant on its actual training set, so if I had actually trained on dTrain then I would have to have a non-horrible RMSE on dTrain). This is another (non-principled) bit of evidence we have a bad model: we fit on all the data yet, do not do better than the best constant on large subsets. This isn’t a reliable measure of generalization error, but it is a very bad sign. You don’t get a lot of credit for passing this sort of test (it is a fairly low bar), but when you fail this test things are bad (especially when the subsets your are testing on are structured to simulate your intended application as we have here).

But when PCA actually helps: what is it actually trying to do for us? The variables of this problem actually were set up in way that helps explain what we should be expecting (and do not confuse what we are expecting with what actually happens or with implementation details, the math can be as impressive as you want and still fail to deliver):

```
```print(varNames)
## [1] "Quelccaya.Ice.Core.summit..Ice.O.18" "Quelccaya.Ice.Core.summit..Ice.accumulation"
## [3] "Quelccaya.Ice.Core..2..Ice.O.18" "Quelccaya.Ice.Core..2..Ice.accumulation"
## [5] "Svalbard.Ice.melt" "west.Greenland.Ice.o.18"
## [7] "tasmania.tree.ring.width" "north.patagonia.tree.ring.width"
## [9] "NA.treeline.ring.widths" "Southeast.U.S.N..Carolina..Dendro.ring.widths"
## [11] "Southeast.U.S.S..Carolina..Dendro.ring.widths" "Southeast.U.S.Georgia.Dendro.ring.widths"
## [13] "Yakutia.Dendro.ring.widths" "Fennoscandia.Dendro.density"
## [15] "Northern.Urals.Dendro.density"

Notice the variables fall into a few small groups: tree ring widths, dendro density, oxygen radioisotope measurements. Each of these is a measurement that is thought to be an outcome affected by temperature (so may record evidence of a temperature at given date). It would make sense to have factors that combine and summarize all measurements of a given type (such as an average). This is something that principal components loading could be expected to do for us (we have not confirmed if it actually has). PCA can be thought of as a simplification or regularize procedure to be combined with regression. The step-wise regressed PCA model is simpler than a full regression because only six synthetic variables are used instead of arbitrary combinations of all 15 original input variables.

However, once we start thinking in terms of variables groups and regularization we realize we have other ways to directly implement these improvements. We could attempt some sort of domain specific direct regularization on the least squares procedure such as adding a penalty term that says all of the coefficients related to a given variable type should have similar magnitudes (this would be a variant of Tikhonov regularization where instead of saying coefficient should be near zero we say they groups of them should be near each other). To do this we would have to use R’s linear algebra capabilities directly (as informing R’s linear regression fitter of the additional loss function would be difficult).

Or we could try a more Bayesian tack and say that all measurements in a given group are in fact noisy observations of hidden ideal variable. So there is an ideal “tree width variable” that has the unobserved temperature as an influence and our tree width measurements are all different derived observations. The Bayesian framework allows us to put a weighted criteria on model coefficients that reflects this pattern of dependence. With the right choice of structure and priors this would simplify the model in a meaningful way (again pushing model coefficients for related variables closer together) and help lessen the potential for overfitting. I don’t have an R package for this I currently recommend, but there are certainly a large number worth trying. The additional complexity of learning a Bayesian regression package is worth the extra control in being able to directly express sensible priors on the model parameters.

My biggest complaint about PCA is: PCA builds synthetic variables without every looking at their relation to the response variable. This is in principle bad: we can force the modeling system to ignore important variables by adding more and more large variance irrelevant variables). Their are some attempts to fix this sort of behavior but explanatory variable scale and variance dominate can undesirably dominate your analysis (such as Partial Least Squares, see Hastie, Tibshirani, Friedman “The Elements of Statistical Learning” 2nd edition, Springer).

So what is one to do? The answer is be careful: read the literature on how to perform a good analysis, use the right tools and test things related to your actual intended application.

Our advice is:

- Don’t use analyses without understanding. You don’t need to study complete implementation details, but you need to understand intent, limits, procedures and consequences.
- Check everything. Use a system (like R) where re-running and re-checking is no harder than copying and pasting. Unchecked procedures become mere ritual. Use the standard tests, but augment them with ad-hoc tests (tests you make up that may not prove anything if passed, but are very bad if they fail). Don’t be afraid to check if things are obviously wrong.
- Don’t treat the assurances (implicit or explicit) that come with a technique as replacements for direct tests of your results. And do not take cross-validation and hold-out techniques as inscrutable methods that always require randomness: see how you can directly re-adapt them to simulate your actual intended application.

It is no coincidence that PCA is most commonly used in fields like social science which have a large emphasis on proper experimental design, variable selection and variable pre-scaling (avoiding many of the potential pitfalls of PCA). We have in fact noticed strong per-field affinities for different modeling methods (and different specializations of methods). For example PLS literature starts in the field of chemometrics, which leads me to conclude it fixes problems that are important to them (scaling and controlling for explanatory variable relevance to the response variable) even if it is not a complete panacea.

]]>