To make things easier here are links to the original three articles which work through scores, significance, and includes a glossery.
A lot of what Nina is presenting can be summed up in the diagram below (also by her). If in the diagram the first row is truth (say red disks are infected) which classifier is the better initial screen for infection? Should you prefer the model 1 80% accurate row or the model 2 70% accurate row? This example helps break dependence on “accuracy as the only true measure” and promote discussion of additional measures.
My concrete advice is:
That being said it always seems like there is a bit of gamesmanship in that somebody always brings up yet another score, often apparently in the hope you may not have heard of it. Some choice of measure is signaling your pedigree (precision/recall implies a data mining background, sensitivity/specificity a medical science background) and hoping to befuddle others.
The rest of this note is some help in dealing with this menagerie of common competing classifier evaluation scores.
Lets define our terms. We are going to work with “binary classification” problems. These are problems where we have example instances (also called rows) that are either “in the class” (we will call these instances “true”) or not (and we will call these instances “false”). A classifier is a function that given the description of an instance tries to determine if the instance is in the class or not. The classifier may either return a decision of “positive”/“negative” (indicating the classifier thinks the instance is in or out of the class) or a probability score denoting the estimated probability of being in the class.
For decision based (or “hard”) classifiers (those returning only a positive/negative determination) the “confusion matrix” is a sufficient statistic in the sense it contains all of the information summarizing classifier quality. All other classification measures can be derived from it.
For a decision classifier (one that returns “positive” and “negative”, and not probabilities) the classifier’s performance is completely determined by four counts:
Notice true and false are being used to indicate if the classifier is correct (and not the actual category of each item) in these terms. This is traditional nomenclature. The first two quantities are where the classifier is correct (positive corresponding to true and negative corresponding to false) and the second two quantities count instances where the classifier is incorrect.
It is traditional to arrange these quantities into a 2 by 2 table called the confusion matrix. If we define:
library('ggplot2')
library('caret')
## Loading required package: lattice
library('rSymPy')
## Loading required package: rJython
## Loading required package: rJava
## Loading required package: rjson
A = Var('TruePositives')
B = Var('FalsePositives')
C = Var('FalseNegatives')
D = Var('TrueNegatives')
(Note all code shared here.)
Then the caret R package defines the confusion matrix as follows (see help("confusionMatrix")
) as follows:
Reference
Predicted Event No Event
Event A B
No Event C D
Reference is “ground truth” or actual outcome. We will call examples that have true ground truth “true examples” (again, please don’t confuse this with “TrueNegatives” which are “false examples” that are correctly scored as being false. We would prefer to have the classifier indicate columns instead of rows, but we will use the caret notation for consistency.
We can encode what we have written about these confusion matrix summaries as algebraic statements. Caret’s help("confusionMatrix")
then gives us definitions of a number of common classifier scores:
# (A+C) and (B+D) are facts about the data, independent of classifier.
Sensitivity = A/(A+C)
Specificity = D/(B+D)
Prevalence = (A+C)/(A+B+C+D)
PPV = (Sensitivity * Prevalence)/((Sensitivity*Prevalence) + ((1-Specificity)*(1-Prevalence)))
NPV = (Specificity * (1-Prevalence))/(((1-Sensitivity)*Prevalence) + ((Specificity)*(1-Prevalence)))
DetectionRate = A/(A+B+C+D)
DetectionPrevalence = (A+B)/(A+B+C+D)
BalancedAccuracy = (Sensitivity+Specificity)/2
We can (from our notes) also define some more common metrics:
TPR = A/(A+C) # True Positive Rate
FPR = B/(B+D) # False Positive Rate
FNR = C/(A+C) # False Negative Rate
TNR = D/(B+D) # True Negative Rate
Recall = A/(A+C)
Precision = A/(A+B)
Accuracy = (A+D)/(A+B+C+D)
By writing everything down it becomes obvious thatSensitivity==TPR==Recall
. That won’t stop somebody from complaining if you say “recall” when they prefer “sensitivity”, but that is how things are.
By declaring all of these quantities as sympy variables and expressions we can now check much more. We confirm formal equality of various measures by checking that their difference algebraically simplifies to zero.
# Confirm TPR == 1 - FNR
sympy(paste("simplify(",TPR-(1-FNR),")"))
## [1] "0"
# Confirm Recall == Sensitivity
sympy(paste("simplify(",Recall-Sensitivity,")"))
## [1] "0"
# Confirm PPV == Precision
sympy(paste("simplify(",PPV-Precision,")"))
## [1] "0"
We can also confirm non-identity by simplifying and checking an instance:
# Confirm Precision != Specificity
expr <- sympy(paste("simplify(",Precision-Specificity,")"))
print(expr)
## [1] "(FalsePositives*TruePositives - FalsePositives*TrueNegatives)/(FalsePositives*TrueNegatives + FalsePositives*TruePositives + TrueNegatives*TruePositives + FalsePositives**2)"
sub <- function(expr,
TruePositives,FalsePositives,FalseNegatives,TrueNegatives) {
eval(expr)
}
sub(parse(text=expr),
TruePositives=0,FalsePositives=1,FalseNegatives=0,TrueNegatives=1)
## [1] -0.5
If we write the probability of a true (in-class) instances scoring higher than a false (not in class) instance (with 1/2 point for ties) as Prob[score(true)>score(false)] (with half point on ties)
. We can then confirm Prob[score(true)>score(false)] (with half point on ties) == BalancedAccuracy
for hard or decision classifiers by defining score(true)>score(false)
as:
A D : True Positive and True Negative: Correct sorting 1 point
A B : True Positive and False Positive (same prediction "Positive", different outcomes): 1/2 point
C D : False Negative and True Negative (same prediction "Negative", different outcomes): 1/2 point
C B : False Negative and True Negative: Wrong order 0 points
Then ScoreTrueGTFalse ==
Prob[score(true)>score(false)] (with 1/2 point for ties)` is:
ScoreTrueGTFalse = (1*A*D + 0.5*A*B + 0.5*C*D + 0*C*B)/((A+C)*(B+D))
Which we can confirm is equal to balanced accuracy.
sympy(paste("simplify(",ScoreTrueGTFalse-BalancedAccuracy,")"))
## [1] "0"
We can also confirm Prob[score(true)>score(false)]
(with half point on ties) == AUC
. We can compute the AUC
(the area under the drawn curve) of the above confusion matrix by referring to the following diagram.
Then we can check for general equality:
AUC = (1/2)*FPR*TPR + (1/2)*(1-FPR)*(1-TPR) + (1-FPR)*TPR
sympy(paste("simplify(",ScoreTrueGTFalse-AUC,")"))
## [1] "0"
This AUC score (with half point credit on ties) equivalence holds in general (see also More on ROC/AUC, though I got this wrong the first time).
We can show F1
is different than Balanced Accuracy by plotting results they differ on:
# Wikipedia https://en.wikipedia.org/wiki/F1_score
F1 = 2*Precision*Recall/(Precision+Recall)
F1 = sympy(paste("simplify(",F1,")"))
print(F1)
## [1] "2*TruePositives/(FalseNegatives + FalsePositives + 2*TruePositives)"
print(BalancedAccuracy)
## [1] "TrueNegatives/(2*(FalsePositives + TrueNegatives)) + TruePositives/(2*(FalseNegatives + TruePositives))"
# Show F1 and BalancedAccuracy do not always vary together (even for hard classifiers)
F1formula = parse(text=F1)
BAformula = parse(text=BalancedAccuracy)
frm = c()
for(TotTrue in 1:5) {
for(TotFalse in 1:5) {
for(TruePositives in 0:TotTrue) {
for(TrueNegatives in 0:TotFalse) {
FalsePositives = TotFalse-TrueNegatives
FalseNegatives = TotTrue-TruePositives
F1a <- sub(F1formula,
TruePositives=TruePositives,FalsePositives=FalsePositives,
FalseNegatives=FalseNegatives,TrueNegatives=TrueNegatives)
BAa <- sub(BAformula,
TruePositives=TruePositives,FalsePositives=FalsePositives,
FalseNegatives=FalseNegatives,TrueNegatives=TrueNegatives)
if((F1a<=0)&&(BAa>0.5)) {
stop()
}
fi = data.frame(
TotTrue=TotTrue,
TotFalse=TotFalse,
TruePositives=TruePositives,FalsePositives=FalsePositives,
FalseNegatives=FalseNegatives,TrueNegatives=TrueNegatives,
F1=F1a,BalancedAccuracy=BAa,
stringsAsFactors = FALSE)
frm = rbind(frm,fi) # bad n^2 accumulation
}
}
}
}
ggplot(data=frm,aes(x=F1,y=BalancedAccuracy)) +
geom_point() +
ggtitle("F1 versus balancedAccuarcy/AUC")
F1 versus BalancedAccuracy/AUC
In various sciences over the years over 20 measures of “scoring correspondence” have been introduced by playing games with publication priority, symmetry, and incorporating significance (“chance adjustments”) directly into the measure.
Each measure presumably exists because it avoids flaws of all of the others. However the sheer number of them (in my opinion) triggers what I call “De Morgan’s objection”:
If I had before me a fly and an elephant, having never seen more than one such magnitude of either kind; and if the fly were to endeavor to persuade me that he was larger than the elephant, I might by possibility be placed in a difficulty. The apparently little creature might use such arguments about the effect of distance, and might appeal to such laws of sight and hearing as I, if unlearned in those things, might be unable wholly to reject. But if there were a thousand flies, all buzzing, to appearance, about the great creature; and, to a fly, declaring, each one for himself, that he was bigger than the quadruped; and all giving different and frequently contradictory reasons; and each one despising and opposing the reasons of the others—I should feel quite at my ease. I should certainly say, My little friends, the case of each one of you is destroyed by the rest.
(Augustus De Morgan, “A Budget of Paradoxes” 1872)
There is actually an excellent literature stream investigating which of these measures are roughly equivalent (say arbitrary monotone functions of each other) and which are different (leave aside which are even useful).
Two excellent guides to this rat hole include:
Ackerman, M., & Ben-David, S. (2008). “Measures of clustering quality: A working set of axioms for clustering.”" Advances in Neural Information Processing Systems: Proceedings of the 2008 Conference.
Warrens, M. (2008). “On similarity coefficients for 2× 2 tables and correction for chance.” Psychometrika, 73(3), 487–502.
The point is: you not only can get a publication trying to sort this mess, you can actually do truly interesting work trying to relate these measures.
One can take finding relations and invariants much further as in “Lectures on Algebraic Statistics” Mathias Drton, Bernd Sturmfels, Seth Sullivant, 2008.
It is a bit much to hope to only need to know “one best measure” or to claim to be familiar (let alone expert) in all plausible measures. Instead, find a few common evaluation measures that work well and stick with them.
]]>‘vtreat’ is a data.frame processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner.
‘vtreat’ is an R package that incorporates a number of transforms and simulated out of sample (cross-frame simulation) procedures that can:
‘vtreat’ can be used to prepare data for either regression or classification.
Please read on for what ‘vtreat’ does and what is new.
The primary function of ‘vtreat’ is re-coding of high-cardinality categorical variables, re-coding of missing data, and out-of sample estimation of variable effects and significances. You can use ‘vtreat’ as a pre-processor and use ‘vtreat::prepare’ as a powerful replacement for ‘stats::model.matrix’. Using ‘vtreat’ should get you quickly into the competitive ballpark of best performance on a real-world data problem (such as KDD2009) leaving you time to apply deeper domain knowledge and model tuning for even better results.
‘vtreat’ achieves this by using the assumption that you have a modeling “y” (or outcome to predict) throughout, and that all preparation and transformation should be designed to use knowledge of this “y” during training (and anticipate not having the “y” during test or application).
More simply: the purpose of ‘vtreat’ is to quickly take a messy real-world data frame similar to:
library('htmlTable')
library('vtreat')
dTrainC <- data.frame(x=c('a','a','a','b','b',NA,NA),
z=c(1,2,3,4,NA,6,NA),
y=c(FALSE,FALSE,TRUE,FALSE,TRUE,TRUE,TRUE))
htmlTable(dTrainC)
x | z | y | |
---|---|---|---|
1 | a | 1 | FALSE |
2 | a | 2 | FALSE |
3 | a | 3 | TRUE |
4 | b | 4 | FALSE |
5 | b | TRUE | |
6 | 6 | TRUE | |
7 | TRUE |
And build a treatment plan:
treatmentsC <- designTreatmentsC(dTrainC,colnames(dTrainC),'y',TRUE)
The treatment plan can then be used to clean up the original data and also be applied to any future application or test data:
dTrainCTreated <- prepare(treatmentsC,dTrainC,pruneSig=0.5)
nround <- function(x) { if(is.numeric(x)) { round(x,2) } else { x } }
htmlTable(data.frame(lapply(dTrainCTreated,nround)))
x_lev_NA | x_lev_x.a | x_catP | x_catB | z_clean | z_isBAD | y | |
---|---|---|---|---|---|---|---|
1 | 0 | 1 | 0.43 | -0.54 | 1 | 0 | FALSE |
2 | 0 | 1 | 0.43 | -0.54 | 2 | 0 | FALSE |
3 | 0 | 1 | 0.43 | -0.54 | 3 | 0 | TRUE |
4 | 0 | 0 | 0.29 | -0.13 | 4 | 0 | FALSE |
5 | 0 | 0 | 0.29 | -0.13 | 3.2 | 1 | TRUE |
6 | 1 | 0 | 0.29 | 0.56 | 6 | 0 | TRUE |
7 | 1 | 0 | 0.29 | 0.56 | 3.2 | 1 | TRUE |
‘vtreat’ is designed to be concise, yet implement substantial data preparation and cleaning.
This release concentrates on code-cleanup and convenience functions inspired by Nina Zumel’s recent article on y-aware PCA/PCR (my note why you should read this series is here). In particular we now have user facing functions and documentation on:
‘vtreat’ now has essentially two workflows:
We think analysts/data-scientists will be well served by learning both workflows and picking the work workflow most appropriate to the data set at hand.
]]>From feedback I am not sure everybody noticed that in addition to being easy and effective, the method is actually novel (we haven’t yet found an academic reference to it or seen it already in use after visiting numerous clients). Likely it has been applied before (as it is a simple method), but it is not currently considered a standard method (something we would like to change).
In this note I’ll discuss some of the context of y-aware scaling.
y-aware scaling is a transform that has been available in as “scale mode” in the vtreat R package since prior to the first public release Aug 7, 2014 (derived from earlier proprietary work). It was always motivated by a “dimensional analysis” or “get the units consistent” argument. It is intended as the pre-processing step before operations that are metric sensitive, such as KNN classification and principal components regression. We didn’t really work on proving theorems about it, because in certain contexts it can be recognized as “the right thing to do.” It derives from considering input (or independent variables or columns) as single variable models and the combining of such variables as a nested model or ensemble model construction (chapter 6 of Practical Data Science with R Nina Zumel, John Mount; Manning 2014 was somewhat organized with this idea behind the scenes). Considering y (or the outcome to be modeled) during dimension reduction prior to predictive modeling is a natural concern, but it seems to be anathema in principal components analysis.
y-aware scaling is in fact simple (it involves multiplying by the slope coefficients from linear regressions for a regression problem or multiplying by the slope coefficient from a logistic regression for classification problems; this is different than multiplying by the outcome y which would not be available during the application phase of a predictive model). The fact that it is simple makes it a bit hard to accept that it is both effective and novel. We are not saying it is unprecedented, but it is certainly not center in the standard literature (despite being an easy and effective technique).
There is an an extensive literature on scaling, filtering, transforming, and pre-conditioning data for principal components analysis (for example see “Centering, scaling, and transformations: improving the biological information content of metabolomics data”, Robert A van den BergEmail, Huub CJ Hoefsloot, Johan A Westerhuis, Age K Smilde and Mariët J van der Werf, BMC Genomics20067:142, 2006). However, these are all what we call x-only transforms.
When you consult references (such as The Elements of Statistical Learning, 2nd edition, Trevor Hastie, Robert Tibshirani, Jerome Friedman, Springer 2009; and Applied Predictive Modeling, Max Kuhn, Kjell Johnson, Springer 2013) you basically see only two y-sensitive principal components style techniques (in addition to recommendations to use regularized regression):
I would like to repeat (it is already implied in Nina’s article): y-aware scaling is not equivalent to either of these methods.
Supervised PCA is simply pruning the variables by inspecting small regressions prior to the PCA steps. In my mind it makes our point that principal components users do not consider using the outcome or y-variable in their data preparation that in 2006 you could get a publication by encouraging this natural step and giving the step a name. I’ll repeat: filtering and pruning variables is common in many forms of data analysis so it is remarkable how much work was required to sell the idea of supervised PCA.
Partial Least Squares Regression is an interesting y-aware technique, but it is a different (and more complicated) technique than y-aware scaling. Here is an example (in R) showing the two methods having very different performance on (an admittedly artificial) problem: PLS.md.
In conclusion, I encourage you to take the time to read up on y-aware scaling and consider using it during your dimension reduction steps prior to predictive modeling.
]]>After reading the article we have a few follow-up thoughts on the topic.
Our group has written on the use of differential privacy to improve machine learning algorithms (by slowing down the exhaustion of novelty in your data):
However, these are situations without competing interests: we are just trying to build a better model. What about the original application of differential privacy: trading modeling effectiveness against protecting those one has collected data on? Is un-audited differential privacy an effective protection, or is it a fig-leaf that merely checks off data privacy regulations?
A few of the points to ponder:
We’ll end with: we think the applications of differential privacy techniques to improving machine performance are still the most promising applications as they don’t have the difficulty of trying to serve competing interests (modeling effectiveness versus privacy). A great example of this is the fascinating paper “The Ladder: A Reliable Leaderboard for Machine Learning Competitions” by Avrim Blum, Moritz Hardt. I’d like to think it is clever applications such as the preceding drive current interest in the topic of differential privacy (post 2015). But it looks like all anyone cares about is the Apple’s announcement.
What I am trying to say: claiming the use of differential privacy should not be a “get out of regulation free card.” At best it is a tool that can be part of implementing privacy protection, and one that definitely requires ongoing detailed oversight and auditing.
]]>Win-Vector LLC’s Dr. Nina Zumel has a three part series on Principal Components Regression that we think is well worth your time.
You can read her first article part 1 here. Principal Components Regression (PCR) is the use of Principal Components Analysis (PCA) as a dimension reduction step prior to linear regression. It is one of the best known dimensionality reduction techniques and a staple procedure in many scientific fields. PCA is used because:
We often find ourselves having to often remind readers that this last reason is not actually positive. The standard derivation of PCA involves trotting out the math and showing the determination of eigenvector directions. It yields visually attractive diagrams such as the following.
Wikipedia: PCA
And this leads to a deficiency in much of the teaching of the method: glossing over the operational consequences and outcomes of applying the method. The mathematics is important to the extent it allows you to reason about the appropriateness of the method, the consequences of the transform, and the pitfalls of the technique. The mathematics is also critical to the correct implementation, but that is what one hopes is already supplied in a reliable analysis platform (such as R). Dr. Zumel uses the expressive and graphical power of R to work through the use of Principal Components Regression in an operational series of examples. She works through how Principal Components Regression is typically mis-applied and continues on to how to correctly apply it. Taking the extra time to work through the all too common errors allows her to demonstrate and quantify the benefits of correct technique. Dr. Zumel will soon follow part 1 later with a shorter part 2 article demonstrating important "y-aware" techniques that squeeze much more modeling power out of your data in predictive analytic situations (which is what regression actually is). Some of the methods are already in the literature, but are still not used widely enough. We hope the demonstrated techniques and included references will give you a perspective to improve how you use or even teach Principal Components Regression. Please read on here.
In part 2 of her series on Principal Components Regression Dr. Nina Zumel illustrates so-called y-aware techniques. These often neglected methods use the fact that for predictive modeling problems we know the dependent variable, outcome or y, so we can use this during data preparation in addition to using it during modeling. Dr. Zumel shows the incorporation of y-aware preparation into Principal Components Analyses can capture more of the problem structure in fewer variables. Such methods include:
This recovers more domain structure and leads to better models. Using the foundation set in the first article Dr. Zumel quickly shows how to move from a traditional x-only analysis that fails to preserve a domain-specific relation of two variables to outcome to a y-aware analysis that preserves the relation. Or in other words how to move away from a middling result where different values of y (rendered as three colors) are hopelessly intermingled when plotted against the first two found latent variables as shown below.
Dr. Zumel shows how to perform a decisive analysis where y is somewhat sortable by the each of the first two latent variable and the first two latent variables capture complementary effects, making them good mutual candidates for further modeling (as shown below).
Click here (part 2 y-aware methods) for the discussion, examples, and references. Part 1 (x only methods) can be found here.
In her series on principal components analysis for regression in R Win-Vector LLC‘s Dr. Nina Zumel broke the demonstration down into the following pieces:
In the earlier parts Dr. Zumel demonstrates common poor practice versus best practice and quantifies the degree of available improvement. In part 3 she moves from the usual "pick the number of components by eyeballing it" non-advice and teaches decisive decision procedures. For picking the number of components to retain for analysis there are a number of standard techniques in the literature including:
Dr. Zumel shows that the last method (designing a formal statistical test) is particularly easy to encode as a permutation test in the y-aware setting (there is also an obvious similarly good bootstrap test). This is well-founded and pretty much state of the art. It is also a great example of why to use a scriptable analysis platform (such as R) as it is easy to wrap arbitrarily complex methods into functions and then directly perform empirical tests on these methods. The following "broken stick" type test yields the following graph which identifies five principal components as being significant:
However, Dr. Zumel goes on to show that in a supervised learning or regression setting we can further exploit the structure of the problem and replace the traditional component magnitude tests with simple model fit significance pruning. The significance method in this case gets the stronger result of finding the two principal components that encode the known even and odd loadings of the example problem:
In fact that is sort of her point: significance pruning either on the original variables or on the derived latent components is enough to give us the right answer. In general we get much better results when (in a supervised learning or regression situation) we use knowledge of the dependent variable (the "y" or outcome) and do all of the following:
The above will become much clearer and much more specific if you click here to read part 3.
]]>Exploring Data Science gives you a free sample of important data science topics chosen from great Manning books. Each chapter was chosen by John Mount and Nina Zumel and includes a brief orientation/introduction. The topics are:
This 191 page e-book is free, but only officially licensed/available from manning.com. To get your free PDF click here and use Manning’s online shopping system. You will have to enter your email and other details, but Manning Publications is a reputable vendor well worth having an account with.
Please check it out! And please help us promote this fun offering by posting, sharing, and Tweeting.
Update: in addition to the excerpted chapters and new introductions the free e-book contains special discount codes for any of the books mentioned!!! So this is really something to consider if you want to deepen or broaden your data science knowledge.
]]>geom_step
is an interesting geom supplied by the R package ggplot2. It is an appropriate rendering option for financial market data and we will show how and why to use it in this article.
Let’s take a simple example of plotting market data. In this case we are plotting the "ask price" (the publicly published price an item is available for purchase at a given time), the "bid price" (the publicly published price an item can be sold for at a given time), and "trades" (past purchases and sales).
Most markets maintain these "quoted" prices as an order book and the public ask price is always greater than the public bid price (else we would have a "crossed market"). We can also track recent transactions or trades. Here is some example (made-up) data.
print(quotes)
## quoteTime date askPrice bidPrice
## 1 2016-01-04 09:14:00 2016-01-04 10.81 10.69
## 2 2016-01-04 11:45:17 2016-01-04 11.09 10.68
## 3 2016-01-04 15:25:00 2016-01-04 12.32 12.03
## 4 2016-01-05 10:12:13 2016-01-05 14.33 13.69
## 5 2016-01-06 09:02:00 2016-01-06 17.17 16.20
## 6 2016-01-06 15:10:00 2016-01-06 18.86 18.35
## 7 2016-01-06 15:27:00 2016-01-06 20.89 20.32
print(trades)
## tradeTime date tradePrice quantity
## 1 2016-01-04 09:14:00 2016-01-04 10.81 600
## 2 2016-01-04 11:45:17 2016-01-04 10.68 500
## 6 2016-01-06 15:10:00 2016-01-06 18.35 200
## 7 2016-01-06 15:27:00 2016-01-06 20.89 200
Notice each revision of the book (notification of a bid price, ask price, or both) happens at a specific time. Ask and bid prices are good until they are revised or withdrawn.
There is some issue as to what is the "price" of a financial instrument (say in this case a stock).
Money only changes hands on trades- so past quotes that were never "hit" or traded against in some sense never happened (in fact this is becoming a problem called "flashing"). So market participants can somewhat manipulate bids and asks as long as they don’t cross. Asks and bids represent risk or a one-sided opinion on price but can not be trusted (especially when the "bid ask gap" is very large).
Trades cost fees and transfer money, so they are evidence of two parties agreeing on price for a moment. But all trades you know about are in the past. Just because somebody purchased some shares of IBM in the past for $120 a share doesn’t mean you can do the same. You could only make such a purchase if there is an appropriate ask price in the market (or you place your own limit order forming a bid that somebody else hits).
What I am trying to say is the classic "ticker tape pattern" graph shown below drawing only trades and connecting them with sloping lines is not appropriate for plotting markets (especially when plotting high frequency or in-day data).
ggplot(data=trades,aes(x=tradeTime,y=tradePrice)) +
geom_line() + geom_point()
There is a lot wrong with such graphs.
geom_smooth
as the defaults use data from the past and future to perform the smoothing (instead should use a trailing window such as exponential smoothing).(Side note: if anybody has some good code to make geom_smooth
perform exponential smoothing in all cases, including grouping and facets I would really like a copy. Right now I have to join in smoothed data as new column as I have never completely grocked all of the implementation interface requirements for new ggplot2
statistics in their full production complexity.)
If all that seems complicated, scary, unpleasant and technical: that is the right way to think. Markets are not safe, simple, or pleasant. They can be reasoned about and worked with, but it is wrong to think they are simple or easy.
An (unfortunately) more complicated (and slightly less legible graph) is needed to try and faithfully present the information. Since asks and bids are good until withdrawn and revised we render then with a step shape (such as generated by ggplot2::geom_step
) and since trades happen only at a single time (and are not a promise going forward) we render them with points. Such a graph is given below.
ggplot() +
geom_step(data=quotes,mapping=aes(x=quoteTime,y=askPrice),
linetype=2,color='#1b9e77',alpha=0.5) +
geom_step(data=quotes,mapping=aes(x=quoteTime,y=bidPrice),
linetype=2,color='#d95f02',alpha=0.5) +
geom_point(data=trades,mapping=(aes(x=tradeTime,y=tradePrice))) +
ylab('price') + xlab('time') + scale_y_log10(breaks=breaks)
The step functions propagate flat lines forward from quote revisions, correctly indicating what ask price and bid price were in effect at all times. Trades are shown as dots since they have no propagation. Each item drawn on the graph at a given time was actually know by that time (so a person or trading strategy would also have access to such information at that time).
Trades that occur nearer the ask price can be considered "buyer initiated" and trades that occur near the bid price are considered can be considered "seller initiated", which we can indicate through color.
mids <- (lastKnownValue(NA,quotes$quoteTime,quotes$askPrice,trades$tradeTime)+
lastKnownValue(NA,quotes$quoteTime,quotes$bidPrice,trades$tradeTime))/2
trades$type <- ifelse(trades$tradePrice>=mids,'buy','sell')
ggplot() +
geom_step(data=quotes,mapping=aes(x=quoteTime,y=askPrice),
linetype=2,color='#1b9e77',alpha=0.5) +
geom_step(data=quotes,mapping=aes(x=quoteTime,y=bidPrice),
linetype=2,color='#d95f02',alpha=0.5) +
geom_point(data=trades,mapping=(aes(x=tradeTime,y=tradePrice,color=type))) +
ylab('price') + xlab('time') + scale_y_log10(breaks=breaks) +
scale_color_brewer(palette = 'Dark2')
This is a good time to point out a problem in these graphs. We are mostly plotting times when the market is closed. Most of the space is wasted. In the graph below we indicate (fictitious) market hours by shading the "market open hours" to illustrate the issue.
print(openClose)
## date time what askPrice bidPrice
## 1 2016-01-04 2016-01-04 09:00:00 open NA NA
## 2 2016-01-04 2016-01-04 15:30:00 close 12.32 12.03
## 3 2016-01-05 2016-01-05 09:00:00 open 12.32 12.03
## 4 2016-01-05 2016-01-05 15:30:00 close 14.33 13.69
## 5 2016-01-06 2016-01-06 09:00:00 open 14.33 13.69
## 6 2016-01-06 2016-01-06 15:30:00 close 20.89 20.32
openClose %>% select(date,time,what) %>% spread(what,time) -> marketHours
ggplot() +
geom_step(data=quotes,mapping=aes(x=quoteTime,y=askPrice),
linetype=2,color='#1b9e77',alpha=0.5) +
geom_step(data=quotes,mapping=aes(x=quoteTime,y=bidPrice),
linetype=2,color='#d95f02',alpha=0.5) +
geom_point(data=trades,mapping=(aes(x=tradeTime,y=tradePrice,color=type))) +
geom_rect(data=marketHours,
mapping=aes(xmin=open,xmax=close,ymin=0,ymax=Inf),
fill='blue',alpha=0.3) +
ylab('price') + xlab('time') + scale_y_log10(breaks=breaks) +
scale_color_brewer(palette = 'Dark2')
The easiest way to fix this in ggplot2
would be to use facet_wrap
, but this crashes (at least for ggplot2
version 2.1.0
current on Cran 2016-06-03) with the very cryptic error message as shown below.
ggplot() +
geom_step(data=quotes,mapping=aes(x=quoteTime,y=askPrice),
linetype=2,color='#1b9e77',alpha=0.5) +
geom_step(data=quotes,mapping=aes(x=quoteTime,y=bidPrice),
linetype=2,color='#d95f02',alpha=0.5) +
geom_point(data=trades,mapping=(aes(x=tradeTime,y=tradePrice,color=type))) +
facet_wrap(~date,scale='free_x') +
ylab('price') + xlab('time') + scale_color_brewer(palette = 'Dark2')
## Error in grid.Call.graphics(L_lines, x$x, x$y, index, x$arrow): invalid line type
Despite the message "invalid line type" the error is not the user’s selection of linetype. It is easier to see what is going on if we replace geom_step
with geom_line
as we show below.
ggplot() +
geom_line(data=quotes,mapping=aes(x=quoteTime,y=askPrice),
linetype=2,color='#1b9e77',alpha=0.5) +
geom_line(data=quotes,mapping=aes(x=quoteTime,y=bidPrice),
linetype=2,color='#d95f02',alpha=0.5) +
geom_point(data=trades,mapping=(aes(x=tradeTime,y=tradePrice,color=type))) +
facet_wrap(~date,scale='free_x') +
ylab('price') + xlab('time') + scale_y_log10(breaks=breaks) +
scale_color_brewer(palette = 'Dark2')
## geom_path: Each group consists of only one observation. Do you need to
## adjust the group aesthetic?
## geom_path: Each group consists of only one observation. Do you need to
## adjust the group aesthetic?
The above graph is now using sloped lines to connect ask price and bid price revisions (given the false impression that these intermediate prices were ever available and essentially "leaking information from the future" into the visual presentation). However, we get a graph and a more reasonable warning message: "geom_path: Each group consists of only one observation." There was only one quote revision on 2016-01-05 so as facet_wrap
treats each facet as sub-graph (and not as a portal into a single larger graph): days with fewer than 2 quote revisions have trouble drawing paths. The trouble causes the (deceptive) blank facet for 2016-01-05 if we are using simple sloped lines (geom_line
) and seems to error out on the more complicated geom_step
.
In my opinion geom_step
should "fail a bit gentler" on this example (as geom_line
already does). In any case the correct domain specific fix is to regularize the data a bit by adding market open and close information. In many markets the open and closing prices are set by specific mechanisms (such as an opening auction and a closing volume or time weighted average). For our example we will just use last known price (which we have already prepared).
openClose %>% mutate(quoteTime=time) %>%
bind_rows(quotes) %>%
arrange(time) %>%
select(date,askPrice,bidPrice,quoteTime) -> joinedData
ggplot() +
geom_step(data=joinedData,mapping=aes(x=quoteTime,y=askPrice),
linetype=2,color='#1b9e77',alpha=0.5) +
geom_step(data=joinedData,mapping=aes(x=quoteTime,y=bidPrice),
linetype=2,color='#d95f02',alpha=0.5) +
geom_point(data=trades,mapping=(aes(x=tradeTime,y=tradePrice,color=type))) +
facet_wrap(~date,scale='free_x') +
ylab('price') + xlab('time') + scale_y_log10(breaks=breaks) +
scale_color_brewer(palette = 'Dark2')
## Warning: Removed 1 rows containing missing values (geom_path).
## Warning: Removed 1 rows containing missing values (geom_path).
The above graph is pretty good. In fact easily producing a graph like this in R using dygraphs
is currently an open issue.
In previous writings we have gone to great lengths to document, explain and motivate vtreat
. That necessarily gets long and unnecessarily feels complicated.
In this example we are going to show what building a predictive model using vtreat
best practices looks like assuming you were somehow already in the habit of using vtreat for your data preparation step. We are deliberately not going to explain any steps, but just show the small number of steps we advise routinely using. This is a simple schematic, but not a guide. Of course we do not advise use without understanding (and we work hard to teach the concepts in our writing), but want what small effort is required to add vtreat
to your predictive modeling practice.
First we set things up: load libraries, initialize parallel processing.
library('vtreat')
library('caret')
library('gbm')
library('doMC')
library('WVPlots') # see https://github.com/WinVector/WVPlots
# parallel for vtreat
ncores <- parallel::detectCores()
parallelCluster <- parallel::makeCluster(ncores)
# parallel for caret
registerDoMC(cores=ncores)
The we load our data for analysis. We are going to build a model predicting an income level from other demographic features. The data is taken from here and you can perform all of the demonstrated steps if you download the contents of the example git directory. Obviously this has a lot of moving parts (R, R Markdown, Github, R packages, devtools)- but is very easy to do a second time (first time can be a bit of learning and preparation).
# load data
# data from: http://archive.ics.uci.edu/ml/machine-learning-databases/adult/
colnames <-
c(
'age',
'workclass',
'fnlwgt',
'education',
'education-num',
'marital-status',
'occupation',
'relationship',
'race',
'sex',
'capital-gain',
'capital-loss',
'hours-per-week',
'native-country',
'class'
)
dTrain <- read.table(
'adult.data.txt',
header = FALSE,
sep = ',',
strip.white = TRUE,
stringsAsFactors = FALSE,
na.strings = c('NA', '?', '')
)
colnames(dTrain) <- colnames
dTest <- read.table(
'adult.test.txt',
skip = 1,
header = FALSE,
sep = ',',
strip.white = TRUE,
stringsAsFactors = FALSE,
na.strings = c('NA', '?', '')
)
colnames(dTest) <- colnames
Now we use vtreat
to prepare the data for analysis. The goal of vtreat is to ensure a ready-to-dance data frame in a statistically valid manner. We are respecting the test/train split and building our data preparation plan only on the training data (though we do apply it to the test data). This step helps with a huge number of potential problems through automated repairs:
# define problem
yName <- 'class'
yTarget <- '>50K'
varNames <- setdiff(colnames,yName)
# build variable encoding plan and prepare simulated out of sample
# training fame (cross-frame)
# http://www.win-vector.com/blog/2016/05/vtreat-cross-frames/
system.time({
cd <- vtreat::mkCrossFrameCExperiment(dTrain,varNames,yName,yTarget,
parallelCluster=parallelCluster)
scoreFrame <- cd$treatments$scoreFrame
dTrainTreated <- cd$crossFrame
# pick our variables
newVars <- scoreFrame$varName[scoreFrame$sig<1/nrow(scoreFrame)]
dTestTreated <- vtreat::prepare(cd$treatments,dTest,
pruneSig=NULL,varRestriction=newVars)
})
## user system elapsed
## 11.340 2.760 30.872
#print(newVars)
Now we train our model. In this case we are using the caret package to tune parameters.
# train our model using caret
system.time({
yForm <- as.formula(paste(yName,paste(newVars,collapse=' + '),sep=' ~ '))
# from: http://topepo.github.io/caret/training.html
fitControl <- trainControl(
method = "cv",
number = 3)
model <- train(yForm,
data = dTrainTreated,
method = "gbm",
trControl = fitControl,
verbose = FALSE)
print(model)
dTest$pred <- predict(model,newdata=dTestTreated,type='prob')[,yTarget]
})
## Stochastic Gradient Boosting
##
## 32561 samples
## 64 predictor
## 2 classes: '<=50K', '>50K'
##
## No pre-processing
## Resampling: Cross-Validated (3 fold)
## Summary of sample sizes: 21707, 21708, 21707
## Resampling results across tuning parameters:
##
## interaction.depth n.trees Accuracy Kappa
## 1 50 0.8476398 0.5083558
## 1 100 0.8556555 0.5561726
## 1 150 0.8577746 0.5699958
## 2 50 0.8560855 0.5606650
## 2 100 0.8593102 0.5810931
## 2 150 0.8625042 0.5930111
## 3 50 0.8593717 0.5789289
## 3 100 0.8649919 0.6017707
## 3 150 0.8660975 0.6073645
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150,
## interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.
## user system elapsed
## 61.908 2.227 36.850
Finally we take a look at the results on the held-out test data.
WVPlots::ROCPlot(dTest,'pred',yName,'predictions on test')
WVPlots::DoubleDensityPlot(dTest,'pred',yName,'predictions on test')
confusionMatrix <- table(truth=dTest[[yName]],pred=dTest$pred>=0.5)
print(confusionMatrix)
## pred
## truth FALSE TRUE
## <=50K. 11684 751
## >50K. 1406 2440
testAccuarcy <- (confusionMatrix[1,1]+confusionMatrix[2,2])/sum(confusionMatrix)
testAccuarcy
## [1] 0.8675143
Notice the achieved test accuracy is in the ballpark of what was reported for this dataset.
(From [adult.names description](http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.names) )
Error Accuracy reported as follows, after removal of unknowns from
| train/test sets):
| C4.5 : 84.46+-0.30
| Naive-Bayes: 83.88+-0.30
| NBTree : 85.90+-0.28
We can also compare accuracy on the "complete cases":
dTestComplete <- dTest[complete.cases(dTest[,varNames]),]
confusionMatrixComplete <- table(truth=dTestComplete[[yName]],
pred=dTestComplete$pred>=0.5)
print(confusionMatrixComplete)
## pred
## truth FALSE TRUE
## <=50K. 10618 742
## >50K. 1331 2369
testAccuarcyComplete <- (confusionMatrixComplete[1,1]+confusionMatrixComplete[2,2])/
sum(confusionMatrixComplete)
testAccuarcyComplete
## [1] 0.8623506
# clean up
parallel::stopCluster(parallelCluster)
These two scores are within noise bounds of each other, but it is our experience that missingness is often actually informative, so in addition to imputing missing values you would like to preserve some notation indicating the missingness (which vtreat
does in fact do).
And that is all there is to this example. I’d like to emphasize that vtreat steps were only a few lines in one of the blocks of code. vtreat
treatment can take some time, but it is usually bearable. By design it is easy to add vtreat to your predictive analytics projects.
The point is: we got competitive results on real world data, in a single try (using vtreat to prepare data and caret to tune parameters). The job of the data scientist is to actually work longer on a problem and do better. But having a good start helps.
The theory behind vtreat is fairly important to the correctness of our implementation, and we would love for you to read through some of it:
But operationally, please think of vtreat
as just adding a couple of lines to your analysis scripts. Again, the raw R markdown source can be found here and a rendered copy (with results and graphs) here.
Before starting the discussion, let’s quickly redo our y-aware PCA. Please refer to our previous post for a full discussion of this data set and this approach.
#
# make data
#
set.seed(23525)
dTrain <- mkData(1000)
dTest <- mkData(1000)
#
# design treatment plan
#
treatmentsN <- designTreatmentsN(dTrain,
setdiff(colnames(dTrain),'y'),'y',
verbose=FALSE)
#
# prepare the treated frames, with y-aware scaling
#
examplePruneSig = 1.0
dTrainNTreatedYScaled <- prepare(treatmentsN,dTrain,
pruneSig=examplePruneSig,scale=TRUE)
dTestNTreatedYScaled <- prepare(treatmentsN,dTest,
pruneSig=examplePruneSig,scale=TRUE)
#
# do the principal components analysis
#
vars <- setdiff(colnames(dTrainNTreatedYScaled),'y')
# prcomp defaults to scale. = FALSE, but we already
# scaled/centered in vtreat- which we don't want to lose.
dmTrain <- as.matrix(dTrainNTreatedYScaled[,vars])
dmTest <- as.matrix(dTestNTreatedYScaled[,vars])
princ <- prcomp(dmTrain, center = FALSE, scale. = FALSE)
If we examine the magnitudes of the resulting singular values, we see that we should use from two to five principal components for our analysis. In fact, as we showed in the previous post, the first two singular values accurately capture the two unobservable processes that contribute to y, and a linear model fit to these two components captures most of the explainable variance in the data, both on training and on hold-out data.
We picked the number of principal components to use by eye; but it’s tricky to implement code based on the strategy "look for a knee in the curve." So how might we automate picking the appropriate number of components in a reliable way?
Jackson (1993) and Peres-Neto, et.al. (2005) are two excellent surveys and evaluations of the different published approaches to picking the number of components in standard PCA. Those methods include:
caret::preProcess.
The papers also cover other approaches, as well as different variations of the above.
Kabakoff (R In Action, 2nd Edition, 2015) suggests comparing the magnitudes of the singular values to those extracted from random matrices of the same shape as the original data. Let’s assume that the original data has k variables, and that PCA on the original data extracts the k singular values s_{i} and the k principal components PC_{i}.To pick the appropriate number of principal components:
The idea is that if there is more variation in a given direction than you would expect at random, then that direction is probably meaningful. If you assume that higher variance directions are more useful than lower variance directions (the usual assumption), then one handy variation is to find the first i such that s_{i} < r_{i}, and keep the first i-1 principal components.
This approach is similar to what the authors of the survey papers cited above refer to as the broken-stick method. In their research, the broken-stick method was among the best performing approaches for a variety of simulated and real-world examples.
With the proper adjustment, all of the above heuristics work as well in the y-adjusted case as they do with traditional x-only PCA.
Since in our case we know y, we can — and should — take advantage of this information. We will use a variation of the broken-stick method, but rather than comparing our data to a random matrix, we will compare our data to alternative datasets where x has no relation to y. We can do this by randomly permuting the y values. This preserves the structure of x — that is, the correlations and relationships of the x variables to each other — but it changes the units of the problem, that is, the y-aware scaling. We are testing whether or not a given principal component appears more meaningful in a metric space induced by the true y than it does in a random metric space, one that preserves the distribution of y, but not the relationship of y to x.
You can read a more complete discussion of permutation tests and their application to variable selection (significance pruning) in this post.
In our example, we’ll use N=100, and rather than using the means of the singular values from our experiments as the thresholds, we’ll use the 98th percentiles. This represents a threshold value that is likely to be exceeded by a singular value induced in a random space only 1/(the number of variables) (1/50=0.02) fraction of the time.
#
# Resample y, do y-aware PCA,
# and return the singular values
#
getResampledSV = function(data,yindices) {
# resample y
data$y = data$y[yindices]
# treatment plan
treatplan = vtreat::designTreatmentsN(data,
setdiff(colnames(data), 'y'),
'y', verbose=FALSE)
# y-aware scaling
dataTreat = vtreat::prepare(treatplan, data, pruneSig=1, scale=TRUE)
# PCA
vars = setdiff(colnames(dataTreat), 'y')
dmat = as.matrix(dataTreat[,vars])
princ = prcomp(dmat, center=FALSE, scale=FALSE)
# return the magnitudes of the singular values
princ$sdev
}
#
# Permute y, do y-aware PCA,
# and return the singular values
#
getPermutedSV = function(data) {
n = nrow(data)
getResampledSV(data,sample(n,n,replace=FALSE))
}
#
# Run the permutation tests and collect the outcomes
#
niter = 100 # should be >> nvars
nvars = ncol(dTrain)-1
# matrix: 1 column for each iter, nvars rows
svmat = vapply(1:niter, FUN=function(i) {getPermutedSV(dTrain)}, numeric(nvars))
rownames(svmat) = colnames(princ$rotation) # rows are principal components
colnames(svmat) = paste0('rep',1:niter) # each col is an iteration
# plot the distribution of values for the first singular value
# compare it to the actual first singular value
ggplot(as.data.frame(t(svmat)), aes(x=PC1)) +
geom_density() + geom_vline(xintercept=princ$sdev[[1]], color="red") +
ggtitle("Distribution of magnitudes of first singular value, permuted data")
Here we show the distribution of the magnitude of the first singular value on the permuted data, and compare it to the magnitude of the actual first singular value (the red vertical line). We see that the actual first singular value is far larger than the magnitude you would expect from data where x is not related to y. Let’s compare all the singular values to their permutation test thresholds. The dashed line is the mean value of each singular value from the permutation tests; the shaded area represents the 98th percentile.
# transpose svmat so we get one column for every principal component
# Get the mean and empirical confidence level of every singular value
as.data.frame(t(svmat)) %>% dplyr::summarize_each(funs(mean)) %>% as.numeric() -> pmean
confF <- function(x) as.numeric(quantile(x,1-1/nvars))
as.data.frame(t(svmat)) %>% dplyr::summarize_each(funs(confF)) %>% as.numeric() -> pupper
pdata = data.frame(pc=seq_len(length(pmean)), magnitude=pmean, upper=pupper)
# we will use the first place where the singular value falls
# below its threshold as the cutoff.
# Obviously there are multiple comparison issues on such a stopping rule,
# but for this example the signal is so strong we can ignore them.
below = which(princ$sdev < pdata$upper)
lastSV = below[[1]] - 1
This test suggests that we should use 5 principal components, which is consistent with what our eye sees. This is perhaps not the "correct" knee in the graph, but it is undoubtably a knee.
Empirically estimating the quantiles from the permuted data so that we can threshold the non-informative singular values will have some undesirable bias and variance, especially if we do not perform enough experiment replications. This suggests that instead of estimating quantiles ad-hoc, we should use a systematic method: The Bootstrap. Bootstrap replication breaks the input to output association by re-sampling with replacement rather than using permutation, but comes with built-in methods to estimate bias-adjusted confidence intervals. The methods are fairly technical, and on this dataset the results are similar, so we don’t show them here, although the code is available in the R markdown document used to produce this note.
Alternatively, we can treat the principal components that we extracted via y-aware PCA simply as transformed variables — which is what they are — and significance prune them in the standard way. As our article on significance pruning discusses, we can estimate the significance of a variable by fitting a one variable model (in this case, a linear regression) and looking at that model’s significance value. You can pick the pruning threshold by considering the rate of false positives that you are willing to tolerate; as a rule of thumb, we suggest one over the number of variables.
In regular significance pruning, you would take any variable with estimated significance value lower than the threshold. Since in the PCR situation we presume that the variables are ordered from most to least useful, you can again look for the first position i where the variable appears insignificant, and use the first i-1 variables.
We’ll use vtreat
to get the significance estimates for the principal components. We’ll use one over the number of variables (1/50 = 0.02) as the pruning threshold.
# get all the principal components
# not really a projection as we took all components!
projectedTrain <- as.data.frame(predict(princ,dTrainNTreatedYScaled),
stringsAsFactors = FALSE)
vars = colnames(projectedTrain)
projectedTrain$y = dTrainNTreatedYScaled$y
# designing the treatment plan for the transformed data
# produces a data frame of estimated significances
tplan = designTreatmentsN(projectedTrain, vars, 'y', verbose=FALSE)
threshold = 1/length(vars)
scoreFrame = tplan$scoreFrame
scoreFrame$accept = scoreFrame$sig < threshold
# pick the number of variables in the standard way:
# the number of variables that pass the significance prune
nPC = sum(scoreFrame$accept)
Significance pruning picks 2 principal components, again consistent with our visual assessment. This time, we picked the correct knee: as we saw in the previous post, the first two principal components were sufficient to describe the explainable structure of the problem.
Since one of the purposes of PCR/PCA is to discover the underlying structure in the data, it’s generally useful to examine the singular values and the variable loadings on the principal components. However an analysis should also be repeatable, and hence, automatable, and it’s not straightforward to automate something as vague as "look for a knee in the curve" when selecting the number of principal components to use. We’ve covered two ways to programatically select the appropriate number of principal components in a predictive modeling context.
To conclude this entire series, here is our recommended best practice for principal components regression:
Thanks to Cyril Pernet, who blogs at NeuroImaging and Statistics, for requesting this follow-up post and pointing us to the Jackson reference.
Jackson, Donald A. "Stopping Rules in Principal Components Analysis: A Comparison of Heuristical and Statistical Approaches", Ecology Vol 74, no. 8, 1993.
Kabacoff, Robert I. R In Action, 2nd edition, Manning, 2015.
Efron, Bradley and Robert J. Tibshirani. An Introduction to the Bootstrap, Chapman and Hall/CRC, 1998.
Peres-Neto, Pedro, Donald A. Jackson and Keith M. Somers. "How many principal components? Stopping rules for determining the number of non-trivial axes revisited", Computational Statistics & Data Analysis, Vol 49, no. 4, 2005.