Beginning analysts and data scientists often ask: “how does one remember and master the seemingly endless number of classifier metrics?”
My concrete advice is:
 Read Nina Zumel’s excellent series on scoring classifiers.
 Keep notes.
 Settle on one or two metrics as you move project to project. We prefer “AUC” early in a project (when you want a flexible score) and “deviance” late in a project (when you want a strict score).
 When working on practical problems work with your business partners to find out which of precision/recall, or sensitivity/specificity most match their business needs. If you have time show them and explain the ROC plot and invite them to price and pick points along the ROC curve that most fit their business goals. Finance partners will rapidly recognize the ROC curve as “the efficient frontier” of classifier performance and be very comfortable working with this summary.
That being said it always seems like there is a bit of gamesmanship in that somebody always brings up yet another score, often apparently in the hope you may not have heard of it. Some choice of measure is signaling your pedigree (precision/recall implies a data mining background, sensitivity/specificity a medical science background) and hoping to befuddle others.
The rest of this note is some help in dealing with this menagerie of common competing classifier evaluation scores.
Definitions
Lets define our terms. We are going to work with “binary classification” problems. These are problems where we have example instances (also called rows) that are either “in the class” (we will call these instances “true”) or not (and we will call these instances “false”). A classifier is a function that given the description of an instance tries to determine if the instance is in the class or not. The classifier may either return a decision of “positive”/“negative” (indicating the classifier thinks the instance is in or out of the class) or a probability score denoting the estimated probability of being in the class.
Decision or Hard Classifiers
For decision based (or “hard”) classifiers (those returning only a positive/negative determination) the “confusion matrix” is a sufficient statistic in the sense it contains all of the information summarizing classifier quality. All other classification measures can be derived from it.
For a decision classifier (one that returns “positive” and “negative”, and not probabilities) the classifier’s performance is completely determined by four counts:
 The True Positive count, this is the number of items that are in the true class that the classifier declares to be positive.
 The True Negative count, this is the number of items that in the false class that the classifier declares to be negative.
 The False Positive count, this is the number of items that are not in the true class that the classifier declares to be positive.
 The False Negative count, this is the number of items in the true class the that classifier declares to be negative.
Notice true and false are being used to indicate if the classifier is correct (and not the actual category of each item) in these terms. This is traditional nomenclature. The first two quantities are where the classifier is correct (positive corresponding to true and negative corresponding to false) and the second two quantities count instances where the classifier is incorrect.
It is traditional to arrange these quantities into a 2 by 2 table called the confusion matrix. If we define:
library('ggplot2')
library('caret')
## Loading required package: lattice
library('rSymPy')
## Loading required package: rJython
## Loading required package: rJava
## Loading required package: rjson
A = Var('TruePositives')
B = Var('FalsePositives')
C = Var('FalseNegatives')
D = Var('TrueNegatives')
(Note all code shared here.)
Then the caret R package defines the confusion matrix as follows (see help("confusionMatrix")
) as follows:
Reference
Predicted Event No Event
Event A B
No Event C D
Reference is “ground truth” or actual outcome. We will call examples that have true ground truth “true examples” (again, please don’t confuse this with “TrueNegatives” which are “false examples” that are correctly scored as being false. We would prefer to have the classifier indicate columns instead of rows, but we will use the caret notation for consistency.
We can encode what we have written about these confusion matrix summaries as algebraic statements. Caret’s help("confusionMatrix")
then gives us definitions of a number of common classifier scores:
# (A+C) and (B+D) are facts about the data, independent of classifier.
Sensitivity = A/(A+C)
Specificity = D/(B+D)
Prevalence = (A+C)/(A+B+C+D)
PPV = (Sensitivity * Prevalence)/((Sensitivity*Prevalence) + ((1Specificity)*(1Prevalence)))
NPV = (Specificity * (1Prevalence))/(((1Sensitivity)*Prevalence) + ((Specificity)*(1Prevalence)))
DetectionRate = A/(A+B+C+D)
DetectionPrevalence = (A+B)/(A+B+C+D)
BalancedAccuracy = (Sensitivity+Specificity)/2
We can (from our notes) also define some more common metrics:
TPR = A/(A+C) # True Positive Rate
FPR = B/(B+D) # False Positive Rate
FNR = C/(A+C) # False Negative Rate
TNR = D/(B+D) # True Negative Rate
Recall = A/(A+C)
Precision = A/(A+B)
Accuracy = (A+D)/(A+B+C+D)
By writing everything down it becomes obvious that Sensitivity==TPR==Recall
. That won’t stop somebody from complaining if you say “recall” when they prefer “sensitivity”, but that is how things are.
By declaring all of these quantities as sympy variables and expressions we can now check much more. We confirm formal equality of various measures by checking that their difference algebraically simplifies to zero.
# Confirm TPR == 1  FNR
sympy(paste("simplify(",TPR(1FNR),")"))
## [1] "0"
# Confirm Recall == Sensitivity
sympy(paste("simplify(",RecallSensitivity,")"))
## [1] "0"
# Confirm PPV == Precision
sympy(paste("simplify(",PPVPrecision,")"))
## [1] "0"
We can also confirm nonidentity by simplifying and checking an instance:
# Confirm Precision != Specificity
expr < sympy(paste("simplify(",PrecisionSpecificity,")"))
print(expr)
## [1] "(FalsePositives*TruePositives  FalsePositives*TrueNegatives)/(FalsePositives*TrueNegatives + FalsePositives*TruePositives + TrueNegatives*TruePositives + FalsePositives**2)"
sub < function(expr,
TruePositives,FalsePositives,FalseNegatives,TrueNegatives) {
eval(expr)
}
sub(parse(text=expr),
TruePositives=0,FalsePositives=1,FalseNegatives=0,TrueNegatives=1)
## [1] 0.5
More difficult checks
Balanced Accuracy
If we write the probability of a true (inclass) instances scoring higher than a false (not in class) instance (with 1/2 point for ties) as Prob[score(true)>score(false)] (with half point on ties)
. We can then confirm Prob[score(true)>score(false)] (with half point on ties) == BalancedAccuracy
for hard or decision classifiers by defining score(true)>score(false)
as:
A D : True Positive and True Negative: Correct sorting 1 point
A B : True Positive and False Positive (same prediction "Positive", different outcomes): 1/2 point
C D : False Negative and True Negative (same prediction "Negative", different outcomes): 1/2 point
C B : False Negative and True Negative: Wrong order 0 points
Then ScoreTrueGTFalse ==
Prob[score(true)>score(false)] (with 1/2 point for ties)` is:
ScoreTrueGTFalse = (1*A*D + 0.5*A*B + 0.5*C*D + 0*C*B)/((A+C)*(B+D))
Which we can confirm is equal to balanced accuracy.
sympy(paste("simplify(",ScoreTrueGTFalseBalancedAccuracy,")"))
## [1] "0"
AUC
We can also confirm Prob[score(true)>score(false)]
(with half point on ties) == AUC
. We can compute the AUC
(the area under the drawn curve) of the above confusion matrix by referring to the following diagram.
Then we can check for general equality:
AUC = (1/2)*FPR*TPR + (1/2)*(1FPR)*(1TPR) + (1FPR)*TPR
sympy(paste("simplify(",ScoreTrueGTFalseAUC,")"))
## [1] "0"
This AUC score (with half point credit on ties) equivalence holds in general (see also More on ROC/AUC, though I got this wrong the first time).
F1
We can show F1
is different than Balanced Accuracy by plotting results they differ on:
# Wikipedia https://en.wikipedia.org/wiki/F1_score
F1 = 2*Precision*Recall/(Precision+Recall)
F1 = sympy(paste("simplify(",F1,")"))
print(F1)
## [1] "2*TruePositives/(FalseNegatives + FalsePositives + 2*TruePositives)"
print(BalancedAccuracy)
## [1] "TrueNegatives/(2*(FalsePositives + TrueNegatives)) + TruePositives/(2*(FalseNegatives + TruePositives))"
# Show F1 and BalancedAccuracy do not always vary together (even for hard classifiers)
F1formula = parse(text=F1)
BAformula = parse(text=BalancedAccuracy)
frm = c()
for(TotTrue in 1:5) {
for(TotFalse in 1:5) {
for(TruePositives in 0:TotTrue) {
for(TrueNegatives in 0:TotFalse) {
FalsePositives = TotFalseTrueNegatives
FalseNegatives = TotTrueTruePositives
F1a < sub(F1formula,
TruePositives=TruePositives,FalsePositives=FalsePositives,
FalseNegatives=FalseNegatives,TrueNegatives=TrueNegatives)
BAa < sub(BAformula,
TruePositives=TruePositives,FalsePositives=FalsePositives,
FalseNegatives=FalseNegatives,TrueNegatives=TrueNegatives)
if((F1a<=0)&&(BAa>0.5)) {
stop()
}
fi = data.frame(
TotTrue=TotTrue,
TotFalse=TotFalse,
TruePositives=TruePositives,FalsePositives=FalsePositives,
FalseNegatives=FalseNegatives,TrueNegatives=TrueNegatives,
F1=F1a,BalancedAccuracy=BAa,
stringsAsFactors = FALSE)
frm = rbind(frm,fi) # bad n^2 accumulation
}
}
}
}
ggplot(data=frm,aes(x=F1,y=BalancedAccuracy)) +
geom_point() +
ggtitle("F1 versus balancedAccuarcy/AUC")
F1 versus BalancedAccuracy/AUC
Baroque measures
In various sciences over the years over 20 measures of “scoring correspondence” have been introduced by playing games with publication priority, symmetry, and incorporating significance (“chance adjustments”) directly into the measure.
Each measure presumably exists because it avoids flaws of all of the others. However the sheer number of them (in my opinion) triggers what I call “De Morgan’s objection”:
If I had before me a fly and an elephant, having never seen more than one such magnitude of either kind; and if the fly were to endeavor to persuade me that he was larger than the elephant, I might by possibility be placed in a difficulty. The apparently little creature might use such arguments about the effect of distance, and might appeal to such laws of sight and hearing as I, if unlearned in those things, might be unable wholly to reject. But if there were a thousand flies, all buzzing, to appearance, about the great creature; and, to a fly, declaring, each one for himself, that he was bigger than the quadruped; and all giving different and frequently contradictory reasons; and each one despising and opposing the reasons of the others—I should feel quite at my ease. I should certainly say, My little friends, the case of each one of you is destroyed by the rest.
(Augustus De Morgan, “A Budget of Paradoxes” 1872)
There is actually an excellent literature stream investigating which of these measures are roughly equivalent (say arbitrary monotone functions of each other) and which are different (leave aside which are even useful).
Two excellent guides to this rat hole include:

Ackerman, M., & BenDavid, S. (2008). “Measures of clustering quality: A working set of axioms for clustering.”" Advances in Neural Information Processing Systems: Proceedings of the 2008 Conference.

Warrens, M. (2008). “On similarity coefficients for 2× 2 tables and correction for chance.” Psychometrika, 73(3), 487–502.
The point is: you not only can get a publication trying to sort this mess, you can actually do truly interesting work trying to relate these measures.
Further directions
One can take finding relations and invariants much further as in “Lectures on Algebraic Statistics” Mathias Drton, Bernd Sturmfels, Seth Sullivant, 2008.
Conclusion
It is a bit much to hope to only need to know “one best measure” or to claim to be familiar (let alone expert) in all plausible measures. Instead, find a few common evaluation measures that work well and stick with them.
Thanks for the great article,
which evaluation metrics would you suggest for a highly imbalanced dataset (fraud detection) where a additionally a different cost is associated with each class?
What about kappa or a cost based approach similar to http://albahnsen.com/CostSensitiveClassification/ or would you suggest something different?
Thanks for the question Georg,
What you are asking is really cutting to the central point. My response is long because it is an important point.
Cohen’s kappa (the measure I assume you are referring to) is actually one of the many “interrater agreement” measures in the references I included (their is also a multirater version called Fliess’ kappa).
These measures were largely designed for measuring how well two raters agree with each other assuming neither is the ground truth. I have seen them mostly in situations where we assign something like categorization of products many times to many external scorers (like mechanical Turk) and we are trying to determine if two or more scorer’s are behaving in a consistent manner. Cohen’s kappa is essentially something as simple as correlation or accuracy adjusted for the rates the scorers are marking things positive. Changing the details of this changes the name of the measure (for instance the Wikipedia tells us Scott’s pi differs on from Cohen’s kappa on how the chance rate is estimated). My point is these variations change the name of the score much faster than they change the actual utility of the score.
What Cohen’s kappa does for an unbalanced class (predicting a rare event) is: estimate something like overall correlation or accuracy (which is easy to get a high score on for such a class just say the event never happens!) and then tries to adjust the score for the fact that high accuracies are easy to achieve when we have unbalanced classes. This is likely most useful when scoring the same tagger across multiple data sets.
In my opinion it is a fallacy to insist general classifier utility evaluations can be as simple as a single number or a total order prior to introducing additional problem domain details. How rare the target class is doesn’t actually tell you the relative cost of false positives and false negatives (which is necessary additional domain knowledge). So adjustments based only on sample size or outcome distribution can never be sufficient. To be sure my intuition that these is no total order is coming from the more detailed world of scoring classifiers or probability classifiers which induces a much more detailed ROC plot than simple hard or decision based classifiers. But most current machine learning implementations tend to return such scores (neural nets, logistic regression, decision trees, random forests, gradient boosting, and even support vector machines).
I would suggest at least reporting both of precision and recall or both of sensitivity and specificity. When I ran a research / datascience group at Shopping.com (now an eBay company) we usually reported precision and recall for each and every category in our catalog (around 200 primary categories). So we definitely felt different errors should have very different prices.
Or (as a variation of what you mentioned): with your business partner put a price on counts on each cell of the confusion matrix and then pick the classifier that induces a confusion matrix maximizing price. This will be an interactive process as once the customer sees the consequences they will have feedback that may help them revise their price estimates. A similar thing can be done tracing along the ROC curve where (after putting population statistics back in) you can say for each point on the curve what confusion matrix would be derived and exactly what your inspection outcome would look like at each point.
So yes, I am in favor of cost sensitive classification I just tend to make the cost adjustments after building the classifier by working along the ROC curve. Methods such as rebalancing classes or stratified sampling can be critical to make standard methods to run quickly or converge on very rare classes, but they are not always completely capable of completely reshaping classifier performance to match business goals ( some notes here http://www.winvector.com/blog/2015/02/doesbalancingclassesimproveclassifierperformance/ ).
John