vtreat
is a data.frame processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner.
Very roughly vtreat
accepts an arbitrary “from the wild” data frame (with different column types, NA
s, NaN
s and so forth) and returns a transformation that reliably and repeatably converts similar data frames to numeric (matrix-like) frames (all independent variables numeric free of NA
, NaN
s, infinities, and so on) ready for predictive modeling. This is a systematic way to work with high-cardinality character and factor variables (which are incompatible with some machine learning implementations such as random forest, and also bring in a danger of statistical over-fitting) and leaves the analyst more time to incorporate domain specific data preparation (as vtreat
tries to handle as much of the common stuff as practical). For more of an overall description please see here.
We suggest any users please update (and you will want to re-run any “design” steps instead of mixing “design” and “prepare” from two different versions of vtreat
).
For what is new in version 0.5.27 please read on.
vtreat
0.5.27 is a maintenance release. User visible improvements include.
catScaling
to FALSE
. We still think working logistic link-space is a great idea for classification problems, we are just not fully satisfied that un-regularized logistic regressions are the best way to get there (largely due to issues of separation and quasi-separation). In the meantime we think working in an expectation space is the safer (and now default) alternative.stats::chisq.test()
instead of insisting on stats::fisher.test()
for large counts. This calculation is used for level pruning and only relevant if rareSig < 1
(the default is 1
). We caution that setting rareSig < 1
remains a fairly expensive setting. We are trying to make significance estimation much more transparent, for example we now return how many extra degrees of freedom are hidden by categorical variable re-encodings in a new score frame column called extraModelDegrees
(found in designTreatments*()$scoreFrame
).The idea is having data preparation as a re-usable library lets us research, document, optimize, and fine tune a lot more details than would make sense on any one analysis project. The main design difference from other data preparation packages is we emphasize “y-aware” (or outcome aware) processing (using the training outcome to generate useful re-encodings of the data).
We have pre-rendered a lot of the package documentation, examples, and tutorials here.
]]>summary()
method is: it is unfaithful to numeric arguments (due to bad default behavior) and frankly it should be considered unreliable. It is likely the way it is for historic and compatibility reasons, but in my opinion it does not currently represent a desirable set of tradeoffs. summary()
likely represents good work by high-ability researchers, and the sharp edges are due to historically necessary trade-offs.
The Big Lebowski, 1998.
Please read on for some context and my criticism.
Edit 8/25/2016: Martin Maechler generously committed a fix! Assuming this works out in testing it looks like we could see an improvement on this core function in April 2017. I really want to say “thank you” to Martin Maechler and the rest of the team for not only this, for all the things they do, and for putting up with me.
My group has been doing a lot more professional training lately. This is interesting because bright students really put a lot of interesting demands on how you organize and communicate. They want things that make sense (so they can learn them), that are powerful (so it is worth learning them), and that are regular (so they can compose them and move beyond what you are teaching). Students are less sympathetic to implementation history and unstated conventions, as new users tend not to benefit from them. Remember a new R
student is still deciding if they want to use R
, to them it is new so an instructor needs to defend R
‘s current trade-offs (not its evolutionary path). We find it is best to point out both what is great in R
and what isn’t great (versus skipping such, or worse trying to justify such portions).
Please keep this in mind when I demonstrate what goes wrong when one attempts to teach R’s summary()
function to the laity.
Suppose you had a list or vector of numbers in R. It would be useful to be able to produce and view some summaries or statistics about these numbers. The primary way to do this in R is to call the summary()
method. Here is an example below:
numbers <- 1:7
print(numbers)
## [1] 1 2 3 4 5 6 7
summary(numbers)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 2.5 4.0 4.0 5.5 7.0
From the names attached to the results you can get the meanings and move on. But the whole time you are hoping none of your students call summary()
on a single number. Because if the do, they have a very good chance of seeing summary()
fail. And now you have broken trust in R
.
Let’s tack into the wind and demonstrate the failure:
summary(15555)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 15560 15560 15560 15560 15560 15560
summary()
is claiming the minimum value from the set of numbers c(15555)
is 15560
. Now this is a deliberately trivial example where we can see what is going on (it sure looks like presentation rounding). To make matters worse, this isn’t just confusion generated during presentation- the actual values are wrong.
str(summary(15555))
## Classes 'summaryDefault', 'table' Named num [1:6] 15560 15560 15560 15560 15560 ...
## ..- attr(*, "names")= chr [1:6] "Min." "1st Qu." "Median" "Mean" ...
summary(15555)[['Min.']] == min(15555)
## [1] FALSE
It may seem silly to expect the slots from a summary()
call on a vector would be used in calculation (when we have direct functions such as quantile()
and mean()
for getting the same results), but using values from summaries of models is standard practice in R. The trivial linear model summary summary(lm(y~0,data.frame(y=15555)))
shows rounded results (though it appears to hold accurate results, and only round during presentation; use unclass()
to inspect the actual values).
This is in fact a problem. You can say this is a consequence of the “default settings of summary()
” and it is my fault for not changing those settings. But frankly it is quite fair to expect the default settings to be safe and sane.
Let us also appeal to authority:
The many computational steps between original data source and displayed results must all be truthful, or the effect of the analysis may be worthless, if not pernicious. This places an obligation on all creators of software to program in such a way that the computations can be understood and trusted. This obligation I label the Prime Directive.
John Chambers, Software for Data Analysis: Programming with R, Springer 2008.
The point is you are delegating work to your system. If it needlessly fails (no matter how trivially) when observed, how can you trust it when unobserved? John Chambers’ point is that trust is very expensive to build up, so you really don’t want to squander it.
I used to try to “lecture this away” as just being “rounding in the presentation for neatness.” But this runs into two objections:
1.556e+4
?summary()
“is just presentation” wouldn’t it be a string?We are losing substitutability. We would love to be able to say to students that “summary()
is a convenient shorthand and you can treat the following as equivalent”:
summary(x)[['Min.']] == min(x)
summary(x)[['1st Qu.']] == quantile(x,0.25)
summary(x)[['Median']] == median(x)
summary(x)[['Mean']] == mean(x)
summary(x)[['3rd Qu.']] == quantile(x,0.75)
summary(x)[['Max.']] == max(x)
But the above isn’t always the case. What we would like is for summary()
to contain these values and get pretty printing by using the S3 or S4 object system to override the print()
method. It is quite likely summary()
predates these object systems, so achieved pretty printing through rounding of values.
We can take a look at the actual code and see what is happening. We are looking for a reason, not an excuse.
From help(summary)
we see summary takes a digits
option with default value digits = max(3, getOption("digits")-3)
(lets not even get into why setting digits
directly does one thing and the system default is shifted by 3
). getOption("digits")
returns 7
on my machine so we see we are asking for four digit rounding, which is consistent with what we saw. Digging through the dispatch rules we can eventually determine that for a numeric vector summary()
eventually calls summary.default()
. By calling print(summary.default)
we can look at the code. The offending snippet is:
qq <- stats::quantile(object)
qq <- signif(c(qq[1L:3L], mean(object), qq[4L:5L]), digits)
After computing the quantiles summary then calls signif()
to round the results. R
isn’t inaccurate, it just went out of its way to round the results.
One reason this article is long is the behavior we are describing breaks expectations. So we end up having to document what is actually going on (a laborious process) instead of being able to rely on shared educated expectations. The whining is where actualities and expectations diverge.
summary()
attempts to achieve neatness and legibility. This is a laudable goal, if achievable. Numeric analysis is not so simple that rounding could safely achieve such a goal.
It is well known that rounding is not a safe or faithful operation (it loses information, and can be catastrophic if naively applied in many stages of a complex calculation). Because it is obvious rounding is dangerous, sophisticated students are surprised that it defaults to “on” in common calculations without indication or warning (such as moving to scientific notation). summary()
compounds this error by returning rounded values (instead of rounding only at print
/presentation). As summary()
is often a first view of data (along with print()
) we encounter confusing inconsistent situations where un-rounded values (presentation of original data) and rounded values are compared.
Of course, we can (and should) teach students to call mean(x)
and quantile(x)
rather than summary(x)
when they want to reuse the summary statistics. But then we have to explain why. After seeing something like this it becomes an unfortunate additional teaching goal to convince students that more of R
doesn’t behave like summary()
.
For your convenience here they are in order:
Please check it out, and please do Tweet/share these tutorials.
]]>parallel
(please see here for example). This is, in our opinion, a necessary step before getting into clever notation and wrapping such as doParallel and foreach. Only then do the students have a sufficiently explicit interface to frame important questions about the semantics of parallel computing. Beginners really need a solid mental model of what services are really being provided by their tools and to test edge cases early.
One question that comes up over and over again is “can you nest parLapply
?”
The answer is “no.” This is in fact an advanced topic, but it is one of the things that pops up when you start worrying about parallel programming. Please read on for what that is the right answer and how to work around that (simulate a “yes”).
I don’t think the above question is usually given sufficient consideration (nesting parallel operations can in fact make a lot of sense). You can’t directly nest parLapply
, but that is a different issue than can one invent a work-around. For example: a “yes” answer (really meaning there are work-arounds) can be found here. Again this is a different question than “is there a way to nest foreach loops” (which is possible through the nesting operator %.%
which presumably handles working around nesting issues in parLapply
).
Let’s set up a concrete example, so we can ask and answer a precise question. Suppose we have a list of jobs (coming from an external source) that we will simulate with the code fragment below.
jobs <- list(1:3,5:10,100:200)
Notice the jobs have wildly diverging sizes, this is an important consideration.
Suppose the task we want to perform is some the square roots of the entries. The standard (non-parallel) calculation would look like the following.
worker1 <- function(x) {
sum(sqrt(x))
}
lapply(jobs,worker1)
## [[1]]
## [1] 4.146264
##
## [[2]]
## [1] 16.32201
##
## [[3]]
## [1] 1231.021
For didactic purposes please pretend that the sum
function is very expensive and the sqrt
function is somewhat expensive.
If it was obvious we always had a great number of small sub-lists we would want to use parallelization to make sure we are performing many sum
s at the same time. We would then parallelize over the first level as below.
clus <- parallel::makeCluster(4)
parallel::parLapplyLB(clus,jobs,worker1)
## [[1]]
## [1] 4.146264
##
## [[2]]
## [1] 16.32201
##
## [[3]]
## [1] 1231.021
Notice that parallel::parLapplyLB
uses almost the same calling convention as lapply
and returns the exact same answer.
If it was obvious we had a single large sub-list we would want to make sure we were always parallelizing the sqrt
operations so we would prefer to parallelize as follows:
mkWorker2 <- function(clus) {
force(clus)
function(x) {
xs <- parallel::parLapplyLB(clus,x,sqrt)
sum(as.numeric(xs))
}
}
worker2 <- mkWorker2(clus)
lapply(jobs,worker2)
## [[1]]
## [1] 4.146264
##
## [[2]]
## [1] 16.32201
##
## [[3]]
## [1] 1231.021
(For the details of building functions and passing values to remote workers please see here.)
If we were not sure if in the future what structure we would encounter we would prefer to schedule all operations for possible parallel execution. This would minimize the number of idle resources and minimize the time to finish the jobs. Ideally that would look like the following (a nested use of parallel):
parallel::parLapplyLB(clus,jobs,worker2)
## Error in checkForRemoteErrors(val): 3 nodes produced errors; first error: invalid connection
Notice the above fails with an error. Wishing for flexible code is what beginners intuitively mean when they as if you can nest parallel calls. They may not be able to explain it, but they are worried they don’t have a good characterization of the work they are trying to parallelize over. They are not asking if things get magically faster by “parallelizing parallel.”
It isn’t too hard to find out what the nature of the error is: the communication connection socket file descriptors (con
) are passed as integers to each machine, but they are not valid descriptors where they arrive (they are just integers). We can see this by looking at the structure of the cluster:
str(clus)
## List of 4
## $ :List of 3
## ..$ con :Classes 'sockconn', 'connection' atomic [1:1] 5
## .. .. ..- attr(*, "conn_id")=<externalptr>
## ..$ host: chr "localhost"
## ..$ rank: int 1
## ..- attr(*, "class")= chr "SOCKnode"
## $ :List of 3
## ..$ con :Classes 'sockconn', 'connection' atomic [1:1] 6
## .. .. ..- attr(*, "conn_id")=<externalptr>
## ..$ host: chr "localhost"
## ..$ rank: int 2
## ..- attr(*, "class")= chr "SOCKnode"
## $ :List of 3
## ..$ con :Classes 'sockconn', 'connection' atomic [1:1] 7
## .. .. ..- attr(*, "conn_id")=<externalptr>
## ..$ host: chr "localhost"
## ..$ rank: int 3
## ..- attr(*, "class")= chr "SOCKnode"
## $ :List of 3
## ..$ con :Classes 'sockconn', 'connection' atomic [1:1] 8
## .. .. ..- attr(*, "conn_id")=<externalptr>
## ..$ host: chr "localhost"
## ..$ rank: int 4
## ..- attr(*, "class")= chr "SOCKnode"
## - attr(*, "class")= chr [1:2] "SOCKcluster" "cluster"
mkWorker3 <- function(clus) {
force(clus)
function(x) {
as.character(clus)
}
}
worker3 <- mkWorker3(clus)
parallel::parLapplyLB(clus,jobs,worker3)
## [[1]]
## [1] "list(con = 5, host = \"localhost\", rank = 1)"
## [2] "list(con = 6, host = \"localhost\", rank = 2)"
## [3] "list(con = 7, host = \"localhost\", rank = 3)"
## [4] "list(con = 8, host = \"localhost\", rank = 4)"
##
## [[2]]
## [1] "list(con = 5, host = \"localhost\", rank = 1)"
## [2] "list(con = 6, host = \"localhost\", rank = 2)"
## [3] "list(con = 7, host = \"localhost\", rank = 3)"
## [4] "list(con = 8, host = \"localhost\", rank = 4)"
##
## [[3]]
## [1] "list(con = 5, host = \"localhost\", rank = 1)"
## [2] "list(con = 6, host = \"localhost\", rank = 2)"
## [3] "list(con = 7, host = \"localhost\", rank = 3)"
## [4] "list(con = 8, host = \"localhost\", rank = 4)"
What we are getting wrong is: we can’t share control of the cluster to each worker just by passing the cluster object around. This would require some central registry and call-back scheme (which is one of the things packages like foreach
and doParallel
accomplish when they “register a parallel back-end to use”). Base parallel
depends more on explicit reference to the cluster data structure, so it isn’t “idiomatic parLapply” to assume we can find “the parallel cluster” (there could in fact be more than one at the same time).
So what is the work around?
One work around is to move to sophisticated wrappers (like doParallel
or even future
, also see here).
These fixes roughly split the calculation into two phases one dedicated to the sqrt
step and the second dedicated to the sum
step (remember we are pretending both of these operations are expensive). We can directly demonstrate such a reorganization as follows.
library('magrittr')
mkWorker4a <- function(clus) {
force(clus)
function(x) {
as.numeric(parallel::parLapplyLB(clus,x,sqrt))
}
}
worker4a <- mkWorker4a(clus)
worker4b <- function(x) {
sum(x)
}
jobs %>%
lapply(X=.,FUN=worker4a) %>%
parallel::parLapplyLB(cl=clus,X=.,fun=worker4b)
## [[1]]
## [1] 4.146264
##
## [[2]]
## [1] 16.32201
##
## [[3]]
## [1] 1231.021
The above depends on not too many of the sub-lists being short (and hiding opportunities for parallelism).
Another fix is to (at the cost of time effort and space) to re-organize the calculation into two sequenced phases, each of which is parallel- but not nested. It is a bit involved, but we show how to do that below (using R
’s Reduce
and split
functions to reorganize the data, though one could also use so-called “tidyverse” methods). This fix is more general, but introduces reorganization overhead.
# Preparation 1: collect all items into one flat list
sqrtjobs <- as.list(Reduce(c,jobs))
# Phase 1: sqrt every item in parallel
sqrts <- parallel::parLapplyLB(clus,sqrtjobs,sqrt)
# Preparation 2: re-assemble new job list that needs only sums
lengths <- vapply(jobs,length,numeric(1))
pattern <- lapply(seq_len(length(lengths)),
function(i) {rep(i,lengths[[i]])})
pattern <- Reduce(c,pattern)
sumjobs <- split(sqrts,Reduce(c,pattern))
sumjobs <- lapply(sumjobs,as.numeric)
names(sumjobs) <- names(jobs)
# Phase 2: sum all items in parallel
parallel::parLapplyLB(clus,sumjobs,sum)
## [[1]]
## [1] 4.146264
##
## [[2]]
## [1] 16.32201
##
## [[3]]
## [1] 1231.021
In conclusion: you can’t directly nest parLapply
, but you can usefully sequence through it.
parallel::stopCluster(clus)
]]>In this article we will consider the latest entry of our mad “programming theory in R series” (see Some programming language theory in R, You don’t need to understand pointers to program using R, Using closures as objects in R, and How and why to return functions in R): category theory!
In practice the programming language Haskell improved greatly when non-monadic I/O libraries were replaced by better monad inspired I/O libraries. But this also created the unfortunate false impression you had to understand monads to use Haskell (when in fact you only have to understand them to implement Haskell).
The fun side of Monads is flexibility in using them and occasionally saying (either formally or informally) “hey, x turns out to be a monad!”
This can vary from meaning:
Saying “x is a monad” is like singing in the shower, it is always more fun to say than to hear. I know I am certainly guilty of this in writing this article.
It turns out the magrittr
pipe package in R obeys the monad axioms. So if you are a data scientist who has tried magrittr
you have already benefited from monadic design.
Obviously I am nowhere near the first to notice this, but it is something I wish to comment on here. It doesn’t matter if this is core intent or a side-effect of good design, but it does give yet another reason to trust the package.
Let’s load the magrittr
package and spot-check the monad laws.
library('magrittr')
# Identity function,
ret <- function(x) { x }
# Note the example functions included here are not fully "standard non-standard eval"
# production hardened.
First we check that magrittr
’s %>%
operator obeys the Monad laws when using %>%
as “bind” and ret
as “return”. For simplicity we would like to think of magrittr
as a category over single argument functions (though obviously magrittr
works over more values than these, and the big part of the the magrittr
service is Currying code fragments into single argument functions).
a <- 1:5 # our values
m <- 1:5 # more values
f <- sin # our first function
g <- cos # our other function
ret(a) %>% f == f(a)
:ret(a) %>% f
## [1] 0.8414710 0.9092974 0.1411200 -0.7568025 -0.9589243
f(a)
## [1] 0.8414710 0.9092974 0.1411200 -0.7568025 -0.9589243
m %>% ret == m
:#
m %>% ret
## [1] 1 2 3 4 5
m
## [1] 1 2 3 4 5
(m %>% f) %>% g == m %>% (function(x) {f(x) %>% g})
:(m %>% f) %>% g
## [1] 0.6663667 0.6143003 0.9900591 0.7270351 0.5744009
m %>% (function(x) {f(x) %>% g})
## [1] 0.6663667 0.6143003 0.9900591 0.7270351 0.5744009
Let’s go through those rather abstract axioms again and specialize them using our knowledge that “ret
” is the identity.
a %>% f == f(a)
. Says: “piping a value into a function is the same as applying the function to the value.” In general the original says ret
has to be faithful in the sense that %>%
can recover enough information to compute f(a)
.m %>% ret == m
. Now implied by Axiom 1’. But in the cases where we don’t have ret
is the identity this would tell us that %>%
is faithful in the sense it is retaining enough information about m
that ret
can re-build m
.(m %>% f) %>% g == m %>% (function(x) {f(x) %>% g})
. There is a notational convention hidden in this statement: we assume m %>% f %>% g
is to be read as (m %>% f) %>% g
. The axiom plus the convention tell us we can consider piping as moving the value m
through f
and then through g
. This is where we are really checking %>%
is behaving like a pipe.As a user we would like to be able to write “a %>% (f %>% g)
” or “h <- f %>% g; a %>% h
”. That is: we would like to be able to save complex magrittr pipe sequences for re-use. We would like to have right-associativity (reified composition of operators) in addition to the left association we expect from magrittr
being described a “forward-pipe operator” (from magrittr’s description).
It turns out that isn’t one of the monad axions, and we can’t immediately do that:
h <- f %>% g
## Error in g(.): non-numeric argument to mathematical function
We can fix this one of two ways: by introducing a function or using a built-in “dot notation” (thank you to Professor Jenny Bryan for pointing this out to me.)
h <- function(x) { x %>% f %>% g }
a %>% h
## [1] 0.6663667 0.6143003 0.9900591 0.7270351 0.5744009
h <- . %>% f %>% g
a %>% h
## [1] 0.6663667 0.6143003 0.9900591 0.7270351 0.5744009
It turns out category theorists anticipated this problem and fix. The function wrapping trick is essentially building the Kleisli category derived from our monad (see “Monads Made Difficult”).
In principle we could implement a Kleisli arrow %>=>%
operator in addition to the magrittr
bind operator (%>%
) which would allow code like the following (all four statements below would be equivalent):
`%>=>%` <- function(f,g) { function(x) {g(f(x))} }
a %>% (sin %>=>% cos %>=>% abs)
## [1] 0.6663667 0.6143003 0.9900591 0.7270351 0.5744009
a %>% ((sin %>=>% cos) %>=>% abs)
## [1] 0.6663667 0.6143003 0.9900591 0.7270351 0.5744009
a %>% (sin %>=>% (cos %>=>% abs))
## [1] 0.6663667 0.6143003 0.9900591 0.7270351 0.5744009
abs(cos(sin(a)))
## [1] 0.6663667 0.6143003 0.9900591 0.7270351 0.5744009
Note that the above %>=>%
operator is not in fact the desired general Kleisli operator as we haven’t implemented the critical Currying services that the magrittr
%>%
operator supplies (these services would be what category theorist call the “endofunctor” which would map R functions to specialized savable R functions; definitions such as function(f,g) { . %>% f %>% g }
won’t work either as we would need to capture unevaluated arguments inside the %>=>%
function and can not delegate that to interior %>%
calls).
In the Kleisli category we would no longer need special monad axioms, they are replaced by the more common associativity and category axioms in the Kleisli category. One still has to prove you have a category, but it is a more standard task.
You can say standard imperative styles of analysis that see operations as sequenced transient steps that mutate data are “strict left to right associative” ways of thinking (the “((a %>% sin) %>% cos) %>% abs
” form). Databases, and standard data analyses in R
are usually so organized.
The “general associativity” way of thinking (the “h <- sin %>=>% cos %>=>% abs; a %>% h
” form) emphasizes the processing pipeline as a reusable entity and data a transient quantities that flow through the pipeline. Systems like Weka, LingPipe, and graphical data science tools such as Alpine Data workflow notebooks essentially represented processes in this manner.
The guarantee of full associativity is just a mathematical way of saying: you can mix these styles (data oriented or operator oriented) and you are guaranteed the same result in either case.
And there you have it, more category theory than a data scientist should need to worry about.
]]>To make things easier here are links to the original three articles which work through scores, significance, and includes a glossery.
A lot of what Nina is presenting can be summed up in the diagram below (also by her). If in the diagram the first row is truth (say red disks are infected) which classifier is the better initial screen for infection? Should you prefer the model 1 80% accurate row or the model 2 70% accurate row? This example helps break dependence on “accuracy as the only true measure” and promote discussion of additional measures.
My concrete advice is:
That being said it always seems like there is a bit of gamesmanship in that somebody always brings up yet another score, often apparently in the hope you may not have heard of it. Some choice of measure is signaling your pedigree (precision/recall implies a data mining background, sensitivity/specificity a medical science background) and hoping to befuddle others.
The rest of this note is some help in dealing with this menagerie of common competing classifier evaluation scores.
Lets define our terms. We are going to work with “binary classification” problems. These are problems where we have example instances (also called rows) that are either “in the class” (we will call these instances “true”) or not (and we will call these instances “false”). A classifier is a function that given the description of an instance tries to determine if the instance is in the class or not. The classifier may either return a decision of “positive”/“negative” (indicating the classifier thinks the instance is in or out of the class) or a probability score denoting the estimated probability of being in the class.
For decision based (or “hard”) classifiers (those returning only a positive/negative determination) the “confusion matrix” is a sufficient statistic in the sense it contains all of the information summarizing classifier quality. All other classification measures can be derived from it.
For a decision classifier (one that returns “positive” and “negative”, and not probabilities) the classifier’s performance is completely determined by four counts:
Notice true and false are being used to indicate if the classifier is correct (and not the actual category of each item) in these terms. This is traditional nomenclature. The first two quantities are where the classifier is correct (positive corresponding to true and negative corresponding to false) and the second two quantities count instances where the classifier is incorrect.
It is traditional to arrange these quantities into a 2 by 2 table called the confusion matrix. If we define:
library('ggplot2')
library('caret')
## Loading required package: lattice
library('rSymPy')
## Loading required package: rJython
## Loading required package: rJava
## Loading required package: rjson
A = Var('TruePositives')
B = Var('FalsePositives')
C = Var('FalseNegatives')
D = Var('TrueNegatives')
(Note all code shared here.)
Then the caret R package defines the confusion matrix as follows (see help("confusionMatrix")
) as follows:
Reference
Predicted Event No Event
Event A B
No Event C D
Reference is “ground truth” or actual outcome. We will call examples that have true ground truth “true examples” (again, please don’t confuse this with “TrueNegatives” which are “false examples” that are correctly scored as being false. We would prefer to have the classifier indicate columns instead of rows, but we will use the caret notation for consistency.
We can encode what we have written about these confusion matrix summaries as algebraic statements. Caret’s help("confusionMatrix")
then gives us definitions of a number of common classifier scores:
# (A+C) and (B+D) are facts about the data, independent of classifier.
Sensitivity = A/(A+C)
Specificity = D/(B+D)
Prevalence = (A+C)/(A+B+C+D)
PPV = (Sensitivity * Prevalence)/((Sensitivity*Prevalence) + ((1-Specificity)*(1-Prevalence)))
NPV = (Specificity * (1-Prevalence))/(((1-Sensitivity)*Prevalence) + ((Specificity)*(1-Prevalence)))
DetectionRate = A/(A+B+C+D)
DetectionPrevalence = (A+B)/(A+B+C+D)
BalancedAccuracy = (Sensitivity+Specificity)/2
We can (from our notes) also define some more common metrics:
TPR = A/(A+C) # True Positive Rate
FPR = B/(B+D) # False Positive Rate
FNR = C/(A+C) # False Negative Rate
TNR = D/(B+D) # True Negative Rate
Recall = A/(A+C)
Precision = A/(A+B)
Accuracy = (A+D)/(A+B+C+D)
By writing everything down it becomes obvious thatSensitivity==TPR==Recall
. That won’t stop somebody from complaining if you say “recall” when they prefer “sensitivity”, but that is how things are.
By declaring all of these quantities as sympy variables and expressions we can now check much more. We confirm formal equality of various measures by checking that their difference algebraically simplifies to zero.
# Confirm TPR == 1 - FNR
sympy(paste("simplify(",TPR-(1-FNR),")"))
## [1] "0"
# Confirm Recall == Sensitivity
sympy(paste("simplify(",Recall-Sensitivity,")"))
## [1] "0"
# Confirm PPV == Precision
sympy(paste("simplify(",PPV-Precision,")"))
## [1] "0"
We can also confirm non-identity by simplifying and checking an instance:
# Confirm Precision != Specificity
expr <- sympy(paste("simplify(",Precision-Specificity,")"))
print(expr)
## [1] "(FalsePositives*TruePositives - FalsePositives*TrueNegatives)/(FalsePositives*TrueNegatives + FalsePositives*TruePositives + TrueNegatives*TruePositives + FalsePositives**2)"
sub <- function(expr,
TruePositives,FalsePositives,FalseNegatives,TrueNegatives) {
eval(expr)
}
sub(parse(text=expr),
TruePositives=0,FalsePositives=1,FalseNegatives=0,TrueNegatives=1)
## [1] -0.5
If we write the probability of a true (in-class) instances scoring higher than a false (not in class) instance (with 1/2 point for ties) as Prob[score(true)>score(false)] (with half point on ties)
. We can then confirm Prob[score(true)>score(false)] (with half point on ties) == BalancedAccuracy
for hard or decision classifiers by defining score(true)>score(false)
as:
A D : True Positive and True Negative: Correct sorting 1 point
A B : True Positive and False Positive (same prediction "Positive", different outcomes): 1/2 point
C D : False Negative and True Negative (same prediction "Negative", different outcomes): 1/2 point
C B : False Negative and True Negative: Wrong order 0 points
Then ScoreTrueGTFalse ==
Prob[score(true)>score(false)] (with 1/2 point for ties)` is:
ScoreTrueGTFalse = (1*A*D + 0.5*A*B + 0.5*C*D + 0*C*B)/((A+C)*(B+D))
Which we can confirm is equal to balanced accuracy.
sympy(paste("simplify(",ScoreTrueGTFalse-BalancedAccuracy,")"))
## [1] "0"
We can also confirm Prob[score(true)>score(false)]
(with half point on ties) == AUC
. We can compute the AUC
(the area under the drawn curve) of the above confusion matrix by referring to the following diagram.
Then we can check for general equality:
AUC = (1/2)*FPR*TPR + (1/2)*(1-FPR)*(1-TPR) + (1-FPR)*TPR
sympy(paste("simplify(",ScoreTrueGTFalse-AUC,")"))
## [1] "0"
This AUC score (with half point credit on ties) equivalence holds in general (see also More on ROC/AUC, though I got this wrong the first time).
We can show F1
is different than Balanced Accuracy by plotting results they differ on:
# Wikipedia https://en.wikipedia.org/wiki/F1_score
F1 = 2*Precision*Recall/(Precision+Recall)
F1 = sympy(paste("simplify(",F1,")"))
print(F1)
## [1] "2*TruePositives/(FalseNegatives + FalsePositives + 2*TruePositives)"
print(BalancedAccuracy)
## [1] "TrueNegatives/(2*(FalsePositives + TrueNegatives)) + TruePositives/(2*(FalseNegatives + TruePositives))"
# Show F1 and BalancedAccuracy do not always vary together (even for hard classifiers)
F1formula = parse(text=F1)
BAformula = parse(text=BalancedAccuracy)
frm = c()
for(TotTrue in 1:5) {
for(TotFalse in 1:5) {
for(TruePositives in 0:TotTrue) {
for(TrueNegatives in 0:TotFalse) {
FalsePositives = TotFalse-TrueNegatives
FalseNegatives = TotTrue-TruePositives
F1a <- sub(F1formula,
TruePositives=TruePositives,FalsePositives=FalsePositives,
FalseNegatives=FalseNegatives,TrueNegatives=TrueNegatives)
BAa <- sub(BAformula,
TruePositives=TruePositives,FalsePositives=FalsePositives,
FalseNegatives=FalseNegatives,TrueNegatives=TrueNegatives)
if((F1a<=0)&&(BAa>0.5)) {
stop()
}
fi = data.frame(
TotTrue=TotTrue,
TotFalse=TotFalse,
TruePositives=TruePositives,FalsePositives=FalsePositives,
FalseNegatives=FalseNegatives,TrueNegatives=TrueNegatives,
F1=F1a,BalancedAccuracy=BAa,
stringsAsFactors = FALSE)
frm = rbind(frm,fi) # bad n^2 accumulation
}
}
}
}
ggplot(data=frm,aes(x=F1,y=BalancedAccuracy)) +
geom_point() +
ggtitle("F1 versus balancedAccuarcy/AUC")
F1 versus BalancedAccuracy/AUC
In various sciences over the years over 20 measures of “scoring correspondence” have been introduced by playing games with publication priority, symmetry, and incorporating significance (“chance adjustments”) directly into the measure.
Each measure presumably exists because it avoids flaws of all of the others. However the sheer number of them (in my opinion) triggers what I call “De Morgan’s objection”:
If I had before me a fly and an elephant, having never seen more than one such magnitude of either kind; and if the fly were to endeavor to persuade me that he was larger than the elephant, I might by possibility be placed in a difficulty. The apparently little creature might use such arguments about the effect of distance, and might appeal to such laws of sight and hearing as I, if unlearned in those things, might be unable wholly to reject. But if there were a thousand flies, all buzzing, to appearance, about the great creature; and, to a fly, declaring, each one for himself, that he was bigger than the quadruped; and all giving different and frequently contradictory reasons; and each one despising and opposing the reasons of the others—I should feel quite at my ease. I should certainly say, My little friends, the case of each one of you is destroyed by the rest.
(Augustus De Morgan, “A Budget of Paradoxes” 1872)
There is actually an excellent literature stream investigating which of these measures are roughly equivalent (say arbitrary monotone functions of each other) and which are different (leave aside which are even useful).
Two excellent guides to this rat hole include:
Ackerman, M., & Ben-David, S. (2008). “Measures of clustering quality: A working set of axioms for clustering.”" Advances in Neural Information Processing Systems: Proceedings of the 2008 Conference.
Warrens, M. (2008). “On similarity coefficients for 2× 2 tables and correction for chance.” Psychometrika, 73(3), 487–502.
The point is: you not only can get a publication trying to sort this mess, you can actually do truly interesting work trying to relate these measures.
One can take finding relations and invariants much further as in “Lectures on Algebraic Statistics” Mathias Drton, Bernd Sturmfels, Seth Sullivant, 2008.
It is a bit much to hope to only need to know “one best measure” or to claim to be familiar (let alone expert) in all plausible measures. Instead, find a few common evaluation measures that work well and stick with them.
]]>‘vtreat’ is a data.frame processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner.
‘vtreat’ is an R package that incorporates a number of transforms and simulated out of sample (cross-frame simulation) procedures that can:
‘vtreat’ can be used to prepare data for either regression or classification.
Please read on for what ‘vtreat’ does and what is new.
The primary function of ‘vtreat’ is re-coding of high-cardinality categorical variables, re-coding of missing data, and out-of sample estimation of variable effects and significances. You can use ‘vtreat’ as a pre-processor and use ‘vtreat::prepare’ as a powerful replacement for ‘stats::model.matrix’. Using ‘vtreat’ should get you quickly into the competitive ballpark of best performance on a real-world data problem (such as KDD2009) leaving you time to apply deeper domain knowledge and model tuning for even better results.
‘vtreat’ achieves this by using the assumption that you have a modeling “y” (or outcome to predict) throughout, and that all preparation and transformation should be designed to use knowledge of this “y” during training (and anticipate not having the “y” during test or application).
More simply: the purpose of ‘vtreat’ is to quickly take a messy real-world data frame similar to:
library('htmlTable')
library('vtreat')
dTrainC <- data.frame(x=c('a','a','a','b','b',NA,NA),
z=c(1,2,3,4,NA,6,NA),
y=c(FALSE,FALSE,TRUE,FALSE,TRUE,TRUE,TRUE))
htmlTable(dTrainC)
x | z | y | |
---|---|---|---|
1 | a | 1 | FALSE |
2 | a | 2 | FALSE |
3 | a | 3 | TRUE |
4 | b | 4 | FALSE |
5 | b | TRUE | |
6 | 6 | TRUE | |
7 | TRUE |
And build a treatment plan:
treatmentsC <- designTreatmentsC(dTrainC,colnames(dTrainC),'y',TRUE)
The treatment plan can then be used to clean up the original data and also be applied to any future application or test data:
dTrainCTreated <- prepare(treatmentsC,dTrainC,pruneSig=0.5)
nround <- function(x) { if(is.numeric(x)) { round(x,2) } else { x } }
htmlTable(data.frame(lapply(dTrainCTreated,nround)))
x_lev_NA | x_lev_x.a | x_catP | x_catB | z_clean | z_isBAD | y | |
---|---|---|---|---|---|---|---|
1 | 0 | 1 | 0.43 | -0.54 | 1 | 0 | FALSE |
2 | 0 | 1 | 0.43 | -0.54 | 2 | 0 | FALSE |
3 | 0 | 1 | 0.43 | -0.54 | 3 | 0 | TRUE |
4 | 0 | 0 | 0.29 | -0.13 | 4 | 0 | FALSE |
5 | 0 | 0 | 0.29 | -0.13 | 3.2 | 1 | TRUE |
6 | 1 | 0 | 0.29 | 0.56 | 6 | 0 | TRUE |
7 | 1 | 0 | 0.29 | 0.56 | 3.2 | 1 | TRUE |
‘vtreat’ is designed to be concise, yet implement substantial data preparation and cleaning.
This release concentrates on code-cleanup and convenience functions inspired by Nina Zumel’s recent article on y-aware PCA/PCR (my note why you should read this series is here). In particular we now have user facing functions and documentation on:
‘vtreat’ now has essentially two workflows:
We think analysts/data-scientists will be well served by learning both workflows and picking the work workflow most appropriate to the data set at hand.
]]>From feedback I am not sure everybody noticed that in addition to being easy and effective, the method is actually novel (we haven’t yet found an academic reference to it or seen it already in use after visiting numerous clients). Likely it has been applied before (as it is a simple method), but it is not currently considered a standard method (something we would like to change).
In this note I’ll discuss some of the context of y-aware scaling.
y-aware scaling is a transform that has been available in as “scale mode” in the vtreat R package since prior to the first public release Aug 7, 2014 (derived from earlier proprietary work). It was always motivated by a “dimensional analysis” or “get the units consistent” argument. It is intended as the pre-processing step before operations that are metric sensitive, such as KNN classification and principal components regression. We didn’t really work on proving theorems about it, because in certain contexts it can be recognized as “the right thing to do.” It derives from considering input (or independent variables or columns) as single variable models and the combining of such variables as a nested model or ensemble model construction (chapter 6 of Practical Data Science with R Nina Zumel, John Mount; Manning 2014 was somewhat organized with this idea behind the scenes). Considering y (or the outcome to be modeled) during dimension reduction prior to predictive modeling is a natural concern, but it seems to be anathema in principal components analysis.
y-aware scaling is in fact simple (it involves multiplying by the slope coefficients from linear regressions for a regression problem or multiplying by the slope coefficient from a logistic regression for classification problems; this is different than multiplying by the outcome y which would not be available during the application phase of a predictive model). The fact that it is simple makes it a bit hard to accept that it is both effective and novel. We are not saying it is unprecedented, but it is certainly not center in the standard literature (despite being an easy and effective technique).
There is an an extensive literature on scaling, filtering, transforming, and pre-conditioning data for principal components analysis (for example see “Centering, scaling, and transformations: improving the biological information content of metabolomics data”, Robert A van den BergEmail, Huub CJ Hoefsloot, Johan A Westerhuis, Age K Smilde and Mariët J van der Werf, BMC Genomics20067:142, 2006). However, these are all what we call x-only transforms.
When you consult references (such as The Elements of Statistical Learning, 2nd edition, Trevor Hastie, Robert Tibshirani, Jerome Friedman, Springer 2009; and Applied Predictive Modeling, Max Kuhn, Kjell Johnson, Springer 2013) you basically see only two y-sensitive principal components style techniques (in addition to recommendations to use regularized regression):
I would like to repeat (it is already implied in Nina’s article): y-aware scaling is not equivalent to either of these methods.
Supervised PCA is simply pruning the variables by inspecting small regressions prior to the PCA steps. In my mind it makes our point that principal components users do not consider using the outcome or y-variable in their data preparation that in 2006 you could get a publication by encouraging this natural step and giving the step a name. I’ll repeat: filtering and pruning variables is common in many forms of data analysis so it is remarkable how much work was required to sell the idea of supervised PCA.
Partial Least Squares Regression is an interesting y-aware technique, but it is a different (and more complicated) technique than y-aware scaling. Here is an example (in R) showing the two methods having very different performance on (an admittedly artificial) problem: PLS.md.
In conclusion, I encourage you to take the time to read up on y-aware scaling and consider using it during your dimension reduction steps prior to predictive modeling.
]]>After reading the article we have a few follow-up thoughts on the topic.
Our group has written on the use of differential privacy to improve machine learning algorithms (by slowing down the exhaustion of novelty in your data):
However, these are situations without competing interests: we are just trying to build a better model. What about the original application of differential privacy: trading modeling effectiveness against protecting those one has collected data on? Is un-audited differential privacy an effective protection, or is it a fig-leaf that merely checks off data privacy regulations?
A few of the points to ponder:
We’ll end with: we think the applications of differential privacy techniques to improving machine performance are still the most promising applications as they don’t have the difficulty of trying to serve competing interests (modeling effectiveness versus privacy). A great example of this is the fascinating paper “The Ladder: A Reliable Leaderboard for Machine Learning Competitions” by Avrim Blum, Moritz Hardt. I’d like to think it is clever applications such as the preceding drive current interest in the topic of differential privacy (post 2015). But it looks like all anyone cares about is the Apple’s announcement.
What I am trying to say: claiming the use of differential privacy should not be a “get out of regulation free card.” At best it is a tool that can be part of implementing privacy protection, and one that definitely requires ongoing detailed oversight and auditing.
]]>