vtreat is an excellent way to prepare data for machine learning, statistical inference, and predictive analytic projects. If you are an
R user we strongly suggest you incorporate
vtreat into your projects. Continue reading Upcoming data preparation and modeling article series
There are substantial differences between ad-hoc analyses (be they: machine learning research, data science contests, or other demonstrations) and production worthy systems. Roughly: ad-hoc analyses have to be correct only at the moment they are run (and often once they are correct, that is the last time they are run; obviously the idea of reproducible research is an attempt to raise this standard). Production systems have to be durable: they have to remain correct as models, data, packages, users, and environments change over time.
Demonstration systems need merely glow in bright light among friends; production systems must be correct, even alone in the dark.
“Character is what you are in the dark.”
I have found: to deliver production worthy data science and predictive analytic systems, one has to develop per-team and per-project field tested recommendations and best practices. This is necessary even when, or especially when, these procedures differ from official doctrine.
p-value is a valid frequentist statistical concept that is much abused and mis-used in practice. In this article I would like to call out a few features of
p-values that can cause problems in evaluating summaries.
Keep in mind:
p-values are useful and routinely taught correctly in statistics, but very often mis-remembered or abused in practice.
From Hamilton’s Lectures on metaphysics and logic (1871).
Internet Archive Book Images
When trying to count rows using
dplyr controlled data-structures (remote
tbls such as
dbplyr structures) one is sailing between Scylla and Charybdis. The task being to avoid
dplyr corner-cases and irregularities (a few of which I attempt to document in this "
While working on a large client project using
Sparklyr and multinomial regression we recently ran into a problem:
Apache Spark chooses the order of multinomial regression outcome targets, whereas
R users are used to choosing the order of the targets (please see here for some details). So to make things more like
R users expect, we need a way to translate one order to another.
suppressPackageStartupMessages(library("dplyr")) library("sparklyr") packageVersion("dplyr") #>  '0.7.2.9000' packageVersion("sparklyr") #>  '0.6.2' packageVersion("dbplyr") #>  '22.214.171.12400' sc <- spark_connect(master = 'local') #> * Using Spark: 2.1.0 d <- dplyr::copy_to(sc, data.frame(x = 1:2)) dim(d) #>  NA ncol(d) #>  NA nrow(d) #>  NA
This means user code or user analyses that depend on one of
nrow() possibly breaks.
nrow() used to return something other than
NA, so older work may not be reproducible.
In fact: where I actually noticed this was deep in debugging a client project (not in a trivial example, such as above).
Tron: fights for the users.
In my opinion: this choice is going to be a great source of surprises, unexpected behavior, and bugs going forward for both
dbplyr users. Continue reading Why to use the replyr R package
seplyr has a neat new feature: the function
seplyr::expand_expr() which implements what we call “the string algebra” or string expression interpolation. The function takes an expression of mixed terms, including: variables referring to names, quoted strings, and general expression terms. It then “de-quotes” all of the variables referring to quoted strings and “dereferences” variables thought to be referring to names. The entire expression is then returned as a single string.
This provides a powerful way to easily work complicated expressions into the
seplyr data manipulation methods. Continue reading Neat New seplyr Feature: String Interpolation