Authors: John Mount and Nina Zumel.
p-value is a valid frequentist statistical concept that is much abused and mis-used in practice. In this article I would like to call out a few features of
p-values that can cause problems in evaluating summaries.
Keep in mind:
p-values are useful and routinely taught correctly in statistics, but very often mis-remembered or abused in practice.
Continue reading Remember: p-values Are Not Effect Sizes
- Question: how hard is it to count rows using the
- Answer: surprisingly difficult.
When trying to count rows using
dplyr controlled data-structures (remote
tbls such as
dbplyr structures) one is sailing between Scylla and Charybdis. The task being to avoid
dplyr corner-cases and irregularities (a few of which I attempt to document in this "
Continue reading It is Needlessly Difficult to Count Rows Using dplyr
While working on a large client project using
Sparklyr and multinomial regression we recently ran into a problem:
Apache Spark chooses the order of multinomial regression outcome targets, whereas
R users are used to choosing the order of the targets (please see here for some details). So to make things more like
R users expect, we need a way to translate one order to another.
Providing good solutions to gaps like this is one of the thing Win-Vector LLC does both in our consulting and training practices.
Continue reading Permutation Theory In Action
Recently I noticed that the
sparklyr had the following odd behavior:
#>  '0.7.2.9000'
#>  '0.6.2'
#>  '188.8.131.5200'
sc <- spark_connect(master = 'local')
#> * Using Spark: 2.1.0
d <- dplyr::copy_to(sc, data.frame(x = 1:2))
#>  NA
#>  NA
#>  NA
This means user code or user analyses that depend on one of
nrow() possibly breaks.
nrow() used to return something other than
NA, so older work may not be reproducible.
In fact: where I actually noticed this was deep in debugging a client project (not in a trivial example, such as above).
Tron: fights for the users.
In my opinion: this choice is going to be a great source of surprises, unexpected behavior, and bugs going forward for both
dbplyr users. Continue reading Why to use the replyr R package
seplyr has a neat new feature: the function
seplyr::expand_expr() which implements what we call “the string algebra” or string expression interpolation. The function takes an expression of mixed terms, including: variables referring to names, quoted strings, and general expression terms. It then “de-quotes” all of the variables referring to quoted strings and “dereferences” variables thought to be referring to names. The entire expression is then returned as a single string.
This provides a powerful way to easily work complicated expressions into the
seplyr data manipulation methods. Continue reading Neat New seplyr Feature: String Interpolation
wrapr is an R package that supplies powerful tools for writing and debugging R code.
Continue reading wrapr: R Code Sweeteners
dplyr is one of the most popular
R packages. It is powerful and important. But is it in fact easily comprehensible? Continue reading Is dplyr Easily Comprehensible?
Somebody nice reached out and gave us this wonderful feedback on our new Supervised Learning in R: Regression (paid) video course.
Thanks for a wonderful course on DataCamp on
Random forest. I was struggling with
Xgboost earlier and
Vtreat has made my life easy now :).
Continue reading Thank You For The Very Nice Comment