Posted on Categories Opinion, Statistics, TutorialsTags , , , 1 Comment on Remember: p-values Are Not Effect Sizes

Remember: p-values Are Not Effect Sizes

Authors: John Mount and Nina Zumel.

The p-value is a valid frequentist statistical concept that is much abused and mis-used in practice. In this article I would like to call out a few features of p-values that can cause problems in evaluating summaries.

Keep in mind: p-values are useful and routinely taught correctly in statistics, but very often mis-remembered or abused in practice.

14582827568 7a2f83011f
From Hamilton’s Lectures on metaphysics and logic (1871).
Internet Archive Book Images

Continue reading Remember: p-values Are Not Effect Sizes

Posted on Categories Opinion, Programming, StatisticsTags , , , , 1 Comment on It is Needlessly Difficult to Count Rows Using dplyr

It is Needlessly Difficult to Count Rows Using dplyr

  • Question: how hard is it to count rows using the R package dplyr?
  • Answer: surprisingly difficult.

When trying to count rows using dplyr or dplyr controlled data-structures (remote tbls such as Sparklyr or dbplyr structures) one is sailing between Scylla and Charybdis. The task being to avoid dplyr corner-cases and irregularities (a few of which I attempt to document in this "dplyr inferno").



800px Johann Heinrich Füssli 054

Continue reading It is Needlessly Difficult to Count Rows Using dplyr

Posted on Categories data science, Pragmatic Data Science, Pragmatic Machine Learning, Programming, Statistics, TutorialsTags , , , , , Leave a comment on Permutation Theory In Action

Permutation Theory In Action

While working on a large client project using Sparklyr and multinomial regression we recently ran into a problem: Apache Spark chooses the order of multinomial regression outcome targets, whereas R users are used to choosing the order of the targets (please see here for some details). So to make things more like R users expect, we need a way to translate one order to another.

Providing good solutions to gaps like this is one of the thing Win-Vector LLC does both in our consulting and training practices.

Continue reading Permutation Theory In Action

Posted on Categories Coding, Opinion, Statistics, TutorialsTags , , , , Leave a comment on Why to use the replyr R package

Why to use the replyr R package

Recently I noticed that the R package sparklyr had the following odd behavior:

suppressPackageStartupMessages(library("dplyr"))
library("sparklyr")
packageVersion("dplyr")
#> [1] '0.7.2.9000'
packageVersion("sparklyr")
#> [1] '0.6.2'
packageVersion("dbplyr")
#> [1] '1.1.0.9000'

sc <- spark_connect(master = 'local')
#> * Using Spark: 2.1.0
d <- dplyr::copy_to(sc, data.frame(x = 1:2))

dim(d)
#> [1] NA
ncol(d)
#> [1] NA
nrow(d)
#> [1] NA

This means user code or user analyses that depend on one of dim(), ncol() or nrow() possibly breaks. nrow() used to return something other than NA, so older work may not be reproducible.

In fact: where I actually noticed this was deep in debugging a client project (not in a trivial example, such as above).


Tron
Tron: fights for the users.

In my opinion: this choice is going to be a great source of surprises, unexpected behavior, and bugs going forward for both sparklyr and dbplyr users. Continue reading Why to use the replyr R package

Posted on Categories Exciting Techniques, Programming, Statistics, TutorialsTags , , , 1 Comment on Neat New seplyr Feature: String Interpolation

Neat New seplyr Feature: String Interpolation

The R package seplyr has a neat new feature: the function seplyr::expand_expr() which implements what we call “the string algebra” or string expression interpolation. The function takes an expression of mixed terms, including: variables referring to names, quoted strings, and general expression terms. It then “de-quotes” all of the variables referring to quoted strings and “dereferences” variables thought to be referring to names. The entire expression is then returned as a single string.


Safety

This provides a powerful way to easily work complicated expressions into the seplyr data manipulation methods. Continue reading Neat New seplyr Feature: String Interpolation

Posted on Categories Programming, StatisticsTags , , , , Leave a comment on wrapr: R Code Sweeteners

wrapr: R Code Sweeteners

wrapr is an R package that supplies powerful tools for writing and debugging R code.

Wraprs

Continue reading wrapr: R Code Sweeteners

Posted on Categories Programming, Statistics, TutorialsTags , , , , , , 6 Comments on Some Neat New R Notations

Some Neat New R Notations

The R package wrapr supplies a few neat new coding notations.


abacus

An Abacus, which gives us the term “calculus.”

Continue reading Some Neat New R Notations

Posted on Categories Opinion, Programming, Rants, StatisticsTags , , 2 Comments on Is dplyr Easily Comprehensible?

Is dplyr Easily Comprehensible?

dplyr is one of the most popular R packages. It is powerful and important. But is it in fact easily comprehensible? Continue reading Is dplyr Easily Comprehensible?

Posted on Categories Administrativia, Opinion, StatisticsTags , , Leave a comment on Thank You For The Very Nice Comment

Thank You For The Very Nice Comment

Somebody nice reached out and gave us this wonderful feedback on our new Supervised Learning in R: Regression (paid) video course.

Thanks for a wonderful course on DataCamp on XGBoost and Random forest. I was struggling with Xgboost earlier and Vtreat has made my life easy now :).

Continue reading Thank You For The Very Nice Comment

Posted on Categories Administrativia, data science, Practical Data Science, Pragmatic Data Science, StatisticsTags , 4 Comments on Supervised Learning in R: Regression

Supervised Learning in R: Regression

We are very excited to announce a new (paid) Win-Vector LLC video training course: Supervised Learning in R: Regression now available on DataCamp

Shield image course 3851 20170725 24872 3f982z Continue reading Supervised Learning in R: Regression