Posted on Categories Opinion, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , , 2 Comments on My advice on dplyr::mutate()

My advice on dplyr::mutate()

There are substantial differences between ad-hoc analyses (be they: machine learning research, data science contests, or other demonstrations) and production worthy systems. Roughly: ad-hoc analyses have to be correct only at the moment they are run (and often once they are correct, that is the last time they are run; obviously the idea of reproducible research is an attempt to raise this standard). Production systems have to be durable: they have to remain correct as models, data, packages, users, and environments change over time.

Demonstration systems need merely glow in bright light among friends; production systems must be correct, even alone in the dark.


Vlcsnap 00887

“Character is what you are in the dark.”

John Whorfin quoting Dwight L. Moody.

I have found: to deliver production worthy data science and predictive analytic systems, one has to develop per-team and per-project field tested recommendations and best practices. This is necessary even when, or especially when, these procedures differ from official doctrine.

What I want to do is share a single small piece of Win-Vector LLC‘s current guidance on using the R package dplyr. Continue reading My advice on dplyr::mutate()

Posted on Categories Opinion, Programming, StatisticsTags , , , , 1 Comment on It is Needlessly Difficult to Count Rows Using dplyr

It is Needlessly Difficult to Count Rows Using dplyr

  • Question: how hard is it to count rows using the R package dplyr?
  • Answer: surprisingly difficult.

When trying to count rows using dplyr or dplyr controlled data-structures (remote tbls such as Sparklyr or dbplyr structures) one is sailing between Scylla and Charybdis. The task being to avoid dplyr corner-cases and irregularities (a few of which I attempt to document in this "dplyr inferno").



800px Johann Heinrich F├╝ssli 054

Continue reading It is Needlessly Difficult to Count Rows Using dplyr

Posted on Categories Opinion, Programming, Rants, StatisticsTags , , 2 Comments on Is dplyr Easily Comprehensible?

Is dplyr Easily Comprehensible?

dplyr is one of the most popular R packages. It is powerful and important. But is it in fact easily comprehensible? Continue reading Is dplyr Easily Comprehensible?

Posted on Categories Programming, StatisticsTags , , , , 4 Comments on What is magrittr’s future in the tidyverse?

What is magrittr’s future in the tidyverse?

For many R users the magrittr pipe is a popular way to arrange computation and famously part of the tidyverse.

NewImage

The tidyverse itself is a rapidly evolving centrally controlled package collection. The tidyverse authors publicly appear to be interested in re-basing the tidyverse in terms of their new rlang/tidyeval package. So it is natural to wonder: what is the future of magrittr (a pre-rlang/tidyeval package) in the tidyverse? Continue reading What is magrittr’s future in the tidyverse?

Posted on Categories Opinion, Programming, StatisticsTags , , , 5 Comments on Please Consider Using wrapr::let() for Replacement Tasks

Please Consider Using wrapr::let() for Replacement Tasks

From dplyr issue 2916.

The following appears to work.

suppressPackageStartupMessages(library("dplyr"))

COL <- "homeworld"
starwars %>%
  group_by(.data[[COL]]) %>%
  head(n=1)
## # A tibble: 1 x 14
## # Groups:   COL [1]
##             name height  mass hair_color skin_color eye_color birth_year
##            <chr>  <int> <dbl>      <chr>      <chr>     <chr>      <dbl>
## 1 Luke Skywalker    172    77      blond       fair      blue         19
## # ... with 7 more variables: gender <chr>, homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>, COL <chr>

Though notice it reports the grouping is by "COL", not by "homeworld". Also the data set now has 14 columns, not the original 13 from the starwars data set.

Continue reading Please Consider Using wrapr::let() for Replacement Tasks

Posted on Categories Coding, data science, Opinion, Programming, Statistics, TutorialsTags , , , , , , , , , , 10 Comments on Non-Standard Evaluation and Function Composition in R

Non-Standard Evaluation and Function Composition in R

In this article we will discuss composing standard-evaluation interfaces (SE: parametric, referentially transparent, or “looks only at values”) and composing non-standard-evaluation interfaces (NSE) in R.

In R the package tidyeval/rlang is a tool for building domain specific languages intended to allow easier composition of NSE interfaces.

To use it you must know some of its structure and notation. Here are some details paraphrased from the major tidyeval/rlang client, the package dplyr: vignette('programming', package = 'dplyr')).

  • ":=" is needed to make left-hand-side re-mapping possible (adding yet another "more than one assignment type operator running around" notation issue).
  • "!!" substitution requires parenthesis to safely bind (so the notation is actually "(!! )", not "!!").
  • Left-hand-sides of expressions are names or strings, while right-hand-sides are quosures/expressions.

Continue reading Non-Standard Evaluation and Function Composition in R

Posted on Categories Opinion, Rants, Statistics, TutorialsTags , , 1 Comment on An easy way to accidentally inflate reported R-squared in linear regression models

An easy way to accidentally inflate reported R-squared in linear regression models

Here is an absolutely horrible way to confuse yourself and get an inflated reported R-squared on a simple linear regression model in R.

We have written about this before, but we found a new twist on the problem (interactions with categorical variable encoding) which we would like to call out here. Continue reading An easy way to accidentally inflate reported R-squared in linear regression models

Posted on Categories Coding, Opinion, Programming, StatisticsTags , , 2 Comments on More on safe substitution in R

More on safe substitution in R

Let’s worry a bit about substitution in R. Substitution is very powerful, which means it can be both used and mis-used. However, that does not mean every use is unsafe or a mistake.

Continue reading More on safe substitution in R

Posted on Categories Opinion, Programming, StatisticsTags , , , , ,

There is usually more than one way in R

Python has a fairly famous design principle (from “PEP 20 — The Zen of Python”):

There should be one– and preferably only one –obvious way to do it.

Frankly in R (especially once you add many packages) there is usually more than one way. As an example we will talk about the common R functions: str(), head(), and the tibble package‘s glimpse(). Continue reading There is usually more than one way in R

Posted on Categories data science, Opinion, StatisticsTags , , ,

R summary() got better!

Here is a really nice feature found in the current 3.4.0 version of R: summary() has become a lot more reasonable.

summary(15555)

#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#   15555   15555   15555   15555   15555   15555 

Please read on for some background. Continue reading R summary() got better!