Posted on Categories Opinion, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , , 2 Comments on My advice on dplyr::mutate()

My advice on dplyr::mutate()

There are substantial differences between ad-hoc analyses (be they: machine learning research, data science contests, or other demonstrations) and production worthy systems. Roughly: ad-hoc analyses have to be correct only at the moment they are run (and often once they are correct, that is the last time they are run; obviously the idea of reproducible research is an attempt to raise this standard). Production systems have to be durable: they have to remain correct as models, data, packages, users, and environments change over time.

Demonstration systems need merely glow in bright light among friends; production systems must be correct, even alone in the dark.


Vlcsnap 00887

“Character is what you are in the dark.”

John Whorfin quoting Dwight L. Moody.

I have found: to deliver production worthy data science and predictive analytic systems, one has to develop per-team and per-project field tested recommendations and best practices. This is necessary even when, or especially when, these procedures differ from official doctrine.

What I want to do is share a single small piece of Win-Vector LLC‘s current guidance on using the R package dplyr. Continue reading My advice on dplyr::mutate()

Posted on Categories Opinion, Programming, StatisticsTags , , , , 1 Comment on It is Needlessly Difficult to Count Rows Using dplyr

It is Needlessly Difficult to Count Rows Using dplyr

  • Question: how hard is it to count rows using the R package dplyr?
  • Answer: surprisingly difficult.

When trying to count rows using dplyr or dplyr controlled data-structures (remote tbls such as Sparklyr or dbplyr structures) one is sailing between Scylla and Charybdis. The task being to avoid dplyr corner-cases and irregularities (a few of which I attempt to document in this "dplyr inferno").



800px Johann Heinrich F├╝ssli 054

Continue reading It is Needlessly Difficult to Count Rows Using dplyr

Posted on Categories data science, Pragmatic Data Science, Pragmatic Machine Learning, Programming, Statistics, TutorialsTags , , , , , Leave a comment on Permutation Theory In Action

Permutation Theory In Action

While working on a large client project using Sparklyr and multinomial regression we recently ran into a problem: Apache Spark chooses the order of multinomial regression outcome targets, whereas R users are used to choosing the order of the targets (please see here for some details). So to make things more like R users expect, we need a way to translate one order to another.

Providing good solutions to gaps like this is one of the thing Win-Vector LLC does both in our consulting and training practices.

Continue reading Permutation Theory In Action

Posted on Categories Coding, Opinion, Statistics, TutorialsTags , , , , Leave a comment on Why to use the replyr R package

Why to use the replyr R package

Recently I noticed that the R package sparklyr had the following odd behavior:

suppressPackageStartupMessages(library("dplyr"))
library("sparklyr")
packageVersion("dplyr")
#> [1] '0.7.2.9000'
packageVersion("sparklyr")
#> [1] '0.6.2'
packageVersion("dbplyr")
#> [1] '1.1.0.9000'

sc <- spark_connect(master = 'local')
#> * Using Spark: 2.1.0
d <- dplyr::copy_to(sc, data.frame(x = 1:2))

dim(d)
#> [1] NA
ncol(d)
#> [1] NA
nrow(d)
#> [1] NA

This means user code or user analyses that depend on one of dim(), ncol() or nrow() possibly breaks. nrow() used to return something other than NA, so older work may not be reproducible.

In fact: where I actually noticed this was deep in debugging a client project (not in a trivial example, such as above).


Tron
Tron: fights for the users.

In my opinion: this choice is going to be a great source of surprises, unexpected behavior, and bugs going forward for both sparklyr and dbplyr users. Continue reading Why to use the replyr R package

Posted on Categories data science, Opinion, StatisticsTags , , , , , 2 Comments on Working With R and Big Data: Use Replyr

Working With R and Big Data: Use Replyr

In our latest R and Big Data article we discuss replyr.

Why replyr

replyr stands for REmote PLYing of big data for R.

Why should R users try replyr? Because it lets you take a number of common working patterns and apply them to remote data (such as databases or Spark).

replyr allows users to work with Spark or database data similar to how they work with local data.frames. Some key capability gaps remedied by replyr include:

  • Summarizing data: replyr_summary().
  • Combining tables: replyr_union_all().
  • Binding tables by row: replyr_bind_rows().
  • Using the split/apply/combine pattern (dplyr::do()): replyr_split(), replyr::gapply().
  • Pivot/anti-pivot (gather/spread): replyr_moveValuesToRows()/ replyr_moveValuesToColumns().
  • Handle tracking.
  • A join controller.

You may have already learned to decompose your local data processing into steps including the above, so retaining such capabilities makes working with Spark and sparklyr much easier. Some of the above capabilities will likely come to the tidyverse, but the above implementations are build purely on top of dplyr and are the ones already being vetted and debugged at production scale (I think these will be ironed out and reliable sooner).

Continue reading Working With R and Big Data: Use Replyr

Posted on Categories data science, Practical Data Science, Pragmatic Data Science, Programming, Statistics, TutorialsTags , , , , , 1 Comment on Join Dependency Sorting

Join Dependency Sorting

In our latest installment of “R and big data” let’s again discuss the task of left joining many tables from a data warehouse using R and a system called "a join controller" (last discussed here).

One of the great advantages to specifying complicated sequences of operations in data (rather than in code) is: it is often easier to transform and extend data. Explicit rich data beats vague convention and complicated code.

Continue reading Join Dependency Sorting

Posted on Categories data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , 4 Comments on Use a Join Controller to Document Your Work

Use a Join Controller to Document Your Work

This note describes a useful replyr tool we call a "join controller" (and is part of our "R and Big Data" series, please see here for the introduction, and here for one our big data courses).

Continue reading Use a Join Controller to Document Your Work

Posted on Categories Applications, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Programming, Statistics, TutorialsTags , , ,

Managing intermediate results when using R/sparklyr

In our latest “R and big data” article we show how to manage intermediate results in non-trivial Apache Spark workflows using R, sparklyr, dplyr, and replyr.


NewImage
Continue reading Managing intermediate results when using R/sparklyr

Posted on Categories Opinion, Programming, StatisticsTags , , , , ,

There is usually more than one way in R

Python has a fairly famous design principle (from “PEP 20 — The Zen of Python”):

There should be one– and preferably only one –obvious way to do it.

Frankly in R (especially once you add many packages) there is usually more than one way. As an example we will talk about the common R functions: str(), head(), and the tibble package‘s glimpse(). Continue reading There is usually more than one way in R

Posted on Categories StatisticsTags , , , ,

Summarizing big data in R

Our next "R and big data tip" is: summarizing big data.

We always say "if you are not looking at the data, you are not doing science"- and for big data you are very dependent on summaries (as you can’t actually look at everything).

Simple question: is there an easy way to summarize big data in R?

The answer is: yes, but we suggest you use the replyr package to do so.

Continue reading Summarizing big data in R