Posted on Categories Opinion, Programming, StatisticsTags , , , , 1 Comment on It is Needlessly Difficult to Count Rows Using dplyr

It is Needlessly Difficult to Count Rows Using dplyr

  • Question: how hard is it to count rows using the R package dplyr?
  • Answer: surprisingly difficult.

When trying to count rows using dplyr or dplyr controlled data-structures (remote tbls such as Sparklyr or dbplyr structures) one is sailing between Scylla and Charybdis. The task being to avoid dplyr corner-cases and irregularities (a few of which I attempt to document in this "dplyr inferno").



800px Johann Heinrich F├╝ssli 054

Continue reading It is Needlessly Difficult to Count Rows Using dplyr

Posted on Categories Coding, Opinion, Statistics, TutorialsTags , , , , Leave a comment on Why to use the replyr R package

Why to use the replyr R package

Recently I noticed that the R package sparklyr had the following odd behavior:

suppressPackageStartupMessages(library("dplyr"))
library("sparklyr")
packageVersion("dplyr")
#> [1] '0.7.2.9000'
packageVersion("sparklyr")
#> [1] '0.6.2'
packageVersion("dbplyr")
#> [1] '1.1.0.9000'

sc <- spark_connect(master = 'local')
#> * Using Spark: 2.1.0
d <- dplyr::copy_to(sc, data.frame(x = 1:2))

dim(d)
#> [1] NA
ncol(d)
#> [1] NA
nrow(d)
#> [1] NA

This means user code or user analyses that depend on one of dim(), ncol() or nrow() possibly breaks. nrow() used to return something other than NA, so older work may not be reproducible.

In fact: where I actually noticed this was deep in debugging a client project (not in a trivial example, such as above).


Tron
Tron: fights for the users.

In my opinion: this choice is going to be a great source of surprises, unexpected behavior, and bugs going forward for both sparklyr and dbplyr users. Continue reading Why to use the replyr R package

Posted on Categories Administrativia, data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, StatisticsTags , , , , , , , 1 Comment on More documentation for Win-Vector R packages

More documentation for Win-Vector R packages

The Win-Vector public R packages now all have new pkgdown documentation sites! (And, a thank-you to Hadley Wickham for developing the pkgdown tool.)

Please check them out (hint: vtreat is our favorite).

NewImage Continue reading More documentation for Win-Vector R packages

Posted on Categories data science, Opinion, StatisticsTags , , , , , 2 Comments on Working With R and Big Data: Use Replyr

Working With R and Big Data: Use Replyr

In our latest R and Big Data article we discuss replyr.

Why replyr

replyr stands for REmote PLYing of big data for R.

Why should R users try replyr? Because it lets you take a number of common working patterns and apply them to remote data (such as databases or Spark).

replyr allows users to work with Spark or database data similar to how they work with local data.frames. Some key capability gaps remedied by replyr include:

  • Summarizing data: replyr_summary().
  • Combining tables: replyr_union_all().
  • Binding tables by row: replyr_bind_rows().
  • Using the split/apply/combine pattern (dplyr::do()): replyr_split(), replyr::gapply().
  • Pivot/anti-pivot (gather/spread): replyr_moveValuesToRows()/ replyr_moveValuesToColumns().
  • Handle tracking.
  • A join controller.

You may have already learned to decompose your local data processing into steps including the above, so retaining such capabilities makes working with Spark and sparklyr much easier. Some of the above capabilities will likely come to the tidyverse, but the above implementations are build purely on top of dplyr and are the ones already being vetted and debugged at production scale (I think these will be ironed out and reliable sooner).

Continue reading Working With R and Big Data: Use Replyr

Posted on Categories data science, Practical Data Science, Pragmatic Data Science, Programming, Statistics, TutorialsTags , , , , , 1 Comment on Join Dependency Sorting

Join Dependency Sorting

In our latest installment of “R and big data” let’s again discuss the task of left joining many tables from a data warehouse using R and a system called "a join controller" (last discussed here).

One of the great advantages to specifying complicated sequences of operations in data (rather than in code) is: it is often easier to transform and extend data. Explicit rich data beats vague convention and complicated code.

Continue reading Join Dependency Sorting

Posted on Categories Coding, data science, Opinion, Programming, Statistics, TutorialsTags , , , , , , , , , , 10 Comments on Non-Standard Evaluation and Function Composition in R

Non-Standard Evaluation and Function Composition in R

In this article we will discuss composing standard-evaluation interfaces (SE: parametric, referentially transparent, or “looks only at values”) and composing non-standard-evaluation interfaces (NSE) in R.

In R the package tidyeval/rlang is a tool for building domain specific languages intended to allow easier composition of NSE interfaces.

To use it you must know some of its structure and notation. Here are some details paraphrased from the major tidyeval/rlang client, the package dplyr: vignette('programming', package = 'dplyr')).

  • ":=" is needed to make left-hand-side re-mapping possible (adding yet another "more than one assignment type operator running around" notation issue).
  • "!!" substitution requires parenthesis to safely bind (so the notation is actually "(!! )", not "!!").
  • Left-hand-sides of expressions are names or strings, while right-hand-sides are quosures/expressions.

Continue reading Non-Standard Evaluation and Function Composition in R

Posted on Categories data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , 4 Comments on Use a Join Controller to Document Your Work

Use a Join Controller to Document Your Work

This note describes a useful replyr tool we call a "join controller" (and is part of our "R and Big Data" series, please see here for the introduction, and here for one our big data courses).

Continue reading Use a Join Controller to Document Your Work

Posted on Categories Coding, Opinion, Programming, StatisticsTags , , , 7 Comments on In defense of wrapr::let()

In defense of wrapr::let()

Saw this the other day:

Wraprvstidyeval

In defense of wrapr::let() (originally part of replyr, and still re-exported by that package) I would say:

  • let() was deliberately designed for a single real-world use case: working with data when you don’t know the column names when you are writing the code (i.e., the column names will come later in a variable). We can re-phrase that as: there is deliberately less to learn as let() is adapted to a need (instead of one having to adapt to let()).
  • The R community already has months of experience confirming let() working reliably in production while interacting with a number of different packages.
  • let() will continue to be a very specific, consistent, reliable, and relevant tool even after dpyr 0.6.* is released, and the community gains experience with rlang/tidyeval in production.

If rlang/tidyeval is your thing, by all means please use and teach it. But please continue to consider also using wrapr::let(). If you are trying to get something done quickly, or trying to share work with others: a “deeper theory” may not be the best choice.

An example follows. Continue reading In defense of wrapr::let()

Posted on Categories StatisticsTags , , , ,

Summarizing big data in R

Our next "R and big data tip" is: summarizing big data.

We always say "if you are not looking at the data, you are not doing science"- and for big data you are very dependent on summaries (as you can’t actually look at everything).

Simple question: is there an easy way to summarize big data in R?

The answer is: yes, but we suggest you use the replyr package to do so.

Continue reading Summarizing big data in R

Posted on Categories TutorialsTags , , 11 Comments on Programming over R

Programming over R

R is a very fluid language amenable to meta-programming, or alterations of the language itself. This has allowed the late user-driven introduction of a number of powerful features such as magrittr pipes, the foreach system, futures, data.table, and dplyr. Please read on for some small meta-programming effects we have been experimenting with.

NewImage Continue reading Programming over R