- Question: how hard is it to count rows using the
- Answer: surprisingly difficult.
When trying to count rows using
dplyr controlled data-structures (remote
tbls such as
dbplyr structures) one is sailing between Scylla and Charybdis. The task being to avoid
dplyr corner-cases and irregularities (a few of which I attempt to document in this "
Continue reading It is Needlessly Difficult to Count Rows Using dplyr
Recently I noticed that the
sparklyr had the following odd behavior:
#>  '0.7.2.9000'
#>  '0.6.2'
#>  '126.96.36.19900'
sc <- spark_connect(master = 'local')
#> * Using Spark: 2.1.0
d <- dplyr::copy_to(sc, data.frame(x = 1:2))
#>  NA
#>  NA
#>  NA
This means user code or user analyses that depend on one of
nrow() possibly breaks.
nrow() used to return something other than
NA, so older work may not be reproducible.
In fact: where I actually noticed this was deep in debugging a client project (not in a trivial example, such as above).
Tron: fights for the users.
In my opinion: this choice is going to be a great source of surprises, unexpected behavior, and bugs going forward for both
dbplyr users. Continue reading Why to use the replyr R package
The Win-Vector public R packages now all have new
pkgdown documentation sites! (And, a thank-you to Hadley Wickham for developing the
Please check them out (hint:
vtreat is our favorite).
Continue reading More documentation for Win-Vector R packages
In our latest R and Big Data article we discuss replyr.
replyr stands for REmote PLYing of big data for R.
Why should R users try
replyr? Because it lets you take a number of common working patterns and apply them to remote data (such as databases or
replyr allows users to work with
Spark or database data similar to how they work with local
data.frames. Some key capability gaps remedied by
- Summarizing data:
- Combining tables:
- Binding tables by row:
- Using the split/apply/combine pattern (
- Pivot/anti-pivot (
- Handle tracking.
- A join controller.
You may have already learned to decompose your local data processing into steps including the above, so retaining such capabilities makes working with
sparklyr much easier. Some of the above capabilities will likely come to the
tidyverse, but the above implementations are build purely on top of
dplyr and are the ones already being vetted and debugged at production scale (I think these will be ironed out and reliable sooner).
Continue reading Working With R and Big Data: Use Replyr
In our latest installment of “
R and big data” let’s again discuss the task of left joining many tables from a data warehouse using
R and a system called "a join controller" (last discussed here).
One of the great advantages to specifying complicated sequences of operations in data (rather than in code) is: it is often easier to transform and extend data. Explicit rich data beats vague convention and complicated code.
Continue reading Join Dependency Sorting
In this article we will discuss composing standard-evaluation interfaces (SE: parametric, referentially transparent, or “looks only at values”) and composing non-standard-evaluation interfaces (NSE) in
R the package
rlang is a tool for building domain specific languages intended to allow easier composition of NSE interfaces.
To use it you must know some of its structure and notation. Here are some details paraphrased from the major
rlang client, the package dplyr:
vignette('programming', package = 'dplyr')).
:=" is needed to make left-hand-side re-mapping possible (adding yet another "more than one assignment type operator running around" notation issue).
!!" substitution requires parenthesis to safely bind (so the notation is actually "
(!! )", not "
- Left-hand-sides of expressions are names or strings, while right-hand-sides are
Continue reading Non-Standard Evaluation and Function Composition in R
This note describes a useful
replyr tool we call a "join controller" (and is part of our "R and Big Data" series, please see here for the introduction, and here for one our big data courses).
Continue reading Use a Join Controller to Document Your Work
Saw this the other day:
In defense of
wrapr::let() (originally part of
replyr, and still re-exported by that package) I would say:
let() was deliberately designed for a single real-world use case: working with data when you don’t know the column names when you are writing the code (i.e., the column names will come later in a variable). We can re-phrase that as: there is deliberately less to learn as
let() is adapted to a need (instead of one having to adapt to
R community already has months of experience confirming
let() working reliably in production while interacting with a number of different packages.
let() will continue to be a very specific, consistent, reliable, and relevant tool even after
dpyr 0.6.* is released, and the community gains experience with
tidyeval in production.
tidyeval is your thing, by all means please use and teach it. But please continue to consider also using
wrapr::let(). If you are trying to get something done quickly, or trying to share work with others: a “deeper theory” may not be the best choice.
An example follows. Continue reading In defense of wrapr::let()
Our next "R and big data tip" is: summarizing big data.
We always say "if you are not looking at the data, you are not doing science"- and for big data you are very dependent on summaries (as you can’t actually look at everything).
Simple question: is there an easy way to summarize big data in
The answer is: yes, but we suggest you use the
replyr package to do so.
Continue reading Summarizing big data in R
R is a very fluid language amenable to meta-programming, or alterations of the language itself. This has allowed the late user-driven introduction of a number of powerful features such as magrittr pipes, the foreach system, futures, data.table, and dplyr. Please read on for some small meta-programming effects we have been experimenting with.
Continue reading Programming over R