In our last article we pointed out a dangerous silent result corruption we have seen when using the
dplyr package with databases.
To systematically avoid this result corruption we suggest breaking up your
dplyr::mutate() statements to be dependency-free (not assigning the same value twice, and not using any value in the same mutate it is formed). We consider these to be key and critical precautions to take when using
dplyr with a database.
We would also like to point out we are also distributing free tools to do this automatically, and a worked example of this solution.
A note to
dplyr with database users: you may benefit from inspecting/re-factoring your code to eliminate value re-use inside
dplyr::mutate() statements. Continue reading Please inspect your dplyr+database code
Win-Vector LLC is proud to introduce two important new tool families (with documentation) in the
0.5.0 version of
seplyr (also now available on CRAN):
partition_mutate_qt(): these are query planners/optimizers that work over
dplyr::mutate() assignments. When using big-data systems through R (such as PostgreSQL or Apache Spark) these planners can make your code faster and sequence steps to avoid critical issues (the complementary problems of too long in-mutate dependence chains, of too many mutate steps, and incidental bugs; all explained in the linked tutorials).
if_else_device(): provides a
dplyr::mutate() based simulation of per-row conditional blocks (including conditional assignment). This allows powerful imperative code (such as often seen in porting from SAS) to be directly and legibly translated into performant
dplyr::mutate() data flow code that works on Spark (via Sparklyr) and databases.
Image by Jeff Kubina from Columbia, Maryland – , CC BY-SA 2.0, Link Continue reading Win-Vector LLC announces new “big data in R” tools
- Question: how hard is it to count rows using the
- Answer: surprisingly difficult.
When trying to count rows using
dplyr controlled data-structures (remote
tbls such as
dbplyr structures) one is sailing between Scylla and Charybdis. The task being to avoid
dplyr corner-cases and irregularities (a few of which I attempt to document in this "
Continue reading It is Needlessly Difficult to Count Rows Using dplyr
When I started writing about methods for better "parametric programming" interfaces for
dplyr users in December of 2016 I encountered three divisions in the audience:
dplyr users who had such a need, and wanted such extensions.
dplyr users who did not have such a need ("we always know the column names").
dplyr users who found the then-current fairly complex "underscore" and
lazyeval system sufficient for the task.
Needing name substitution is a problem an advanced full-time
R user can solve on their own. However a part-time
R would greatly benefit from a simple, reliable, readable, documented, and comprehensible packaged solution. Continue reading Let’s Have Some Sympathy For The Part-time R User
seplyr is an
R package that makes it easy to program over
To illustrate this we will work an example.
Continue reading Tutorial: Using seplyr to Program Over dplyr
The development version of my new
seplyr is performing in practical applications with
0.7.* much better than even I (the
seplyr package author) expected.
I think I have hit a very good set of trade-offs, and I have now spent significant time creating documentation and examples.
I wish there had been such a package weeks ago, and that I had started using this approach in my own client work at that time. If you are already a
dplyr user I strongly suggest trying
seplyr in your own analysis projects.
Please see here for details.
I have been writing a lot (too much) on the
tidyeval lately. The reason is: major changes were recently announced. If you are going to use
dplyr well and correctly going forward you may need to understand some of the new issues (if you don’t use
dplyr you can safely skip all of this). I am trying to work out (publicly) how to best incorporate the new methods into:
- real world analyses,
- reusable packages,
- and teaching materials.
I think some of the apparent discomfort on my part comes from my feeling that
dplyr never really gave standard evaluation (SE) a fair chance. In my opinion:
dplyr is based strongly on non-standard evaluation (NSE, originally through
lazyeval and now through
tidyeval) more by the taste and choice than by actual analyst benefit or need.
dplyr isn’t my package, so it isn’t my choice to make; but I can still have an informed opinion, which I will discuss below.
Continue reading dplyr 0.7 Made Simpler
dplyr users one of the promises of the new
tidyeval system is an improved ability to program over
dplyr itself. In particular to add new verbs that encapsulate previously compound steps into better self-documenting atomic steps.
Let’s take a look at this capability.
Continue reading Better Grouped Summaries in dplyr
In our latest R and Big Data article we discuss replyr.
replyr stands for REmote PLYing of big data for R.
Why should R users try
replyr? Because it lets you take a number of common working patterns and apply them to remote data (such as databases or
replyr allows users to work with
Spark or database data similar to how they work with local
data.frames. Some key capability gaps remedied by
- Summarizing data:
- Combining tables:
- Binding tables by row:
- Using the split/apply/combine pattern (
- Pivot/anti-pivot (
- Handle tracking.
- A join controller.
You may have already learned to decompose your local data processing into steps including the above, so retaining such capabilities makes working with
sparklyr much easier. Some of the above capabilities will likely come to the
tidyverse, but the above implementations are build purely on top of
dplyr and are the ones already being vetted and debugged at production scale (I think these will be ironed out and reliable sooner).
Continue reading Working With R and Big Data: Use Replyr