To systematically avoid this result corruption we suggest breaking up your
dplyr::mutate() statements to be dependency-free (not assigning the same value twice, and not using any value in the same mutate it is formed). We consider these to be key and critical precautions to take when using
dplyr with a database.
A note to
dplyr with database users: you may benefit from inspecting/re-factoring your code to eliminate value re-use inside
dplyr::mutate() statements. Continue reading Please inspect your dplyr+database code
We have been writing a lot on higher-order data transforms lately:
- Coordinatized Data: A Fluid Data Specification
- Data Wrangling at Scale
- Fluid Data
- Big Data Transforms.
What I want to do now is "write a bit more, so I finally feel I have been concise."
As part of our consulting practice Win-Vector LLC has been helping a few clients stand-up advanced analytics and machine learning stacks using
R and substantial data stores (such as relational database variants such as
PostgreSQL or big data systems such as
Often we come to a point where we or a partner realize: "the design would be a whole lot easier if we could phrase it in terms of higher order data operators."
When trying to count rows using
dplyr controlled data-structures (remote
tbls such as
dbplyr structures) one is sailing between Scylla and Charybdis. The task being to avoid
dplyr corner-cases and irregularities (a few of which I attempt to document in this "
While working on a large client project using
Sparklyr and multinomial regression we recently ran into a problem:
Apache Spark chooses the order of multinomial regression outcome targets, whereas
R users are used to choosing the order of the targets (please see here for some details). So to make things more like
R users expect, we need a way to translate one order to another.