Posted on Categories Coding, Programming, StatisticsTags , , Leave a comment on How to Avoid the dplyr Dependency Driven Result Corruption

How to Avoid the dplyr Dependency Driven Result Corruption

In our last article we pointed out a dangerous silent result corruption we have seen when using the R dplyr package with databases.

To systematically avoid this result corruption we suggest breaking up your dplyr::mutate() statements to be dependency-free (not assigning the same value twice, and not using any value in the same mutate it is formed). We consider these to be key and critical precautions to take when using dplyr with a database.

We would also like to point out we are also distributing free tools to do this automatically, and a worked example of this solution.

Posted on Categories Opinion, Programming, StatisticsTags , , 3 Comments on Please inspect your dplyr+database code

Please inspect your dplyr+database code

A note to dplyr with database users: you may benefit from inspecting/re-factoring your code to eliminate value re-use inside dplyr::mutate() statements. Continue reading Please inspect your dplyr+database code

Posted on Categories Coding, data science, Exciting Techniques, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , , , , , 1 Comment on Win-Vector LLC announces new “big data in R” tools

Win-Vector LLC announces new “big data in R” tools

Win-Vector LLC is proud to introduce two important new tool families (with documentation) in the 0.5.0 version of seplyr (also now available on CRAN):

  • partition_mutate_se() / partition_mutate_qt(): these are query planners/optimizers that work over dplyr::mutate() assignments. When using big-data systems through R (such as PostgreSQL or Apache Spark) these planners can make your code faster and sequence steps to avoid critical issues (the complementary problems of too long in-mutate dependence chains, of too many mutate steps, and incidental bugs; all explained in the linked tutorials).
  • if_else_device(): provides a dplyr::mutate() based simulation of per-row conditional blocks (including conditional assignment). This allows powerful imperative code (such as often seen in porting from SAS) to be directly and legibly translated into performant dplyr::mutate() data flow code that works on Spark (via Sparklyr) and databases.


Blacksmith working

Image by Jeff Kubina from Columbia, Maryland – [1], CC BY-SA 2.0, Link

Continue reading Win-Vector LLC announces new “big data in R” tools

Posted on Categories Coding, Pragmatic Data Science, Pragmatic Machine Learning, StatisticsTags , , , , , 3 Comments on Vectorized Block ifelse in R

Vectorized Block ifelse in R

Win-Vector LLC has been working on porting some significant large scale production systems from SAS to R.

From this experience we want to share how to simulate, in R with Apache Spark (via Sparklyr), a nifty SAS feature: the vectorized “block if(){}else{}” structure. Continue reading Vectorized Block ifelse in R

Posted on Categories data science, Pragmatic Data Science, Programming, Statistics, TutorialsTags , , 3 Comments on Arbitrary Data Transforms Using cdata

Arbitrary Data Transforms Using cdata

We have been writing a lot on higher-order data transforms lately:

Cdata

What I want to do now is "write a bit more, so I finally feel I have been concise."

Continue reading Arbitrary Data Transforms Using cdata

Posted on Categories Programming, Statistics, TutorialsTags , , , , , , 2 Comments on RStudio Keyboard Shortcuts for Pipes

RStudio Keyboard Shortcuts for Pipes

I have just released some simple RStudio add-ins that are great for creating keyboard shortcuts when working with pipes in R.

You can install the add-ins from here (which also includes both installation instructions and use instructions/examples).

RStudio Logo Blue Gradient

Wraprs BizarroPipe Logo

Posted on Categories Pragmatic Data Science, Pragmatic Machine Learning, Programming, Statistics, TutorialsTags , , , , , , , Leave a comment on Data Wrangling at Scale

Data Wrangling at Scale

Just wrote a new R article: “Data Wrangling at Scale” (using Dirk Eddelbuettel’s tint template).

Fd

Please check it out.

Posted on Categories Administrativia, Statistics, TutorialsTags , , , , , 8 Comments on Update on coordinatized or fluid data

Update on coordinatized or fluid data

We have just released a major update of the cdata R package to CRAN.

Cdata

If you work with R and data, now is the time to check out the cdata package. Continue reading Update on coordinatized or fluid data

Posted on Categories Coding, Opinion, Statistics, TutorialsTags , , , , Leave a comment on Let X=X in R

Let X=X in R

Our article "Let’s Have Some Sympathy For The Part-time R User" includes two points:

  • Sometimes you have to write parameterized or re-usable code.
  • The methods for doing this should be easy and legible.

The first point feels abstract, until you find yourself wanting to re-use code on new projects. As for the second point: I feel the wrapr package is the easiest, safest, most consistent, and most legible way to achieve maintainable code re-use in R.

In this article we will show how wrapr makes code-rewriting even easier with its new let x=x automation.


411gJqs4qlL

Let X=X

Continue reading Let X=X in R

Posted on Categories Coding, data science, Pragmatic Data Science, Programming, Statistics, TutorialsTags , , 1 Comment on Big Data Transforms

Big Data Transforms

As part of our consulting practice Win-Vector LLC has been helping a few clients stand-up advanced analytics and machine learning stacks using R and substantial data stores (such as relational database variants such as PostgreSQL or big data systems such as Spark).


IMG 6061 3

Often we come to a point where we or a partner realize: "the design would be a whole lot easier if we could phrase it in terms of higher order data operators."

Continue reading Big Data Transforms