Posted on Categories Coding, Programming, TutorialsTags , , 3 Comments on Advisory on Multiple Assignment dplyr::mutate() on Databases

Advisory on Multiple Assignment dplyr::mutate() on Databases

I currently advise R dplyr users to take care when using multiple assignment dplyr::mutate() commands on databases.


Unknown

(image: Kingroyos, Creative Commons Attribution-Share Alike 3.0 Unported License)

In this note I exhibit a troublesome example, and a systematic solution.

Continue reading Advisory on Multiple Assignment dplyr::mutate() on Databases

Posted on Categories Administrativia, Opinion, Pragmatic Data Science, Pragmatic Machine Learning, Programming, StatisticsTags , , , , , , , , 4 Comments on Getting started with seplyr

Getting started with seplyr

A big “thank you!!!” to Microsoft for hosting our new introduction to seplyr. If you are working R and big data I think the seplyr package can be a valuable tool.


Safety
Continue reading Getting started with seplyr

Posted on Categories Coding, Programming, StatisticsTags , , Leave a comment on How to Avoid the dplyr Dependency Driven Result Corruption

How to Avoid the dplyr Dependency Driven Result Corruption

In our last article we pointed out a dangerous silent result corruption we have seen when using the R dplyr package with databases.

To systematically avoid this result corruption we suggest breaking up your dplyr::mutate() statements to be dependency-free (not assigning the same value twice, and not using any value in the same mutate it is formed). We consider these to be key and critical precautions to take when using dplyr with a database.

We would also like to point out we are also distributing free tools to do this automatically, and a worked example of this solution.

Posted on Categories Coding, Pragmatic Data Science, Pragmatic Machine Learning, StatisticsTags , , , , , 3 Comments on Vectorized Block ifelse in R

Vectorized Block ifelse in R

Win-Vector LLC has been working on porting some significant large scale production systems from SAS to R.

From this experience we want to share how to simulate, in R with Apache Spark (via Sparklyr), a nifty SAS feature: the vectorized “block if(){}else{}” structure. Continue reading Vectorized Block ifelse in R

Posted on Categories Exciting Techniques, Programming, Statistics, TutorialsTags , , , 1 Comment on Neat New seplyr Feature: String Interpolation

Neat New seplyr Feature: String Interpolation

The R package seplyr has a neat new feature: the function seplyr::expand_expr() which implements what we call “the string algebra” or string expression interpolation. The function takes an expression of mixed terms, including: variables referring to names, quoted strings, and general expression terms. It then “de-quotes” all of the variables referring to quoted strings and “dereferences” variables thought to be referring to names. The entire expression is then returned as a single string.


Safety

This provides a powerful way to easily work complicated expressions into the seplyr data manipulation methods. Continue reading Neat New seplyr Feature: String Interpolation

Posted on Categories Programming, Statistics, TutorialsTags , , , , , , 6 Comments on Some Neat New R Notations

Some Neat New R Notations

The R package wrapr supplies a few neat new coding notations.


abacus

An Abacus, which gives us the term “calculus.”

Continue reading Some Neat New R Notations

Posted on Categories Opinion, Programming, StatisticsTags , , , , , 10 Comments on Let’s Have Some Sympathy For The Part-time R User

Let’s Have Some Sympathy For The Part-time R User

When I started writing about methods for better "parametric programming" interfaces for dplyr for R dplyr users in December of 2016 I encountered three divisions in the audience:

  • dplyr users who had such a need, and wanted such extensions.
  • dplyr users who did not have such a need ("we always know the column names").
  • dplyr users who found the then-current fairly complex "underscore" and lazyeval system sufficient for the task.

Needing name substitution is a problem an advanced full-time R user can solve on their own. However a part-time R would greatly benefit from a simple, reliable, readable, documented, and comprehensible packaged solution. Continue reading Let’s Have Some Sympathy For The Part-time R User

Posted on Categories Administrativia, data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, StatisticsTags , , , , , , , 1 Comment on More documentation for Win-Vector R packages

More documentation for Win-Vector R packages

The Win-Vector public R packages now all have new pkgdown documentation sites! (And, a thank-you to Hadley Wickham for developing the pkgdown tool.)

Please check them out (hint: vtreat is our favorite).

NewImage Continue reading More documentation for Win-Vector R packages

Posted on Categories Coding, data science, Opinion, Programming, Statistics, TutorialsTags , , , , 13 Comments on Tutorial: Using seplyr to Program Over dplyr

Tutorial: Using seplyr to Program Over dplyr

seplyr is an R package that makes it easy to program over dplyr 0.7.*.

To illustrate this we will work an example.

Continue reading Tutorial: Using seplyr to Program Over dplyr

Posted on Categories Administrativia, Exciting Techniques, Statistics, TutorialsTags , , 1 Comment on seplyr update

seplyr update

The development version of my new R package seplyr is performing in practical applications with dplyr 0.7.* much better than even I (the seplyr package author) expected.

I think I have hit a very good set of trade-offs, and I have now spent significant time creating documentation and examples.

I wish there had been such a package weeks ago, and that I had started using this approach in my own client work at that time. If you are already a dplyr user I strongly suggest trying seplyr in your own analysis projects.

Please see here for details.