Posted on Categories Administrativia, Opinion, Pragmatic Data Science, Pragmatic Machine Learning, Programming, StatisticsTags , , , , , , , , 4 Comments on Getting started with seplyr

Getting started with seplyr

A big “thank you!!!” to Microsoft for hosting our new introduction to seplyr. If you are working R and big data I think the seplyr package can be a valuable tool.


Safety
Continue reading Getting started with seplyr

Posted on Categories Coding, Programming, StatisticsTags , , Leave a comment on How to Avoid the dplyr Dependency Driven Result Corruption

How to Avoid the dplyr Dependency Driven Result Corruption

In our last article we pointed out a dangerous silent result corruption we have seen when using the R dplyr package with databases.

To systematically avoid this result corruption we suggest breaking up your dplyr::mutate() statements to be dependency-free (not assigning the same value twice, and not using any value in the same mutate it is formed). We consider these to be key and critical precautions to take when using dplyr with a database.

We would also like to point out we are also distributing free tools to do this automatically, and a worked example of this solution.

Posted on Categories Opinion, Programming, StatisticsTags , , 3 Comments on Please inspect your dplyr+database code

Please inspect your dplyr+database code

A note to dplyr with database users: you may benefit from inspecting/re-factoring your code to eliminate value re-use inside dplyr::mutate() statements. Continue reading Please inspect your dplyr+database code

Posted on Categories data science, Pragmatic Data Science, Programming, Statistics, TutorialsTags , , 3 Comments on Arbitrary Data Transforms Using cdata

Arbitrary Data Transforms Using cdata

We have been writing a lot on higher-order data transforms lately:

Cdata

What I want to do now is "write a bit more, so I finally feel I have been concise."

Continue reading Arbitrary Data Transforms Using cdata

Posted on Categories Programming, Statistics, TutorialsTags , , , , , , 2 Comments on RStudio Keyboard Shortcuts for Pipes

RStudio Keyboard Shortcuts for Pipes

I have just released some simple RStudio add-ins that are great for creating keyboard shortcuts when working with pipes in R.

You can install the add-ins from here (which also includes both installation instructions and use instructions/examples).

RStudio Logo Blue Gradient

Wraprs BizarroPipe Logo

Posted on Categories Pragmatic Data Science, Pragmatic Machine Learning, Programming, Statistics, TutorialsTags , , , , , , ,

Data Wrangling at Scale

Just wrote a new R article: “Data Wrangling at Scale” (using Dirk Eddelbuettel’s tint template).

Fd

Please check it out.

Posted on Categories Coding, data science, Pragmatic Data Science, Programming, Statistics, TutorialsTags , , 1 Comment on Big Data Transforms

Big Data Transforms

As part of our consulting practice Win-Vector LLC has been helping a few clients stand-up advanced analytics and machine learning stacks using R and substantial data stores (such as relational database variants such as PostgreSQL or big data systems such as Spark).


IMG 6061 3

Often we come to a point where we or a partner realize: "the design would be a whole lot easier if we could phrase it in terms of higher order data operators."

Continue reading Big Data Transforms

Posted on Categories Opinion, Programming, StatisticsTags , , , , 2 Comments on It is Needlessly Difficult to Count Rows Using dplyr

It is Needlessly Difficult to Count Rows Using dplyr

  • Question: how hard is it to count rows using the R package dplyr?
  • Answer: surprisingly difficult.

When trying to count rows using dplyr or dplyr controlled data-structures (remote tbls such as Sparklyr or dbplyr structures) one is sailing between Scylla and Charybdis. The task being to avoid dplyr corner-cases and irregularities (a few of which I attempt to document in this "dplyr inferno").



800px Johann Heinrich F├╝ssli 054

Continue reading It is Needlessly Difficult to Count Rows Using dplyr

Posted on Categories data science, Pragmatic Data Science, Pragmatic Machine Learning, Programming, Statistics, TutorialsTags , , , , ,

Permutation Theory In Action

While working on a large client project using Sparklyr and multinomial regression we recently ran into a problem: Apache Spark chooses the order of multinomial regression outcome targets, whereas R users are used to choosing the order of the targets (please see here for some details). So to make things more like R users expect, we need a way to translate one order to another.

Providing good solutions to gaps like this is one of the thing Win-Vector LLC does both in our consulting and training practices.

Continue reading Permutation Theory In Action