Posted on Categories Coding, Opinion, Pragmatic Data Science, Statistics, TutorialsTags , , , , , , , Leave a comment on R Tip: Think in Terms of Values

R Tip: Think in Terms of Values

R tip: first organize your tasks in terms of data, values, and desired transformation of values, not initially in terms of concrete functions or code.

I know I write a lot about coding in R. But it is in the service of supporting statistics, analysis, predictive analytics, and data science.

R without data is like going to the theater to watch the curtain go up and down.

(Adapted from Ben Katchor’s Julius Knipl, Real Estate Photographer: Stories, Little, Brown, and Company, 1996, page 72, “Excursionist Drama 2”.)

Usually you come to R to work with data. If you think and plan in terms of data and values (including introducing more data to control processing) you will usually work in much faster, explainable, and maintainable fashion.

Continue reading R Tip: Think in Terms of Values

Posted on Categories Coding, Opinion, Statistics, TutorialsTags , , , , , , , 1 Comment on R Tip: Use let() to Re-Map Names

R Tip: Use let() to Re-Map Names

Another R tip. Need to replace a name in some R code or make R code re-usable? Use wrapr::let().



Continue reading R Tip: Use let() to Re-Map Names

Posted on Categories Coding, Opinion, Statistics, TutorialsTags , , , , , , , 13 Comments on R Tip: Break up Function Nesting for Legibility

R Tip: Break up Function Nesting for Legibility

There are a number of easy ways to avoid illegible code nesting problems in R.

In this R tip we will expand upon the above statement with a simple example.

Continue reading R Tip: Break up Function Nesting for Legibility

Posted on Categories Coding, Statistics, TutorialsTags , , , , 1 Comment on R Tip: Make Arguments Explicit in magrittr/dplyr Pipelines

R Tip: Make Arguments Explicit in magrittr/dplyr Pipelines

I think this is the R Tip that is going to be the most controversial yet. Its potential pitfalls include: it is a style prescription (which makes it different than and less immediately useful than something of the nature of R Tip: Force Named Arguments), and it is heterodox (this is not how magrittr/dplyr is taught by the original authors, and not how it is commonly used). However, I have not been at all good at anticipating which tips get which sort of reception (and this valuable feedback, public and private, is part of what I get of this series).

On to the tip (which only applies if you are a magrittr pipeline user).

R tip: when using magrittr pipelines consider making them more explicit, and more readable (especially to novices) by using explicit dot-arguments throughout.

Continue reading R Tip: Make Arguments Explicit in magrittr/dplyr Pipelines

Posted on Categories Coding, data science, Programming, StatisticsTags , , , , , , , 12 Comments on Is 10,000 Cells Big?

Is 10,000 Cells Big?

Trick question: is a 10,000 cell numeric data.frame big or small?

In the era of "big data" 10,000 cells is minuscule. Such data could be fit on fewer than 1,000 punched cards (or less than half a box).


Punch card

The joking answer is: it is small when they are selling you the system, but can be considered unfairly large later.

Continue reading Is 10,000 Cells Big?

Posted on Categories Exciting Techniques, Programming, Statistics, TutorialsTags , , , , , 4 Comments on Supercharge your R code with wrapr

Supercharge your R code with wrapr

I would like to demonstrate some helpful wrapr R notation tools that really neaten up your R code.


1968 AMX blown and tubbed e

Img: Christopher Ziemnowicz.

Continue reading Supercharge your R code with wrapr

Posted on Categories Coding, Programming, TutorialsTags , , 3 Comments on Advisory on Multiple Assignment dplyr::mutate() on Databases

Advisory on Multiple Assignment dplyr::mutate() on Databases

I currently advise R dplyr users to take care when using multiple assignment dplyr::mutate() commands on databases.


Unknown

(image: Kingroyos, Creative Commons Attribution-Share Alike 3.0 Unported License)

In this note I exhibit a troublesome example, and a systematic solution.

Continue reading Advisory on Multiple Assignment dplyr::mutate() on Databases

Posted on Categories Coding, Computer Science, data science, Opinion, Programming, Statistics, TutorialsTags , , , , 14 Comments on Base R can be Fast

Base R can be Fast

“Base R” (call it “Pure R”, “Good Old R”, just don’t call it “Old R” or late for dinner) can be fast for in-memory tasks. This is despite the commonly repeated claim that: “packages written in C/C++ are (edit: “always”) faster than R code.”

The benchmark results of “rquery: Fast Data Manipulation in R” really called out for follow-up timing experiments. This note is one such set of experiments, this time concentrating on in-memory (non-database) solutions.

Below is a graph summarizing our new results for a number of in-memory implementations, a range of data sizes, and two different machine types.

Unnamed chunk 2 1 Continue reading Base R can be Fast

Posted on Categories Administrativia, Opinion, Pragmatic Data Science, Pragmatic Machine Learning, Programming, StatisticsTags , , , , , , , , 4 Comments on Getting started with seplyr

Getting started with seplyr

A big “thank you!!!” to Microsoft for hosting our new introduction to seplyr. If you are working R and big data I think the seplyr package can be a valuable tool.


Safety
Continue reading Getting started with seplyr

Posted on Categories Coding, Programming, StatisticsTags , ,

How to Avoid the dplyr Dependency Driven Result Corruption

In our last article we pointed out a dangerous silent result corruption we have seen when using the R dplyr package with databases.

To systematically avoid this result corruption we suggest breaking up your dplyr::mutate() statements to be dependency-free (not assigning the same value twice, and not using any value in the same mutate it is formed). We consider these to be key and critical precautions to take when using dplyr with a database.

We would also like to point out we are also distributing free tools to do this automatically, and a worked example of this solution.