Posted on Categories data science, Practical Data Science, Pragmatic Data Science, Programming, Statistics, TutorialsTags , , , , , 1 Comment on Join Dependency Sorting

Join Dependency Sorting

In our latest installment of “R and big data” let’s again discuss the task of left joining many tables from a data warehouse using R and a system called "a join controller" (last discussed here).

One of the great advantages to specifying complicated sequences of operations in data (rather than in code) is: it is often easier to transform and extend data. Explicit rich data beats vague convention and complicated code.

Continue reading Join Dependency Sorting

Posted on Categories Coding, Programming, Statistics, TutorialsTags , , ,

wrapr Implementation Update

Introduction

The development version CRAN version of our R helper function wrapr::let() has switched from string-based substitution to abstract syntax tree based substitution (AST based substitution, or language based substitution).

Wraprs

I am looking for some feedback from wrapr::let() users already doing substantial work with wrapr::let(). If you are already using wrapr::let() please test if the current development version of wrapr works with your code. If you run into problems: I apologize, and please file a GitHub issue.

Continue reading wrapr Implementation Update

Posted on Categories Coding, data science, Opinion, Programming, Statistics, TutorialsTags , , , , , , , , , , 10 Comments on Non-Standard Evaluation and Function Composition in R

Non-Standard Evaluation and Function Composition in R

In this article we will discuss composing standard-evaluation interfaces (SE: parametric, referentially transparent, or “looks only at values”) and composing non-standard-evaluation interfaces (NSE) in R.

In R the package tidyeval/rlang is a tool for building domain specific languages intended to allow easier composition of NSE interfaces.

To use it you must know some of its structure and notation. Here are some details paraphrased from the major tidyeval/rlang client, the package dplyr: vignette('programming', package = 'dplyr')).

  • ":=" is needed to make left-hand-side re-mapping possible (adding yet another "more than one assignment type operator running around" notation issue).
  • "!!" substitution requires parenthesis to safely bind (so the notation is actually "(!! )", not "!!").
  • Left-hand-sides of expressions are names or strings, while right-hand-sides are quosures/expressions.

Continue reading Non-Standard Evaluation and Function Composition in R

Posted on Categories Opinion, Rants, Statistics, TutorialsTags , , 1 Comment on An easy way to accidentally inflate reported R-squared in linear regression models

An easy way to accidentally inflate reported R-squared in linear regression models

Here is an absolutely horrible way to confuse yourself and get an inflated reported R-squared on a simple linear regression model in R.

We have written about this before, but we found a new twist on the problem (interactions with categorical variable encoding) which we would like to call out here. Continue reading An easy way to accidentally inflate reported R-squared in linear regression models

Posted on Categories data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , 4 Comments on Use a Join Controller to Document Your Work

Use a Join Controller to Document Your Work

This note describes a useful replyr tool we call a "join controller" (and is part of our "R and Big Data" series, please see here for the introduction, and here for one our big data courses).

Continue reading Use a Join Controller to Document Your Work

Posted on Categories Applications, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Programming, Statistics, TutorialsTags , , ,

Managing intermediate results when using R/sparklyr

In our latest “R and big data” article we show how to manage intermediate results in non-trivial Apache Spark workflows using R, sparklyr, dplyr, and replyr.


NewImage
Continue reading Managing intermediate results when using R/sparklyr

Posted on Categories data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , , 1 Comment on Managing Spark data handles in R

Managing Spark data handles in R

When working with big data with R (say, using Spark and sparklyr) we have found it very convenient to keep data handles in a neat list or data_frame.


5465544053 8b626a09c8 b

Please read on for our handy hints on keeping your data handles neat. Continue reading Managing Spark data handles in R

Posted on Categories Administrativia, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , 5 Comments on New series: R and big data (concentrating on Spark and sparklyr)

New series: R and big data (concentrating on Spark and sparklyr)

Win-Vector LLC has recently been teaching how to use R with big data through Spark and sparklyr. We have also been helping clients become productive on R/Spark infrastructure through direct consulting and bespoke training. I thought this would be a good time to talk about the power of working with big-data using R, share some hints, and even admit to some of the warts found in this combination of systems.

The ability to perform sophisticated analyses and modeling on “big data” with R is rapidly improving, and this is the time for businesses to invest in the technology. Win-Vector can be your key partner in methodology development and training (through our consulting and training practices).


We Can Do It

J. Howard Miller, 1943.

The field is exciting, rapidly evolving, and even a touch dangerous. We invite you to start using Spark through R and are starting a new series of articles tagged “R and big data” to help you produce production quality solutions quickly.

Please read on for a brief description of our new articles series: “R and big data.” Continue reading New series: R and big data (concentrating on Spark and sparklyr)

Posted on Categories Computer Science, Expository Writing, Programming, TutorialsTags , , , , , , 1 Comment on On indexing operators and composition

On indexing operators and composition

In this article I will discuss array indexing, operators, and composition in depth. If you work through this article you should end up with a very deep understanding of array indexing and the deep interpretation available when we realize indexing is an instance of function composition (or an example of permutation groups or semigroups: some very deep yet accessible pure mathematics).


NewImage

A permutation of indices

In this article I will be working hard to convince you a very fundamental true statement is in fact true: array indexing is associative; and to simultaneously convince you that you should still consider this amazing (as it is a very strong claim with very many consequences). Array indexing respecting associative transformations should not be a-priori intuitive to the general programmer, as array indexing code is rarely re-factored or transformed, so programmers tend to have little experience with the effect. Consider this article an exercise to build the experience to make this statement a posteriori obvious, and hence something you are more comfortable using and relying on.

R‘s array indexing notation is really powerful, so we will use it for our examples. This is going to be long (because I am trying to slow the exposition down enough to see all the steps and relations) and hard to follow without working examples (say with R), and working through the logic with pencil and a printout (math is not a spectator sport). I can’t keep all the steps in my head without paper, so I don’t really expect readers to keep all the steps in their heads without paper (though I have tried to organize the flow of this article and signal intent often enough to make this readable). Continue reading On indexing operators and composition

Posted on Categories Opinion, Statistics, TutorialsTags , , , 2 Comments on dplyr in Context

dplyr in Context

Introduction

Beginning R users often come to the false impression that the popular packages dplyr and tidyr are both all of R and sui generis inventions (in that they might be unprecedented and there might no other reasonable way to get the same effects in R). These packages and their conventions are high-value, but they are results of evolution and implement a style of programming that has been available in R for some time. They evolved in a context, and did not burst on the scene fully armored with spear in hand.

Continue reading dplyr in Context