Posted on Categories Administrativia, data science, Practical Data Science, Pragmatic Data Science, StatisticsTags , 4 Comments on Supervised Learning in R: Regression

Supervised Learning in R: Regression

We are very excited to announce a new (paid) Win-Vector LLC video training course: Supervised Learning in R: Regression now available on DataCamp

Shield image course 3851 20170725 24872 3f982z Continue reading Supervised Learning in R: Regression

Posted on Categories Administrativia, data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, StatisticsTags , , , , , , , 1 Comment on More documentation for Win-Vector R packages

More documentation for Win-Vector R packages

The Win-Vector public R packages now all have new pkgdown documentation sites! (And, a thank-you to Hadley Wickham for developing the pkgdown tool.)

Please check them out (hint: vtreat is our favorite).

NewImage Continue reading More documentation for Win-Vector R packages

Posted on Categories Coding, data science, Opinion, Programming, Statistics, TutorialsTags , , , , 13 Comments on Tutorial: Using seplyr to Program Over dplyr

Tutorial: Using seplyr to Program Over dplyr

seplyr is an R package that makes it easy to program over dplyr 0.7.*.

To illustrate this we will work an example.

Continue reading Tutorial: Using seplyr to Program Over dplyr

Posted on Categories data science, Opinion, Programming, Statistics, TutorialsTags , , , , 12 Comments on dplyr 0.7 Made Simpler

dplyr 0.7 Made Simpler

I have been writing a lot (too much) on the R topics dplyr/rlang/tidyeval lately. The reason is: major changes were recently announced. If you are going to use dplyr well and correctly going forward you may need to understand some of the new issues (if you don’t use dplyr you can safely skip all of this). I am trying to work out (publicly) how to best incorporate the new methods into:

  • real world analyses,
  • reusable packages,
  • and teaching materials.

I think some of the apparent discomfort on my part comes from my feeling that dplyr never really gave standard evaluation (SE) a fair chance. In my opinion: dplyr is based strongly on non-standard evaluation (NSE, originally through lazyeval and now through rlang/tidyeval) more by the taste and choice than by actual analyst benefit or need. dplyr isn’t my package, so it isn’t my choice to make; but I can still have an informed opinion, which I will discuss below.

Continue reading dplyr 0.7 Made Simpler

Posted on Categories data science, Statistics, TutorialsTags , , , 10 Comments on Better Grouped Summaries in dplyr

Better Grouped Summaries in dplyr

For R dplyr users one of the promises of the new rlang/tidyeval system is an improved ability to program over dplyr itself. In particular to add new verbs that encapsulate previously compound steps into better self-documenting atomic steps.

Let’s take a look at this capability.

Continue reading Better Grouped Summaries in dplyr

Posted on Categories data science, Opinion, StatisticsTags , , , , , 2 Comments on Working With R and Big Data: Use Replyr

Working With R and Big Data: Use Replyr

In our latest R and Big Data article we discuss replyr.

Why replyr

replyr stands for REmote PLYing of big data for R.

Why should R users try replyr? Because it lets you take a number of common working patterns and apply them to remote data (such as databases or Spark).

replyr allows users to work with Spark or database data similar to how they work with local data.frames. Some key capability gaps remedied by replyr include:

  • Summarizing data: replyr_summary().
  • Combining tables: replyr_union_all().
  • Binding tables by row: replyr_bind_rows().
  • Using the split/apply/combine pattern (dplyr::do()): replyr_split(), replyr::gapply().
  • Pivot/anti-pivot (gather/spread): replyr_moveValuesToRows()/ replyr_moveValuesToColumns().
  • Handle tracking.
  • A join controller.

You may have already learned to decompose your local data processing into steps including the above, so retaining such capabilities makes working with Spark and sparklyr much easier. Some of the above capabilities will likely come to the tidyverse, but the above implementations are build purely on top of dplyr and are the ones already being vetted and debugged at production scale (I think these will be ironed out and reliable sooner).

Continue reading Working With R and Big Data: Use Replyr

Posted on Categories data science, Practical Data Science, Pragmatic Data Science, Programming, Statistics, TutorialsTags , , , , , 1 Comment on Join Dependency Sorting

Join Dependency Sorting

In our latest installment of “R and big data” let’s again discuss the task of left joining many tables from a data warehouse using R and a system called "a join controller" (last discussed here).

One of the great advantages to specifying complicated sequences of operations in data (rather than in code) is: it is often easier to transform and extend data. Explicit rich data beats vague convention and complicated code.

Continue reading Join Dependency Sorting

Posted on Categories Coding, data science, Opinion, Programming, Statistics, TutorialsTags , , , , , , , , , , 10 Comments on Non-Standard Evaluation and Function Composition in R

Non-Standard Evaluation and Function Composition in R

In this article we will discuss composing standard-evaluation interfaces (SE: parametric, referentially transparent, or “looks only at values”) and composing non-standard-evaluation interfaces (NSE) in R.

In R the package tidyeval/rlang is a tool for building domain specific languages intended to allow easier composition of NSE interfaces.

To use it you must know some of its structure and notation. Here are some details paraphrased from the major tidyeval/rlang client, the package dplyr: vignette('programming', package = 'dplyr')).

  • ":=" is needed to make left-hand-side re-mapping possible (adding yet another "more than one assignment type operator running around" notation issue).
  • "!!" substitution requires parenthesis to safely bind (so the notation is actually "(!! )", not "!!").
  • Left-hand-sides of expressions are names or strings, while right-hand-sides are quosures/expressions.

Continue reading Non-Standard Evaluation and Function Composition in R

Posted on Categories data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , 4 Comments on Use a Join Controller to Document Your Work

Use a Join Controller to Document Your Work

This note describes a useful replyr tool we call a "join controller" (and is part of our "R and Big Data" series, please see here for the introduction, and here for one our big data courses).

Continue reading Use a Join Controller to Document Your Work

Posted on Categories data science, Opinion, StatisticsTags , , , Leave a comment on R summary() got better!

R summary() got better!

Here is a really nice feature found in the current 3.4.0 version of R: summary() has become a lot more reasonable.

summary(15555)

#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#   15555   15555   15555   15555   15555   15555 

Please read on for some background. Continue reading R summary() got better!