Posted on Categories Administrativia, Opinion, StatisticsTags , , Leave a comment on Thank You For The Very Nice Comment

Thank You For The Very Nice Comment

Somebody nice reached out and gave us this wonderful feedback on our new Supervised Learning in R: Regression (paid) video course.

Thanks for a wonderful course on DataCamp on XGBoost and Random forest. I was struggling with Xgboost earlier and Vtreat has made my life easy now :).

Continue reading Thank You For The Very Nice Comment

Posted on Categories Opinion, Programming, StatisticsTags , , , , , 9 Comments on Let’s Have Some Sympathy For The Part-time R User

Let’s Have Some Sympathy For The Part-time R User

When I started writing about methods for better "parametric programming" interfaces for dplyr for R dplyr users in December of 2016 I encountered three divisions in the audience:

  • dplyr users who had such a need, and wanted such extensions.
  • dplyr users who did not have such a need ("we always know the column names").
  • dplyr users who found the then-current fairly complex "underscore" and lazyeval system sufficient for the task.

Needing name substitution is a problem an advanced full-time R user can solve on their own. However a part-time R would greatly benefit from a simple, reliable, readable, documented, and comprehensible packaged solution. Continue reading Let’s Have Some Sympathy For The Part-time R User

Posted on Categories Coding, data science, Opinion, Programming, Statistics, TutorialsTags , , , , 13 Comments on Tutorial: Using seplyr to Program Over dplyr

Tutorial: Using seplyr to Program Over dplyr

seplyr is an R package that makes it easy to program over dplyr 0.7.*.

To illustrate this we will work an example.

Continue reading Tutorial: Using seplyr to Program Over dplyr

Posted on Categories data science, Opinion, Programming, Statistics, TutorialsTags , , , , 12 Comments on dplyr 0.7 Made Simpler

dplyr 0.7 Made Simpler

I have been writing a lot (too much) on the R topics dplyr/rlang/tidyeval lately. The reason is: major changes were recently announced. If you are going to use dplyr well and correctly going forward you may need to understand some of the new issues (if you don’t use dplyr you can safely skip all of this). I am trying to work out (publicly) how to best incorporate the new methods into:

  • real world analyses,
  • reusable packages,
  • and teaching materials.

I think some of the apparent discomfort on my part comes from my feeling that dplyr never really gave standard evaluation (SE) a fair chance. In my opinion: dplyr is based strongly on non-standard evaluation (NSE, originally through lazyeval and now through rlang/tidyeval) more by the taste and choice than by actual analyst benefit or need. dplyr isn’t my package, so it isn’t my choice to make; but I can still have an informed opinion, which I will discuss below.

Continue reading dplyr 0.7 Made Simpler

Posted on Categories Opinion, Programming, Statistics, TutorialsTags , , , 8 Comments on In praise of syntactic sugar

In praise of syntactic sugar

There has been some talk of adding native pipe notation to R (for example here, here, and here). And even a tidyeval/rlang pipe here.

I think a critical aspect of such an extension would be to treat such a notation as syntactic sugar and not insist such a pipe match magrittr semantics, or worse yet give a platform for authors to insert their own preferred ad-hoc semantics. Continue reading In praise of syntactic sugar

Posted on Categories data science, Opinion, StatisticsTags , , , , , 2 Comments on Working With R and Big Data: Use Replyr

Working With R and Big Data: Use Replyr

In our latest R and Big Data article we discuss replyr.

Why replyr

replyr stands for REmote PLYing of big data for R.

Why should R users try replyr? Because it lets you take a number of common working patterns and apply them to remote data (such as databases or Spark).

replyr allows users to work with Spark or database data similar to how they work with local data.frames. Some key capability gaps remedied by replyr include:

  • Summarizing data: replyr_summary().
  • Combining tables: replyr_union_all().
  • Binding tables by row: replyr_bind_rows().
  • Using the split/apply/combine pattern (dplyr::do()): replyr_split(), replyr::gapply().
  • Pivot/anti-pivot (gather/spread): replyr_moveValuesToRows()/ replyr_moveValuesToColumns().
  • Handle tracking.
  • A join controller.

You may have already learned to decompose your local data processing into steps including the above, so retaining such capabilities makes working with Spark and sparklyr much easier. Some of the above capabilities will likely come to the tidyverse, but the above implementations are build purely on top of dplyr and are the ones already being vetted and debugged at production scale (I think these will be ironed out and reliable sooner).

Continue reading Working With R and Big Data: Use Replyr

Posted on Categories Opinion, Programming, StatisticsTags , , , , , 2 Comments on Using wrapr::let() with tidyeval

Using wrapr::let() with tidyeval

While going over some of the discussion related to my last post I came up with a really neat way to use wrapr::let() and rlang/tidyeval together.

Please read on to see the situation and example. Continue reading Using wrapr::let() with tidyeval

Posted on Categories Opinion, Programming, StatisticsTags , , , 5 Comments on Please Consider Using wrapr::let() for Replacement Tasks

Please Consider Using wrapr::let() for Replacement Tasks

From dplyr issue 2916.

The following appears to work.

suppressPackageStartupMessages(library("dplyr"))

COL <- "homeworld"
starwars %>%
  group_by(.data[[COL]]) %>%
  head(n=1)
## # A tibble: 1 x 14
## # Groups:   COL [1]
##             name height  mass hair_color skin_color eye_color birth_year
##            <chr>  <int> <dbl>      <chr>      <chr>     <chr>      <dbl>
## 1 Luke Skywalker    172    77      blond       fair      blue         19
## # ... with 7 more variables: gender <chr>, homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>, COL <chr>

Though notice it reports the grouping is by "COL", not by "homeworld". Also the data set now has 14 columns, not the original 13 from the starwars data set.

Continue reading Please Consider Using wrapr::let() for Replacement Tasks

Posted on Categories Coding, data science, Opinion, Programming, Statistics, TutorialsTags , , , , , , , , , , 10 Comments on Non-Standard Evaluation and Function Composition in R

Non-Standard Evaluation and Function Composition in R

In this article we will discuss composing standard-evaluation interfaces (SE: parametric, referentially transparent, or “looks only at values”) and composing non-standard-evaluation interfaces (NSE) in R.

In R the package tidyeval/rlang is a tool for building domain specific languages intended to allow easier composition of NSE interfaces.

To use it you must know some of its structure and notation. Here are some details paraphrased from the major tidyeval/rlang client, the package dplyr: vignette('programming', package = 'dplyr')).

  • ":=" is needed to make left-hand-side re-mapping possible (adding yet another "more than one assignment type operator running around" notation issue).
  • "!!" substitution requires parenthesis to safely bind (so the notation is actually "(!! )", not "!!").
  • Left-hand-sides of expressions are names or strings, while right-hand-sides are quosures/expressions.

Continue reading Non-Standard Evaluation and Function Composition in R