The secret is out: Nina Zumel and I are busy working on Practical Data Science with R2, the second edition of our best selling book on learning data science using the R language.
Our publisher, Manning, has a great slide deck describing the book (and a discount code!!!) here:
We also just got back our part-1 technical review for the new book. Here is a quote from the technical review we are particularly proud of:
The dot notation for base
R and the
dplyr package did make me stand up and think. Certain things suddenly made sense.
Continue reading Practical Data Science with R2
In August of 2003 Thomas Lumley added
R 1.8.1. This gave
R users an explicit Lisp-style quasiquotation capability.
bquote() and quasiquotation are actually quite powerful. Professor Thomas Lumley should get, and should continue to receive, a lot of credit and thanks for introducing the concept into
bquote() is already powerful enough to build a version of
dplyr 0.5.0 with quasiquotation semantics quite close (from a user perspective) to what is now claimed in
Let’s take a look at that.
Continue reading Quasiquotation in R via bquote()
wrapr pipe RJournal article we used piping into
ggplot2 layers/geoms/items as an example.
Being able to use the same pipe operator for data processing steps and for
ggplot2 layering is a question that comes up from time to time (for example: Why can’t ggplot2 use %>%?). In fact the primary
ggplot2 package author wishes that
magrittr piping was the composing notation for
ggplot2 (though it is obviously too late to change).
There are some fundamental difficulties in trying to use the
magrittr pipe in such a way. In particular
magrittr looks for its own pipe by name in un-evaluated code, and thus is difficult to engineer over (though it can be hacked around). The general concept is: pipe stages are usually functions or function calls, and
ggplot2 components are objects (verbs versus nouns); and at first these seem incompatible.
wrapr dot-arrow-pipe was designed to handle such distinctions.
Let’s work an example.
Continue reading Piping into ggplot2
Saghir Bashir of ilustat recently shared a nice getting started with
In addition they were generous enough to link to Dirk Eddelbuette’s later adaption of the guide to use
This type of cooperation and user choice is what keeps the
R community vital. Please encourage it. (Heck, please insist on it!)
According to a KDD poll fewer respondents (by rate) used only
R in 2017 than in 2016. At the same time more respondents (by rate) used only
Python in 2017 than in 2016.
Let’s take this as an excuse to take a quick look at what happens when we try a task in both systems.
Continue reading Running the Same Task in Python and R
Let’s take a quick look at a very important and common experimental problem: checking if the difference in success rates of two Binomial experiments is statistically significant. This can arise in A/B testing situations such as online advertising, sales, and manufacturing.
We already share a free video course on a Bayesian treatment of planning and evaluating A/B tests (including a free Shiny application). Let’s now take a look at the should be simple task of simply building a summary statistic that includes a classic frequentist significance.
Continue reading Quick Significance Calculations for A/B Tests in R
vtreat is a powerful
R package for preparing messy real-world data for machine learning. We have further extended the package with a number of features including rquery/rqdatatable integration (allowing vtreat application at scale on Apache Spark or data.table!).
vtreat and can now effectively prepare data for multi-class classification or multinomial modeling.
Continue reading Modeling muti-category Outcomes With vtreat
Our interference from then environment issue was a bit subtle. But there are variations that can be a bit more insidious.
Please consider the following.
Continue reading A Better Example of the Confused By The Environment Issue
It is no great secret: I like value oriented interfaces that preserve referential transparency. It is the side of the public debate I take in
"One of the most useful properties of expressions is that called by Quine referential transparency. In essence this means that if we wish to find the value of an expression which contains a sub-expression, the only thing we need to know about the sub-expression is its value."
Christopher Strachey, "Fundamental Concepts in Programming Languages", Higher-Order and Symbolic Computation, 13, 1149, 2000, Kluwer Academic Publishers (lecture notes written by Christopher Strachey for the International Summer School in Computer Programming at Copenhagen in August, 1967).
Please read on for discussion of a subtle bug shared by a few popular non-standard evaluation interfaces.
Continue reading A Subtle Flaw in Some Popular R NSE Interfaces
I’ve ended up (almost accidentally) collecting a number of different solutions to the “use a column to choose values from other columns in R” problem.
Please read on for a brief benchmark comparing these methods/solutions.
Continue reading Timing Column Indexing in R
We recently saw a great recurring R question: “how do you use one column to choose a different value for each row?” That is: how do you use a column as an index? Please read on for some idiomatic base R, data.table, and dplyr solutions.
Continue reading Using a Column as a Column Index