Posted on Categories Programming, TutorialsTags , , 4 Comments on Use Pseudo-Aggregators to Add Safety Checks to Your Data-Wrangling Workflow

One of the concepts we teach in both Practical Data Science with R and in our theory of data shaping is the importance of identifying the roles of columns in your data.

For example, to think in terms of multi-row records it helps to identify:

• Which columns are keys (together identify rows or records).
• Which columns are data/payload (are considered free varying data).
• Which columns are "derived" (functions of the keys).

In this note we will show how to use some of these ideas to write safer data-wrangling code.

Posted on Categories Programming, TutorialsTags

## Conway’s Game of Life in R: Or On the Importance of Vectorizing Your R Code

R is an interpreted programming language with vectorized data structures. This means a single R command can ask for very many arithmetic operations to be performed. This also means R computation can be fast. We will show an example of this using Conway’s Game of Life.

Posted on Categories TutorialsTags , , 2 Comments on Scatterplot matrices (pair plots) with cdata and ggplot2

## Scatterplot matrices (pair plots) with cdata and ggplot2

In my previous post, I showed how to use `cdata` package along with `ggplot2`‘s faceting facility to compactly plot two related graphs from the same data. This got me thinking: can I use `cdata` to produce a `ggplot2` version of a scatterplot matrix, or pairs plot?

A pairs plot compactly plots every (numeric) variable in a dataset against every other one. In base plot, you would use the `pairs()` function. Here is the base version of the pairs plot of the `iris` dataset:

``````pairs(iris[1:4],
main = "Anderson's Iris Data -- 3 species",
pch = 21,
bg = c("#1b9e77", "#d95f02", "#7570b3")[unclass(iris\$Species)])``````

There are other ways to do this, too:

``````# not run

library(ggplot2)
library(GGally)
ggpairs(iris, columns=1:4, aes(color=Species)) +
ggtitle("Anderson's Iris Data -- 3 species")

library(lattice)
splom(iris[1:4],
groups=iris\$Species,
main="Anderson's Iris Data -- 3 species")``````

But I wanted to see if `cdata` was up to the task. So here we go….

Posted on Tags ,

## Designing Transforms for Data Reshaping with cdata

Authors: John Mount, and Nina Zumel 2018-10-25

As a followup to our previous post, this post goes a bit deeper into reasoning about data transforms using the `cdata` package. The `cdata` packages demonstrates the "coordinatized data" theory and includes an implementation of the "fluid data" methodology for general data re-shaping.

`cdata` adheres to the so-called "Rule of Representation":

Fold knowledge into data, so program logic can be stupid and robust.

The design principle expressed by this rule is that it is much easier to reason about data than to try to reason about code, so using data to control your code is often a very good trade-off.

We showed in the last post how `cdata` takes a transform control table to specify how you want your data reshaped. The question then becomes: how do you come up with the transform control table?

Let’s discuss that using the example from the previous post: "plotting the `iris` data faceted".

Posted on Categories Programming, TutorialsTags , , 8 Comments on Faceted Graphs with cdata and ggplot2

## Faceted Graphs with cdata and ggplot2

In between client work, John and I have been busy working on our book, Practical Data Science with R, 2nd Edition. To demonstrate a toy example for the section I’m working on, I needed scatter plots of the petal and sepal dimensions of the `iris` data, like so:

I wanted a plot for petal dimensions and sepal dimensions, but I also felt that two plots took up too much space. So, I thought, why not make a faceted graph that shows both:

Except — which columns do I plot and what do I facet on?

``head(iris)``
``````##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa``````

Here’s one way to create the plot I want, using the `cdata` package along with `ggplot2`.

Posted on Categories Programming, Tutorials

## Piping into ggplot2

In our `wrapr` pipe RJournal article we used piping into `ggplot2` layers/geoms/items as an example.

Being able to use the same pipe operator for data processing steps and for `ggplot2` layering is a question that comes up from time to time (for example: Why canâ€™t ggplot2 use %>%?). In fact the primary `ggplot2` package author wishes that `magrittr` piping was the composing notation for `ggplot2` (though it is obviously too late to change).

There are some fundamental difficulties in trying to use the `magrittr` pipe in such a way. In particular `magrittr` looks for its own pipe by name in un-evaluated code, and thus is difficult to engineer over (though it can be hacked around). The general concept is: pipe stages are usually functions or function calls, and `ggplot2` components are objects (verbs versus nouns); and at first these seem incompatible.

However, the `wrapr` dot-arrow-pipe was designed to handle such distinctions.

Let’s work an example.

Posted on Categories Opinion, TutorialsTags , ,

## Some R Guides: tidyverse and data.table Versions

Saghir Bashir of ilustat recently shared a nice getting started with `R` and `tidyverse` guide.

In addition they were generous enough to link to Dirk Eddelbuette’s later adaption of the guide to use `data.table`.

This type of cooperation and user choice is what keeps the `R` community vital. Please encourage it. (Heck, please insist on it!)

Posted on Tags , , 1 Comment on Quick Significance Calculations for A/B Tests in R

## Introduction

Let’s take a quick look at a very important and common experimental problem: checking if the difference in success rates of two Binomial experiments is statistically significant. This can arise in A/B testing situations such as online advertising, sales, and manufacturing.

We already share a free video course on a Bayesian treatment of planning and evaluating A/B tests (including a free Shiny application). Let’s now take a look at the should be simple task of simply building a summary statistic that includes a classic frequentist significance.

Posted on

## Modeling multi-category Outcomes With vtreat

`vtreat` is a powerful `R` package for preparing messy real-world data for machine learning. We have further extended the package with a number of features including rquery/rqdatatable integration (allowing vtreat application at scale on Apache Spark or data.table!).

In addition `vtreat` and can now effectively prepare data for multi-class classification or multinomial modeling.

Posted on Categories Opinion, Programming, TutorialsTags , 5 Comments on A Subtle Flaw in Some Popular R NSE Interfaces

## A Subtle Flaw in Some Popular R NSE Interfaces

It is no great secret: I like value oriented interfaces that preserve referential transparency. It is the side of the public debate I take in `R` programming.

"One of the most useful properties of expressions is that called by Quine referential transparency. In essence this means that if we wish to find the value of an expression which contains a sub-expression, the only thing we need to know about the sub-expression is its value."

Please read on for discussion of a subtle bug shared by a few popular non-standard evaluation interfaces.