Posted on Categories data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Programming, TutorialsTags , , Leave a comment on The blocks and rows theory of data shaping

The blocks and rows theory of data shaping

We have our latest note on the theory of data wrangling up here. It discusses the roles of “block records” and “row records” in the cdata data transform tool. With that and the theory of how to design transforms, we think we have a pretty complete description of the system.

Rowrecs to blocks

Posted on Categories Programming, TutorialsTags , , 4 Comments on Use Pseudo-Aggregators to Add Safety Checks to Your Data-Wrangling Workflow

Use Pseudo-Aggregators to Add Safety Checks to Your Data-Wrangling Workflow

One of the concepts we teach in both Practical Data Science with R and in our theory of data shaping is the importance of identifying the roles of columns in your data.

For example, to think in terms of multi-row records it helps to identify:

  • Which columns are keys (together identify rows or records).
  • Which columns are data/payload (are considered free varying data).
  • Which columns are "derived" (functions of the keys).

In this note we will show how to use some of these ideas to write safer data-wrangling code.

Continue reading Use Pseudo-Aggregators to Add Safety Checks to Your Data-Wrangling Workflow

Posted on Categories Programming, TutorialsTags Leave a comment on Conway’s Game of Life in R: Or On the Importance of Vectorizing Your R Code

Conway’s Game of Life in R: Or On the Importance of Vectorizing Your R Code

R is an interpreted programming language with vectorized data structures. This means a single R command can ask for very many arithmetic operations to be performed. This also means R computation can be fast. We will show an example of this using Conway’s Game of Life.

Continue reading Conway’s Game of Life in R: Or On the Importance of Vectorizing Your R Code

Posted on Categories TutorialsTags , , 2 Comments on Scatterplot matrices (pair plots) with cdata and ggplot2

Scatterplot matrices (pair plots) with cdata and ggplot2

In my previous post, I showed how to use cdata package along with ggplot2‘s faceting facility to compactly plot two related graphs from the same data. This got me thinking: can I use cdata to produce a ggplot2 version of a scatterplot matrix, or pairs plot?

A pairs plot compactly plots every (numeric) variable in a dataset against every other one. In base plot, you would use the pairs() function. Here is the base version of the pairs plot of the iris dataset:

pairs(iris[1:4], 
      main = "Anderson's Iris Data -- 3 species",
      pch = 21, 
      bg = c("#1b9e77", "#d95f02", "#7570b3")[unclass(iris$Species)])

Unnamed chunk 1 1

There are other ways to do this, too:

# not run

library(ggplot2)
library(GGally)
ggpairs(iris, columns=1:4, aes(color=Species)) + 
  ggtitle("Anderson's Iris Data -- 3 species")

library(lattice)
splom(iris[1:4], 
      groups=iris$Species, 
      main="Anderson's Iris Data -- 3 species")

But I wanted to see if cdata was up to the task. So here we go….

Continue reading Scatterplot matrices (pair plots) with cdata and ggplot2

Posted on Categories data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, TutorialsTags , Leave a comment on Designing Transforms for Data Reshaping with cdata

Designing Transforms for Data Reshaping with cdata

Authors: John Mount, and Nina Zumel 2018-10-25

As a followup to our previous post, this post goes a bit deeper into reasoning about data transforms using the cdata package. The cdata packages demonstrates the "coordinatized data" theory and includes an implementation of the "fluid data" methodology for general data re-shaping.

cdata adheres to the so-called "Rule of Representation":

Fold knowledge into data, so program logic can be stupid and robust.

The Art of Unix Programming, Erick S. Raymond, Addison-Wesley , 2003

The design principle expressed by this rule is that it is much easier to reason about data than to try to reason about code, so using data to control your code is often a very good trade-off.

We showed in the last post how cdata takes a transform control table to specify how you want your data reshaped. The question then becomes: how do you come up with the transform control table?

Let’s discuss that using the example from the previous post: "plotting the iris data faceted".

Continue reading Designing Transforms for Data Reshaping with cdata

Posted on Categories Programming, TutorialsTags , , 8 Comments on Faceted Graphs with cdata and ggplot2

Faceted Graphs with cdata and ggplot2

In between client work, John and I have been busy working on our book, Practical Data Science with R, 2nd Edition. To demonstrate a toy example for the section I’m working on, I needed scatter plots of the petal and sepal dimensions of the iris data, like so:

Unnamed chunk 1 1

I wanted a plot for petal dimensions and sepal dimensions, but I also felt that two plots took up too much space. So, I thought, why not make a faceted graph that shows both:

Unnamed chunk 2 1

Except — which columns do I plot and what do I facet on?

head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Here’s one way to create the plot I want, using the cdata package along with ggplot2.

Continue reading Faceted Graphs with cdata and ggplot2

Posted on Categories Coding, OpinionTags , , 2 Comments on Quasiquotation in R via bquote()

Quasiquotation in R via bquote()

In August of 2003 Thomas Lumley added bquote() to R 1.8.1. This gave R and R users an explicit Lisp-style quasiquotation capability. bquote() and quasiquotation are actually quite powerful. Professor Thomas Lumley should get, and should continue to receive, a lot of credit and thanks for introducing the concept into R.

In fact bquote() is already powerful enough to build a version of dplyr 0.5.0 with quasiquotation semantics quite close (from a user perspective) to what is now claimed in tidyeval/rlang.

Let’s take a look at that.

Continue reading Quasiquotation in R via bquote()

Posted on Categories Programming, TutorialsTags , , , , , Leave a comment on Piping into ggplot2

Piping into ggplot2

In our wrapr pipe RJournal article we used piping into ggplot2 layers/geoms/items as an example.

Being able to use the same pipe operator for data processing steps and for ggplot2 layering is a question that comes up from time to time (for example: Why can’t ggplot2 use %>%?). In fact the primary ggplot2 package author wishes that magrittr piping was the composing notation for ggplot2 (though it is obviously too late to change).

There are some fundamental difficulties in trying to use the magrittr pipe in such a way. In particular magrittr looks for its own pipe by name in un-evaluated code, and thus is difficult to engineer over (though it can be hacked around). The general concept is: pipe stages are usually functions or function calls, and ggplot2 components are objects (verbs versus nouns); and at first these seem incompatible.

However, the wrapr dot-arrow-pipe was designed to handle such distinctions.

Let’s work an example.

Continue reading Piping into ggplot2

Posted on Categories Opinion, TutorialsTags , , Leave a comment on Some R Guides: tidyverse and data.table Versions

Some R Guides: tidyverse and data.table Versions

Saghir Bashir of ilustat recently shared a nice getting started with R and tidyverse guide.

NewImage

In addition they were generous enough to link to Dirk Eddelbuette’s later adaption of the guide to use data.table.

NewImage

This type of cooperation and user choice is what keeps the R community vital. Please encourage it. (Heck, please insist on it!)

Posted on Categories Coding, OpinionTags , , 15 Comments on Running the Same Task in Python and R

Running the Same Task in Python and R

According to a KDD poll fewer respondents (by rate) used only R in 2017 than in 2016. At the same time more respondents (by rate) used only Python in 2017 than in 2016.

Let’s take this as an excuse to take a quick look at what happens when we try a task in both systems.

Continue reading Running the Same Task in Python and R