Posted on Categories Programming, TutorialsTags , , , 1 Comment on Quoting in R

Quoting in R

Many R users appear to be big fans of "code capturing" or "non standard evaluation" (NSE) interfaces. In this note we will discuss quoting and non-quoting interfaces in R.

Continue reading Quoting in R

Posted on Categories Opinion, Statistics, TutorialsTags , Leave a comment on More on Bias Corrected Standard Deviation Estimates

More on Bias Corrected Standard Deviation Estimates

This note is just a quick follow-up to our last note on correcting the bias in estimated standard deviations for binomial experiments.

Continue reading More on Bias Corrected Standard Deviation Estimates

Posted on Categories Opinion, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , Leave a comment on How to de-Bias Standard Deviation Estimates

How to de-Bias Standard Deviation Estimates

This note is about attempting to remove the bias brought in by using sample standard deviation estimates to estimate an unknown true standard deviation of a population. We establish there is a bias, concentrate on why it is not important to remove it for reasonable sized samples, and (despite that) give a very complete bias management solution.

Continue reading How to de-Bias Standard Deviation Estimates

Posted on Categories Statistics, Tutorials, UncategorizedTags , , , Leave a comment on R tip: Make Your Results Clear with sigr

R tip: Make Your Results Clear with sigr

R is designed to make working with statistical models fast, succinct, and reliable.

For instance building a model is a one-liner:

model <- lm(Petal.Length ~ Sepal.Length, data = iris)

And producing a detailed diagnostic summary of the model is also a one-liner:

summary(model)

# Call:
# lm(formula = Petal.Length ~ Sepal.Length, data = iris)
# 
# Residuals:
#      Min       1Q   Median       3Q      Max 
# -2.47747 -0.59072 -0.00668  0.60484  2.49512 
# 
# Coefficients:
#              Estimate Std. Error t value Pr(>|t|)    
# (Intercept)  -7.10144    0.50666  -14.02   <2e-16 ***
# Sepal.Length  1.85843    0.08586   21.65   <2e-16 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 0.8678 on 148 degrees of freedom
# Multiple R-squared:   0.76,   Adjusted R-squared:  0.7583 
# F-statistic: 468.6 on 1 and 148 DF,  p-value: < 2.2e-16

However, useful as the above is: it isn’t exactly presentation ready. To formally report the R-squared of our model we would have to cut and paste this information from the summary. That is a needlessly laborious and possibly error-prone step.

With the sigr package this can be made much easier:

library("sigr")
Rsquared <- wrapFTest(model)
print(Rsquared)

# [1] "F Test summary: (R2=0.76, F(1,148)=468.6, p<1e-05)."

And this formal summary can be directly rendered into many formats (Latex, html, markdown, and ascii).

render(Rsquared, format="html")

F Test summary: (R2=0.76, F(1,148)=468.6, p<1e-05).

sigr can help make your publication workflow much easier and more repeatable/reliable.

Posted on Categories Programming, TutorialsTags , , , 2 Comments on coalesce with wrapr

coalesce with wrapr

coalesce is a classic useful SQL operator that picks the first non-NULL value in a sequence of values.

We thought we would share a nice version of it for picking non-NA R with convenient operator infix notation wrapr::coalesce(). Here is a short example of it in action:

library("wrapr")

NA %?% 0

# [1] 0

A more substantial application is the following.

Continue reading coalesce with wrapr

Posted on Categories data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Programming, TutorialsTags , , Leave a comment on The blocks and rows theory of data shaping

The blocks and rows theory of data shaping

We have our latest note on the theory of data wrangling up here. It discusses the roles of “block records” and “row records” in the cdata data transform tool. With that and the theory of how to design transforms, we think we have a pretty complete description of the system.

Rowrecs to blocks

Posted on Categories Programming, TutorialsTags , , 4 Comments on Use Pseudo-Aggregators to Add Safety Checks to Your Data-Wrangling Workflow

Use Pseudo-Aggregators to Add Safety Checks to Your Data-Wrangling Workflow

One of the concepts we teach in both Practical Data Science with R and in our theory of data shaping is the importance of identifying the roles of columns in your data.

For example, to think in terms of multi-row records it helps to identify:

  • Which columns are keys (together identify rows or records).
  • Which columns are data/payload (are considered free varying data).
  • Which columns are "derived" (functions of the keys).

In this note we will show how to use some of these ideas to write safer data-wrangling code.

Continue reading Use Pseudo-Aggregators to Add Safety Checks to Your Data-Wrangling Workflow

Posted on Categories Programming, TutorialsTags Leave a comment on Conway’s Game of Life in R: Or On the Importance of Vectorizing Your R Code

Conway’s Game of Life in R: Or On the Importance of Vectorizing Your R Code

R is an interpreted programming language with vectorized data structures. This means a single R command can ask for very many arithmetic operations to be performed. This also means R computation can be fast. We will show an example of this using Conway’s Game of Life.

Continue reading Conway’s Game of Life in R: Or On the Importance of Vectorizing Your R Code

Posted on Categories TutorialsTags , , 2 Comments on Scatterplot matrices (pair plots) with cdata and ggplot2

Scatterplot matrices (pair plots) with cdata and ggplot2

In my previous post, I showed how to use cdata package along with ggplot2‘s faceting facility to compactly plot two related graphs from the same data. This got me thinking: can I use cdata to produce a ggplot2 version of a scatterplot matrix, or pairs plot?

A pairs plot compactly plots every (numeric) variable in a dataset against every other one. In base plot, you would use the pairs() function. Here is the base version of the pairs plot of the iris dataset:

pairs(iris[1:4], 
      main = "Anderson's Iris Data -- 3 species",
      pch = 21, 
      bg = c("#1b9e77", "#d95f02", "#7570b3")[unclass(iris$Species)])

Unnamed chunk 1 1

There are other ways to do this, too:

# not run

library(ggplot2)
library(GGally)
ggpairs(iris, columns=1:4, aes(color=Species)) + 
  ggtitle("Anderson's Iris Data -- 3 species")

library(lattice)
splom(iris[1:4], 
      groups=iris$Species, 
      main="Anderson's Iris Data -- 3 species")

But I wanted to see if cdata was up to the task. So here we go….

Continue reading Scatterplot matrices (pair plots) with cdata and ggplot2

Posted on Categories data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, TutorialsTags , Leave a comment on Designing Transforms for Data Reshaping with cdata

Designing Transforms for Data Reshaping with cdata

Authors: John Mount, and Nina Zumel 2018-10-25

As a followup to our previous post, this post goes a bit deeper into reasoning about data transforms using the cdata package. The cdata packages demonstrates the "coordinatized data" theory and includes an implementation of the "fluid data" methodology for general data re-shaping.

cdata adheres to the so-called "Rule of Representation":

Fold knowledge into data, so program logic can be stupid and robust.

The Art of Unix Programming, Erick S. Raymond, Addison-Wesley , 2003

The design principle expressed by this rule is that it is much easier to reason about data than to try to reason about code, so using data to control your code is often a very good trade-off.

We showed in the last post how cdata takes a transform control table to specify how you want your data reshaped. The question then becomes: how do you come up with the transform control table?

Let’s discuss that using the example from the previous post: "plotting the iris data faceted".

Continue reading Designing Transforms for Data Reshaping with cdata