R
users appear to be big fans of "code capturing" or "non standard evaluation" (NSE) interfaces. In this note we will discuss quoting and non-quoting interfaces in R
.
The above terms are simply talking about interfaces where a name to be used is captured from the source code the user typed, and thus does not need quote marks. For example:
d <- data.frame(x = 1)
d$x
## [1] 1
Notice both during data.frame
creation and column access: the column name is given without quotes and also accessed without quotes.
This differs from using a standard value oriented interface as in the following:
d[["x"]]
## [1] 1
A natural reason for R
users to look for automatic quoting is: it helps make working with columns in data.frame
s (R
‘s primary data analysis structure) look much like working with variables in the environment. Without the quotes a column name looks very much like a variable name. And thinking of columns as variables is a useful mindset.
Another place implicit quoting shows up is with R
‘s "combine" operator where one can write either of the following.
c(a = "b")
## a
## "b"
c("a" = "b")
## a
## "b"
The wrapr
package brings in a new function: qc()
or "quoting c()
" that gives a very powerful and convenient way to elide quotes.
library(wrapr)
qc(a = b)
## a
## "b"
Notice quotes are not required on either side of the name assignment. Again, eliding quotes is not that big a deal, and not to everyone’s taste. For example I have never seen a Python user feel they are missing anything because they write "{"a" : "b"}
" to construct their own named dictionary structure.
That being said, qc()
is a very convenient and consistent notation if you do want to work in an NSE style.
For example, if it ever bothered you that dplyr
join takes the join column names as a character vector you can use qc()
to instead write:
dplyr::full_join(
iris, iris,
by = qc(Sepal.Length, Sepal.Width,
Petal.Length, Petal.Width,
Species))
(Actually I very much like that the join takes the columns as a vector, as it is much easier to program over.) I feel the qc()
grouping of the columns makes it easier for a reader to see which arguments are the column set than a use of ...
would. Please take, as an example, the following dplyr::group_by()
:
library(dplyr)
starwars %>%
group_by(homeworld, species, add = FALSE) %>%
summarize(mass = mean(mass, na.rm = TRUE))
## # A tibble: 58 x 3
## # Groups: homeworld [?]
## homeworld species mass
## <chr> <chr> <dbl>
## 1 Alderaan Human 64
## 2 Aleen Minor Aleena 15
## 3 Bespin Human 79
## 4 Bestine IV Human 110
## 5 Cato Neimoidia Neimodian 90
## 6 Cerea Cerean 82
## 7 Champala Chagrian NaN
## 8 Chandrila Human NaN
## 9 Concord Dawn Human 79
## 10 Corellia Human 78.5
## # ... with 48 more rows
When coming back to such code later, I find the following notation to be easier to read:
library(seplyr)
starwars %>%
group_by_se(qc(homeworld, species), add = FALSE) %>%
summarize(mass = mean(mass, na.rm = TRUE))
## # A tibble: 58 x 3
## # Groups: homeworld [?]
## homeworld species mass
## <chr> <chr> <dbl>
## 1 Alderaan Human 64
## 2 Aleen Minor Aleena 15
## 3 Bespin Human 79
## 4 Bestine IV Human 110
## 5 Cato Neimoidia Neimodian 90
## 6 Cerea Cerean 82
## 7 Champala Chagrian NaN
## 8 Chandrila Human NaN
## 9 Concord Dawn Human 79
## 10 Corellia Human 78.5
## # ... with 48 more rows
In the above we can clearly see which arguments to the grouping command are intended to be column names, and which are not.
qc()
is a powerful NSE tool that annotates and contains where we are expecting quoting behavior. Some possible applications include examples such as the following.
# install many packages
install.packages(qc(testthat, knitr, rmarkdown, R.rsp))
# select columns
iris[, qc(Petal.Length, Petal.Width, Species)]
# control a for-loop
for(col in qc(Petal.Length, Petal.Width)) {
iris[[col]] <- sqrt(iris[[col]])
}
# control a vapply
vapply(qc(Petal.Length, Petal.Width),
function(col) {
sum(is.na(iris[[col]]))
}, numeric(1))
The idea is: with qc()
the user can switch name capturing notation at will, with no prior-arrangement needed in the functions or packages used. Also the parenthesis in qc()
make for more legible code: a reader can see which arguments are being quoted and taken as a group.
As of wrapr 1.7.0
qc()
incorporates bquote()
functionality. bquote()
is R
‘s built-in quasi-quotation facility. It was added to R
in August of 2003 by Thomas Lumley, and doesn’t get as much attention as it deserves.
A quoting tool such as qc()
becomes a quasi-quoting tool if we add a notation that signals we do not wish to quote. In R
the standard notation for this is ".()
" (Lisp uses a back-tick, the data.table
packages uses "..
", and the rlang
package uses "!!
"). The bquote()
-enabled version of qc()
lets us write code such as the following.
library(wrapr)
extra_column = "Species"
qc(Petal.Length, Petal.Width, extra_column)
## [1] "Petal.Length" "Petal.Width" "extra_column"
qc(Petal.Length, Petal.Width, .(extra_column))
## [1] "Petal.Length" "Petal.Width" "Species"
Notice it is un-ambiguous what is going on above. The first qc()
quotes all of its arguments into strings. The second works much the same, with the exception of names marked with .()
. This ability to "break out" or turn off quoting is convenient if we are working with a combination of values we wish to type in directly and others we wish to take from variables.
qc()
allows substitution on the left-hand sides of assignments, if we use the alternate :=
notation for assignment (a convention put forward by data.table
, and later adopted by dplyr
).
library(wrapr)
left_name = "a"
right_value = "b"
qc(.(left_name) := .(right_value))
## a
## "b"
The wrapr
package also exports an implementation for :=
. So one could also write:
left_name := right_value
## a
## "b"
The hope is that the qc()
and :=
operators are well behaved enough to commute in the sense the following two statements should return the same value.
library(wrapr)
qc(a := b, c := d)
## a c
## "b" "d"
qc(a, c) := qc(b, d)
## a c
## "b" "d"
The idea is: when there is a symmetry it is often evidence you are using the right concepts.
In conclusion: the goal of wrapr::qc()
is to put a very regular and controllable quoting facility directly into the hands of the R
user. This allows the R
user to treat just about any R
function or package as if the function or package itself implemented argument quoting and quasi-quotation capabilities.
For normal deviates there is, of course, a well know scaling correction that returns an unbiased estimate for observed standard deviations.
It (from the same source):
… provides an example where imposing the requirement for unbiased estimation might be seen as just adding inconvenience, with no real benefit.
Let’s make a quick plot comparing the naive estimate of standard deviation (“forgetting to use n-1
in the denominator”) and the Bessel corrected estimate (the square-root of the Bessel corrected variance). It is well known that the naive estimate is biased-down and under-estimates both the variance and standard deviation. The Bessel correction deliberately inflates the variance estimate to get the expected value right (i.e., to remove the bias). However, as we can see in the following graph: for the standard deviation the correction is too much. The square-root of the Bessel corrected variance is systematically an over-estimate of the standard deviation.
We can show this graphically as follows.
The above graph is portraying, for different sample sizes (n
), the ratio of the expected values of the various estimates to the true value of the standard deviation (for observations from an i.i.d. normal random source). So: an unbiased estimate would lie on the line y=1
.
Notice the Bessel corrected is further away from the true value of the standard deviation than the naive estimate was (just biased in the opposite direction). So from the standard-deviation point of view the Bessel correction isn’t really better than the naive estimate.
All work is shared here.
]]>
Everitt’s The Cambridge Dictionary of Statistics 2nd Edition, Cambridge University Press defines a “statistic” as:
A numeric characteristic of a sample.
Informally we can say a statistic is a summary. And this lets us say the field of statistics is the science of relating the observed summaries of samples to the corresponding unobserved summary of the total population or universe the samples are drawn from.
For example: we can take our universe to be the set of female adult crew members of the Titanic, and our observable to be if they survived or not. Then our universe or total population is the following:
20 survivors (we will code these as 1's) 3 fatalities (we will code these as 0's)
We can ask R to show us some common summaries: mean (which we will denote as “p”), variance, and standard_deviation.
universe = c(rep(1, 20), rep(0, 3)) p = mean(universe) print(p) # [1] 0.8695652 variance <- mean((universe - p)^2) print(variance) # [1] 0.1134216 standard_deviation <- sqrt(variance) print(standard_deviation) # [1] 0.3367812
Note, we deliberately did not call R’s var()
and sd()
methods, as they both include Bessel’s sample correction. For an entire universe (sometimes called a population) the sample correction is inappropriate, and leads to a wrong answer.
Python’s statistics library has a sensible approach. It presents un-corrected (population or universe versions) and corrected (sample versions) on equal footing:
The bias in question is falling off at a rate of 1/n
(where n
is our sample size). So the bias issue loses what little gravity it ever may have ever had when working with big data. Most sources of noise will be falling off at a slower rate of 1/sqrt(n)
, so it is unlikely this bias is going to be the worst feature of your sample.
But let’s pretend the sample size correction indeed is an important point for a while.
Under the “no bias allowed” rubric: if it is so vitally important to bias-correct the variance estimate, would it not be equally critical to correct the standard deviation estimate?
The practical answer seems to be: no. The straightforward standard deviation estimate itself is biased (it has to be, as a consequence of Jensen’s inequality). And pretty much nobody cares, corrects it, or teaches how to correct it, as it just isn’t worth the trouble.
Let’s convert that to a worked example. We are going to investigate the statistics of drawing samples of size 5 (uniformly with replacement) from our above universe or population. To see what is going on we will draw a great number of examples (though in practice when we could do this, we could more easily directly summarize over this small universe).
Here is the distribution of observed sample variances for both the naive calculation (assuming the variance observed on the sample is representative of the universe it was draw from) and the Bessel corrected calculation (inflating the sample estimate of variance by n/(n-1)
, or 25% in this case). The idea is: the observable Bessel corrected variances of samples converge to the unobserved un-corrected (or naive) variance of the population.
We generated the above graph by uniformly drawing samples (with replacement) of size 5 from the above universe. The experiment was repeated 100000 times and we are depicting the distribution of what variance estimates were seen as a density plot (only a few discrete values occur in small samples). The black dashed vertical line is the mean of the 100000 variance estimates, and the vertical red line and blocks structure indicates the true variance of the original universe or population.
Notice the Bessel corrected estimate of variance is itself a higher variance estimate: many of the adjusted observations are further away from the true value. The Bessel correction (in this case) didn’t make any one estimate better, it made the average over all possible estimates better. In this case it is inflating all the positive variance estimates to trade-off against the zero variance estimates.
Now suppose we run the same experiment for standard deviation estimates.
The Bessel adjustment helped, but was nowhere near correcting the bias. Both estimates of standard deviation are severely downward biased.
Here is a that experiment again, including an additional correction procedure (which we will call joint scale corrected).
Notice the scale corrected estimate is unbiased. We admit, if this were so massively important it would be taught more commonly. However, as standard deviations summaries are more common than variance summaries (example: summary.lm()
): having an unbiased estimate for a standard deviation is probably more important than having an unbiased estimate for variance.
The scale corrected estimate did raise the variance of the estimation procedure, but not much more than the Bessel correction did:
estimation_method variance_of_estimate 1: scale_corrected_sd 0.05763954 2: naive_sd 0.04554781 3: Bessel_sd 0.05693477
How did we correct the standard deviation estimate?
It is quite simple:
n
, the standard deviation estimate is essentially a lookup table that maps the pair n
(the sample size) and k
(the number of 1’s seen) to an estimate. Call this table est(n,.)
. Remember est(n,.)
is just a table of n+1
numbers.sqrt(p(1-p))
(assuming we exactly knew p
). This means for any p
we want these two quantities near each other, or:dbinom(k, n, p)
being the probability of drawing “k
” 1’s from in n draws when 1’s have a probability of p (this is a known quantity, a value we just copy in). More examples of using this methodology can be found here.
est(n,.)
that simultaneously make many of these check equations near true. We used a discrete set of p
‘s with a Jeffreys prior style example weighing. One could probably recast this as a calculus of variations problem over all p
s and work closer to an analytic solution.
The concept is from signal processing: observation is an averaging filter, so we need to design an approximate inverse filter (something like an unsharp mask). The above inverse filter was specialized for the 0/1 binomial case, but one can use the above principles do design inverse filters for additional situations.
All of the above was easy in R and we got the following estimation table (here shown graphically with the un-adjusted and also Bessel adjusted estimation procedures).
In the above graph: n
is the sample size, k
is how many 1’s are observed, and the y axis is the estimate of the unknown true population standard deviation.
Notice the “joint scaled” solution wiggles around a bit.
As we have said, the above graph is the estimator. For samples of size 5 pick the estimator you want to use and the k corresponding to how many 1’s you saw in your sample: then the y height is the estimate you should use for population standard deviation.
In the next graph: p
is the unknown true frequency of 1’s, and the y
-axis is the difference between the expected value of the estimated standard deviation and the true standard deviation.
Notice the joint scaled estimate reports a non-zero standard deviation even for all zeros or all ones samples. Also notice the joint estimate tries to stay near the desired difference of zero using a smooth curve that wiggles above and below the desired difference of zero.
The next graph is the same sort of presentation for ratios of estimated to true standard deviations.
We still see the Bessel corrected estimates are better than naive, but not as good as jointly adjusted.
And that is how you correct standard deviation estimates (at least for binomial experiments). We emphasize that nobody makes this correction in practice, and for big data and large samples the correction is entirely pointless. The variance correction is equally pointless, but since it is easier to perform it is usually added.
]]>lm()
summary object does in fact carry the R-squared and F statistics, both in the printed form:
model_lm <- lm(formula = Petal.Length ~ Sepal.Length, data = iris) (smod_lm <- summary(model_lm)) ## ## Call: ## lm(formula = Petal.Length ~ Sepal.Length, data = iris) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2.47747 -0.59072 -0.00668 0.60484 2.49512 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -7.10144 0.50666 -14.02 <2e-16 *** ## Sepal.Length 1.85843 0.08586 21.65 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.8678 on 148 degrees of freedom ## Multiple R-squared: 0.76, Adjusted R-squared: 0.7583 ## F-statistic: 468.6 on 1 and 148 DF, p-value: < 2.2e-16
and also in the summary()
object:
c(R2 = smod_lm$r.squared, F = smod_lm$fstatistic[1]) ## R2 F.value ## 0.7599546 468.5501535
Note, though, that while the summary reports the model’s significance, it does not carry it as a specific summary()
object item. sigr::wrapFTest()
is a convenient way to extract the model’s R-squared and F statistic and simultaneously calculate the model significance, as is required by many scientific publications.
sigr
is even more helpful for logistic regression, via glm()
, which reports neither the model’s chi-squared statistic nor its significance.
iris$isVersicolor <- iris$Species == "versicolor" model_glm <- glm( isVersicolor ~ Sepal.Length + Sepal.Width, data = iris, family = binomial) (smod_glm <- summary(model_glm)) ## ## Call: ## glm(formula = isVersicolor ~ Sepal.Length + Sepal.Width, family = binomial, ## data = iris) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.9769 -0.8176 -0.4298 0.8855 2.0855 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 8.0928 2.3893 3.387 0.000707 *** ## Sepal.Length 0.1294 0.2470 0.524 0.600247 ## Sepal.Width -3.2128 0.6385 -5.032 4.85e-07 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 190.95 on 149 degrees of freedom ## Residual deviance: 151.65 on 147 degrees of freedom ## AIC: 157.65 ## ## Number of Fisher Scoring iterations: 5
To get the significance of a logistic regression model, call wrapr::wrapChiSqTest():
library(sigr) (chi2Test <- wrapChiSqTest(model_glm)) ## [1] “Chi-Square Test summary: pseudo-R2=0.21 (X2(2,N=150)=39, p<1e-05).”
Notice that the fit summary also reports a pseudo-R-squared. You can extract the values directly off the sigr
object, as well:
str(chi2Test) ## List of 10 ## $ test : chr "Chi-Square test" ## $ df.null : int 149 ## $ df.residual : int 147 ## $ null.deviance : num 191 ## $ deviance : num 152 ## $ pseudoR2 : num 0.206 ## $ pValue : num 2.92e-09 ## $ sig : num 2.92e-09 ## $ delta_deviance: num 39.3 ## $ delta_df : int 2 ## - attr(*, "class")= chr [1:2] "sigr_chisqtest" "sigr_statistic"
And of course you can render the sigr
object into one of several formats (Latex, html, markdown, and ascii) for direct inclusion in a report or publication.
render(chi2Test, format = "html")
Chi-Square Test summary: pseudo-R^{2}=0.21 (χ^{2}(2,N=150)=39, p<1e-05).
By the way, if you are interested, we give the explicit formula for calculating the significance of a logistic regression model in Practical Data Science with R.
]]>For instance building a model is a one-liner:
model <- lm(Petal.Length ~ Sepal.Length, data = iris)
And producing a detailed diagnostic summary of the model is also a one-liner:
summary(model) # Call: # lm(formula = Petal.Length ~ Sepal.Length, data = iris) # # Residuals: # Min 1Q Median 3Q Max # -2.47747 -0.59072 -0.00668 0.60484 2.49512 # # Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) -7.10144 0.50666 -14.02 <2e-16 *** # Sepal.Length 1.85843 0.08586 21.65 <2e-16 *** # --- # Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # # Residual standard error: 0.8678 on 148 degrees of freedom # Multiple R-squared: 0.76, Adjusted R-squared: 0.7583 # F-statistic: 468.6 on 1 and 148 DF, p-value: < 2.2e-16
However, useful as the above is: it isn’t exactly presentation ready. To formally report the R-squared of our model we would have to cut and paste this information from the summary. That is a needlessly laborious and possibly error-prone step.
With the sigr
package this can be made much easier:
library("sigr") Rsquared <- wrapFTest(model) print(Rsquared) # [1] "F Test summary: (R2=0.76, F(1,148)=468.6, p<1e-05)."
And this formal summary can be directly rendered into many formats (Latex, html, markdown, and ascii).
render(Rsquared, format="html")
F Test summary: (R^{2}=0.76, F(1,148)=468.6, p<1e-05).
sigr
can help make your publication workflow much easier and more repeatable/reliable.
coalesce
is a classic useful SQL
operator that picks the first non-NULL
value in a sequence of values.
We thought we would share a nice version of it for picking non-NA
R with convenient operator infix notation wrapr::coalesce()
. Here is a short example of it in action:
library("wrapr") NA %?% 0 # [1] 0
A more substantial application is the following.
library("wrapr") d <- wrapr::build_frame( "more_precise_sensor", "cheaper_back_up_sensor" | 0.31 , NA | 0.41 , 0.5 | NA , 0.5 ) print(d) # more_precise_sensor cheaper_back_up_sensor # 1 0.31 NA # 2 0.41 0.50 # 3 NA 0.50 d$measurement <- d$more_precise_sensor %?% d$cheaper_back_up_sensor print(d) # more_precise_sensor cheaper_back_up_sensor measurement # 1 0.31 NA 0.31 # 2 0.41 0.50 0.41 # 3 NA 0.50 0.50]]>
cdata
data transform tool. With that and the theory of how to design transforms, we think we have a pretty complete description of the system.
]]>For example, to think in terms of multi-row records it helps to identify:
In this note we will show how to use some of these ideas to write safer data-wrangling code.
In mathematics the statement "y is a function of x" merely means there is an ideal lookup table with which if you knew the value of x
, then you could in principle know the value of y
.
For example in the following R
data.frame:
y
is a function of x
, as all rows that have the same value for x
also have the same value for y
.
d1 <- data.frame(
x = c("a", "a", "b", "b", "c"),
y = c(10, 10, 20, 20, 20),
z = c(1, 2, 1, 1, 1),
stringsAsFactors = FALSE)
print(d1)
## x y z
## 1 a 10 1
## 2 a 10 2
## 3 b 20 1
## 4 b 20 1
## 5 c 20 1
Notice if we know the value of x
we then, in principle know the value of y
. In the same example z
is not a function of x
, as it does not have this property.
A more concrete example would be: user-name is a function of user-ID. If you know the an individual’s user-ID, then you also (if you have the right lookup table) know the individual’s user-name.
We first taught these concepts in the context of SQL
and SQL
grouped aggregation. In SQL
once you aggregate on one column, then all other columns in your query must either be the grouping columns, or also aggregated. This is easiest to show in code, but we will use the dplyr
package for our example.
library("dplyr")
d1 %>%
group_by(x) %>%
summarize(y = max(y)) %>%
ungroup()
## # A tibble: 3 x 2
## x y
## <chr> <dbl>
## 1 a 10
## 2 b 20
## 3 c 20
Notice only grouping columns and columns passed through an aggregating calculation (such as max()
) are passed through (the column z
is not in the result). Now because y
is a function of x
no substantial aggregation is going on, we call this situation a "pseudo aggregation" and we have taught this before. This is also why we made the seemingly strange choice of keeping the variable name y
(instead of picking a new name such as max_y
), we expect the y
values coming out to be the same as the one coming in- just with changes of length. Pseudo aggregation (using the projection y[[1]]
) was also used in the solutions of the column indexing problem.
Our wrapr
package now supplies a special case pseudo-aggregator (or in a mathematical sense: projection): psagg()
. It works as follows.
library("wrapr")
d1 %>%
group_by(x) %>%
summarize(y = psagg(y)) %>%
ungroup()
## # A tibble: 3 x 2
## x y
## <chr> <dbl>
## 1 a 10
## 2 b 20
## 3 c 20
psagg()
pretty much worked the same as the earlier max()
. However, it documents our belief that y
is a function of x
(that nothing interesting is going on for this column during aggregation). Where psagg()
differs is if our assumption that y
is a function of x
is violated.
d2 <- data.frame(
x = c("a", "a", "b", "b", "c"),
y = c(10, 10, 20, 23, 20),
stringsAsFactors = FALSE)
print(d2)
## x y
## 1 a 10
## 2 a 10
## 3 b 20
## 4 b 23
## 5 c 20
d2 %>%
group_by(x) %>%
summarize(y = psagg(y)) %>%
ungroup()
## Error in summarise_impl(.data, dots): Evaluation error: wrapr::psagg argument values are varying.
The code caught that our assumption was false and raised on error. This sort of checking can save a lot of time and prevent erroneous results.
And that is part of what we teach:
psagg()
also works well with data.table
.
library("data.table")
as.data.table(d1)[
, .(y = psagg(y)), by = "x"]
## x y
## 1: a 10
## 2: b 20
## 3: c 20
as.data.table(d2)[
, .(y = psagg(y)), by = "x"]
## Error in psagg(y): wrapr::psagg argument values are varying
Of course we don’t strictly need psagg()
as we could insert checks by hand (though this would become burdensome if we had many derived columns).
as.data.table(d1)[
, .(y = max(y),
y_was_const = min(y)==max(y)),
by = "x"]
## x y y_was_const
## 1: a 10 TRUE
## 2: b 20 TRUE
## 3: c 20 TRUE
as.data.table(d2)[
, .(y = max(y),
y_was_const = min(y)==max(y)),
by = "x"]
## x y y_was_const
## 1: a 10 TRUE
## 2: b 23 FALSE
## 3: c 20 TRUE
Unfortunately, this sort of checking does not currently work for dplyr
.
packageVersion("dplyr")
## [1] '0.7.7'
d1 %>%
group_by(x) %>%
summarize(y = max(y),
y_was_const = min(y)==max(y)) %>%
ungroup()
## # A tibble: 3 x 3
## x y y_was_const
## <chr> <dbl> <lgl>
## 1 a 10 TRUE
## 2 b 20 TRUE
## 3 c 20 TRUE
d2 %>%
group_by(x) %>%
summarize(y = max(y),
y_was_const = min(y)==max(y)) %>%
ungroup()
## # A tibble: 3 x 3
## x y y_was_const
## <chr> <dbl> <lgl>
## 1 a 10 TRUE
## 2 b 23 TRUE
## 3 c 20 TRUE
Notice the per-group variation in y
was not detected. This appears to be a dplyr
un-caught result corruption issue. I fully get that it is odd to run into an error during a checking step, but the checking step did not in fact introduce problems- it is merely failing to catch them. We run checks like this because the data (possibly from an external source) may not be quite structured they way we were told it is.
If we don’t attempt to re-use the variable name we get the correct result.
d2 %>%
group_by(x) %>%
summarize(y_for_group = max(y),
y_was_const = min(y)==max(y)) %>%
ungroup()
## # A tibble: 3 x 3
## x y_for_group y_was_const
## <chr> <dbl> <lgl>
## 1 a 10 TRUE
## 2 b 23 FALSE
## 3 c 20 TRUE
However, being forced to rename variables during aggregation is a needless user burden.
Our seplyr
package (a package that is a very thin adapter on top of dplyr
) does issue a warning in the failing situation.
library("seplyr")
d2 %>%
group_by_se("x") %>%
summarize_nse(y = max(y),
y_was_const = min(y)==max(y)) %>%
ungroup()
## Warning in summarize_se(res, summarizeTerms, warn = summarize_nse_warn, :
## seplyr::summarize_se possibly confusing column name re-use c('y' =
## 'max(y)', 'y_was_const' = 'min(y) == max(y)')
## # A tibble: 3 x 3
## x y y_was_const
## <chr> <dbl> <lgl>
## 1 a 10 TRUE
## 2 b 23 TRUE
## 3 c 20 TRUE
The result is still wrong, but at least in this situation the user has a better chance at noticing and working around the issue.
The problem may not be common (it may or may not be in any of your code, or code you use), and is of course easy to avoid (once you know the nature of the issue).
]]>
Conway’s Game of Life is one of the most interesting examples of cellular automata. It is traditionally simulated on a rectangular grid (like a chessboard) and each cell is considered either live or dead. The rules of evolution are simple: the next life grid is computed as follows:
This rule can be implemented as scalar code in R:
# d is a matrix of logical values life_step_scalar <- function(d) { nrow <- dim(d)[[1]] ncol <- dim(d)[[2]] dnext <- matrix(data = FALSE, nrow = nrow, ncol = ncol) for(i in seq_len(nrow)) { for(j in seq_len(ncol)) { pop <- 0 if(i>1) { if(j>1) { pop <- pop + d[i-1, j-1] } pop <- pop + d[i-1, j] if(j<ncol) { pop <- pop + d[i-1, j+1] } } if(j>1) { pop <- pop + d[i, j-1] } if(j<ncol) { pop <- pop + d[i, j+1] } if(i<nrow) { if(j>1) { pop <- pop + d[i+1, j-1] } pop <- pop + d[i+1, j] if(j<ncol) { pop <- pop + d[i+1, j+1] } } dnext[i,j] <- (pop==3) || (d[i,j] && (pop>=2) && (pop<=3)) } } dnext }
We could probably speed the above code up by a factor of 2 to 4 by eliminating the if-statements which requires writing 9 versions of the for-loops (depending if the index is in the right-boundary, interior, or left-boundary of its range). However, as we are about to see- this is not worth the effort.
A much faster implementation is the vector implementation found here.
life_step <- function(d) { # form the neighboring sums nrow <- dim(d)[[1]] ncol <- dim(d)[[2]] d_eu <- rbind(d[-1, , drop = FALSE], 0) d_ed <- rbind(0, d[-nrow, , drop = FALSE]) d_le <- cbind(d[ , -1, drop = FALSE], 0) d_re <- cbind(0, d[ , -ncol, drop = FALSE]) d_lu <- cbind(d_eu[ , -1, drop = FALSE], 0) d_ru <- cbind(0, d_eu[ , -ncol, drop = FALSE]) d_ld <- cbind(d_ed[ , -1, drop = FALSE], 0) d_rd <- cbind(0, d_ed[ , -ncol, drop = FALSE]) pop <- d_eu + d_ed + d_le + d_re + d_lu + d_ru + d_ld + d_rd d <- (pop==3) | (d & (pop>=2) & (pop<=3)) d }
The way this code works is: it builds 8 copies of the life-table, one shifting each neighboring cell into the current cell-position. With these 8 new matrices the entire life forward evolution rule is computed vectorized over all cells using the expression “(pop==3) | (d & (pop>=2) & (pop<=3))
“. Notice the vectorized code is actually shorter: we handle the edge-cases by zero-padding.
The performance difference is substantial:
The vectorized code is about 10 times faster on average (details can be found here).
A simulation of this type produces figures such as the following:
Of course if you are serious about Conway’s Game of Life you would use specialized software (even in-browser JavaScript), and specialized algorithms (such as HashLife).
One objection is: the vectorized code uses more memory. To that I give the following famous quote:
The biggest difference between time and space is that you can’t reuse time.
-Merrick Furst
And that is our (toy) example of vectorizing code. Techniques such as these are why very fast and powerful code can in fact be written in R.
]]>cdata
package along with ggplot2
‘s faceting facility to compactly plot two related graphs from the same data. This got me thinking: can I use cdata
to produce a ggplot2
version of a scatterplot matrix, or pairs plot?
A pairs plot compactly plots every (numeric) variable in a dataset against every other one. In base plot, you would use the pairs()
function. Here is the base version of the pairs plot of the iris
dataset:
pairs(iris[1:4],
main = "Anderson's Iris Data -- 3 species",
pch = 21,
bg = c("#1b9e77", "#d95f02", "#7570b3")[unclass(iris$Species)])
There are other ways to do this, too:
# not run
library(ggplot2)
library(GGally)
ggpairs(iris, columns=1:4, aes(color=Species)) +
ggtitle("Anderson's Iris Data -- 3 species")
library(lattice)
splom(iris[1:4],
groups=iris$Species,
main="Anderson's Iris Data -- 3 species")
But I wanted to see if cdata
was up to the task. So here we go….
First, load the packages:
library(ggplot2)
library(cdata)
To create the pairs plot in ggplot2
, I need to reshape the data appropriately. For cdata
, I need to specify what shape I want the data to be in, using a control table. See the last post for how the control table works. For this task, creating the control table is slightly more involved.
Here, I specify the variables I want to plot.
meas_vars <- colnames(iris)[1:4]
The function expand_grid()
returns a data frame of all combinations of its arguments; in this case, I want all pairs of variables.
# the data.frame() call strips the attributes from
# the frame returned by expand.grid()
controlTable <- data.frame(expand.grid(meas_vars, meas_vars,
stringsAsFactors = FALSE))
# rename the columns
colnames(controlTable) <- c("x", "y")
# add the key column
controlTable <- cbind(
data.frame(pair_key = paste(controlTable[[1]], controlTable[[2]]),
stringsAsFactors = FALSE),
controlTable)
controlTable
## pair_key x y
## 1 Sepal.Length Sepal.Length Sepal.Length Sepal.Length
## 2 Sepal.Width Sepal.Length Sepal.Width Sepal.Length
## 3 Petal.Length Sepal.Length Petal.Length Sepal.Length
## 4 Petal.Width Sepal.Length Petal.Width Sepal.Length
## 5 Sepal.Length Sepal.Width Sepal.Length Sepal.Width
## 6 Sepal.Width Sepal.Width Sepal.Width Sepal.Width
## 7 Petal.Length Sepal.Width Petal.Length Sepal.Width
## 8 Petal.Width Sepal.Width Petal.Width Sepal.Width
## 9 Sepal.Length Petal.Length Sepal.Length Petal.Length
## 10 Sepal.Width Petal.Length Sepal.Width Petal.Length
## 11 Petal.Length Petal.Length Petal.Length Petal.Length
## 12 Petal.Width Petal.Length Petal.Width Petal.Length
## 13 Sepal.Length Petal.Width Sepal.Length Petal.Width
## 14 Sepal.Width Petal.Width Sepal.Width Petal.Width
## 15 Petal.Length Petal.Width Petal.Length Petal.Width
## 16 Petal.Width Petal.Width Petal.Width Petal.Width
The control table specifies that for every row of iris
, sixteen new rows get produced, one for each possible pair of variables. The column pair_key
will be the key column in the new data frame; there’s one key level for every possible pair of variables. The columns x
and y
will be the value columns in the new data frame — these will be the columns that we plot.
Now I can create the new data frame, using rowrecs_to_blocks()
. I’ll also carry along the Species
column to color the points in the plot.
iris_aug = rowrecs_to_blocks(
iris,
controlTable,
columnsToCopy = "Species")
head(iris_aug)
## Species pair_key x y
## 1 setosa Sepal.Length Sepal.Length 5.1 5.1
## 2 setosa Sepal.Width Sepal.Length 3.5 5.1
## 3 setosa Petal.Length Sepal.Length 1.4 5.1
## 4 setosa Petal.Width Sepal.Length 0.2 5.1
## 5 setosa Sepal.Length Sepal.Width 5.1 3.5
## 6 setosa Sepal.Width Sepal.Width 3.5 3.5
Note that the data is now sixteen times larger, which I admit is perverse.
If I didn’t care about how the individual subplots were arranged, I’d be done: I’d plot y
vs x
, and facet_wrap
on pair_key
. But I want the subplots arranged in a grid. To do this I use facet_grid
, which will require two key columns:
splt <- strsplit(iris_aug$pair_key, split = " ", fixed = TRUE)
iris_aug$xv <- vapply(splt, function(si) si[[1]], character(1))
iris_aug$yv <- vapply(splt, function(si) si[[2]], character(1))
head(iris_aug)
## Species pair_key x y xv yv
## 1 setosa Sepal.Length Sepal.Length 5.1 5.1 Sepal.Length Sepal.Length
## 2 setosa Sepal.Width Sepal.Length 3.5 5.1 Sepal.Width Sepal.Length
## 3 setosa Petal.Length Sepal.Length 1.4 5.1 Petal.Length Sepal.Length
## 4 setosa Petal.Width Sepal.Length 0.2 5.1 Petal.Width Sepal.Length
## 5 setosa Sepal.Length Sepal.Width 5.1 3.5 Sepal.Length Sepal.Width
## 6 setosa Sepal.Width Sepal.Width 3.5 3.5 Sepal.Width Sepal.Width
And now I can produce the graph, using facet_grid
.
# reorder the key columns to be the same order
# as the base version above
iris_aug$xv <- factor(as.character(iris_aug$xv),
meas_vars)
iris_aug$yv <- factor(as.character(iris_aug$yv),
meas_vars)
ggplot(iris_aug, aes(x=x, y=y)) +
geom_point(aes(color=Species, shape=Species)) +
facet_grid(yv~xv, labeller = label_both, scale = "free") +
ggtitle("Anderson's Iris Data -- 3 species") +
scale_color_brewer(palette = "Dark2") +
ylab(NULL) +
xlab(NULL)
This pair plot has x = y
plots on the diagonals instead of the names of the variables, but you can confirm that it is otherwise the same as the pair plot produced by pairs()
.
Of course, calling pairs()
(or ggpairs()
, or splom()
) is a lot easier than all this, but now I’ve proven to myself that cdata
with ggplot2
can do the job. This version does have a few advantages. It comes with a legend by default, which is nice. And it’s not obvious how to change the color palette in ggpairs()
— I prefer the Brewer Dark2 palette, myself.
Luckily, this code is straightforward to wrap as a function, so if you like the cdata
version, I’ve now added the PairPlot()
function to WVPlots
. Now it’s a one-liner, too.
library(WVPlots)
PairPlot(iris,
colnames(iris)[1:4],
"Anderson's Iris Data -- 3 species",
group_var = "Species")