Posted on Categories data science, Programming, StatisticsTags , , Leave a comment on More on sigr

More on sigr

If you’ve read our previous R Tip on using sigr with linear models, you might have noticed that the lm() summary object does in fact carry the R-squared and F statistics, both in the printed form:

model_lm <- lm(formula = Petal.Length ~ Sepal.Length, data = iris)
(smod_lm <- summary(model_lm))
## 
## Call:
## lm(formula = Petal.Length ~ Sepal.Length, data = iris)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.47747 -0.59072 -0.00668  0.60484  2.49512 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -7.10144    0.50666  -14.02   <2e-16 ***
## Sepal.Length  1.85843    0.08586   21.65   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8678 on 148 degrees of freedom
## Multiple R-squared:   0.76,  Adjusted R-squared:  0.7583 
## F-statistic: 468.6 on 1 and 148 DF,  p-value: < 2.2e-16

and also in the summary() object:

c(R2 = smod_lm$r.squared, F = smod_lm$fstatistic[1])

##          R2     F.value 
##   0.7599546 468.5501535

Note, though, that while the summary reports the model’s significance, it does not carry it as a specific summary() object item. sigr::wrapFTest() is a convenient way to extract the model’s R-squared and F statistic and simultaneously calculate the model significance, as is required by many scientific publications.

sigr is even more helpful for logistic regression, via glm(), which reports neither the model’s chi-squared statistic nor its significance.

iris$isVersicolor <- iris$Species == "versicolor"

model_glm <- glm(
  isVersicolor ~ Sepal.Length + Sepal.Width,
  data = iris,
  family = binomial)

(smod_glm <- summary(model_glm))

## 
## Call:
## glm(formula = isVersicolor ~ Sepal.Length + Sepal.Width, family = binomial, 
##     data = iris)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9769  -0.8176  -0.4298   0.8855   2.0855  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    8.0928     2.3893   3.387 0.000707 ***
## Sepal.Length   0.1294     0.2470   0.524 0.600247    
## Sepal.Width   -3.2128     0.6385  -5.032 4.85e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 190.95  on 149  degrees of freedom
## Residual deviance: 151.65  on 147  degrees of freedom
## AIC: 157.65
## 
## Number of Fisher Scoring iterations: 5

To get the significance of a logistic regression model, call wrapr::wrapChiSqTest():

library(sigr)
(chi2Test <- wrapChiSqTest(model_glm))

## [1] “Chi-Square Test summary: pseudo-R2=0.21 (X2(2,N=150)=39, p<1e-05).”

Notice that the fit summary also reports a pseudo-R-squared. You can extract the values directly off the sigr object, as well:

str(chi2Test)

## List of 10
##  $ test          : chr "Chi-Square test"
##  $ df.null       : int 149
##  $ df.residual   : int 147
##  $ null.deviance : num 191
##  $ deviance      : num 152
##  $ pseudoR2      : num 0.206
##  $ pValue        : num 2.92e-09
##  $ sig           : num 2.92e-09
##  $ delta_deviance: num 39.3
##  $ delta_df      : int 2
##  - attr(*, "class")= chr [1:2] "sigr_chisqtest" "sigr_statistic"

And of course you can render the sigr object into one of several formats (Latex, html, markdown, and ascii) for direct inclusion in a report or publication.

render(chi2Test, format = "html")

Chi-Square Test summary: pseudo-R2=0.21 (χ2(2,N=150)=39, p<1e-05).

By the way, if you are interested, we give the explicit formula for calculating the significance of a logistic regression model in Practical Data Science with R.

Posted on Categories Statistics, Tutorials, UncategorizedTags , , , Leave a comment on R tip: Make Your Results Clear with sigr

R tip: Make Your Results Clear with sigr

R is designed to make working with statistical models fast, succinct, and reliable.

For instance building a model is a one-liner:

model <- lm(Petal.Length ~ Sepal.Length, data = iris)

And producing a detailed diagnostic summary of the model is also a one-liner:

summary(model)

# Call:
# lm(formula = Petal.Length ~ Sepal.Length, data = iris)
# 
# Residuals:
#      Min       1Q   Median       3Q      Max 
# -2.47747 -0.59072 -0.00668  0.60484  2.49512 
# 
# Coefficients:
#              Estimate Std. Error t value Pr(>|t|)    
# (Intercept)  -7.10144    0.50666  -14.02   <2e-16 ***
# Sepal.Length  1.85843    0.08586   21.65   <2e-16 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 0.8678 on 148 degrees of freedom
# Multiple R-squared:   0.76,   Adjusted R-squared:  0.7583 
# F-statistic: 468.6 on 1 and 148 DF,  p-value: < 2.2e-16

However, useful as the above is: it isn’t exactly presentation ready. To formally report the R-squared of our model we would have to cut and paste this information from the summary. That is a needlessly laborious and possibly error-prone step.

With the sigr package this can be made much easier:

library("sigr")
Rsquared <- wrapFTest(model)
print(Rsquared)

# [1] "F Test summary: (R2=0.76, F(1,148)=468.6, p<1e-05)."

And this formal summary can be directly rendered into many formats (Latex, html, markdown, and ascii).

render(Rsquared, format="html")

F Test summary: (R2=0.76, F(1,148)=468.6, p<1e-05).

sigr can help make your publication workflow much easier and more repeatable/reliable.

Posted on Categories Programming, TutorialsTags , , , 2 Comments on coalesce with wrapr

coalesce with wrapr

coalesce is a classic useful SQL operator that picks the first non-NULL value in a sequence of values.

We thought we would share a nice version of it for picking non-NA R with convenient operator infix notation wrapr::coalesce(). Here is a short example of it in action:

library("wrapr")

NA %?% 0

# [1] 0

A more substantial application is the following.

Continue reading coalesce with wrapr

Posted on Categories data science, Opinion, Practical Data Science, Pragmatic Data Science, TutorialsTags , , , , 7 Comments on R Tip: Give data.table a Try

R Tip: Give data.table a Try

If your R or dplyr work is taking what you consider to be a too long (seconds instead of instant, or minutes instead of seconds, or hours instead of minutes, or a day instead of an hour) then try data.table.

For some tasks data.table is routinely faster than alternatives at pretty much all scales (example timings here).

If your project is large (millions of rows, hundreds of columns) you really should rent an an Amazon EC2 r4.8xlarge (244 GiB RAM) machine for an hour for about $2.13 (quick setup instructions here) and experience speed at scale.

Posted on Categories Programming, TutorialsTags , , , , 4 Comments on R Tip: How to Pass a formula to lm

R Tip: How to Pass a formula to lm

R tip : how to pass a formula to lm().

Often when modeling in R one wants to build up a formula outside of the modeling call. This allows the set of columns being used to be passed around as a vector of strings, and treated as data. Being able to treat controls (such as the set of variables to use) as manipulable values allows for very powerful automated modeling methods.

Continue reading R Tip: How to Pass a formula to lm

Posted on Categories data science, ProgrammingTags , , , , , , 11 Comments on Speed up your R Work

Speed up your R Work

Introduction

In this note we will show how to speed up work in R by partitioning data and process-level parallelization. We will show the technique with three different R packages: rqdatatable, data.table, and dplyr. The methods shown will also work with base-R and other packages.

For each of the above packages we speed up work by using wrapr::execute_parallel which in turn uses wrapr::partition_tables to partition un-related data.frame rows and then distributes them to different processors to be executed. rqdatatable::ex_data_table_parallel conveniently bundles all of these steps together when working with rquery pipelines.

The partitioning is specified by the user preparing a grouping column that tells the system which sets of rows must be kept together in a correct calculation. We are going to try to demonstrate everything with simple code examples, and minimal discussion.

Continue reading Speed up your R Work

Posted on Categories data science, Opinion, Programming, TutorialsTags , , , , , , , , 4 Comments on seplyr 0.5.8 Now Available on CRAN

seplyr 0.5.8 Now Available on CRAN

We are pleased to announce that seplyr version 0.5.8 is now available on CRAN.

seplyr is an R package that provides a thin wrapper around elements of the dplyr package and (now with version 0.5.8) the tidyr package. The intent is to give the part time R user the ability to easily program over functions from the popular dplyr and tidyr packages. Our assumption is always that a data scientist most often comes to R to work with data, not to tinker with the programming language itself.

Continue reading seplyr 0.5.8 Now Available on CRAN

Posted on Categories Coding, TutorialsTags , , , ,

R Tip: Be Wary of “…”

R Tip: be wary of “...“.

The following code example contains an easy error in using the R function unique().

vec1 <- c("a", "b", "c")
vec2 <- c("c", "d")
unique(vec1, vec2)
# [1] "a" "b" "c"

Notice none of the novel values from vec2 are present in the result. Our mistake was: we (improperly) tried to use unique() with multiple value arguments, as one would use union(). Also notice no error or warning was signaled. We used unique() incorrectly and nothing pointed this out to us. What compounded our error was R‘s “...” function signature feature.

In this note I will talk a bit about how to defend against this kind of mistake. I am going to apply the principle that a design that makes committing mistakes more difficult (or even impossible) is a good thing, and not a sign of carelessness, laziness, or weakness. I am well aware that every time I admit to making a mistake (I have indeed made the above mistake) those who claim to never make mistakes have a laugh at my expense. Honestly I feel the reason I see more mistakes is I check a lot more.

Continue reading R Tip: Be Wary of “…”