Manning has a new discount code and a free excerpt of our book Practical Data Science with R, 2nd Edition: here.

This section is elementary, but things really pick up speed as later on (also available in a paid preview).

Skip to content
# Category: Statistics

Posted on Categories Administrativia, data science, Opinion, StatisticsLeave a comment on PDSwR2 Free Excerpt and New Discount Code## PDSwR2 Free Excerpt and New Discount Code

Posted on Categories Administrativia, data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics5 Comments on PDSwR2: New Chapters!## PDSwR2: New Chapters!

Posted on Categories data science, Exciting Techniques, Statistics, Tutorials1 Comment on Fully General Record Transforms with cdata## Fully General Record Transforms with cdata

Posted on Categories Administrativia, data science, Practical Data Science, Statistics1 Comment on Practical Data Science with R, 2nd Edition discount!## Practical Data Science with R, 2nd Edition discount!

Posted on Categories Programming, Statistics, Tutorials, Uncategorized4 Comments on What does it mean to write “vectorized” code in R?## What does it mean to write “vectorized” code in R?

Posted on Categories data science, Exciting Techniques, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsLeave a comment on vtreat Variable Importance## vtreat Variable Importance

Posted on Categories Opinion, Statistics, Tutorials## More on Bias Corrected Standard Deviation Estimates

Posted on Categories Opinion, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, Tutorials## How to de-Bias Standard Deviation Estimates

Posted on Categories data science, Programming, Statistics## More on sigr

Posted on Categories Statistics, Tutorials, Uncategorized## R tip: Make Your Results Clear with sigr

Manning has a new discount code and a free excerpt of our book Practical Data Science with R, 2nd Edition: here.

This section is elementary, but things really pick up speed as later on (also available in a paid preview).

We have two new chapters of *Practical Data Science with R, Second Edition* online and available for review!

The newly available chapters cover:

**Data Engineering And Data Shaping** – Explores how to use R to organize or wrangle data into a shape useful for analysis. The chapter covers applying data transforms, data manipulation packages, and more.

**Choosing and Evaluating Models** – The chapter starts with exploring machine learning approaches and then moves to studying key model evaluation topics like mapping business problems to machine learning tasks, evaluating model quality, and how to explain model predictions.

If you haven’t signed up for our book’s MEAP (Manning Early Access Program), we encourage you to do so. The MEAP includes a free copy of *Practical Data Science with R*, First Edition, as well as early access to chapter drafts of the second edition as we complete them.

For those of you who have already subscribed — thank you! We hope you enjoy the new chapters, and we look forward to your feedback.

One of the design goals of the `cdata`

`R`

package is that very powerful and arbitrary record transforms should be convenient and take only one or two steps. In fact it is the goal to take just about any record shape to any other in two steps: first convert to row-records, then re-block the data into arbitrary record shapes (please see here and here for the concepts).

But as with all general ideas, it is much easier to see what we mean by the above with a concrete example.

Please help share our news and this discount.

The second edition of our best-selling book *Practical Data Science with R ^{2}*, Zumel, Mount is featured as deal of the day at Manning.

The second edition isn’t finished yet, but chapters 1 through 4 are available in the Manning Early Access Program (MEAP), and we have finished chapters 5 and 6 which are now in production at Manning (so they should be available soon). The authors are hard at work on chapters 7 and 8 right now.

The discount gets you half off. Also the 2nd edition comes with a free e-copy the first edition (so you can jump ahead).

Here are the details in Tweetable form:

Deal of the Day January 13: Half off Practical Data Science with R, Second Edition. Use code dotd011319au at http://bit.ly/2SKAxe9.

One often hears that `R`

can not be fast (false), or more correctly that for fast code in `R`

you may have to consider “vectorizing.”

A lot of knowledgable `R`

users are not comfortable with the term “vectorize”, and not really familiar with the method.

“Vectorize” is just a slightly high-handed way of saying:

`R`

naturally stores data in columns (or in column major order), so if you are not coding to that pattern you are fighting the language.

In this article we will make the above clear by working through a non-trivial example of writing vectorized code.

Continue reading What does it mean to write “vectorized” code in R?

`vtreat`

‘s purpose is to produce pure numeric `R`

`data.frame`

s that are ready for supervised predictive modeling (predicting a value from other values). By ready we mean: a purely numeric data frame with no missing values and a reasonable number of columns (missing-values re-encoded with indicators, and high-degree categorical re-encode by effects codes or impact codes).

In this note we will discuss a small aspect of the `vtreat`

package: variable screening.

This note is just a quick follow-up to our last note on correcting the bias in estimated standard deviations for binomial experiments.

Continue reading More on Bias Corrected Standard Deviation Estimates

This note is about attempting to remove the bias brought in by using sample standard deviation estimates to estimate an unknown true standard deviation of a population. We establish there is a bias, concentrate on why it is *not* important to remove it for reasonable sized samples, and (despite that) give a very complete bias management solution.

Continue reading How to de-Bias Standard Deviation Estimates

If you’ve read our previous R Tip on using sigr with linear models, you might have noticed that the `lm()`

summary object does in fact carry the R-squared and F statistics, both in the printed form:

model_lm <- lm(formula = Petal.Length ~ Sepal.Length, data = iris) (smod_lm <- summary(model_lm)) ## ## Call: ## lm(formula = Petal.Length ~ Sepal.Length, data = iris) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2.47747 -0.59072 -0.00668 0.60484 2.49512 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -7.10144 0.50666 -14.02 <2e-16 *** ## Sepal.Length 1.85843 0.08586 21.65 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.8678 on 148 degrees of freedom ## Multiple R-squared: 0.76, Adjusted R-squared: 0.7583 ## F-statistic: 468.6 on 1 and 148 DF, p-value: < 2.2e-16

and also in the `summary()`

object:

c(R2 = smod_lm$r.squared, F = smod_lm$fstatistic[1]) ## R2 F.value ## 0.7599546 468.5501535

Note, though, that while the summary *reports* the model’s significance, it does not carry it as a specific `summary()`

object item. `sigr::wrapFTest()`

is a convenient way to extract the model’s R-squared and F statistic *and* simultaneously calculate the model significance, as is required by many scientific publications.

`sigr`

is even more helpful for logistic regression, via `glm()`

, which reports neither the model’s chi-squared statistic nor its significance.

iris$isVersicolor <- iris$Species == "versicolor" model_glm <- glm( isVersicolor ~ Sepal.Length + Sepal.Width, data = iris, family = binomial) (smod_glm <- summary(model_glm)) ## ## Call: ## glm(formula = isVersicolor ~ Sepal.Length + Sepal.Width, family = binomial, ## data = iris) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.9769 -0.8176 -0.4298 0.8855 2.0855 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 8.0928 2.3893 3.387 0.000707 *** ## Sepal.Length 0.1294 0.2470 0.524 0.600247 ## Sepal.Width -3.2128 0.6385 -5.032 4.85e-07 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 190.95 on 149 degrees of freedom ## Residual deviance: 151.65 on 147 degrees of freedom ## AIC: 157.65 ## ## Number of Fisher Scoring iterations: 5

To get the significance of a logistic regression model, call `wrapr::wrapChiSqTest():`

library(sigr) (chi2Test <- wrapChiSqTest(model_glm)) ## [1] “Chi-Square Test summary: pseudo-R2=0.21 (X2(2,N=150)=39, p<1e-05).”

Notice that the fit summary also reports a pseudo-R-squared. You can extract the values directly off the `sigr`

object, as well:

str(chi2Test) ## List of 10 ## $ test : chr "Chi-Square test" ## $ df.null : int 149 ## $ df.residual : int 147 ## $ null.deviance : num 191 ## $ deviance : num 152 ## $ pseudoR2 : num 0.206 ## $ pValue : num 2.92e-09 ## $ sig : num 2.92e-09 ## $ delta_deviance: num 39.3 ## $ delta_df : int 2 ## - attr(*, "class")= chr [1:2] "sigr_chisqtest" "sigr_statistic"

And of course you can render the `sigr`

object into one of several formats (Latex, html, markdown, and ascii) for direct inclusion in a report or publication.

render(chi2Test, format = "html")

**Chi-Square Test** summary: *pseudo- R^{2}*=0.21 (

By the way, if you are interested, we give the explicit formula for calculating the significance of a logistic regression model in *Practical Data Science with R*.

R is designed to make working with statistical models fast, succinct, and reliable.

For instance building a model is a one-liner:

model <- lm(Petal.Length ~ Sepal.Length, data = iris)

And producing a detailed diagnostic summary of the model is also a one-liner:

summary(model) # Call: # lm(formula = Petal.Length ~ Sepal.Length, data = iris) # # Residuals: # Min 1Q Median 3Q Max # -2.47747 -0.59072 -0.00668 0.60484 2.49512 # # Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) -7.10144 0.50666 -14.02 <2e-16 *** # Sepal.Length 1.85843 0.08586 21.65 <2e-16 *** # --- # Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # # Residual standard error: 0.8678 on 148 degrees of freedom # Multiple R-squared: 0.76, Adjusted R-squared: 0.7583 # F-statistic: 468.6 on 1 and 148 DF, p-value: < 2.2e-16

However, useful as the above is: it isn’t exactly presentation ready. To formally report the R-squared of our model we would have to cut and paste this information from the summary. That is a needlessly laborious and possibly error-prone step.

With the `sigr`

package this can be made much easier:

library("sigr") Rsquared <- wrapFTest(model) print(Rsquared) # [1] "F Test summary: (R2=0.76, F(1,148)=468.6, p<1e-05)."

And this formal summary can be directly rendered into many formats (Latex, html, markdown, and ascii).

render(Rsquared, format="html")

**F Test** summary: (*R ^{2}*=0.76,

`sigr`

can help make your publication workflow much easier and more repeatable/reliable.