Posted on Categories Exciting Techniques, Practical Data Science, Pragmatic Data Science, Statistics, TutorialsTags , , 1 Comment on Quick Significance Calculations for A/B Tests in R

Quick Significance Calculations for A/B Tests in R

Introduction

Let’s take a quick look at a very important and common experimental problem: checking if the difference in success rates of two Binomial experiments is statistically significant. This can arise in A/B testing situations such as online advertising, sales, and manufacturing.

We already share a free video course on a Bayesian treatment of planning and evaluating A/B tests (including a free Shiny application). Let’s now take a look at the should be simple task of simply building a summary statistic that includes a classic frequentist significance.

Continue reading Quick Significance Calculations for A/B Tests in R

Posted on Categories data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , Leave a comment on Modeling muti-category Outcomes With vtreat

Modeling muti-category Outcomes With vtreat

vtreat is a powerful R package for preparing messy real-world data for machine learning. We have further extended the package with a number of features including rquery/rqdatatable integration (allowing vtreat application at scale on Apache Spark or data.table!).

In addition vtreat and can now effectively prepare data for multi-class classification or multinomial modeling.

Continue reading Modeling muti-category Outcomes With vtreat

Posted on Categories Opinion, Practical Data Science, StatisticsTags , 2 Comments on Practical Data Science with R2

Practical Data Science with R2

The secret is out: Nina Zumel and I are busy working on Practical Data Science with R2, the second edition of our best selling book on learning data science using the R language.

Our publisher, Manning, has a great slide deck describing the book (and a discount code!!!) here:

Pdsr2s

We also just got back our part-1 technical review for the new book. Here is a quote from the technical review we are particularly proud of:

The dot notation for base R and the dplyr package did make me stand up and think. Certain things suddenly made sense.

Continue reading Practical Data Science with R2

Posted on Categories Administrativia, Practical Data Science, StatisticsTags , , 1 Comment on More Practical Data Science with R Book News

More Practical Data Science with R Book News

Some more Practical Data Science with R news.

Practical Data Science with R is the book we wish we had when we started in data science. Practical Data Science with R, Second Edition is the revision of that book with the packages we wish had been available at that time (in particular vtreat, cdata, and wrapr). A second edition also lets us also correct some omissions, such as not demonstrating data.table.

For your part: please help us get the word out about this book. Practical Data Science with R, Second Edition, R in Action, Second Edition, and Think Like a Data Scientist are Manning’s August 20th 2018 “Deal of the Day” (use code dotd082018au at https://www.manning.com/dotd).

For our part we are busy revising chapters and setting up a new Github repository for examples and code and other reader resources.

Posted on Categories Administrativia, data science, Practical Data Science, Pragmatic Data Science, StatisticsTags , 4 Comments on Announcing Practical Data Science with R, 2nd Edition

Announcing Practical Data Science with R, 2nd Edition

We are pleased and excited to announce that we are working on a second edition of Practical Data Science with R!

NewImage

Continue reading Announcing Practical Data Science with R, 2nd Edition

Posted on Categories data science, Opinion, StatisticsTags , 11 Comments on Meta-packages, nails in CRAN’s coffin

Meta-packages, nails in CRAN’s coffin

Derek Jones recently discussed a possible future for the R ecosystem in “StatsModels: the first nail in R’s coffin”.

This got me thinking on the future of CRAN (which I consider vital to R, and vital in distributing our work) in the era of super-popular meta-packages. Meta-packages are convenient, but they have a profoundly negative impact on the packages they exclude.

For example: tidyverse advertises a popular R universe where the vital package data.table never existed.

NewImage

And now tidymodels is shaping up to be a popular universe where our own package vtreat never existed, except possibly as a footnote to embed.

NewImage

NewImage

Users currently (with some luck) discover packages like ours and then (because they trust CRAN) feel able to try them. With popular walled gardens that becomes much less likely. It is one thing for a standard package to duplicate another package (it is actually hard to avoid, and how work legitimately competes), it is quite another for a big-brand meta-package to pre-pick winners (and losers).

All I can say is: please give vtreat a chance and a try. It is a package for preparing messy real-world data for predictive modeling. In addition to re-coding high cardinality categorical variables (into what we call effect-codes after Cohen, or impact-codes), it deals with missing values, can be parallelized, can be run on databases, and has years of production experience baked in.

Some places to start with vtreat:

Posted on Categories Administrativia, data science, Opinion, Practical Data Science, StatisticsTags , , , ,

John Mount speaking on rquery and rqdatatable

rquery and rqdatatable are new R packages for data wrangling; either at scale (in databases, or big data systems such as Apache Spark), or in-memory. The packages speed up both execution (through optimizations) and development (though a good mental model and up-front error checking) for data wrangling tasks.


Rquery
Rqdatatable

Win-Vector LLC‘s John Mount will be speaking on the rquery and rqdatatable packages at the The East Bay R Language Beginners Group Tuesday, August 7, 2018 (Oakland, CA).

Continue reading John Mount speaking on rquery and rqdatatable

Posted on Categories Administrativia, data science, StatisticsTags , , ,

Big News: vtreat 1.2.0 is Available on CRAN, and it is now Big Data Capable

We here at Win-Vector LLC have some really big news we would please like the R-community’s help sharing.

vtreat version 1.2.0 is now available on CRAN, and this version of vtreat can now implement its data cleaning and preparation steps on databases and big data systems such as Apache Spark.

vtreat is a very complete and rigorous tool for preparing messy real world data for supervised machine-learning tasks. It implements a technique we call “safe y-aware processing” using cross-validation or stacking techniques. It is very easy to use: you show it some data and it designs a data transform for you.

Thanks to the rquery package, this data preparation transform can now be directly applied to databases, or big data systems such as PostgreSQL, Amazon RedShift, Apache Spark, or Google BigQuery. Or, thanks to the data.table and rqdatatable packages, even fast large in-memory transforms are possible.

We have some basic examples of the new vtreat capabilities here and here.

Posted on Categories data science, Exciting Techniques, Opinion, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , ,

rqdatatable: rquery Powered by data.table

rquery is an R package for specifying data transforms using piped Codd-style operators. It has already shown great performance on PostgreSQL and Apache Spark. rqdatatable is a new package that supplies a screaming fast implementation of the rquery system in-memory using the data.table package.

rquery is already one of the fastest and most teachable (due to deliberate conformity to Codd’s influential work) tools to wrangle data on databases and big data systems. And now rquery is also one of the fastest methods to wrangle data in-memory in R (thanks to data.table, via a thin adaption supplied by rqdatatable).

Continue reading rqdatatable: rquery Powered by data.table

Posted on Categories data science, Opinion, Pragmatic Data Science, Statistics, TutorialsTags , , , 5 Comments on Talking about clinical significance

Talking about clinical significance

In statistical work in the age of big data we often get hung up on differences that are statistically significant (reliable enough to show up again and again in repeated measurements), but clinically insignificant (visible in aggregation, but too small to make any real difference to individuals).

An example would be: a diet that changes individual weight by an ounce on average with a standard deviation of a pound. With a large enough population the diet is statistically significant. It could also be used to shave an ounce off a national average weight. But, for any one individual: this diet is largely pointless.

The concept is teachable, but we have always stumbled of the naming “statistical significance” versus “practical clinical significance.”

I am suggesting trying the word “substantial” (and its antonym “insubstantial”) to describe if changes are physically small or large.

This comes down to having to remind people that “p-values are not effect sizes”. In this article we recommended reporting three statistics: a units-based effect size (such as expected delta pounds), a dimensionless effects size (such as Cohen’s d), and a reliability of experiment size measure (such as a statistical significance, which at best measures only one possible risk: re-sampling risk).

The merit is: if we don’t confound different meanings, we may be less confusing. A downside is: some of these measures are a bit technical to discuss. I’d be interested in hearing opinions and about teaching experiences along these distinctions.