Posted on Categories Administrativia, data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, StatisticsTags 5 Comments on PDSwR2: New Chapters!

PDSwR2: New Chapters!

We have two new chapters of Practical Data Science with R, Second Edition online and available for review!

NewImage

The newly available chapters cover:

Data Engineering And Data Shaping – Explores how to use R to organize or wrangle data into a shape useful for analysis. The chapter covers applying data transforms, data manipulation packages, and more.

Choosing and Evaluating Models – The chapter starts with exploring machine learning approaches and then moves to studying key model evaluation topics like mapping business problems to machine learning tasks, evaluating model quality, and how to explain model predictions.

If you haven’t signed up for our book’s MEAP (Manning Early Access Program), we encourage you to do so. The MEAP includes a free copy of Practical Data Science with R, First Edition, as well as early access to chapter drafts of the second edition as we complete them.

For those of you who have already subscribed — thank you! We hope you enjoy the new chapters, and we look forward to your feedback.

Posted on Categories data science, Exciting Techniques, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , Leave a comment on vtreat Variable Importance

vtreat Variable Importance

vtreat‘s purpose is to produce pure numeric R data.frames that are ready for supervised predictive modeling (predicting a value from other values). By ready we mean: a purely numeric data frame with no missing values and a reasonable number of columns (missing-values re-encoded with indicators, and high-degree categorical re-encode by effects codes or impact codes).

In this note we will discuss a small aspect of the vtreat package: variable screening.

Continue reading vtreat Variable Importance

Posted on Categories Opinion, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , ,

How to de-Bias Standard Deviation Estimates

This note is about attempting to remove the bias brought in by using sample standard deviation estimates to estimate an unknown true standard deviation of a population. We establish there is a bias, concentrate on why it is not important to remove it for reasonable sized samples, and (despite that) give a very complete bias management solution.

Continue reading How to de-Bias Standard Deviation Estimates

Posted on Categories data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Programming, TutorialsTags , ,

The blocks and rows theory of data shaping

We have our latest note on the theory of data wrangling up here. It discusses the roles of “block records” and “row records” in the cdata data transform tool. With that and the theory of how to design transforms, we think we have a pretty complete description of the system.

Rowrecs to blocks

Posted on Categories data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, TutorialsTags ,

Designing Transforms for Data Reshaping with cdata

Authors: John Mount, and Nina Zumel 2018-10-25

As a followup to our previous post, this post goes a bit deeper into reasoning about data transforms using the cdata package. The cdata packages demonstrates the "coordinatized data" theory and includes an implementation of the "fluid data" methodology for general data re-shaping.

cdata adheres to the so-called "Rule of Representation":

Fold knowledge into data, so program logic can be stupid and robust.

The Art of Unix Programming, Erick S. Raymond, Addison-Wesley , 2003

The design principle expressed by this rule is that it is much easier to reason about data than to try to reason about code, so using data to control your code is often a very good trade-off.

We showed in the last post how cdata takes a transform control table to specify how you want your data reshaped. The question then becomes: how do you come up with the transform control table?

Let’s discuss that using the example from the previous post: "plotting the iris data faceted".

Continue reading Designing Transforms for Data Reshaping with cdata

Posted on Categories Exciting Techniques, Practical Data Science, Pragmatic Data Science, Statistics, TutorialsTags , , 1 Comment on Quick Significance Calculations for A/B Tests in R

Quick Significance Calculations for A/B Tests in R

Introduction

Let’s take a quick look at a very important and common experimental problem: checking if the difference in success rates of two Binomial experiments is statistically significant. This can arise in A/B testing situations such as online advertising, sales, and manufacturing.

We already share a free video course on a Bayesian treatment of planning and evaluating A/B tests (including a free Shiny application). Let’s now take a look at the should be simple task of simply building a summary statistic that includes a classic frequentist significance.

Continue reading Quick Significance Calculations for A/B Tests in R

Posted on Categories data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , ,

Modeling multi-category Outcomes With vtreat

vtreat is a powerful R package for preparing messy real-world data for machine learning. We have further extended the package with a number of features including rquery/rqdatatable integration (allowing vtreat application at scale on Apache Spark or data.table!).

In addition vtreat and can now effectively prepare data for multi-class classification or multinomial modeling.

Continue reading Modeling multi-category Outcomes With vtreat

Posted on Categories data science, Opinion, Practical Data Science, Pragmatic Data Science, TutorialsTags , , , , 7 Comments on R Tip: Give data.table a Try

R Tip: Give data.table a Try

If your R or dplyr work is taking what you consider to be a too long (seconds instead of instant, or minutes instead of seconds, or hours instead of minutes, or a day instead of an hour) then try data.table.

For some tasks data.table is routinely faster than alternatives at pretty much all scales (example timings here).

If your project is large (millions of rows, hundreds of columns) you really should rent an an Amazon EC2 r4.8xlarge (244 GiB RAM) machine for an hour for about $2.13 (quick setup instructions here) and experience speed at scale.

Posted on Categories data science, Pragmatic Data Science, TutorialsTags , , 2 Comments on Timings of a Grouped Rank Filter Task

Timings of a Grouped Rank Filter Task

Introduction

This note shares an experiment comparing the performance of a number of data processing systems available in R. Our notional or example problem is finding the top ranking item per group (group defined by three string columns, and order defined by a single numeric column). This is a common and often needed task.

Continue reading Timings of a Grouped Rank Filter Task

Posted on Categories Administrativia, data science, Practical Data Science, Pragmatic Data Science, StatisticsTags , 4 Comments on Announcing Practical Data Science with R, 2nd Edition

Announcing Practical Data Science with R, 2nd Edition

We are pleased and excited to announce that we are working on a second edition of Practical Data Science with R!

NewImage

Continue reading Announcing Practical Data Science with R, 2nd Edition