Posted on Categories Administrativia, data science, Exciting Techniques, Opinion, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , Leave a comment on Upcoming speaking engagments

Upcoming speaking engagments

I have a couple of public appearances coming up soon.

Continue reading Upcoming speaking engagments

Posted on Categories data science, Opinion, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, StatisticsTags , , Leave a comment on cdata Update

cdata Update

The R package cdata now has version 0.7.0 available from CRAN.

cdata is a data manipulation package that subsumes many higher order data manipulation operations including pivot/un-pivot, spread/gather, or cast/melt. The record to record transforms are specified by drawing a table that expresses the record structure (called the “control table” and also the link between the key concepts of row-records and block-records).

What can be quickly specified and achieved using these concepts and notations is amazing and quite teachable. These transforms can be run in-memory or in remote database or big-data systems (such as Spark).

The concepts are taught in Nina Zumel’s excellent tutorial.


Untitled

And in John Mount’s quick screencast/lecture.

link, slides

The 0.7.0 update adds local versions of the operators in addition to the Spark and database implementations. These methods should now be a bit safer for in-memory complex/annotated types such as dates and times.

Posted on Categories Administrativia, data science, Opinion, Practical Data Science, Pragmatic Data Science, StatisticsTags , , , Leave a comment on Four Years of Practical Data Science with R

Four Years of Practical Data Science with R

Four years ago today authors Nina Zumel and John Mount received our author’s copies of Practical Data Science with R!

1960860 10203595069745403 608808262 o

Continue reading Four Years of Practical Data Science with R

Posted on Categories data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , , , 9 Comments on R Tip: Use the vtreat Package For Data Preparation

R Tip: Use the vtreat Package For Data Preparation

If you are working with predictive modeling or machine learning in R this is the R tip that is going to save you the most time and deliver the biggest improvement in your results.

R Tip: Use the vtreat package for data preparation in predictive analytics and machine learning projects.

Vtreat

When attempting predictive modeling with real-world data you quickly run into difficulties beyond what is typically emphasized in machine learning coursework:

  • Missing, invalid, or out of range values.
  • Categorical variables with large sets of possible levels.
  • Novel categorical levels discovered during test, cross-validation, or model application/deployment.
  • Large numbers of columns to consider as potential modeling variables (both statistically hazardous and time consuming).
  • Nested model bias poisoning results in non-trivial data processing pipelines.

Any one of these issues can add to project time and decrease the predictive power and reliability of a machine learning project. Many real world projects encounter all of these issues, which are often ignored leading to degraded performance in production.

vtreat systematically and correctly deals with all of the above issues in a documented, automated, parallel, and statistically sound manner.

vtreat can fix or mitigate these domain independent issues much more reliably and much faster than by-hand ad-hoc methods.
This leaves the data scientist or analyst more time to research and apply critical domain dependent (or knowledge based) steps and checks.

If you are attempting high-value predictive modeling in R, you should try out vtreat and consider adding it to your workflow.

Continue reading R Tip: Use the vtreat Package For Data Preparation

Posted on Categories Coding, data science, Exciting Techniques, Programming, Statistics, TutorialsTags , , , Leave a comment on Wanted: cdata Test Pilots

Wanted: cdata Test Pilots

I need a few volunteers to please “test pilot” the development version of the R package cdata, please.

Jackie Cochran at 1938 Bendix Race
Jacqueline Cochran: at the time of her death, no other pilot held more speed, distance, or altitude records in aviation history than Cochran.

Continue reading Wanted: cdata Test Pilots

Posted on Categories data science, Opinion, Statistics, TutorialsTags , , , , , Leave a comment on We Want to be Playing with a Moderate Number of Powerful Blocks

We Want to be Playing with a Moderate Number of Powerful Blocks

Many data scientists (and even statisticians) often suffer under one of the following misapprehensions:

  • They believe a technique doesn’t work in their current situation (when in fact it does), leading to useless precautions and missed opportunities.
  • They believe a technique does work in their current situation (when in fact it does not), leading to failed experiments or incorrect results.

I feel this happens less often if you are working with observable and composable tools of the proper scale. Somewhere between monolithic all in one systems, and ad-hoc one-off coding is a cognitive sweet spot where great work can be done.

Continue reading We Want to be Playing with a Moderate Number of Powerful Blocks

Posted on Categories Coding, data science, Programming, StatisticsTags , , , , , , , 12 Comments on Is 10,000 Cells Big?

Is 10,000 Cells Big?

Trick question: is a 10,000 cell numeric data.frame big or small?

In the era of "big data" 10,000 cells is minuscule. Such data could be fit on fewer than 1,000 punched cards (or less than half a box).


Punch card

The joking answer is: it is small when they are selling you the system, but can be considered unfairly large later.

Continue reading Is 10,000 Cells Big?

Posted on Categories data science, StatisticsTags , Leave a comment on Latest vtreat up on CRAN

Latest vtreat up on CRAN

There is a new version of the R package vtreat now up on CRAN.

vtreat is an essential data preparation system for predictive modeling that helps defend your predictive modeling work against real world data issues including:

  • High cardinality categorical variables
  • Rare levels (including new or novel levels during application) in categorical variables
  • Missing data (random or systematic)
  • Irrelevant variables/columns
  • Nested model bias, and other over-fit issues.

vtreat also includes excellent, citable, documentation: vtreat: a data.frame Processor for Predictive Modeling.

For this release I want to thank everybody who generously donated their time to submit an issue or build a git pull-request. In particular:

  • Vadim Khotilovich, who found and fixed a major performance problem in the y-stratified sampling.
  • Lawrence Wu, who has been donating documentation fixes.
  • Peter Hurford, who has been donating documentation fixes.
Posted on Categories Administrativia, data science, Exciting Techniques, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , ,

Data Reshaping with cdata

I’ve just shared a short webcast on data reshaping in R using the cdata package.

(link)

We also have two really nifty articles on the theory and methods:

Please give it a try!

This is the material I recently presented at the January 2017 BARUG Meetup.

NewImage

Posted on Categories Coding, Computer Science, data science, Opinion, Programming, Statistics, TutorialsTags , , , , 14 Comments on Base R can be Fast

Base R can be Fast

“Base R” (call it “Pure R”, “Good Old R”, just don’t call it “Old R” or late for dinner) can be fast for in-memory tasks. This is despite the commonly repeated claim that: “packages written in C/C++ are (edit: “always”) faster than R code.”

The benchmark results of “rquery: Fast Data Manipulation in R” really called out for follow-up timing experiments. This note is one such set of experiments, this time concentrating on in-memory (non-database) solutions.

Below is a graph summarizing our new results for a number of in-memory implementations, a range of data sizes, and two different machine types.

Unnamed chunk 2 1 Continue reading Base R can be Fast