Posted on Categories Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , , , Leave a comment on Encoding categorical variables: one-hot and beyond

Encoding categorical variables: one-hot and beyond

(or: how to correctly use xgboost from R)

R has "one-hot" encoding hidden in most of its modeling paths. Asking an R user where one-hot encoding is used is like asking a fish where there is water; they can’t point to it as it is everywhere.

For example we can see evidence of one-hot encoding in the variable names chosen by a linear regression:

dTrain <-  data.frame(x= c('a','b','b', 'c'),
                      y= c(1, 2, 1, 2))
summary(lm(y~x, data= dTrain))
## 
## Call:
## lm(formula = y ~ x, data = dTrain)
## 
## Residuals:
##          1          2          3          4 
## -2.914e-16  5.000e-01 -5.000e-01  2.637e-16 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   1.0000     0.7071   1.414    0.392
## xb            0.5000     0.8660   0.577    0.667
## xc            1.0000     1.0000   1.000    0.500
## 
## Residual standard error: 0.7071 on 1 degrees of freedom
## Multiple R-squared:    0.5,  Adjusted R-squared:   -0.5 
## F-statistic:   0.5 on 2 and 1 DF,  p-value: 0.7071

Continue reading Encoding categorical variables: one-hot and beyond

Posted on Categories data science, Expository Writing, Opinion, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Programming, Statistics, TutorialsTags , , , , , , Leave a comment on Teaching pivot / un-pivot

Teaching pivot / un-pivot

Authors: John Mount and Nina Zumel

Introduction

In teaching thinking in terms of coordinatized data we find the hardest operations to teach are joins and pivot.

One thing we commented on is that moving data values into columns, or into a “thin” or entity/attribute/value form (often called “un-pivoting”, “stacking”, “melting” or “gathering“) is easy to explain, as the operation is a function that takes a single row and builds groups of new rows in an obvious manner. We commented that the inverse operation of moving data into rows, or the “widening” operation (often called “pivoting”, “unstacking”, “casting”, or “spreading”) is harder to explain as it takes a specific group of columns and maps them back to a single row. However, if we take extra care and factor the pivot operation into its essential operations we find pivoting can be usefully conceptualized as a simple single row to single row mapping followed by a grouped aggregation.

Please read on for our thoughts on teaching pivoting data. Continue reading Teaching pivot / un-pivot

Posted on Categories data science, Expository Writing, Opinion, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Programming, Statistics, TutorialsTags , , , , , , , 1 Comment on Coordinatized Data: A Fluid Data Specification

Coordinatized Data: A Fluid Data Specification

Authors: John Mount and Nina Zumel.

Introduction

It has been our experience when teaching the data wrangling part of data science that students often have difficulty understanding the conversion to and from row-oriented and column-oriented data formats (what is commonly called pivoting and un-pivoting).

Real trust and understanding of this concept doesn’t fully form until one realizes that rows and columns are inessential implementation details when reasoning about your data. Many algorithms are sensitive to how data is arranged in rows and columns, so there is a need to convert between representations. However, confusing representation with semantics slows down understanding.

In this article we will try to separate representation from semantics. We will advocate for thinking in terms of coordinatized data, and demonstrate advanced data wrangling in R.

Continue reading Coordinatized Data: A Fluid Data Specification

Posted on Categories Administrativia, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, StatisticsTags , Leave a comment on Practical Data Science with R: ACM SIGACT News Book Review and Discount!

Practical Data Science with R: ACM SIGACT News Book Review and Discount!

Our book Practical Data Science with R has just been reviewed in Association for Computing Machinery Special Interest Group on Algorithms and Computation Theory (ACM SIGACT) News by Dr. Allan M. Miller (U.C. Berkeley)!


NewImage

The book is half off at Manning March 21st 2017 using the following code (please share/Tweet):

Deal of the Day March 21: Half off my book Practical Data Science with R. Use code dotd032117au at https://www.manning.com/dotd

Please read on for links and excerpts from the review. Continue reading Practical Data Science with R: ACM SIGACT News Book Review and Discount!

Posted on Categories Administrativia, Practical Data Science, Pragmatic Data Science, Pragmatic Machine LearningTags , , , , Leave a comment on Practical Data Science with R errata update: Java SQLScrewdriver replaced by R procedures and article

Practical Data Science with R errata update: Java SQLScrewdriver replaced by R procedures and article

We have updated the errata for Practical Data Science with R to reflect that it is no longer worth the effort to use the Java version of SQLScrewdriver as described.

Screwdriver

We are very sorry for any confusion, trouble, or wasted effort bringing in Java software (something we are very familiar with, but forget not everybody uses) has caused readers. Also, database adapters for R have greatly improved, so we feel more confident depending on them alone. Practical Data Science with R remains an excellent book and a good resource to learn from that we are very proud of and fully support (hence errata). Continue reading Practical Data Science with R errata update: Java SQLScrewdriver replaced by R procedures and article

Posted on Categories Exciting Techniques, Pragmatic Data Science, Programming, Statistics, TutorialsTags , , , , Leave a comment on Step-Debugging magrittr/dplyr Pipelines in R with wrapr and replyr

Step-Debugging magrittr/dplyr Pipelines in R with wrapr and replyr

In this screencast we demonstrate how to easily and effectively step-debug magrittr/dplyr pipelines in R using wrapr and replyr.



Continue reading Step-Debugging magrittr/dplyr Pipelines in R with wrapr and replyr

Posted on Categories Opinion, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , Leave a comment on vtreat: prepare data

vtreat: prepare data

This article is on preparing data for modeling in R using vtreat.

Vtreat Continue reading vtreat: prepare data

Posted on Categories data science, Practical Data Science, Pragmatic Data Science, Programming, Statistics, TutorialsTags , , , , 16 Comments on The Zero Bug

The Zero Bug

I am going to write about an insidious statistical, data analysis, and presentation fallacy I call “the zero bug” and the habits you need to cultivate to avoid it.


The zero bug

The zero bug

Here is the zero bug in a nutshell: common data aggregation tools often can not “count to zero” from examples, and this causes problems. Please read on for what this means, the consequences, and how to avoid the problem. Continue reading The Zero Bug

Posted on Categories data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , , , , 3 Comments on A Theory of Nested Cross Simulation

A Theory of Nested Cross Simulation

[Reader’s Note. Some of our articles are applied and some of our articles are more theoretical. The following article is more theoretical, and requires fairly formal notation to even work through. However, it should be of interest as it touches on some of the fine points of cross-validation that are quite hard to perceive or discuss without the notational framework. We thought about including some “simplifying explanatory diagrams” but so many entities are being introduced and manipulated by the processes we are describing we found equation notation to be in fact cleaner than the diagrams we attempted and rejected.]

Please consider either of the following common predictive modeling tasks:

  • Picking hyper-parameters, fitting a model, and then evaluating the model.
  • Variable preparation/pruning, fitting a model, and then evaluating the model.

In each case you are building a pipeline where “y-aware” (or outcome aware) choices and transformations made at each stage affect later stages. This can introduce undesirable nested model bias and over-fitting.

Our current standard advice to avoid nested model bias is either:

  • Split your data into 3 or more disjoint pieces, such as separate variable preparation/pruning, model fitting, and model evaluation.
  • Reserve a test-set for evaluation and use “simulated out of sample data” or “cross-frame”/“cross simulation” techniques to simulate dividing data among the first two model construction stages.

The first practice is simple and computationally efficient, but statistically inefficient. This may not matter if you have a lot of data, as in “big data”. The second procedure is more statistically efficient, but is also more complicated and has some computational cost. For convenience the cross simulation method is supplied as a ready to go procedure in our R data cleaning and preparation package vtreat.

What would it look like if we insisted on using cross simulation or simulated out of sample techniques for all three (or more) stages? Please read on to find out.

CleanAllTheThings

Hyperbole and a Half copyright Allie Brosh (use allowed in some situations with attribution)

Posted on Categories data science, Opinion, Practical Data Science, Pragmatic Data Science, StatisticsTags , , , , , 4 Comments on Data Preparation, Long Form and tl;dr Form

Data Preparation, Long Form and tl;dr Form

Data preparation and cleaning are some of the most important steps of predictive analytic and data science tasks. They are laborious, where most of the errors are made, your last line of defense against a wild data, and hold the biggest opportunities for outcome improvement. No matter how much time you spend on them, they still seem like a neglected topic. Data preparation isn’t as self contained or genteel as tweaking machine learning models or hyperparameter tuning; and that is one of the reasons data preparation represents such an important practical opportunity for improvement.


NewImage

Photo: NY – http://nyphotographic.com/, License: Creative Commons 3 – CC BY-SA 3.0

Our group is distributing a detailed writeup of the theory and operation behind our R realization of a set of sound data preparation and cleaning procedures called vtreat here: arXiv:1611.09477 [stat.AP]. This is where you can find out what vtreat does, decide if it is appropriate for your problem, or even find a specification allowing the use of the techniques in non-R environments (such as Python/Pandas/scikit-learn, Spark, and many others).

We have submitted this article for formal publication, so it is our intent you can cite this article (as it stands) in scientific work as a pre-print, and later cite it from a formally refereed source.

Or alternately, below is the tl;dr (“too long; didn’t read”) form. Continue reading Data Preparation, Long Form and tl;dr Form