Posted on Categories Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , , , Leave a comment on Encoding categorical variables: one-hot and beyond

Encoding categorical variables: one-hot and beyond

(or: how to correctly use xgboost from R)

R has "one-hot" encoding hidden in most of its modeling paths. Asking an R user where one-hot encoding is used is like asking a fish where there is water; they can’t point to it as it is everywhere.

For example we can see evidence of one-hot encoding in the variable names chosen by a linear regression:

dTrain <-  data.frame(x= c('a','b','b', 'c'),
                      y= c(1, 2, 1, 2))
summary(lm(y~x, data= dTrain))
## 
## Call:
## lm(formula = y ~ x, data = dTrain)
## 
## Residuals:
##          1          2          3          4 
## -2.914e-16  5.000e-01 -5.000e-01  2.637e-16 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   1.0000     0.7071   1.414    0.392
## xb            0.5000     0.8660   0.577    0.667
## xc            1.0000     1.0000   1.000    0.500
## 
## Residual standard error: 0.7071 on 1 degrees of freedom
## Multiple R-squared:    0.5,  Adjusted R-squared:   -0.5 
## F-statistic:   0.5 on 2 and 1 DF,  p-value: 0.7071

Continue reading Encoding categorical variables: one-hot and beyond

Posted on Categories data science, Expository Writing, Opinion, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Programming, Statistics, TutorialsTags , , , , , , Leave a comment on Teaching pivot / un-pivot

Teaching pivot / un-pivot

Authors: John Mount and Nina Zumel

Introduction

In teaching thinking in terms of coordinatized data we find the hardest operations to teach are joins and pivot.

One thing we commented on is that moving data values into columns, or into a “thin” or entity/attribute/value form (often called “un-pivoting”, “stacking”, “melting” or “gathering“) is easy to explain, as the operation is a function that takes a single row and builds groups of new rows in an obvious manner. We commented that the inverse operation of moving data into rows, or the “widening” operation (often called “pivoting”, “unstacking”, “casting”, or “spreading”) is harder to explain as it takes a specific group of columns and maps them back to a single row. However, if we take extra care and factor the pivot operation into its essential operations we find pivoting can be usefully conceptualized as a simple single row to single row mapping followed by a grouped aggregation.

Please read on for our thoughts on teaching pivoting data. Continue reading Teaching pivot / un-pivot

Posted on Categories Opinion, Rants, StatisticsTags , , , Leave a comment on You can’t do that in statistics

You can’t do that in statistics

There are a number of statistical principles that are perhaps more honored in the breach than in the observance. For fun I am going to name a few, and show why they are not always the “precision surgical knives of thought” one would hope for (working more like large hammers).

NewImage Continue reading You can’t do that in statistics

Posted on Categories data science, Expository Writing, Opinion, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Programming, Statistics, TutorialsTags , , , , , , , 1 Comment on Coordinatized Data: A Fluid Data Specification

Coordinatized Data: A Fluid Data Specification

Authors: John Mount and Nina Zumel.

Introduction

It has been our experience when teaching the data wrangling part of data science that students often have difficulty understanding the conversion to and from row-oriented and column-oriented data formats (what is commonly called pivoting and un-pivoting).

Real trust and understanding of this concept doesn’t fully form until one realizes that rows and columns are inessential implementation details when reasoning about your data. Many algorithms are sensitive to how data is arranged in rows and columns, so there is a need to convert between representations. However, confusing representation with semantics slows down understanding.

In this article we will try to separate representation from semantics. We will advocate for thinking in terms of coordinatized data, and demonstrate advanced data wrangling in R.

Continue reading Coordinatized Data: A Fluid Data Specification

Posted on Categories Opinion, Statistics, TutorialsTags , , , , , , , Leave a comment on Debugging Pipelines in R with Bizarro Pipe and Eager Assignment

Debugging Pipelines in R with Bizarro Pipe and Eager Assignment

This is a note on debugging magrittr pipelines in R using Bizarro Pipe and eager assignment.


Moth
Continue reading Debugging Pipelines in R with Bizarro Pipe and Eager Assignment

Posted on Categories StatisticsTags , , , , , , , , 9 Comments on Datashader is a big deal

Datashader is a big deal

I recently got back from Strata West 2017 (where I ran a very well received workshop on R and Spark). One thing that really stood out for me at the exhibition hall was Bokeh plus datashader from Continuum Analytics.

I had the privilege of having Peter Wang himself demonstrate datashader for me and answer a few of my questions.

I am so excited about datashader capabilities I literally will not wait for the functionality to be exposed in R through rbokeh. I am going to leave my usual knitr/rmarkdown world and dust off Jupyter Notebook just to use datashader plotting. This is worth trying, even for diehard R users. Continue reading Datashader is a big deal

Posted on Categories Administrativia, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, StatisticsTags , Leave a comment on Practical Data Science with R: ACM SIGACT News Book Review and Discount!

Practical Data Science with R: ACM SIGACT News Book Review and Discount!

Our book Practical Data Science with R has just been reviewed in Association for Computing Machinery Special Interest Group on Algorithms and Computation Theory (ACM SIGACT) News by Dr. Allan M. Miller (U.C. Berkeley)!


NewImage

The book is half off at Manning March 21st 2017 using the following code (please share/Tweet):

Deal of the Day March 21: Half off my book Practical Data Science with R. Use code dotd032117au at https://www.manning.com/dotd

Please read on for links and excerpts from the review. Continue reading Practical Data Science with R: ACM SIGACT News Book Review and Discount!

Posted on Categories Administrativia, Statistics, TutorialsTags , , Leave a comment on New screencast: using R and RStudio to install and experiment with Apache Spark

New screencast: using R and RStudio to install and experiment with Apache Spark

I have new short screencast up: using R and RStudio to install and experiment with Apache Spark.

More material from my recent Strata workshop Modeling big data with R, sparklyr, and Apache Spark can be found here.

Posted on Categories Administrativia, StatisticsTags , , Leave a comment on Some Win-Vector R packages

Some Win-Vector R packages

This post concludes our mini-series of Win-Vector open source R packages. We end with WVPlots, a collection of ready-made ggplot2 plots we find handy.

IMG 6061

Please read on for list of some of the Win-Vector LLC open-source R packages that we are pleased to share. Continue reading Some Win-Vector R packages

Posted on Categories Programming, StatisticsTags , , , Leave a comment on sigr: Simple Significance Reporting

sigr: Simple Significance Reporting

sigr is a simple R package that conveniently formats a few statistics and their significance tests. This allows the analyst to use the correct test no matter what modeling package or procedure they use.

Sigr Continue reading sigr: Simple Significance Reporting