Posted on Categories data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , Leave a comment on WVPlots 1.1.2 on CRAN

WVPlots 1.1.2 on CRAN

I have put a new release of the WVPlots package up on CRAN. This release adds palette and/or color controls to most of the plotting functions in the package.

WVPlots was originally a catch-all package of ggplot2 visualizations that we at Win-Vector tended to use repeatedly, and wanted to turn into “one-liners.” A consequence of this is that the older visualizations had our preferred color schemes hard-coded in. More recent additions to the package sometimes had palette or color controls, but not in a consistent way. Making color controls more consistent has been a “todo” for a while—one that I’d been putting off. A recent request from user Brice Richard (thanks Brice!) has pushed me to finally make the changes.

Most visualizations in the package that color-code by group now have a palette argument that takes the name of a Brewer palette for the graph; Dark2 is usually the default. To use the ggplot2 default palette, or to set an alternative palette, such as viridis or a manually specified color scheme, set palette=NULL. Here’s some examples:


mpg = ggplot2::mpg
mpg$trans = gsub("\\(.*$", '', mpg$trans)
# default palette: Dark2 
DoubleDensityPlot(mpg, "cty", "trans", "City driving mpg by transmission type")

Unnamed chunk 1 1

# set a different Brewer color palette
DoubleDensityPlot(mpg, "cty", "trans", 
                  "City driving mpg by transmission type",
                  palette = "Accent")

Unnamed chunk 1 2

# set a custom palette
cmap = c("auto" = "#7b3294", "manual" = "#008837")

DoubleDensityPlot(mpg, "cty", "trans", 
                  "City driving mpg by transmission type",
                  palette=NULL) + 
  scale_color_manual(values=cmap) + 

Unnamed chunk 1 3

For other plots, the user can now specify the desired color for different elements of the graph.

title = "Count of cars by number of carburetors and cylinders"

# default fill: darkblue
ShadowPlot(mtcars, "carb", "cyl",
           title = title)

Unnamed chunk 2 1

# specify fill
ShadowPlot(mtcars, "carb", "cyl",
           title = title,
           fillcolor = "#a6611a")

Unnamed chunk 2 2

We hope that these changes make WVPlots even more useful to our users. For examples of several of the visualizations in WVPlots, see this example vignette. For the complete list of visualizations, see the reference page.

Posted on Categories data science, Opinion, Pragmatic Data Science, Pragmatic Machine Learning, TutorialsTags , , , , , , , ,

Advanced Data Reshaping in Python and R

This note is a simple data wrangling example worked using both the Python data_algebra package and the R cdata package. Both of these packages make data wrangling easy through he use of coordinatized data concepts (relying heavily on Codd’s “rule of access”).

The advantages of data_algebra and cdata are:

  • The user specifies their desired transform declaratively by example and in data. What one does is: work an example, and then write down what you want (we have a tutorial on this here).
  • The transform systems can print what a transform is going to do. This makes reasoning about data transforms much easier.
  • The transforms, as they themselves are written as data, can be easily shared between systems (such as R and Python).

Continue reading Advanced Data Reshaping in Python and R

Posted on Categories Administrativia, Opinion, Pragmatic Data Science, Pragmatic Machine Learning, TutorialsTags ,

New Getting Started with vtreat Documentation

Win Vector LLC‘s Dr. Nina Zumel has just released some new vtreat documentation.

vtreat is a an all-in one step data preparation system that helps defend your machine learning algorithms from:

  • Missing values
  • Large cardinality categorical variables
  • Novel levels from categorical variables

I hoped she could get the Python vtreat documentation up to parity with the R vtreat documentation. But I think she really hit the ball out of the park, and went way past that.

The new documentation is 3 “getting started” guides. These guides deliberately overlap, so you don’t have to read them all. Just read the one suited to your problem and go.

The new guides:

Perhaps we can back-port the new guides to the R version at some point.

Posted on Categories Administrativia, data science, Opinion, Pragmatic Data Science, Pragmatic Machine Learning, TutorialsTags , , , , , , , ,

Introducing data_algebra

This article introduces the data_algebra project: a data processing tool family available in R and Python. These tools are designed to transform data either in-memory or on remote databases.

In particular we will discuss the Python implementation (also called data_algebra) and its relation to the mature R implementations (rquery and rqdatatable).

Continue reading Introducing data_algebra

Posted on Categories Pragmatic Data Science, Pragmatic Machine Learning, TutorialsTags , , , , ,

What is vtreat?

vtreat is a DataFrame processor/conditioner that prepares real-world data for supervised machine learning or predictive modeling in a statistically sound manner.

vtreat takes an input DataFrame that has a specified column called “the outcome variable” (or “y”) that is the quantity to be predicted (and must not have missing values). Other input columns are possible explanatory variables (typically numeric or categorical/string-valued, these columns may have missing values) that the user later wants to use to predict “y”. In practice such an input DataFrame may not be immediately suitable for machine learning procedures that often expect only numeric explanatory variables, and may not tolerate missing values.

To solve this, vtreat builds a transformed DataFrame where all explanatory variable columns have been transformed into a number of numeric explanatory variable columns, without missing values. The vtreat implementation produces derived numeric columns that capture most of the information relating the explanatory columns to the specified “y” or dependent/outcome column through a number of numeric transforms (indicator variables, impact codes, prevalence codes, and more). This transformed DataFrame is suitable for a wide range of supervised learning methods from linear regression, through gradient boosted machines.

The idea is: you can take a DataFrame of messy real world data and easily, faithfully, reliably, and repeatably prepare it for machine learning using documented methods using vtreat. Incorporating vtreat into your machine learning workflow lets you quickly work with very diverse structured data.

Worked examples can be found here.

For more detail please see here: arXiv:1611.09477 stat.AP (the documentation describes the R version, however all of the examples can be found worked in Python here).

vtreat is available as a Python/Pandas package, and also as an R package.

(logo: Julie Mount, source: “The Harvest” by Boris Kustodiev 1914)

Some operational examples can be found here.

Posted on Categories Administrativia, Pragmatic Data ScienceTags , , , ,

Speaking at BARUG

We will be speaking at the Tuesday, September 3, 2019 BARUG. If you are in the Bay Area, please come see us.

Nina Zumel & John Mount
Practical Data Science with R

Practical Data Science with R (Zumel and Mount) was one of the first, and most widely-read books on the practice of doing Data Science using R. We have been working hard on an improved and revised 2nd edition of our book (coming out this Fall). The book reflects more experience with data science, teaching, and with R itself. We will talk about what direction we think the R community has been taking, how this affected the book, and what is new in the upcoming edition.

Posted on Categories data science, Expository Writing, Opinion, Pragmatic Data Science, Pragmatic Machine Learning, TutorialsTags , , , , ,

Lord Kelvin, Data Scientist

In 1876 A. Légé & Co., 20 Cross Street, Hatton Gardens, London completed the first “tide calculating machine” for William Thomson (later Lord Kelvin) (ref).


Thomson’s (Lord Kelvin) First Tide Predicting Machine, 1876

The results were plotted on the paper cylinders, and one literally “turned the crank” to perform the calculations.

The tide calculating machine embodied ideas of Sir Isaac Newton, and Pierre-Simon Laplace (ref), and could predict tide driven water levels by the means of wheels and gears.

The question is: can modern data science tools quickly forecast tides to similar accuracy?

Continue reading Lord Kelvin, Data Scientist

Posted on Categories Opinion, Practical Data Science, Pragmatic Data Science, Pragmatic Machine LearningTags , ,

A Kind Note That We Really Appreciate

The following really made my day.

I tell every data scientist I know about vtreat and urge them to read the paper.
Jason Wolosonovich

Jason, thanks for your support and thank you so much for taking the time to say this (and for your permission to quote you on this).

For those interested the R version of vtreat can be found here, the paper can be found here, and the in-development Python/Pandas version of vtreat can be found (with examples) here.

Chapter of 8 Zumel, Mount, Practical Data Science with R, 2nd Edition, Manning 2019 has a more operational discussion of vtreat (which itself uses concepts developed in chapter 4).

Posted on Categories Administrativia, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, ProgrammingTags , , ,

Big News: Porting vtreat to Python

We at Win-Vector LLC have some big news.

We are finally porting a streamlined version of our R vtreat variable preparation package to Python.

vtreat is a great system for preparing messy data for supervised machine learning.

The new implementation is based on Pandas, and we are experimenting with pushing the sklearn.pipeline.Pipeline APIs to their limit. In particular we have found the .fit_transform() pattern is a great way to express building up a cross-frame to avoid nested model bias (in this case .fit_transform() != .fit().transform()). There is a bit of difference in how object oriented APIs compose versus how functional APIs compose. We are making an effort to research how to make this an advantage, and not a liability.

The new repository is here. And we have a non-trivial worked classification example. Next up is multinomial classification. After that a few validation suites to prove the two vtreat systems work similarly. And then we have some exciting new capabilities.

The first application is going to be a shortening and streamlining of our current 4 day data science in Python course (while allowing more concrete examples!).

This also means data scientists who use both R and Python will have a few more tools that present similarly in each language.

Posted on Categories data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , ,

An Ad-hoc Method for Calibrating Uncalibrated Models

In the previous article in this series, we showed that common ensemble models like random forest and gradient boosting are uncalibrated: they are not guaranteed to estimate aggregates or rollups of the data in an unbiased way. However, they can be preferable to calibrated models such as linear or generalized linear regression, when they make more accurate predictions on individuals. In this article, we’ll demonstrate one ad-hoc method for calibrating an uncalibrated model with respect to specific grouping variables. This "polishing step" potentially returns a model that estimates certain rollups in an unbiased way, while retaining good performance on individual predictions.

Continue reading An Ad-hoc Method for Calibrating Uncalibrated Models