vtreat cross frames

Posted on Categories data science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , Leave a comment on vtreat cross frames

vtreat cross frames

John Mount, Nina Zumel

2016-05-05

As a follow on to “On Nested Models” we work R examples demonstrating “cross validated training frames” (or “cross frames”) in vtreat.

Continue reading vtreat cross frames

On Nested Models

Posted on Categories Exciting Techniques, Opinion, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , 1 Comment on On Nested Models

We have been recently working on and presenting on nested modeling issues. These are situations where the output of one trained machine learning model is part of the input of a later model or procedure. I am now of the opinion that correct treatment of nested models is one of the biggest opportunities for improvement in data science practice. Nested models can be more powerful than non-nested, but are easy to get wrong.

Continue reading On Nested Models

A bit on the F1 score floor

Posted on Categories Mathematics, Opinion, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , Leave a comment on A bit on the F1 score floor

At Strata+Hadoop World “R Day” Tutorial, Tuesday, March 29 2016, San Jose, California we spent some time on classifier measures derived from the so-called “confusion matrix.”

We repeated our usual admonition to not use “accuracy itself” as a project quality goal (business people tend to ask for it as it is the word they are most familiar with, but it usually isn’t what they really want).


NewImage
One reason not to use accuracy: an example where a classifier that does nothing is “more accurate” than one that actually has some utility. (Figure credit Nina Zumel, slides here)

And we worked through the usual bestiary of other metrics (precision, recall, sensitivity, specificity, AUC, balanced accuracy, and many more).

Please read on to see what stood out. Continue reading A bit on the F1 score floor

WVPlots: example plots in R using ggplot2

Posted on Categories Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Programming, Statistics, TutorialsTags , , , 3 Comments on WVPlots: example plots in R using ggplot2

Nina Zumel and I have been working on packaging our favorite graphing techniques in a more reusable way that emphasizes the analysis task at hand over the steps needed to produce a good visualization. The idea is: we sacrifice some of the flexibility and composability inherent to ggplot2 in R for a menu of prescribed presentation solutions (which we are sharing on Github).

For example the plot below showing both an observed discrete empirical distribution (as stems) and a matching theoretical distribution (as bars) is a built in “one liner.”

NewImage

Please read on for some of the ideas and how to use this package. Continue reading WVPlots: example plots in R using ggplot2

More on preparing data

Posted on Categories Administrativia, Opinion, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , Leave a comment on More on preparing data

The Microsoft Data Science User Group just sponsored Dr. Nina Zumel‘s presentation “Preparing Data for Analysis Using R”. Microsoft saw Win-Vector LLC‘s ODSC West 2015 presentation “Prepping Data for Analysis using R” and generously offered to sponsor improving it and disseminating it to a wider audience.



Logo

We feel Nina really hit the ball out of the park with over 400 new live viewers. Read more for links to even more free materials! Continue reading More on preparing data

Win-Vector video courses: price/status changes

Posted on Categories Administrativia, Pragmatic Data Science, Pragmatic Machine Learning, StatisticsTags , , 3 Comments on Win-Vector video courses: price/status changes

Win-Vector LLC has been offering a couple of online video courses on the topics of data science and A/B testing (both using R). These are high quality courses and well worth the money and time needed to work through them closely (with all materials distributed on GitHub).

Our current distributor is Udemy, which has just announced a unilateral change in pricing policy (March 2, 2016). This note is about the current status of these courses. Continue reading Win-Vector video courses: price/status changes

Databases in containers

Posted on Categories Coding, Exciting Techniques, Opinion, Pragmatic Data Science, Pragmatic Machine Learning, RantsTags , , , , 5 Comments on Databases in containers

A great number of readers reacted very positively to Nina Zumel‘s article Using PostgreSQL in R: A quick how-to. Part of the reason is she described an incredibly powerful data science pattern: using a formerly expensive permanent system infrastructure as a simple transient tool.

In her case the tools were the data manipulation grammars SQL (Structured Query Language) and dplyr. It happened to be the case that in both cases the implementation was supplied by a backing database system (PostgreSQL), but the database was not the center of attention for very long.

In this note we will concentrate on SQL (which itself can be used to implement dplyr operators, and is available on even Hadoop scaled systems such as Hive). Our point can be summarized as: SQL isn’t the price of admission to a server, a server is the fee paid to use SQL. We will try to reduce the fee and show how to containerize PostgreSQL on Microsoft Windows (as was already done for us on Apple OSX).


Containerized DB

NewImage

The Smashing Pumpkins “Bullet with Butterfly Wings” (start 2 minutes 6s)

“Despite all my rage I am still just a rat in a cage!”

(image credit).

Continue reading Databases in containers

Neglected optimization topic: set diversity

Posted on Categories Applications, data science, Expository Writing, Opinion, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , , 2 Comments on Neglected optimization topic: set diversity

The mathematical concept of set diversity is a somewhat neglected topic in current applied decision sciences and optimization. We take this opportunity to discuss the issue.

The problem

Consider the following problem: for a number of items U = {x_1, … x_n} pick a small set of them X = {x_i1, x_i2, ..., x_ik} such that there is a high probability one of the x in X is a “success.” By success I mean some standard business outcome such as making a sale (in the sense of any of: propensity, appetency, up selling, and uplift modeling), clicking an advertisement, adding an account, finding a new medicine, or learning something useful.

This is common in:

  • Search engines. The user is presented with a page consisting of “top results” with the hope that one of the results is what the user wanted.
  • Online advertising. The user is presented with a number of advertisements in enticements in the hope that one of them matches user taste.
  • Science. A number of molecules are simultaneously presented to biological assay hoping that at least one of them is a new drug candidate, or that the simultaneous set of measurements shows us where to experiment further.
  • Sensor/guard placement. Overlapping areas of coverage don’t make up for uncovered areas.
  • Machine learning method design. The random forest algorithm requires diversity among its sub-trees to work well. It tries to ensure by both per-tree variable selections and re-sampling (some of these issues discussed here).

In this note we will touch on key applications and some of the theory involved. While our group specializes in practical data science implementations, applications, and training, our researchers experience great joy when they can re-formulate a common problem using known theory/math and the reformulation is game changing (as it is in the case of set-scoring).


Rplot01

Minimal spanning trees, the basis of one set diversity metric.

Continue reading Neglected optimization topic: set diversity

Running R jobs quickly on many machines

Posted on Categories Exciting Techniques, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, ProgrammingTags , , 4 Comments on Running R jobs quickly on many machines

As we demonstrated in “A gentle introduction to parallel computing in R” one of the great things about R is how easy it is to take advantage of parallel processing capabilities to speed up calculation. In this note we will show how to move from running jobs multiple CPUs/cores to running jobs multiple machines (for even larger scaling and greater speedup). Using the technique on Amazon EC2 even turns your credit card into a supercomputer.


NewImage
Colossus supercomputer : The Forbin Project

Continue reading Running R jobs quickly on many machines

Nina Zumel and John Mount part of R Day at Strata + Hadoop World in San Jose 2016

Posted on Categories Administrativia, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, StatisticsTags

Nina Zumel and I are honored to have been invited to be part of Strata + Hadoop World in San Jose 2016 R Day organized by RStudio and O’Reilly. Continue reading Nina Zumel and John Mount part of R Day at Strata + Hadoop World in San Jose 2016