vtreat is a powerful R package for preparing messy real-world data for machine learning. We have further extended the package with a number of features including rquery/rqdatatable integration (allowing vtreat application at scale on Apache Spark or data.table!).
In addition vtreat and can now effectively prepare data for multi-class classification or multinomial modeling.
Dr. Nina Zumel will be presenting “Myths of Data Science: Things you Should and Should Not Believe”,
Sunday, October 29, 2017
10:00 AM to 12:30 PM at the She Talks Data Meetup (Bay Area).
ODSC West 2017 is soon. It is our favorite conference and we will be giving both a workshop and a talk.
Thursday Nov 2 2017,
“Modeling big data with R, Sparklyr, and Apache Spark”,
Workshop/Training intermediate, 4 hours,
by Dr. John Mount (link).
Friday Nov 3 2017,
“Myths of Data Science: Things you Should and Should Not Believe”,
Data Science lecture beginner/intermediate, 45 minutes,
by Dr. Nina Zumel (link, length, abstract, and title to be corrected).
We really hope you can make these talks.
On the “R for big data” front we have some big news: the replyr package now implements pivot/un-pivot (or what tidyr calls spread/gather) for big data (databases and Sparklyr). This data shaping ability adds a lot of user power. We call the theory “coordinatized data” and the work practice “fluid data”.
While working on a large client project using Sparklyr and multinomial regression we recently ran into a problem: Apache Spark chooses the order of multinomial regression outcome targets, whereas R users are used to choosing the order of the targets (please see here for some details). So to make things more like R users expect, we need a way to translate one order to another.
In this article we will discuss the machine learning method called “decision trees”, moving quickly over the usual “how decision trees work” and spending time on “why decision trees work.” We will write from a computational learning theory perspective, and hope this helps make both decision trees and computational learning theory more comprehensible. The goal of this article is to set up terminology so we can state in one or two sentences why decision trees tend to work well in practice.
Differential privacy was originally developed to facilitate secure analysis over sensitive data, with mixed success. It’s back in the news again now, with exciting results from Cynthia Dwork, et. al. (see references at the end of the article) that apply results from differential privacy to machine learning.
In this article we’ll work through the definition of differential privacy and demonstrate how Dwork et.al.’s recent results can be used to improve the model fitting process.
The Voight-Kampff Test: Looking for a difference. Scene from Blade Runner