Posted on Categories Administrativia, data science, Opinion, Practical Data Science, Pragmatic Data Science, StatisticsTags , , ,

Four Years of Practical Data Science with R

Four years ago today authors Nina Zumel and John Mount received our author’s copies of Practical Data Science with R!

1960860 10203595069745403 608808262 o

Continue reading Four Years of Practical Data Science with R

Posted on Categories Exciting Techniques, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , 2 Comments on Plotting Deep Learning Model Performance Trajectories

Plotting Deep Learning Model Performance Trajectories

I am excited to share a new deep learning model performance trajectory graph.

Here is an example produced based on Keras in R using ggplot2:

Unknown Continue reading Plotting Deep Learning Model Performance Trajectories

Posted on Categories Administrativia, data science, StatisticsTags , , , 1 Comment on Some Announcements

Some Announcements

Some Announcements:

  • Dr. Nina Zumel will be presenting “Myths of Data Science: Things you Should and Should Not Believe”,
    Sunday, October 29, 2017
    10:00 AM to 12:30 PM at the She Talks Data Meetup (Bay Area).
  • ODSC West 2017 is soon. It is our favorite conference and we will be giving both a workshop and a talk.
    • Thursday Nov 2 2017,
      2:00 PM,
      Room T2,
      “Modeling big data with R, Sparklyr, and Apache Spark”,
      Workshop/Training intermediate, 4 hours,
      by Dr. John Mount (link).

    • Friday Nov 3 2017,
      4:15 PM,
      Room TR2
      “Myths of Data Science: Things you Should and Should Not Believe”,
      Data Science lecture beginner/intermediate, 45 minutes,
      by Dr. Nina Zumel (link, length, abstract, and title to be corrected).

    • We really hope you can make these talks.

  • On the “R for big data” front we have some big news: the replyr package now implements pivot/un-pivot (or what tidyr calls spread/gather) for big data (databases and Sparklyr). This data shaping ability adds a lot of user power. We call the theory “coordinatized data” and the work practice “fluid data”.
Posted on Categories Administrativia, Opinion, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , , , 3 Comments on Upcoming data preparation and modeling article series

Upcoming data preparation and modeling article series

I am pleased to announce that vtreat version 0.6.0 is now available to R users on CRAN.


Vtreat

vtreat is an excellent way to prepare data for machine learning, statistical inference, and predictive analytic projects. If you are an R user we strongly suggest you incorporate vtreat into your projects. Continue reading Upcoming data preparation and modeling article series

Posted on Categories data science, Pragmatic Data Science, Pragmatic Machine Learning, Programming, Statistics, TutorialsTags , , , , ,

Permutation Theory In Action

While working on a large client project using Sparklyr and multinomial regression we recently ran into a problem: Apache Spark chooses the order of multinomial regression outcome targets, whereas R users are used to choosing the order of the targets (please see here for some details). So to make things more like R users expect, we need a way to translate one order to another.

Providing good solutions to gaps like this is one of the thing Win-Vector LLC does both in our consulting and training practices.

Continue reading Permutation Theory In Action

Posted on Categories math programming, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , , 3 Comments on Why do Decision Trees Work?

Why do Decision Trees Work?

In this article we will discuss the machine learning method called “decision trees”, moving quickly over the usual “how decision trees work” and spending time on “why decision trees work.” We will write from a computational learning theory perspective, and hope this helps make both decision trees and computational learning theory more comprehensible. The goal of this article is to set up terminology so we can state in one or two sentences why decision trees tend to work well in practice.

Continue reading Why do Decision Trees Work?

Posted on Categories OpinionTags , , ,

Another note on differential privacy

I want to recommend an excellent article on the recent claimed use of differential privacy to actually preserve user privacy: “A Few Thoughts on Cryptographic Engineering” by Matthew Green.

After reading the article we have a few follow-up thoughts on the topic. Continue reading Another note on differential privacy

Posted on Categories data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, StatisticsTags , , , ,

Free gradient boosting lecture

We have always regretted that we didn’t get to cover gradient boosting in Practical Data Science with R (Manning 2014). To try make up for that we are sharing (for free) our GBM lecture from our (paid) video course Introduction to Data Science.


(link, all support material here).

Please help us get the word out by sharing/Tweeting!

Posted on Categories data science, Exciting Techniques, Expository Writing, Statistics, Statistics To English Translation, TutorialsTags , , , , , , 6 Comments on A Simpler Explanation of Differential Privacy

A Simpler Explanation of Differential Privacy

Differential privacy was originally developed to facilitate secure analysis over sensitive data, with mixed success. It’s back in the news again now, with exciting results from Cynthia Dwork, et. al. (see references at the end of the article) that apply results from differential privacy to machine learning.

In this article we’ll work through the definition of differential privacy and demonstrate how Dwork et.al.’s recent results can be used to improve the model fitting process.

NewImage
The Voight-Kampff Test: Looking for a difference. Scene from Blade Runner

Continue reading A Simpler Explanation of Differential Privacy

Posted on Categories data science, Mathematics, OpinionTags , , , ,

How sure are you that large margin implies low VC dimension?

How sure are you that large margin implies low VC dimension (and good generalization error)? It is true. But even if you have taken a good course on machine learning you many have seen the actual proof (with all of the caveats and conditions). I worked through the literature proofs over the holiday and it took a lot of notes to track what is really going on in the derivation of the support vector machine.


Margin2
Figure: the standard SVM margin diagram, this time with some un-marked data added.
Continue reading How sure are you that large margin implies low VC dimension?