Posted on Categories Administrativia, data science, StatisticsTags , , , 1 Comment on Some Announcements

Some Announcements

Some Announcements:

  • Dr. Nina Zumel will be presenting “Myths of Data Science: Things you Should and Should Not Believe”,
    Sunday, October 29, 2017
    10:00 AM to 12:30 PM at the She Talks Data Meetup (Bay Area).
  • ODSC West 2017 is soon. It is our favorite conference and we will be giving both a workshop and a talk.
    • Thursday Nov 2 2017,
      2:00 PM,
      Room T2,
      “Modeling big data with R, Sparklyr, and Apache Spark”,
      Workshop/Training intermediate, 4 hours,
      by Dr. John Mount (link).

    • Friday Nov 3 2017,
      4:15 PM,
      Room TR2
      “Myths of Data Science: Things you Should and Should Not Believe”,
      Data Science lecture beginner/intermediate, 45 minutes,
      by Dr. Nina Zumel (link, length, abstract, and title to be corrected).

    • We really hope you can make these talks.

  • On the “R for big data” front we have some big news: the replyr package now implements pivot/un-pivot (or what tidyr calls spread/gather) for big data (databases and Sparklyr). This data shaping ability adds a lot of user power. We call the theory “coordinatized data” and the work practice “fluid data”.
Posted on Categories Administrativia, Opinion, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , , , 3 Comments on Upcoming data preparation and modeling article series

Upcoming data preparation and modeling article series

I am pleased to announce that vtreat version 0.6.0 is now available to R users on CRAN.


Vtreat

vtreat is an excellent way to prepare data for machine learning, statistical inference, and predictive analytic projects. If you are an R user we strongly suggest you incorporate vtreat into your projects. Continue reading Upcoming data preparation and modeling article series

Posted on Categories data science, Pragmatic Data Science, Pragmatic Machine Learning, Programming, Statistics, TutorialsTags , , , , ,

Permutation Theory In Action

While working on a large client project using Sparklyr and multinomial regression we recently ran into a problem: Apache Spark chooses the order of multinomial regression outcome targets, whereas R users are used to choosing the order of the targets (please see here for some details). So to make things more like R users expect, we need a way to translate one order to another.

Providing good solutions to gaps like this is one of the thing Win-Vector LLC does both in our consulting and training practices.

Continue reading Permutation Theory In Action

Posted on Categories math programming, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , , 3 Comments on Why do Decision Trees Work?

Why do Decision Trees Work?

In this article we will discuss the machine learning method called “decision trees”, moving quickly over the usual “how decision trees work” and spending time on “why decision trees work.” We will write from a computational learning theory perspective, and hope this helps make both decision trees and computational learning theory more comprehensible. The goal of this article is to set up terminology so we can state in one or two sentences why decision trees tend to work well in practice.

Continue reading Why do Decision Trees Work?

Posted on Categories OpinionTags , , ,

Another note on differential privacy

I want to recommend an excellent article on the recent claimed use of differential privacy to actually preserve user privacy: “A Few Thoughts on Cryptographic Engineering” by Matthew Green.

After reading the article we have a few follow-up thoughts on the topic. Continue reading Another note on differential privacy

Posted on Categories data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, StatisticsTags , , , ,

Free gradient boosting lecture

We have always regretted that we didn’t get to cover gradient boosting in Practical Data Science with R (Manning 2014). To try make up for that we are sharing (for free) our GBM lecture from our (paid) video course Introduction to Data Science.


(link, all support material here).

Please help us get the word out by sharing/Tweeting!

Posted on Categories data science, Exciting Techniques, Expository Writing, Statistics, Statistics To English Translation, TutorialsTags , , , , , , 6 Comments on A Simpler Explanation of Differential Privacy

A Simpler Explanation of Differential Privacy

Differential privacy was originally developed to facilitate secure analysis over sensitive data, with mixed success. It’s back in the news again now, with exciting results from Cynthia Dwork, et. al. (see references at the end of the article) that apply results from differential privacy to machine learning.

In this article we’ll work through the definition of differential privacy and demonstrate how Dwork et.al.’s recent results can be used to improve the model fitting process.

NewImage
The Voight-Kampff Test: Looking for a difference. Scene from Blade Runner

Continue reading A Simpler Explanation of Differential Privacy

Posted on Categories data science, Mathematics, OpinionTags , , , ,

How sure are you that large margin implies low VC dimension?

How sure are you that large margin implies low VC dimension (and good generalization error)? It is true. But even if you have taken a good course on machine learning you many have seen the actual proof (with all of the caveats and conditions). I worked through the literature proofs over the holiday and it took a lot of notes to track what is really going on in the derivation of the support vector machine.


Margin2
Figure: the standard SVM margin diagram, this time with some un-marked data added.
Continue reading How sure are you that large margin implies low VC dimension?

Posted on Categories Coding, Computer Science, data science, Expository Writing, math programming, Pragmatic Machine Learning, StatisticsTags , , , , , , , , , 4 Comments on The Geometry of Classifiers

The Geometry of Classifiers

As John mentioned in his last post, we have been quite interested in the recent study by Fernandez-Delgado, et.al., “Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?” (the “DWN study” for short), which evaluated 179 popular implementations of common classification algorithms over 120 or so data sets, mostly from the UCI Machine Learning Repository. For fun, we decided to do a follow-up study, using their data and several classifier implementations from scikit-learn, the Python machine learning library. We were interested not just in classifier accuracy, but also in seeing if there is a “geometry” of classifiers: which classifiers produce predictions patterns that look similar to each other, and which classifiers produce predictions that are quite different? To examine these questions, we put together a Shiny app to interactively explore how the relative behavior of classifiers changes for different types of data sets.

Continue reading The Geometry of Classifiers

Posted on Categories data science, Expository Writing, Opinion, Pragmatic Data Science, Pragmatic Machine Learning, StatisticsTags , , , , , , 4 Comments on Data Science, Machine Learning, and Statistics: what is in a name?

Data Science, Machine Learning, and Statistics: what is in a name?

A fair complaint when seeing yet another “data science” article is to say: “this is just medical statistics” or “this is already part of bioinformatics.” We certainly label many articles as “data science” on this blog. Probably the complaint is slightly cleaner if phrased as “this is already known statistics.” But the essence of the complaint is a feeling of claiming novelty in putting old wine in new bottles. Rob Tibshirani nailed this type of distinction in is famous machine learning versus statistics glossary.

I’ve written about statistics v.s. machine learning , but I would like to explain why we (the authors of this blog) often use the term data science. Nina Zumel explained being a data scientist very well, I am going to take a swipe at explaining data science.

We (the authors on this blog) label many of our articles as being about data science because we want to emphasize that the various techniques we write about are only meaningful when considered parts of a larger end to end process. The process we are interested in is the deployment of useful data driven models into production. The important components are learning the true business needs (often by extensive partnership with customers), enabling the collection of data, managing data, applying modeling techniques and applying statistics criticisms. The pre-existing term I have found that is closest to describing this whole project system is data science, so that is the term I use. I tend to use it a lot, because while I love the tools and techniques our true loyalty is to the whole process (and I want to emphasize this to our readers).

The phrase “data science” as in use it today is a fairly new term (made popular by William S. Cleveland, DJ Patil, and Jeff Hammerbacher). I myself worked in a “computational sciences” group in the mid 1990’s (this group emphasized simulation based modeling of small molecules and their biological interactions, the naming was an attempt to emphasize computation over computers). So for me “data science” seems like a good term when your work is driven by data (versus driven from computer simulations). For some people data science is considered a new calling and for others it is a faddish misrepresentation of work that has already been done. I think there are enough substantial differences in approach between traditional statistics, machine learning, data mining, predictive analytics, and data science to justify at least this much nomenclature. In this article I will try to describe (but not fully defend) my opinion. Continue reading Data Science, Machine Learning, and Statistics: what is in a name?