Differential privacy was originally developed to facilitate secure analysis over sensitive data, with mixed success. It’s back in the news again now, with exciting results from Cynthia Dwork, et. al. (see references at the end of the article) that apply results from differential privacy to machine learning.
In this article we’ll work through the definition of differential privacy and demonstrate how Dwork et.al.’s recent results can be used to improve the model fitting process.
The Voight-Kampff Test: Looking for a difference. Scene from Blade Runner
How sure are you that large margin implies low VC dimension (and good generalization error)? It is true. But even if you have taken a good course on machine learning you many have seen the actual proof (with all of the caveats and conditions). I worked through the literature proofs over the holiday and it took a lot of notes to track what is really going on in the derivation of the support vector machine.
Figure: the standard SVM margin diagram, this time with some un-marked data added.
As John mentioned in his last post, we have been quite interested in the recent study by Fernandez-Delgado, et.al., “Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?” (the “DWN study” for short), which evaluated 179 popular implementations of common classification algorithms over 120 or so data sets, mostly from the UCI Machine Learning Repository. For fun, we decided to do a follow-up study, using their data and several classifier implementations from scikit-learn, the Python machine learning library. We were interested not just in classifier accuracy, but also in seeing if there is a “geometry” of classifiers: which classifiers produce predictions patterns that look similar to each other, and which classifiers produce predictions that are quite different? To examine these questions, we put together a Shiny app to interactively explore how the relative behavior of classifiers changes for different types of data sets.
A fair complaint when seeing yet another “data science” article is to say: “this is just medical statistics” or “this is already part of bioinformatics.” We certainly label many articles as “data science” on this blog. Probably the complaint is slightly cleaner if phrased as “this is already known statistics.” But the essence of the complaint is a feeling of claiming novelty in putting old wine in new bottles. Rob Tibshirani nailed this type of distinction in is famous machine learning versus statistics glossary.
We (the authors on this blog) label many of our articles as being about data science because we want to emphasize that the various techniques we write about are only meaningful when considered parts of a larger end to end process. The process we are interested in is the deployment of useful data driven models into production. The important components are learning the true business needs (often by extensive partnership with customers), enabling the collection of data, managing data, applying modeling techniques and applying statistics criticisms. The pre-existing term I have found that is closest to describing this whole project system is data science, so that is the term I use. I tend to use it a lot, because while I love the tools and techniques our true loyalty is to the whole process (and I want to emphasize this to our readers).
The phrase “data science” as in use it today is a fairly new term (made popular by William S. Cleveland, DJ Patil, and Jeff Hammerbacher). I myself worked in a “computational sciences” group in the mid 1990’s (this group emphasized simulation based modeling of small molecules and their biological interactions, the naming was an attempt to emphasize computation over computers). So for me “data science” seems like a good term when your work is driven by data (versus driven from computer simulations). For some people data science is considered a new calling and for others it is a faddish misrepresentation of work that has already been done. I think there are enough substantial differences in approach between traditional statistics, machine learning, data mining, predictive analytics, and data science to justify at least this much nomenclature. In this article I will try to describe (but not fully defend) my opinion. Continue reading Data Science, Machine Learning, and Statistics: what is in a name?
Living in the age of big data we ask what to do when we have the good fortune to be presented with a huge amount of supervised training data? Most often at large scale we are presented with the un-supervised problems of characterization and information extraction; but some problem domains offer an almost limitless supply of supervised training data (such as using older data to build models that predict the near future). Having too much training data is a good problem to have and there are ways to use traditional methods (like logistic regression) at this scale. We present an “out of core” logistic regression implementation and a quick example in Apache Hadoop running on Amazon Elastic MapReduce. This presentation assumes familiarity with Unix style command lines, Java and Hadoop.
Having a bit of history as both a user of machine learning and a researcher in the field I feel I have developed a useful perspective on the various trends, flavors and nuances in machine learning and artificial intelligence. I thought I would take a moment to outline a bit of it here and demonstrate how what we call artificial intelligence is becoming more statistical in nature. Continue reading A Personal Perspective on Machine Learning
We describe the “the local to global principle.” It is a principle used to break algorithmic problem solving into two distinct phases (local criticism followed by global solution) and is an aid both in the design and in the application of algorithms. Instead of giving a formal definition of the principle we quickly define it and discuss a few examples and methods. We have produced both a stand-alone PDF (more legible) and a HTML/blog form (more skimable). Continue reading The Local to Global Principle