Posted on Categories data science, Pragmatic Data Science, Programming, Statistics, TutorialsTags 7 Comments on Don’t use stats::aggregate()

Don’t use stats::aggregate()

When working with an analysis system (such as R) there are usually good reasons to prefer using functions from the “base” system over using functions from extension packages. However, base functions are sometimes locked into unfortunate design compromises that can now be avoided. In R’s case I would say: do not use stats::aggregate().

Read on for our example. Continue reading Don’t use stats::aggregate()

Posted on Categories Pragmatic Data Science, Quantitative Finance, RantsTags , , 3 Comments on Who is allowed to call themselves a data scientist?

Who is allowed to call themselves a data scientist?

It has been popular to complain that the current terms “data science” and “big data” are so vague as to be meaningless. While these terms are quite high on the hype-cycle, even the American Statistical Association was forced to admit that data science is actually a real thing and exists.

Gartner hype cycle (Wikipedia).

Given we agree data science exists, who is allowed to call themselves a data scientist? Continue reading Who is allowed to call themselves a data scientist?

Posted on Categories Administrativia, Practical Data ScienceTags , , , ,

Thank you Joseph Rickert!

A bit of text we are proud to steal from our good friend Joseph Rickert:

Then, for some very readable background material on SVMs I recommend section 13.4 of Applied Predictive Modeling and sections 9.3 and 9.4 of Practical Data Science with R by Nina Zumel and John Mount. You will be hard pressed to find an introduction to kernel methods and SVMs that is as clear and useful as this last reference.

For more on SVMs see the original article on the Revolution Analytics blog.

Posted on Categories Exciting Techniques, StatisticsTags , , 4 Comments on A simple differentially private-ish procedure

A simple differentially private-ish procedure

Authors: John Mount and Nina Zumel

Nina and I were noodling with some variations of differentially private machine learning, and think we have found a variation of a standard practice that is actually fairly efficient in establishing differential privacy a privacy condition (but, as commenters pointed out- not differential privacy).


Read on for the idea and a rough analysis. Continue reading A simple differentially private-ish procedure

Posted on Categories data science, Expository Writing, Opinion, Rants, Statistics, Statistics To English Translation, TutorialsTags , , , , ,

Baking priors

There remains a bit of a two-way snobbery that Frequentist statistics is what we teach (as so-called objective statistics remain the same no matter who works with them) and Bayesian statistics is what we do (as it tends to directly estimate posterior probabilities we are actually interested in). Nina Zumel hit the nail on the head when she wrote an article explaining the appropriateness of the type of statistical theory depends on the type of question you are trying to answer, not on your personal prejudices.

We will discuss a few more examples that have been in our mind, including one I am calling “baking priors.” This final example will demonstrate some of the advantages of allowing researchers to document their priors.

Thumb IMG 0539 1024
Figure 1: two loaves of bread.
Continue reading Baking priors

Posted on Categories Computer Science, Exciting Techniques, Opinion, Programming, RantsTags , 2 Comments on Thumbs up for Anaconda

Thumbs up for Anaconda

One of the things I like about R is: because it is not used for systems programming you can expect to install your own current version of R without interference from some system version of R that is deliberately being held back at some older version (for reasons of script compatibility). R is conveniently distributed as a single package (with automated install of additional libraries).

Want to do some data analysis? Install R, load your data, and go. You don’t expect to spend hours on system administration just to get back to your task.

Python, being a popular general purpose language does not have this advantage, but thanks to Anaconda from Continuum Analytics you can skip (or at least delegate) a lot of the system environment imposed pain. With Anaconda trying out Python packages (Jupyter, scikit-learn, pandas, numpy, sympy, cvxopt, bokeh, and more) becomes safe and pleasant. Continue reading Thumbs up for Anaconda

Posted on Categories Administrativia, Practical Data Science, StatisticsTags , ,

Some key Win-Vector serial data science articles

As readers have surely noticed the Win-Vector LLC blog isn’t a stream of short notes, but instead a collection of long technical articles. It is the only way we can properly treat topics of consequence.


What not everybody may have noticed is a number of these articles are serialized into series for deeper comprehension. The key series include:

  • Statistics to English translation.

    This series tries to find vibrant applications and explanations of standard good statistical practices, to make them more approachable to the non statistician.

  • Statistics as it should be.

    This series tries to cover cutting edge machine learning techniques, and then adapt and explain them in traditional statistical terms.

  • R as it is.

    This series tries to teach the statistical programming language R “warts and all” so we can see it as the versatile and powerful data science tool that it is.

To get a taste of what we are up to in our writing please checkout our blog highlights and these series. For deeper treatments of more operational topics also check out our book Practical Data Science with R.

Or if you have something particular you need solved consider engaging us at Win-Vector LLC for data science consulting and/or training.

Posted on Categories Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , , , 3 Comments on Using differential privacy to reuse training data

Using differential privacy to reuse training data

Win-Vector LLC‘s Nina Zumel wrote a great article explaining differential privacy and demonstrating how to use it to enhance forward step-wise logistic regression (essentially reusing test data). This allowed her to reproduce results similar to the recent Science paper “The reusable holdout: Preserving validity in adaptive data analysis”. The technique essentially protects and reuses test data, allowing the series of adaptive decisions driving forward step-wise logistic regression to remain valid with respect to unseen future data. Without the differential privacy precaution these steps are not always sufficiently independent of each other to ensure good model generalization performance. Through differential privacy one gets safe reuse of test data across many adaptive queries, yielding more accurate estimates of out of sample performance, more robust choices, and resulting in a better model.

In this note I will discuss a specific related application: using differential privacy to reuse training data (or equivalently make training procedures more statistically efficient). I will also demonstrate similar effects using more familiar statistical techniques.

Continue reading Using differential privacy to reuse training data

Posted on Categories data science, Exciting Techniques, Expository Writing, Statistics, Statistics To English Translation, TutorialsTags , , , , , , 6 Comments on A Simpler Explanation of Differential Privacy

A Simpler Explanation of Differential Privacy

Differential privacy was originally developed to facilitate secure analysis over sensitive data, with mixed success. It’s back in the news again now, with exciting results from Cynthia Dwork, et. al. (see references at the end of the article) that apply results from differential privacy to machine learning.

In this article we’ll work through the definition of differential privacy and demonstrate how Dwork’s recent results can be used to improve the model fitting process.

The Voight-Kampff Test: Looking for a difference. Scene from Blade Runner

Continue reading A Simpler Explanation of Differential Privacy