Posted on Categories Administrativia, Pragmatic Data Science, StatisticsTags , , 5 Comments on Practical Data Science with R: Release date announced

Practical Data Science with R: Release date announced

PracticalDataScienceCover

It took a little longer than we’d hoped, but we did it! Practical Data Science with R will be released on April 2nd (physical version). The eBook version will follow soon after, on April 15th. You can preorder the pBook now on the Manning book page. The physical version comes with a complimentary eBook version (when the eBook is released), in all three formats: PDF, ePub, and Kindle.

If you haven’t yet, order it now!

(softbound 416 pages, black and white; includes access to color PDF, ePub and Kindle when available)

Posted on Categories data science, Opinion, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , 2 Comments on Can a classifier that never says “yes” be useful?

Can a classifier that never says “yes” be useful?

Many data science projects and presentations are needlessly derailed by not having set shared business relevant quantitative expectations early on (for some advice see Setting expectations in data science projects). One of the most common issues is the common layman expectation of “perfect prediction” from classification projects. It is important to set expectations correctly so your partners know what you are actually working towards and do not consider late choices of criteria disappointments or “venue shopping.” Continue reading Can a classifier that never says “yes” be useful?

Posted on Categories Administrativia, Practical Data Science, StatisticsTags , , 1 Comment on Some statistics about the book

Some statistics about the book

The release date for Zumel, Mount “Practical Data Science with R” is getting close. I thought I would share a few statistics about what goes into this kind of book. Continue reading Some statistics about the book

Posted on Categories Administrativia, StatisticsTags , , , , , ,

The Statistics behind “Verification by Multiplicity”

There’s a new post up at the ninazumel.com blog that looks at the statistics of “verification by multiplicity” — the statistical technique that is behind NASA’s announcement of 715 new planets that have been validated in the data from the Kepler Space Telescope.

We normally don’t write about science here at Win-Vector, but we do sometimes examine the statistics and statistical methods behind scientific announcements and issues. NASA’s new technique is a cute and relatively straightforward (statistically speaking) approach.

From what I understand of the introduction to the paper, there are two ways to determine whether or not a planet candidate is really a planet: the first is to confirm the fact with additional measurements of the target star’s gravitational wobble, or by measurements of the transit times of the apparent planets across the face of the star. Getting sufficient measurements can take time. The other way is to “validate” the planet by showing that it’s highly unlikely that the sighting was a false positive. Specifically, the probability that the signal observed was caused by a planet should be at least 100 times larger than the probability that the signal is a false positive. The validation analysis is a Bayesian approach that considers various mechanisms that produce false positives, determines the probability that these various mechanisms could have produced the signal in question, and compares them to the probability that a planet produced the signal.

The basic idea behind verification by multiplicity is that planets are often clustered in multi-planet star systems, while false positive measurements (mistaken identification of potential planets) occur randomly. Putting this another way: if false positives are random, then they won’t tend to occur together near the same star. So if you observe a star with multiple “planet signals,” it’s unlikely that all the signals are false positives. We can use that observation to quantify how much more likely it is that a star with multiple candidates actually hosts a planet. The resulting probability can be used as an improved prior for the planet model when doing the statistical validation described above.

You can read the rest of the article here.