Data preparation and cleaning are some of the most important steps of predictive analytic and data science tasks. They are laborious, where most of the errors are made, your last line of defense against a wild data, and hold the biggest opportunities for outcome improvement. No matter how much time you spend on them, they still seem like a neglected topic. Data preparation isn’t as self contained or genteel as tweaking machine learning models or hyperparameter tuning; and that is one of the reasons data preparation represents such an important practical opportunity for improvement.
Our group is distributing a detailed writeup of the theory and operation behind our R realization of a set of sound data preparation and cleaning procedures called vtreat here: arXiv:1611.09477 [stat.AP]. This is where you can find out what vtreat does, decide if it is appropriate for your problem, or even find a specification allowing the use of the techniques in non-R environments (such as Python/Pandas/scikit-learn, Spark, and many others).
We have submitted this article for formal publication, so it is our intent you can cite this article (as it stands) in scientific work as a pre-print, and later cite it from a formally refereed source.
“Data Science” is obviously a trendy term making it way through the hype cycle. Either nobody is good enough to be a data scientist (unicorns) or everybody is too good to be a data scientist (or the truth is somewhere in the middle).
And there is a quarter that grumbles that we are merely talking about statistics under a new name (see here and here).
It has always been the case that advances in data engineering (such as punch cards, or data centers) make analysis practical at new scales (though I still suspect Map/Reduce was a plot designed to trick engineers into being excited about ETL and report generation).
In this spirit next week we will write about the sequential analysis solution for A/B-testing, invented in the 1940s by one of the greats of statistics and operations research: Abraham Wald (whom we have written about before).
It has been popular to complain that the current terms “data science” and “big data” are so vague as to be meaningless. While these terms are quite high on the hype-cycle, even the American Statistical Association was forced to admit that data science is actually a real thing and exists.
There’s a new post up at the ninazumel.com blog that looks at the statistics of “verification by multiplicity” — the statistical technique that is behind NASA’s announcement of 715 new planets that have been validated in the data from the Kepler Space Telescope.
We normally don’t write about science here at Win-Vector, but we do sometimes examine the statistics and statistical methods behind scientific announcements and issues. NASA’s new technique is a cute and relatively straightforward (statistically speaking) approach.
From what I understand of the introduction to the paper, there are two ways to determine whether or not a planet candidate is really a planet: the first is to confirm the fact with additional measurements of the target star’s gravitational wobble, or by measurements of the transit times of the apparent planets across the face of the star. Getting sufficient measurements can take time. The other way is to “validate” the planet by showing that it’s highly unlikely that the sighting was a false positive. Specifically, the probability that the signal observed was caused by a planet should be at least 100 times larger than the probability that the signal is a false positive. The validation analysis is a Bayesian approach that considers various mechanisms that produce false positives, determines the probability that these various mechanisms could have produced the signal in question, and compares them to the probability that a planet produced the signal.
The basic idea behind verification by multiplicity is that planets are often clustered in multi-planet star systems, while false positive measurements (mistaken identification of potential planets) occur randomly. Putting this another way: if false positives are random, then they won’t tend to occur together near the same star. So if you observe a star with multiple “planet signals,” it’s unlikely that all the signals are false positives. We can use that observation to quantify how much more likely it is that a star with multiple candidates actually hosts a planet. The resulting probability can be used as an improved prior for the planet model when doing the statistical validation described above.