Posted on Categories data science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , Leave a comment on Python Data Science Tip: Don’t use Default Cross Validation Settings

Python Data Science Tip: Don’t use Default Cross Validation Settings

Here is a quick, simple, and important tip for doing machine learning, data science, or statistics in Python: don’t use the default cross validation settings. The default can default to a deterministic, and even ordered split, which is not in general what one wants or expects from a statistical point of view. From a software engineering point of view the defaults may be sensible as since they don’t touch the pseudo-random number generator they are repeatable, deterministic, and side-effect free.

This issue falls under “read the manual”, but it is always frustrating when the defaults are not sufficiently generous.

Continue reading Python Data Science Tip: Don’t use Default Cross Validation Settings

Posted on Categories Coding, Computer Science, data science, Expository Writing, math programming, Pragmatic Machine Learning, StatisticsTags , , , , , , , , , 4 Comments on The Geometry of Classifiers

The Geometry of Classifiers

As John mentioned in his last post, we have been quite interested in the recent study by Fernandez-Delgado,, “Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?” (the “DWN study” for short), which evaluated 179 popular implementations of common classification algorithms over 120 or so data sets, mostly from the UCI Machine Learning Repository. For fun, we decided to do a follow-up study, using their data and several classifier implementations from scikit-learn, the Python machine learning library. We were interested not just in classifier accuracy, but also in seeing if there is a “geometry” of classifiers: which classifiers produce predictions patterns that look similar to each other, and which classifiers produce predictions that are quite different? To examine these questions, we put together a Shiny app to interactively explore how the relative behavior of classifiers changes for different types of data sets.

Continue reading The Geometry of Classifiers