Posted on Categories data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, Statistics To English TranslationTags , , , , 9 Comments on How Do You Know if Your Data Has Signal?

How Do You Know if Your Data Has Signal?

NewImage
Image by Liz Sullivan, Creative Commons. Source: Wikimedia

An all too common approach to modeling in data science is to throw all possible variables at a modeling procedure and “let the algorithm sort it out.” This is tempting when you are not sure what are the true causes or predictors of the phenomenon you are interested in, but it presents dangers, too. Very wide data sets are computationally difficult for some modeling procedures; and more importantly, they can lead to overfit models that generalize poorly on new data. In extreme cases, wide data can fool modeling procedures into finding models that look good on training data, even when that data has no signal. We showed some examples of this previously in our “Bad Bayes” blog post.

In this latest “Statistics as it should be” article, we will look at a heuristic to help determine which of your input variables have signal. Continue reading How Do You Know if Your Data Has Signal?

Posted on Categories Opinion, Rants, StatisticsTags , , , 9 Comments on I was wrong about statistics

I was wrong about statistics

I’ll admit it: I have been wrong about statistics. However, that isn’t what this article is about. This article is less about some of the statistical mistakes I have made, as a mere working data scientist, and more of a rant about the hectoring tone of corrections from some statisticians (both when I have been right and when I have been wrong).


5317820857 d1f6a5b8a9 b
Used wrong (image Justin Baeder, some rights reserved).

Continue reading I was wrong about statistics