Posted on Categories art, OpinionTags , 1 Comment on A non-technical post and ask

A non-technical post and ask

This article is not on the usual technical topics of this blog, so you have my apology up front for that. And instead of trying to help you, we are asking for your help.

Nina Zumel has written a lot of important and helpful articles for this blog. I would call out in particular: her invention of and leadership in our Statistics to English category, clear writing on statistical significance, visualization and working as a data scientist. She has also written a bit more on the whole person: I Write, Therefore I Think and On Balance.

In this spirit I would like to call your attention to a KickStarter that is important to her and all of us at Win-Vector LLC to: the Non Stop Bhangra Documentary.

I am asking you to please consider promoting this KickStarter to anyone you know that cares about music, entertainment/culture in the San Francisco bay area, Indian culture or the possibility of having some identity outside of professional work. Nina’s story is only one among many of an incredible collective of people who all give a lot of their time to share what has been called “infections joy” with many (including local elementary and high schools). We would really like to see filmmaker Odell Hussey get the money to complete the documentary project he has been donating many hours to for years. This is exactly the kind of project KickStarter was designed for: finishing a larger work.

I ask that you consider supporting the Non Stop Bhangra Documentary. Please join us in supporting this amazing project.

D30297bfd41dd7200cae3c012c199820 large

Posted on Categories data science, Statistics, TutorialsTags , , , , 6 Comments on A bit more on sample size

A bit more on sample size

In our article What is a large enough random sample? we pointed out that if you wanted to measure a proportion to an accuracy “a” with chance of being wrong of “d” then a idea was to guarantee you had a sample size of at least:


This is the central question in designing opinion polls or running A/B tests. This estimate comes from a quick application of Hoeffding’s inequality and because it has a simple form it is possible to see that accuracy is very expensive (to halve the size of difference we are trying to measure we have to multiply the sample size by four) and the cheapness of confidence (increases in the required confidence or significance of a result cost only moderately in sample size).

However, for high-accuracy situations (when you are trying to measure two effects that are very close to each other) suggesting a sample size that is larger than is strictly necessary (as we are using an bound, not an exact formula for the required sample size). As a theorist or a statistician we like to error on the side of too large a sample (guaranteeing reliability), but somebody who is paying for each entry in a poll would want a smaller size.

This article shows a function that computes the exact size needed (using R). Continue reading A bit more on sample size