Elon Musk’s writing about a Tesla battery fire reminded me of some of the math related to trying to estimate the rate of a rare event from a single occurrence of the event (plus many non-event occurrences). In this article we work through some of the ideas. Continue reading Estimating rates from a single occurrence of a rare event

## Some puzzles about boxes

This article is a break from data-science, and is instead about the kind of problem you can try on the train. It is problem 70 in Bollobas’s “The art of mathematics” (though I forgot that and re-worked the problem crudely from memory when writing this article).

One of the many irritating things about airlines is the fact that the cary-on bag restrictions are often stated as “your maximum combined linear measurement (length + width + height) must not exceed 45 inches” when they really mean your bag must fit into a 14 inch by 9 inch by 22 inch box (so they actually may not accept a 43 inch by one inch by one inch pool spear as your carry-on). The “total linear measure” seems (at first glance) “gameable,” but can (through some hairy math) at least be seen to at least be self-consistent. It turns out you can’t put a box with longer total linear measurements into a box with smaller total linear measurements.

Let’s work out why this could be problem and then why the measure works. Continue reading Some puzzles about boxes

## Just spoke at The Berkeley R Language Beginner Study Group

Just spoke at The Berkeley R Language Beginner Study Group. Great audience, very bright. Here are my slides: StartR.pdf .

## Speaking at BARUG Wednesday, August 21, 2013

I’ll be talking at the “Official” BARUG meeting Wednesday, August 21, 2013. The RSVPs look full (sorry) but I wanted to post a thanks to the organizers for considering me. If things go well I’ll see if I can post the slides later (not sure if that is useful without detailed speakers notes). Continue reading Speaking at BARUG Wednesday, August 21, 2013

## Practical Data Science with R, deal of the day Aug 1 2013

Deal of the Day August 1: Half off my book Practical Data Science with R. Use code dotd0801au at www.manning.com/zumel/

## What is “Practical Data Science with R”?

A bit about our upcoming book “Practical Data Science with R”. Nina and I share our current draft of the front matter from the book, which is a description which will help you decide if this is the book for you (we hope that it is). Or this could be the book that helps explain what you do to others.

## On Writing Our Book: A Little Philosophy

We recently got this question from a subscriber to our book:

… will you in any way describe what subject areas, backgrounds, courses etc. would help a non data scientist prepare themselves to at least understand at a deeper level why they techniques you will discuss work…and also understand the boundary conditions and limits of the models etc….. ?

[…] I would love to understand what I could review first to better prepare to extract the most from it.

It’s a good question, and it raises an interesting philosophical point. To read our book, it will of course help to know a little bit about statistics and probability, and to be familiar with R and/or with programming in general. But we do plan on introducing the necessary concepts as needed into our discussion, so we don’t consider these subjects to be “pre-requisites” in a strict sense.

Part of our reason for writing this book is to make reading about statistics/probability and machine learning easier. That is, we hope that if you read our book, other reference books and textbooks will make more sense, because we have given you a concrete context for the abstract concepts that the reference books cover.

So, my advice to our subscriber was to keep his references handy as he read our book, rather than trying to brush up on all the “pre-requisite” subjects first.

Of course, everyone learns differently, and we’d like to know what other readers think. What (if anything) would you consider “pre-requisites” to our book? What would you consider good companion references?

If you are subscribed to our book, please join the conversation, or post other comments on the *Practical Data Science with R* author’s forum. Your input will help us write a better book; we look forward to hearing from you.

## Practical Data Science with R news

We have some great news for “Practical Data Science with R”:

- We have started an announcement page to point direct readers to the book, book forums, data and the free preview chapter.

## Big News! “Practical Data Science with R” MEAP launched!

Nina Zumel and I ( John Mount ) have been working very hard on producing an exciting new book called “Practical Data Science with R.” The book has now entered Manning Early Access Program (MEAP) which allows you to subscribe to chapters as they become available and give us feedback before the book goes into print.

Please subscribe to our book, your support now will help us improve it. Please also forward this offer to your friends and colleagues (and please ask them to also subscribe and forward). Continue reading Big News! “Practical Data Science with R” MEAP launched!

## Bayesian and Frequentist Approaches: Ask the Right Question

It occurred to us recently that we don’t have any articles about Bayesian approaches to statistics here. I’m not going to get into the “Bayesian versus Frequentist” war; in my opinion, which style of approach to use is less about philosophy, and more about figuring out the best way to answer a question. Once you have the right question, then the right approach will naturally suggest itself to you. It could be a frequentist approach, it could be a bayesian one, it could be both — even while solving the same problem.

Let’s take the example that Bayesians love to hate: significance testing, especially in clinical trial style experiments. Clinical trial experiments are designed to answer questions of the form “Does treatment X have a discernible effect on condition Y, on average?” To be specific, let’s use the question “Does drugX reduce hypertension, on average?” Assuming that your experiment does show a positive effect, the statistical significance tests that you run should check for the sorts of problems that John discussed in our previous article, Worry about correctness and repeatability, not p-values: What are the chances that an ineffective drug could produce the results that I saw? How likely is it that another researcher could replicate my results with the same size trial?

We can argue about whether or not the question we are answering is the *correct* question — but given that it *is* the question, the procedure to answer it and to verify the statistical validity of the results is perfectly appropriate.

So what is the correct question? From your family doctor’s viewpoint, a clinical trial answers the question “If I prescribe drugX to all my hypertensive patients, will their blood pressure improve, on average?” That isn’t the question (hopefully) that your doctor actually asks, though possibly your insurance company does. Your doctor should be asking “If I prescribe drugX to *this patient*, the one sitting in my examination room, will the patient’s blood pressure improve?” There is only one patient, so there is no such thing as “on average.”

If your doctor has a masters degree in statistics, the question might be phrased as “If I prescribe drugX to this patient, what is the posterior probability that the patient’s blood pressure will improve?” And that’s a bayesian question. Continue reading Bayesian and Frequentist Approaches: Ask the Right Question