Posted on Categories math programming, Programming, StatisticsTags , , , , , , , ,

Large Data Logistic Regression (with example Hadoop code)

Living in the age of big data we ask what to do when we have the good fortune to be presented with a huge amount of supervised training data? Most often at large scale we are presented with the un-supervised problems of characterization and information extraction; but some problem domains offer an almost limitless supply of supervised training data (such as using older data to build models that predict the near future). Having too much training data is a good problem to have and there are ways to use traditional methods (like logistic regression) at this scale. We present an “out of core” logistic regression implementation and a quick example in Apache Hadoop running on Amazon Elastic MapReduce. This presentation assumes familiarity with Unix style command lines, Java and Hadoop.

Continue reading Large Data Logistic Regression (with example Hadoop code)

Posted on Categories Applications, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , , ,

Learn Logistic Regression (and beyond)

One of the current best tools in the machine learning toolbox is the 1930s statistical technique called logistic regression. We explain how to add professional quality logistic regression to your analytic repertoire and describe a bit beyond that. Continue reading Learn Logistic Regression (and beyond)

Posted on Categories Computer Science, Expository Writing, History, Opinion, StatisticsTags , , 7 Comments on A Personal Perspective on Machine Learning

A Personal Perspective on Machine Learning

Having a bit of history as both a user of machine learning and a researcher in the field I feel I have developed a useful perspective on the various trends, flavors and nuances in machine learning and artificial intelligence. I thought I would take a moment to outline a bit of it here and demonstrate how what we call artificial intelligence is becoming more statistical in nature. Continue reading A Personal Perspective on Machine Learning

Posted on Categories History, StatisticsTags , , , , , ,

Deming, Wald and Boyd: cutting through the fog of analytics

This article is a quick appreciation of some of the statistical, analytic and philosphic techniques of Deming, Wald and Boyd. Many of these techniques have become pillars of modern industry through the sciences of statistics and operations research.
Continue reading Deming, Wald and Boyd: cutting through the fog of analytics

Posted on Categories Coding, Rants, StatisticsTags , , , 10 Comments on R annoyances

R annoyances

Readers returning to our blog will know that Win-Vector LLC is fairly “pro-R.” You can take that to mean “in favor or R” or “professionally using R” (both statements are true). Some days we really don’t feel that way. Continue reading R annoyances

Posted on Categories Applications, Expository Writing, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, Statistics To English TranslationTags , , , , , ,

Living in A Lognormal World

Recently, we had a client come to us with (among other things) the following question:
Who is more valuable, Customer Type A, or Customer Type B?

This client already tracked the net profit and loss generated by every customer who used his services, and had begun to analyze his customers by group. He was especially interested in Customer Type A; his gut instinct told him that Type A customers were quite profitable compared to the others (Type B) and he wanted to back up this feeling with numbers.

He found that, on average, Type A customers generate about $92 profit per month, and Type B customers average about $115 per month (The data and figures that we are using in this discussion aren’t actual client data, of course, but a notional example). He also found that while Type A customers make up about 4% of the customer base, they generate less than 4% of the net profit per month. So Type A customers actually seem to be less profitable than Type B customers. Apparently, our client was mistaken.

Or was he? Continue reading Living in A Lognormal World

Posted on Categories Opinion, Rants, StatisticsTags , , , , 1 Comment on Relative returns: a banker versus trader paradox

Relative returns: a banker versus trader paradox

Quick Joke.

Q: What is the difference between a banker and a trader?
A: A banker will try and tell you a 10% loss followed by a 10% gain is breaking even.

Continue reading Relative returns: a banker versus trader paradox

Posted on Categories Applications, Expository Writing, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, Statistics To English TranslationTags , , , ,

Statistics to English Translation, Part 2b: Calculating Significance

In the previous installment of the Statistics to English Translation, we discussed the technical meaning of the term ”significant”. In this installment, we look at how significance is calculated. This article will be a little more technically detailed than the last one, but our primary goal is still to help you decipher statements about significance in research papers: statements like “
$ (F(2, 864) = 6.6, p = 0.0014)$ ”.

As in the last article, we will concentrate on situations where we want to test the difference of means. You should read that previous article first, so you are familiar with the terminology that we use in this one.

A pdf version of this current article can be found here.
Continue reading Statistics to English Translation, Part 2b: Calculating Significance

Posted on Categories Rants, StatisticsTags , , 3 Comments on CRU graph yet again (with R)

CRU graph yet again (with R)

IowaHawk has a excellent article attempting to reproduce the infamous CRU climate graph using OpenOffice: Fables of the Reconstruction. We thought we would show how to produced similarly bad results using R.
Continue reading CRU graph yet again (with R)

Posted on Categories Applications, Expository Writing, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, Statistics To English TranslationTags , , , 4 Comments on Statistics to English Translation, Part 2a: ’Significant’ Doesn’t Always Mean ’Important’

Statistics to English Translation, Part 2a: ’Significant’ Doesn’t Always Mean ’Important’

In this installment of our ongoing Statistics to English Translation series1, we will look at the technical meaning of the term ”significant”. As you might expect, what it means in statistics is not exactly what it means in everyday language.

As always, a pdf version of this article is available as well. Continue reading Statistics to English Translation, Part 2a: ’Significant’ Doesn’t Always Mean ’Important’