This is a tutorial on how to try out a new package in R. The summary is: expect errors, search out errors and don’t start with the built in examples or real data.
Suppose you want to try out a novel statistical technique? A good fraction of the time R is your best bet for a first trial. Take as an example general additive models (“Generalized Additive Models,” Trevor J Hastie, Robert Tibshirani, Statistical Science (1986) vol. 1 (3) pp. 297-318); R has a package named “gam” written by Trevor Hastie himself. But, like most R packages, trying the package from the supplied documentation brings in unfamiliar data and concerns. It is best to start small and quickly test if the package itself is suitable to your needs. We give a quick outline of how to learn such a package and quickly find out if the package is for you.
Continue reading The cranky guide to trying R packages
Living in the age of big data we ask what to do when we have the good fortune to be presented with a huge amount of supervised training data? Most often at large scale we are presented with the un-supervised problems of characterization and information extraction; but some problem domains offer an almost limitless supply of supervised training data (such as using older data to build models that predict the near future). Having too much training data is a good problem to have and there are ways to use traditional methods (like logistic regression) at this scale. We present an “out of core” logistic regression implementation and a quick example in Apache Hadoop running on Amazon Elastic MapReduce. This presentation assumes familiarity with Unix style command lines, Java and Hadoop.
Continue reading Large Data Logistic Regression (with example Hadoop code)
One of the current best tools in the machine learning toolbox is the 1930s statistical technique called logistic regression. We explain how to add professional quality logistic regression to your analytic repertoire and describe a bit beyond that. Continue reading Learn Logistic Regression (and beyond)
Having a bit of history as both a user of machine learning and a researcher in the field I feel I have developed a useful perspective on the various trends, flavors and nuances in machine learning and artificial intelligence. I thought I would take a moment to outline a bit of it here and demonstrate how what we call artificial intelligence is becoming more statistical in nature. Continue reading A Personal Perspective on Machine Learning
This article is a quick appreciation of some of the statistical, analytic and philosphic techniques of Deming, Wald and Boyd. Many of these techniques have become pillars of modern industry through the sciences of statistics and operations research.
Continue reading Deming, Wald and Boyd: cutting through the fog of analytics
Readers returning to our blog will know that Win-Vector LLC is fairly “pro-R.” You can take that to mean “in favor or R” or “professionally using R” (both statements are true). Some days we really don’t feel that way. Continue reading R annoyances
Recently, we had a client come to us with (among other things) the following question:
Who is more valuable, Customer Type A, or Customer Type B?
This client already tracked the net profit and loss generated by every customer who used his services, and had begun to analyze his customers by group. He was especially interested in Customer Type A; his gut instinct told him that Type A customers were quite profitable compared to the others (Type B) and he wanted to back up this feeling with numbers.
He found that, on average, Type A customers generate about $92 profit per month, and Type B customers average about $115 per month (The data and figures that we are using in this discussion aren’t actual client data, of course, but a notional example). He also found that while Type A customers make up about 4% of the customer base, they generate less than 4% of the net profit per month. So Type A customers actually seem to be less profitable than Type B customers. Apparently, our client was mistaken.
Or was he? Continue reading Living in A Lognormal World
Q: What is the difference between a banker and a trader?
A: A banker will try and tell you a 10% loss followed by a 10% gain is breaking even.
Continue reading Relative returns: a banker versus trader paradox
In the previous installment of the Statistics to English Translation, we discussed the technical meaning of the term ”significant”. In this installment, we look at how significance is calculated. This article will be a little more technically detailed than the last one, but our primary goal is still to help you decipher statements about significance in research papers: statements like “
As in the last article, we will concentrate on situations where we want to test the difference of means. You should read that previous article first, so you are familiar with the terminology that we use in this one.
A pdf version of this current article can be found here.
Continue reading Statistics to English Translation, Part 2b: Calculating Significance
IowaHawk has a excellent article attempting to reproduce the infamous CRU climate graph using OpenOffice: Fables of the Reconstruction. We thought we would show how to produced similarly bad results using R.
Continue reading CRU graph yet again (with R)