Six Fundamental Methods to Generate a Random Variable

January 20th, 2012 1 comment

Introduction

To implement many numeric simulations you need a sophisticated source of instances of random variables. The question is: how do you generate them?

The literature is full of algorithms requiring random samples as inputs or drivers (conditional random fields, Bayesian network models, particle filters and so on). The literature is also full of competing methods (pseudorandom generators, entropy sources, Gibbs samplers, Metropolis–Hastings algorithm, Markov chain Monte Carlo methods, bootstrap methods and so on). Our thesis is: this diversity is supported by only a few fundamental methods. And you are much better off thinking in terms of a few deliberately simple composable mechanisms than you would be in relying on some hugely complicated black box “brand name” technique.

We will discuss the half dozen basic methods that all of these techniques are derived from. Read more…

Importance Sampling

January 1st, 2012 No comments

We describe briefly the powerful simulation tefchnique known as
“importance sampling.” Importance sampling is a technique that lets
you use numerical simulation to explore events that, at first look,
appear too rare to be reliably approximated numerically. The correctness
of importance sampling follows almost immediately from the definition
of a change of density. Like most mathematical techniques, importance
sampling brings in its own concerns and controls that were not obvious
in the original problem. To deal with these concerns (like picking
the re-weighting to use) we will largely appeal to the ideas from
“A Tutorial on the Cross-Entropy Method” Pieter-Tjerk de Boer, Dirk P Kroese, Shie Mannor, and Reuven Y Rubinstein, Annals of Operations Research, 2005 vol. 134 (1) pp. 19-67. Read more…

Why you can not to use statistics to dispute magic

December 10th, 2011 No comments

It is a subtle point that statistical modeling is different than model based science. However, empirical scientists seem to go out of their way to conflate the two before the public (as statistical modeling is easier to perform and model based science is more highly rewarded). It is often claimed that model based science is being done when in fact statistics is what is being done (for instance some of the unfortunate distractions of flawed reports related to the important question of the magnitude of plausible anthropogenic global warming).

Both model based science and statistics are wonderful fields, but it is important to not receive the results of one when you have paid for the other.

We will pointedly discuss one of the differences. Read more…

What to do when you run out of memory

December 6th, 2011 No comments

A constant problem for computer science (since its inception) is how to manipulate data that is larger than machine memory. We present here some general strategies for working “out of core” or what you should do when you run out of memory.

Early computers were most limited by their paltry memory sizes. von Neumann himself commented that even a room full of genius mathematicians would not be capable of much if all they could communicate, think upon or remember were the characters on a single type written page (much more memory than the few hundred words available to the Eniac). The most visible portions of early computers are their external memories or secondary stores: card readers, paper tape readers and tape drives.


IMG 0062

SDC 920 computer, Computer History Museum, Mountain View CA

Historically computer scientists have concentrated on streaming or online algorithms (that is algorithms that work with the data in the order it is available and use limited memory). For many problems we have found this an insufficient model and it is much better to assume you can re-order and replicate data (such as scattering data to many processors and re-collecting it to sort). The scatter/gather paradigm is ubiquitous and is the underpinning of large scale sorting, databases and Map Reduce. So in one sense databases and Map Reduce different APIs on top of very related technologies (journaling, splitting and merging). Replicating data (or even delaying duplicate elimination) that is already “too large to handle” may seem counterintuitive; but it is exploiting the primary property of secondary storage: that secondary storage tends to be much larger than primary storage (typically by 2 orders of magnitude, compare a 2 terabyte drive to an 8 gigabyte memory stick). Read more…

My Favorite Graphs

December 5th, 2011 5 comments

The important criterion for a graph is not simply how fast we can see a result; rather it is whether through the use of the graph we can see something that would have been harder to see otherwise or that could not have been seen at all.

– William Cleveland, The Elements of Graphing Data, Chapter 2

In this article, I will discuss some graphs that I find extremely useful in my day-to-day work as a data scientist. While all of them are helpful (to me) for statistical visualization during the analysis process, not all of them will necessarily be useful for presentation of final results, especially to non-technical audiences.

I tend to follow Cleveland’s philosophy, quoted above; these graphs show me — and hopefully you — aspects of data and models that I might not otherwise see. Some of them, however, are non-standard, and tend to require explanation. My purpose here is to share with our readers some ideas for graphical analysis that are either useful to you directly, or will give you some ideas of your own.

Read more…

Correlation and R-Squared

November 21st, 2011 1 comment

What is R2? In the context of predictive models (usually linear regression), where y is the true outcome, and f is the model’s prediction, the definition that I see most often is:

4471BBA8-E9DB-4D30-A9AE-A74F8C773247.jpg

In words, R2 is a measure of how much of the variance in y is explained by the model, f.

Under “general conditions”, as Wikipedia says,
R2 is also the square of the correlation between the actual and predicted outcomes:

A4311540-8DFB-45FB-93F7-65E7B72AE6C8.jpg

I prefer the “squared correlation” definition, as it gets more directly at what is usually my primary concern: prediction. If R2 is close to one, then the model’s predictions mirror true outcome, tightly. If R2 is low, then either the model does not mirror true outcome, or it only mirrors it loosely: a “cloud” that — hopefully — is oriented in the right direction. Of course, looking at the graph always helps:

R2_compare.png

The question we will address here is : how do you get from R2 to correlation?

Read more…

An Appreciation of Locality Sensitive Hashing

November 21st, 2011 1 comment

We share our admiration for a set of results called “locality sensitive hashing” by demonstrating a greatly simplified example that exhibits the spirit of the techniques. Read more…

“The Mythical Man Month” is still a good read

October 23rd, 2011 1 comment

Re-read Fred Brooks “The Mythical Man Month” over vacation.  Book remains insightful about computer science and project management. Read more…

Kernel Methods and Support Vector Machines de-Mystified

October 7th, 2011 2 comments

We give a simple explanation of the interrelated machine learning techniques called kernel methods and support vector machines. We hope to characterize and de-mystify some of the properties of these methods. To do this we work some examples and draw a few analogies. The familiar no matter how wonderful is not perceived as mystical. Read more…

Increase your productivity

September 24th, 2011 1 comment

I think I have been pretty productive on technical tasks lately and the method is (at least to me) interesting. The effect was accidental but I think one can explain it and reproduce it by synthesizing three important observations on human behavior. Read more…