Archive

Author Archive

Ergodic Theory for Interested Computer Scientists

February 4th, 2012 No comments

We describe ergodic theory in modern notation accessible to interested computer scientists.

The ergodic theorem (http://en.wikipedia.org/wiki/Ergodic theory (link)) is an important principle of recurrence and averaging in dynamical systems. However, there are some inconsistent uses of the term, much of the machinery is intended to work with deterministic dynamical systems (not probabilistic systems, as is often implied) and often the conclusion of the theory is mis-described as its premises.

By “interested computer scientists” we mean people who know math and work with probabilistic systems1, but know not to accept mathematical definitions without some justification (actually a good attitude for mathematicians also). Read more…

Six Fundamental Methods to Generate a Random Variable

January 20th, 2012 1 comment

Introduction

To implement many numeric simulations you need a sophisticated source of instances of random variables. The question is: how do you generate them?

The literature is full of algorithms requiring random samples as inputs or drivers (conditional random fields, Bayesian network models, particle filters and so on). The literature is also full of competing methods (pseudorandom generators, entropy sources, Gibbs samplers, Metropolis–Hastings algorithm, Markov chain Monte Carlo methods, bootstrap methods and so on). Our thesis is: this diversity is supported by only a few fundamental methods. And you are much better off thinking in terms of a few deliberately simple composable mechanisms than you would be in relying on some hugely complicated black box “brand name” technique.

We will discuss the half dozen basic methods that all of these techniques are derived from. Read more…

Importance Sampling

January 1st, 2012 No comments

We describe briefly the powerful simulation tefchnique known as
“importance sampling.” Importance sampling is a technique that lets
you use numerical simulation to explore events that, at first look,
appear too rare to be reliably approximated numerically. The correctness
of importance sampling follows almost immediately from the definition
of a change of density. Like most mathematical techniques, importance
sampling brings in its own concerns and controls that were not obvious
in the original problem. To deal with these concerns (like picking
the re-weighting to use) we will largely appeal to the ideas from
“A Tutorial on the Cross-Entropy Method” Pieter-Tjerk de Boer, Dirk P Kroese, Shie Mannor, and Reuven Y Rubinstein, Annals of Operations Research, 2005 vol. 134 (1) pp. 19-67. Read more…

Why you can not to use statistics to dispute magic

December 10th, 2011 No comments

It is a subtle point that statistical modeling is different than model based science. However, empirical scientists seem to go out of their way to conflate the two before the public (as statistical modeling is easier to perform and model based science is more highly rewarded). It is often claimed that model based science is being done when in fact statistics is what is being done (for instance some of the unfortunate distractions of flawed reports related to the important question of the magnitude of plausible anthropogenic global warming).

Both model based science and statistics are wonderful fields, but it is important to not receive the results of one when you have paid for the other.

We will pointedly discuss one of the differences. Read more…

What to do when you run out of memory

December 6th, 2011 No comments

A constant problem for computer science (since its inception) is how to manipulate data that is larger than machine memory. We present here some general strategies for working “out of core” or what you should do when you run out of memory.

Early computers were most limited by their paltry memory sizes. von Neumann himself commented that even a room full of genius mathematicians would not be capable of much if all they could communicate, think upon or remember were the characters on a single type written page (much more memory than the few hundred words available to the Eniac). The most visible portions of early computers are their external memories or secondary stores: card readers, paper tape readers and tape drives.


IMG 0062

SDC 920 computer, Computer History Museum, Mountain View CA

Historically computer scientists have concentrated on streaming or online algorithms (that is algorithms that work with the data in the order it is available and use limited memory). For many problems we have found this an insufficient model and it is much better to assume you can re-order and replicate data (such as scattering data to many processors and re-collecting it to sort). The scatter/gather paradigm is ubiquitous and is the underpinning of large scale sorting, databases and Map Reduce. So in one sense databases and Map Reduce different APIs on top of very related technologies (journaling, splitting and merging). Replicating data (or even delaying duplicate elimination) that is already “too large to handle” may seem counterintuitive; but it is exploiting the primary property of secondary storage: that secondary storage tends to be much larger than primary storage (typically by 2 orders of magnitude, compare a 2 terabyte drive to an 8 gigabyte memory stick). Read more…

An Appreciation of Locality Sensitive Hashing

November 21st, 2011 1 comment

We share our admiration for a set of results called “locality sensitive hashing” by demonstrating a greatly simplified example that exhibits the spirit of the techniques. Read more…

“The Mythical Man Month” is still a good read

October 23rd, 2011 1 comment

Re-read Fred Brooks “The Mythical Man Month” over vacation.  Book remains insightful about computer science and project management. Read more…

Kernel Methods and Support Vector Machines de-Mystified

October 7th, 2011 2 comments

We give a simple explanation of the interrelated machine learning techniques called kernel methods and support vector machines. We hope to characterize and de-mystify some of the properties of these methods. To do this we work some examples and draw a few analogies. The familiar no matter how wonderful is not perceived as mystical. Read more…

Increase your productivity

September 24th, 2011 1 comment

I think I have been pretty productive on technical tasks lately and the method is (at least to me) interesting. The effect was accidental but I think one can explain it and reproduce it by synthesizing three important observations on human behavior. Read more…

The equivalence of logistic regression and maximum entropy models

September 23rd, 2011 Comments off

Nina Zumel recently gave a very clear explanation of logistic regression ( The Simpler Derivation of Logistic Regression ). In particular she called out the central role of log-odds ratios and demonstrated how the “deviance” (that mysterious
quantity reported by fitting packages) is both a term in “the pseudo-R^2″ (so directly measures goodness of fit) and is the quantity that is actually optimized during the fitting procedure. One great point of the writeup was how simple everything is once you start thinking in terms of derivatives (and that it isn’t so much the functional form of the sigmoid that is special but its relation to its own derivative that is special).

We adapt these presentation ideas to make explicit the well known equivalence of logistic regression and maximum entropy models. Read more…