Archive

Archive for the ‘Computer Science’ Category

Ergodic Theory for Interested Computer Scientists

February 4th, 2012 No comments

We describe ergodic theory in modern notation accessible to interested computer scientists.

The ergodic theorem (http://en.wikipedia.org/wiki/Ergodic theory (link)) is an important principle of recurrence and averaging in dynamical systems. However, there are some inconsistent uses of the term, much of the machinery is intended to work with deterministic dynamical systems (not probabilistic systems, as is often implied) and often the conclusion of the theory is mis-described as its premises.

By “interested computer scientists” we mean people who know math and work with probabilistic systems1, but know not to accept mathematical definitions without some justification (actually a good attitude for mathematicians also). Read more…

Six Fundamental Methods to Generate a Random Variable

January 20th, 2012 1 comment

Introduction

To implement many numeric simulations you need a sophisticated source of instances of random variables. The question is: how do you generate them?

The literature is full of algorithms requiring random samples as inputs or drivers (conditional random fields, Bayesian network models, particle filters and so on). The literature is also full of competing methods (pseudorandom generators, entropy sources, Gibbs samplers, Metropolis–Hastings algorithm, Markov chain Monte Carlo methods, bootstrap methods and so on). Our thesis is: this diversity is supported by only a few fundamental methods. And you are much better off thinking in terms of a few deliberately simple composable mechanisms than you would be in relying on some hugely complicated black box “brand name” technique.

We will discuss the half dozen basic methods that all of these techniques are derived from. Read more…

What to do when you run out of memory

December 6th, 2011 No comments

A constant problem for computer science (since its inception) is how to manipulate data that is larger than machine memory. We present here some general strategies for working “out of core” or what you should do when you run out of memory.

Early computers were most limited by their paltry memory sizes. von Neumann himself commented that even a room full of genius mathematicians would not be capable of much if all they could communicate, think upon or remember were the characters on a single type written page (much more memory than the few hundred words available to the Eniac). The most visible portions of early computers are their external memories or secondary stores: card readers, paper tape readers and tape drives.


IMG 0062

SDC 920 computer, Computer History Museum, Mountain View CA

Historically computer scientists have concentrated on streaming or online algorithms (that is algorithms that work with the data in the order it is available and use limited memory). For many problems we have found this an insufficient model and it is much better to assume you can re-order and replicate data (such as scattering data to many processors and re-collecting it to sort). The scatter/gather paradigm is ubiquitous and is the underpinning of large scale sorting, databases and Map Reduce. So in one sense databases and Map Reduce different APIs on top of very related technologies (journaling, splitting and merging). Replicating data (or even delaying duplicate elimination) that is already “too large to handle” may seem counterintuitive; but it is exploiting the primary property of secondary storage: that secondary storage tends to be much larger than primary storage (typically by 2 orders of magnitude, compare a 2 terabyte drive to an 8 gigabyte memory stick). Read more…

An Appreciation of Locality Sensitive Hashing

November 21st, 2011 1 comment

We share our admiration for a set of results called “locality sensitive hashing” by demonstrating a greatly simplified example that exhibits the spirit of the techniques. Read more…

“The Mythical Man Month” is still a good read

October 23rd, 2011 1 comment

Re-read Fred Brooks “The Mythical Man Month” over vacation.  Book remains insightful about computer science and project management. Read more…

Programmers Should Know R

August 6th, 2011 1 comment

Programmers should definitely know how to use R. I don’t mean they should switch from their current language to R, but they should think of R as a handy tool during development. Read more…

Automatic Detection of Potential Deadlock

June 4th, 2011 Comments off

We would like to share a programming article we wrote on the automatic detection of potential deadlock. Read more…

Brevity is a Virtue

April 27th, 2011 2 comments

Our friends at Dataspora have a nice article on the more modern Map Reduce languages. A very good read and clearly a lot of thought went into preparing it. Read more…

Categories: Computer Science, Programming Tags:

A Personal Perspective on Machine Learning

October 31st, 2010 7 comments

Having a bit of history as both a user of machine learning and a researcher in the field I feel I have developed a useful perspective on the various trends, flavors and nuances in machine learning and artificial intelligence. I thought I would take a moment to outline a bit of it here and demonstrate how what we call artificial intelligence is becoming more statistical in nature. Read more…

What Did Theorists Do Before The Age Of Big Data?

August 2nd, 2010 1 comment

We have been living in the age of “big data” for some time now. This is an age where incredible things can be accomplished through the effective application of statistics and machine learning at large scale (for example see: “The Unreasonable Effectiveness of Data” Alon Halevy, Peter Norvig, Fernando Pereira, IEEE Intelligent Systems (2009)). But I have gotten to thinking about the period before this. The period before we had easy access to so much data, before most computation was aggregation and before we accepted numerical analysis style convergence as “efficient.” A small problem I needed to solve (as part of a bigger project) reminded me what theoretical computer scientists did then: we worried about provable worst case efficiency.

Read more…