It recently hit me that I see unit tests as a form of penance (in addition to being a great tool for specification and test driven development). If you fix a bug and don’t add a unit test I suspect you are not actually sorry. Continue reading
Recently Heroku was accused of using random queue routing while claiming to supply something similar to shortest queue routing (see: James Somers – Heroku’s Ugly Secret and more discussion at hacker news: Heroku’s Ugly Secret). If this is true it is pretty bad. I like randomized algorithms and I like queueing theory, but you need to work through proofs or at least simulations when playing with queues. You don’t want to pick an arbitrary algorithm and claim it works “due to randomness.” We will show a very quick example where randomized routing is very bad with near certainty. Just because things are “random” doesn’t mean you can’t or shouldn’t characterize them. Continue reading
From time to time we work on projects that would benefit from a free lightweight pure Java linear programming library. That is a library unencumbered by a bad license, available cheaply, without an infinite amount of file format and interop cruft and available in Java (without binary blobs and JNI linkages). There are a few such libraries, but none have repeatably, efficiently and reliably met our needs. So we have re-packaged an older one of our own for release under the Apache 2.0 license. This code will have its own rough edges (not having been used widely in production), but I still feel fills an important gap. This article is brief introduction to our WVLPSolver Java library. Continue reading
We have added a worked example to the README of our experimental logistic regression code.
The Logistic codebase is designed to support experimentation on variations of logistic regression including:
- A pure Java implementation (thus directly usable in Java server environments).
- A simple multinomial implementation (that allows more than two possible result categories).
- The ability to work with too large for memory data-sets and directly from files or database tables.
- A demonstration of the steps needed to use standard Newton-Raphson in Hadoop.
- Ability to work with arbitrarily large categorical inputs.
- Provide explicit L2 model regularization.
- Implement safe optimization methods (like conjugate gradient, line-search and majorization) for situations where the standard Iteratively-re-Weighted-Least-Squares/Newton-Raphson fails.
- Provide an overall framework to quickly try implementation experiments (as opposed to novel usage experiments).
What we mean by this code being “experimental” is that it has capabilities that many standard implementations do not. In fact most of the items in the above list are not usually made available to the logistic regression user. But our project is also stand-alone and not as well integrated into existing workflows as standard production systems. Before trying our code you may want to try R or Mahout. Continue reading
I am going to come-out and say it: I am emotionally done with 32 bit machines and operating systems. My sympathy for them is at an end.
I know that ARM is still 32 bit, but in that case you get something big back in exchange: the ability to deploy on smartphones and tablets. For PCs and servers 32 bit addressing’s time is long past, yet we still have to code for and regularly run into these machines and operating systems. The time/space savings of 32 bit representations is nothing compared to the loss of capability in sticking with that architecture and the wasted effort in coding around it. My work is largely data analysis in a server environment, and it is just getting ridiculous to not be able to always assume at least a 64 bit machine. Continue reading
A lot of people consider the static typing found in languages such as C, C++, ML, Java and Scala as needless hairshirtism. They consider the dynamic typing of languages like Lisp, Scheme, Perl, Ruby and Python as a critical advantage (ignoring other features of these languages and other efforts at generic programming such as the STL).
I strongly disagree. I find the pain of having to type or read through extra declarations is small (especially if you know how to copy-paste or use a modern IDE). And certainly much smaller than the pain of the dynamic language driven anti-patterns of: lurking bugs, harder debugging and more difficult maintenance. Debugging is one of the most expensive steps in software development- so you want incur less of it (even if it is at the expense of more typing). To be sure, there is significant cost associated with static typing (I confess: I had to read the book and post a question on Stack Overflow to design the type interfaces in Automatic Differentiation with Scala; but this is up-front design effort that has ongoing benefits, not hidden debugging debt).
There is, of course, no prior reason anybody should immediately care if I do or do not like dynamic typing. What I mean by saying this is I have some experience and observations about problems with dynamic typing that I feel can help others.
I will point out a couple of example bugs that just keep giving. Maybe you think you are too careful to ever make one of these mistakes, but somebody in your group surely will. And a type checking compiler finding a possible bug early is the cheapest way to deal with a bug (and static types themselves are only a stepping stone for even deeper static code analysis). Continue reading
We describe ergodic theory in modern notation accessible to interested computer scientists.
The ergodic theorem (http://en.wikipedia.org/wiki/Ergodic theory (link)) is an important principle of recurrence and averaging in dynamical systems. However, there are some inconsistent uses of the term, much of the machinery is intended to work with deterministic dynamical systems (not probabilistic systems, as is often implied) and often the conclusion of the theory is mis-described as its premises.
By “interested computer scientists” we mean people who know math and work with probabilistic systems1, but know not to accept mathematical definitions without some justification (actually a good attitude for mathematicians also). Continue reading
To implement many numeric simulations you need a sophisticated source of instances of random variables. The question is: how do you generate them?
The literature is full of algorithms requiring random samples as inputs or drivers (conditional random fields, Bayesian network models, particle filters and so on). The literature is also full of competing methods (pseudorandom generators, entropy sources, Gibbs samplers, Metropolis–Hastings algorithm, Markov chain Monte Carlo methods, bootstrap methods and so on). Our thesis is: this diversity is supported by only a few fundamental methods. And you are much better off thinking in terms of a few deliberately simple composable mechanisms than you would be in relying on some hugely complicated black box “brand name” technique.
We will discuss the half dozen basic methods that all of these techniques are derived from. Continue reading
A constant problem for computer science (since its inception) is how to manipulate data that is larger than machine memory. We present here some general strategies for working “out of core” or what you should do when you run out of memory.
Early computers were most limited by their paltry memory sizes. von Neumann himself commented that even a room full of genius mathematicians would not be capable of much if all they could communicate, think upon or remember were the characters on a single type written page (much more memory than the few hundred words available to the Eniac). The most visible portions of early computers are their external memories or secondary stores: card readers, paper tape readers and tape drives.
SDC 920 computer, Computer History Museum, Mountain View CA
Historically computer scientists have concentrated on streaming or online algorithms (that is algorithms that work with the data in the order it is available and use limited memory). For many problems we have found this an insufficient model and it is much better to assume you can re-order and replicate data (such as scattering data to many processors and re-collecting it to sort). The scatter/gather paradigm is ubiquitous and is the underpinning of large scale sorting, databases and Map Reduce. So in one sense databases and Map Reduce different APIs on top of very related technologies (journaling, splitting and merging). Replicating data (or even delaying duplicate elimination) that is already “too large to handle” may seem counterintuitive; but it is exploiting the primary property of secondary storage: that secondary storage tends to be much larger than primary storage (typically by 2 orders of magnitude, compare a 2 terabyte drive to an 8 gigabyte memory stick). Continue reading
We share our admiration for a set of results called “locality sensitive hashing” by demonstrating a greatly simplified example that exhibits the spirit of the techniques. Continue reading