Living in the age of big data we ask what to do when we have the good fortune to be presented with a huge amount of supervised training data? Most often at large scale we are presented with the un-supervised problems of characterization and information extraction; but some problem domains offer an almost limitless supply of supervised training data (such as using older data to build models that predict the near future). Having too much training data is a good problem to have and there are ways to use traditional methods (like logistic regression) at this scale. We present an “out of core” logistic regression implementation and a quick example in Apache Hadoop running on Amazon Elastic MapReduce. This presentation assumes familiarity with Unix style command lines, Java and Hadoop.
One of the current best tools in the machine learning toolbox is the 1930s statistical technique called logistic regression. We explain how to add professional quality logistic regression to your analytic repertoire and describe a bit beyond that. Continue reading Learn Logistic Regression (and beyond)
Having a bit of history as both a user of machine learning and a researcher in the field I feel I have developed a useful perspective on the various trends, flavors and nuances in machine learning and artificial intelligence. I thought I would take a moment to outline a bit of it here and demonstrate how what we call artificial intelligence is becoming more statistical in nature. Continue reading A Personal Perspective on Machine Learning
StarCraft and StarCraft II are very popular real time strategy games. The core of these games is the mining of resources, and conversion of those resources into specialized military units. Idealized fighting and predator/prey relations have long been analyzed in terms of differential equations. We use the differential equation formalism (in particular Lanchester’s equations of 1916) to discuss expected game outcomes and how, in principle, one can derive a StarCraft strategy that complements search, simulation or more classic artificial intelligence techniques.
Ambitious analytics projects have a tangible risk of failure. Uncertainty breeds anxiety. There are known techniques to lower the uncertainty, guarantee failure and shift the blame onto others. We outline a few proven methods of analytics sabotage and their application. In honor of Steven Potter call this activity “statsmanship” which we define as pursing the goal of making your analytics group cry.
Continue reading Statsmanship: Failure Through Analytics Sabotage
Fast Portfolio re-Balancing as a Fractional Linear Program is an example of the kind of work we have done encoding client problems (in this case optimal portfolio selection) as optimization problems (so we can use purchased software to solve them). Its a bit mathy- but we are excited we got permission to share this. Continue reading Fast Portfolio re-Balancing as a Fractional Linear Program
We have been living in the age of “big data” for some time now. This is an age where incredible things can be accomplished through the effective application of statistics and machine learning at large scale (for example see: “The Unreasonable Effectiveness of Data” Alon Halevy, Peter Norvig, Fernando Pereira, IEEE Intelligent Systems (2009)). But I have gotten to thinking about the period before this. The period before we had easy access to so much data, before most computation was aggregation and before we accepted numerical analysis style convergence as “efficient.” A small problem I needed to solve (as part of a bigger project) reminded me what theoretical computer scientists did then: we worried about provable worst case efficiency.
We extend the ideas of from Automatic Differentiation with Scala to include the reverse accumulation. Reverse accumulation is a non-obvious improvement to automatic differentiation that can in many cases vastly speed up calculations of gradients. Continue reading Gradients via Reverse Accumulation
This article is a worked-out exercise in applying the Scala type system to solve a small scale optimization problem. For this article we supply complete Scala source code (under a GPLv3 license) and some design discussion. Continue reading Automatic Differentiation with Scala
Having worked with Unix (BSD, HPUX, IRIX, Linux and OSX), Windows (NT4, 2000, XP, Vista and 7) for quite a while I have seen a lot of different software tools. I would like to quickly exhibit my “must have” list. These are the packages that I find to be the single “must have offerings” in a number of categories. I have avoided some categories (such as editors, email programs, programing language, IDEs, photo editors, backup solutions, databases, database tools and web tools) where I have no feeling of having seen a single absolute best offering.
The spirit of the list is to pick items such that: if you disagree with an item in this list then either you are wrong or you know something I would really like to hear about.