A big congratulations to Win-Vector LLC‘s Dr. Nina Zumel for authoring and teaching portions of EMC‘s new Data Science and Big Data Analytics training and certification program. A big congratulations to EMC, EMC Education Services and Greenplum for creating a great training course. Finally a huge thank you to EMC, EMC Education Services and Greenplum for inviting Win-Vector LLC to contribute to this great project.
How is it even possible to set expectations and launch data science projects?
Data science projects vary from “executive dashboards” through “automate what my analysts are already doing well” to “here is some data, we would like some magic.” That is you may be called to produce visualizations, analytics, data mining, statistics, machine learning, method research or method invention. Given the wide range of wants, diverse data sources, required levels of innovation and methods it often feels like you can not even set goals for data science projects.
Many of these projects either fail or become open ended (become unmanageable).
As an alternative we describe some of our methods for setting quantifiable goals and front-loading risk in data science projects. Continue reading Setting expectations in data science projects
Much of the data that the analyst uses exhibits extraordinary range. For example: incomes, company sizes, popularity of books and any “winner takes all process”; (see: Living in A Lognormal World). Tukey recommended the logarithm as an important “stabilizing transform” (a transform that brings data into a more usable form prior to generating exploratory statistics, analysis or modeling). One benefit of such transforms is: data that is normal (or Gaussian) meets more of the stated expectations of common modeling methods like least squares linear regression. So data from distributions like the lognormal is well served by a
log() transformation (that transforms the data closer to Gaussian) prior to analysis. However, not all data is appropriate for a log-transform (such as data with zero or negative values). We discuss a simple transform that we call a signed pseudo logarithm that is particularly appropriate to signed wide-range data (such as profit and loss). Continue reading Modeling Trick: the Signed Pseudo Logarithm
A lot of people consider the static typing found in languages such as C, C++, ML, Java and Scala as needless hairshirtism. They consider the dynamic typing of languages like Lisp, Scheme, Perl, Ruby and Python as a critical advantage (ignoring other features of these languages and other efforts at generic programming such as the STL).
I strongly disagree. I find the pain of having to type or read through extra declarations is small (especially if you know how to copy-paste or use a modern IDE). And certainly much smaller than the pain of the dynamic language driven anti-patterns of: lurking bugs, harder debugging and more difficult maintenance. Debugging is one of the most expensive steps in software development- so you want incur less of it (even if it is at the expense of more typing). To be sure, there is significant cost associated with static typing (I confess: I had to read the book and post a question on Stack Overflow to design the type interfaces in Automatic Differentiation with Scala; but this is up-front design effort that has ongoing benefits, not hidden debugging debt).
There is, of course, no prior reason anybody should immediately care if I do or do not like dynamic typing. What I mean by saying this is I have some experience and observations about problems with dynamic typing that I feel can help others.
I will point out a couple of example bugs that just keep giving. Maybe you think you are too careful to ever make one of these mistakes, but somebody in your group surely will. And a type checking compiler finding a possible bug early is the cheapest way to deal with a bug (and static types themselves are only a stepping stone for even deeper static code analysis). Continue reading Why I don’t like Dynamic Typing
We describe ergodic theory in modern notation accessible to interested computer scientists.
The ergodic theorem (http://en.wikipedia.org/wiki/Ergodic theory (link)) is an important principle of recurrence and averaging in dynamical systems. However, there are some inconsistent uses of the term, much of the machinery is intended to work with deterministic dynamical systems (not probabilistic systems, as is often implied) and often the conclusion of the theory is mis-described as its premises.
By “interested computer scientists” we mean people who know math and work with probabilistic systems1, but know not to accept mathematical definitions without some justification (actually a good attitude for mathematicians also). Continue reading Ergodic Theory for Interested Computer Scientists
To implement many numeric simulations you need a sophisticated source of instances of random variables. The question is: how do you generate them?
The literature is full of algorithms requiring random samples as inputs or drivers (conditional random fields, Bayesian network models, particle filters and so on). The literature is also full of competing methods (pseudorandom generators, entropy sources, Gibbs samplers, Metropolis–Hastings algorithm, Markov chain Monte Carlo methods, bootstrap methods and so on). Our thesis is: this diversity is supported by only a few fundamental methods. And you are much better off thinking in terms of a few deliberately simple composable mechanisms than you would be in relying on some hugely complicated black box “brand name” technique.
We will discuss the half dozen basic methods that all of these techniques are derived from. Continue reading Six Fundamental Methods to Generate a Random Variable
We describe briefly the powerful simulation technique known as “importance sampling.” Importance sampling is a technique that allows you to use numerical simulation to explore events that, at first look, appear too rare to be reliably approximated numerically. The correctness of importance sampling follows almost immediately from the definition of a change of density. Like most mathematical techniques, importance sampling brings in its own concerns and controls that were not obvious in the original problem. To deal with these concerns (like picking the re-weighting to use) we will largely appeal to the ideas from “A Tutorial on the Cross-Entropy Method” Pieter-Tjerk de Boer, Dirk P Kroese, Shie Mannor, and Reuven Y Rubinstein, Annals of Operations Research, 2005 vol. 134 (1) pp. 19-67. Continue reading Importance Sampling
It is a subtle point that statistical modeling is different than model based science. However, empirical scientists seem to go out of their way to conflate the two before the public (as statistical modeling is easier to perform and model based science is more highly rewarded). It is often claimed that model based science is being done when in fact statistics is what is being done (for instance some of the unfortunate distractions of flawed reports related to the important question of the magnitude of plausible anthropogenic global warming).
Both model based science and statistics are wonderful fields, but it is important to not receive the results of one when you have paid for the other.
We will pointedly discuss one of the differences. Continue reading Why you can not to use statistics to dispute magic
A constant problem for computer science (since its inception) is how to manipulate data that is larger than machine memory. We present here some general strategies for working “out of core” or what you should do when you run out of memory.
Early computers were most limited by their paltry memory sizes. von Neumann himself commented that even a room full of genius mathematicians would not be capable of much if all they could communicate, think upon or remember were the characters on a single type written page (much more memory than the few hundred words available to the Eniac). The most visible portions of early computers are their external memories or secondary stores: card readers, paper tape readers and tape drives.
SDC 920 computer, Computer History Museum, Mountain View CA
Historically computer scientists have concentrated on streaming or online algorithms (that is algorithms that work with the data in the order it is available and use limited memory). For many problems we have found this an insufficient model and it is much better to assume you can re-order and replicate data (such as scattering data to many processors and re-collecting it to sort). The scatter/gather paradigm is ubiquitous and is the underpinning of large scale sorting, databases and Map Reduce. So in one sense databases and Map Reduce different APIs on top of very related technologies (journaling, splitting and merging). Replicating data (or even delaying duplicate elimination) that is already “too large to handle” may seem counterintuitive; but it is exploiting the primary property of secondary storage: that secondary storage tends to be much larger than primary storage (typically by 2 orders of magnitude, compare a 2 terabyte drive to an 8 gigabyte memory stick). Continue reading What to do when you run out of memory