One thing I have observed in multiple software engineering and data science projects is inconvenient steps get skipped. This is negative and happens despite rules, best intentions and effort. Recently I have noticed a positive re-formulation of this in that project quality increases rapidly when you make it take more effort to do the wrong thing. Continue reading Make it more effort to do the wrong thing
Given the range of wants, diverse data sources, required innovation and methods it often feels like data science projects are immune to planning, scoping and tracking. Without a system to break a data science project into smaller observable components you greatly increase your risk of failure. As a followup to the statistical ideas we shared in setting expectations in data science projects we share a few project planning ideas from software engineering. Continue reading Data science project planning
A bit more on the ROC/AUC
The receiver operating characteristic curve (or ROC) is one of the standard methods to evaluate a scoring system. Nina Zumel has described its application, but I would like to call out some additional details. In my opinion while the ROC is a useful tool, the “area under the curve” (AUC) summary often read off it is not as intuitive and interpretable as one would hope or some writers assert.
Check out Nina Zumel’s latest article: On Balance.
XCOM: Enemy Unknown is a turn based video game where the player choses among actions (for example shooting an alien) that are labeled with a declared probability of success.
Image copyright Firaxis Games
A lot of gamers, after missing a 80% chance of success shot, start asking if the game’s pseudo random number generator is fair. Is the game really rolling the dice as stated, or is it cheating? Of course the matching question is: are player memories at all fair; would they remember the other 4 out of 5 times they made such a shot?
This article is intended as an introduction to the methods you would use to test such a question (be it in a video game, in science, or in a business application such as measuring advertisement conversion). There are already some interesting articles on collecting and analyzing XCOM data and finding and characterizing the actual pseudo random generator code in the game, and discussing the importance of repeatable pseudo-random results. But we want to add a discussion pointed a bit more at analysis technique in general. We emphasize methods that are efficient in their use of data. This is a statistical term meaning that a maximal amount of learning is gained from the data. In particular we do not recommend data binning as a first choice for analysis as it cuts down on sample size and thus is not the most efficient estimation technique.
I know “officially” data scientists all always work in “big data” environments with data in a remote database, streaming store or key-value system. But in day to day work Excel files and Excel export files get used a lot and cause a disproportionate amount of pain.
I would like to make a plea to my fellow data scientists to stop using Excel-like formats for informal data exchange and become much stricter in producing and insisting on truly open machine readable files. Open files are those in an open format (not proprietary like Microsoft Excel) and machine readable in this case means readable by a very simple program (preferring simple escaping strategies to complicated quoting strategies). A lot of commonly preferred formats surprisingly do not meet these conditions: for example Microsoft Excel, XML and quoted CSV all fail the test. A few formats that do meet these conditions: SQL dumps, JSON and what I call “strong TSV.” I will illustrate some of the difficulty in using ad-hoc formats in R and suggest work-arounds. Continue reading Please stop using Excel-like formats to exchange data
From time to time we work on projects that would benefit from a free lightweight pure Java linear programming library. That is a library unencumbered by a bad license, available cheaply, without an infinite amount of file format and interop cruft and available in Java (without binary blobs and JNI linkages). There are a few such libraries, but none have repeatably, efficiently and reliably met our needs. So we have re-packaged an older one of our own for release under the Apache 2.0 license. This code will have its own rough edges (not having been used widely in production), but I still feel fills an important gap. This article is brief introduction to our WVLPSolver Java library. Continue reading Yet Another Java Linear Programming Library
von Neumann and Morgenstern’s “Theory of Games and Economic Behavior” is the famous basis for game theory. One of the central accomplishments is the rigorous proof that comparative “preference methods” over fairly complicated “event spaces” are no more expressive than numeric (real number valued) utilities. That is: for a very wide class of event spaces and comparison functions “>” there is a utility function u() such that:
a > b (“>” representing the arbitrary comparison or preference for the event space) if and only if u(a) > u(b) (this time “>” representing the standard order on the reals).
However, an active reading of sections 1 through 3 and even the 2nd edition’s axiomatic appendix shows that the concept of “events” (what preferences and utilities are defined over) is deliberately left undefined. There is math and objects and spaces, but not all of them are explicitly defined in term of known structures (are they points in R^n, sets, multi-sets, sums over sets or what?). The word “event” is used early in the book and not in the index. Axiomatic treatments often rely on intentionally leaving ground-concepts undefined, but we are going to work a concrete example through von Neumann and Morgenstern to try and illustrate a bit more of the required intuition and deep nature of their formal notions of events and utility. I also will illustrate how, at least in discussion, von Neuman and Morgenstern may have held on to a naive “single outcome” intuition of events and a naive “direct dollars” intuition of utility despite erecting a theory carefully designed to support much more structure. This is possible because they never have to calculate in the general event space: they prove access to the preference allows them to construct the utility funciton u() and then work over the real numbers. Sections 1 through 3 are designed to eliminate the need for a theory of preference or utility and allow von Neuman and Morgenstern to work with real numbers (while achieving full generality). They never need to make the translations explicit, because soon after showing the translations are possible they assume they have already been applied. Continue reading Working an example of von Neumann and Morgenstern utility
We have added a worked example to the README of our experimental logistic regression code.
The Logistic codebase is designed to support experimentation on variations of logistic regression including:
- A pure Java implementation (thus directly usable in Java server environments).
- A simple multinomial implementation (that allows more than two possible result categories).
- The ability to work with too large for memory data-sets and directly from files or database tables.
- A demonstration of the steps needed to use standard Newton-Raphson in Hadoop.
- Ability to work with arbitrarily large categorical inputs.
- Provide explicit L2 model regularization.
- Implement safe optimization methods (like conjugate gradient, line-search and majorization) for situations where the standard Iteratively-re-Weighted-Least-Squares/Newton-Raphson fails.
- Provide an overall framework to quickly try implementation experiments (as opposed to novel usage experiments).
What we mean by this code being “experimental” is that it has capabilities that many standard implementations do not. In fact most of the items in the above list are not usually made available to the logistic regression user. But our project is also stand-alone and not as well integrated into existing workflows as standard production systems. Before trying our code you may want to try R or Mahout. Continue reading Added worked example to logistic regression project
Check out: I Write, Therefore I Think