<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Win-Vector Blog &#187; Map Reduce</title>
	<atom:link href="http://www.win-vector.com/blog/tag/map-reduce/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.win-vector.com/blog</link>
	<description>The Applied Theorist&#039;s Point of View</description>
	<lastBuildDate>Sat, 04 Feb 2012 17:42:12 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>What to do when you run out of memory</title>
		<link>http://www.win-vector.com/blog/2011/12/what-to-do-when-you-run-out-of-memory/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=what-to-do-when-you-run-out-of-memory</link>
		<comments>http://www.win-vector.com/blog/2011/12/what-to-do-when-you-run-out-of-memory/#comments</comments>
		<pubDate>Tue, 06 Dec 2011 12:25:59 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Coding]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Tutorials]]></category>
		<category><![CDATA[Additive Combinatorics]]></category>
		<category><![CDATA[GNU sort]]></category>
		<category><![CDATA[H2]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Map Reduce]]></category>
		<category><![CDATA[Out of core]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1892</guid>
		<description><![CDATA[A constant problem for computer science (since its inception) is how to manipulate data that is larger than machine memory. We present here some general strategies for working &#8220;out of core&#8221; or what you should do when you run out of memory. Early computers were most limited by their paltry memory sizes. von Neumann himself [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/' rel='bookmark' title='Automatic Differentiation with Scala'>Automatic Differentiation with Scala</a></li>
<li><a href='http://www.win-vector.com/blog/2010/12/large-data-logistic-regression-with-example-hadoop-code/' rel='bookmark' title='Large Data Logistic Regression (with example Hadoop code)'>Large Data Logistic Regression (with example Hadoop code)</a></li>
<li><a href='http://www.win-vector.com/blog/2011/01/sql-screwdriver/' rel='bookmark' title='SQL Screwdriver'>SQL Screwdriver</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>A constant problem for computer science (since its inception) is how to manipulate data that is larger than machine memory.  We present here some general strategies for working &#8220;out of core&#8221; or what you should do when you run out of memory.</p>
<p>Early computers were most limited by their paltry memory sizes.  von Neumann himself commented that even a room full of genius mathematicians would not be capable of much if all they could communicate, think upon or remember were the characters on a single type written page (much more memory than the few hundred words available to the <a target="_blank" href="http://en.wikipedia.org/wiki/ENIAC">Eniac</a>).   The most visible portions of early computers are their external memories or secondary stores: card readers, paper tape readers and tape drives.</p>
<p><center><br />
<img style="display:block; margin-left:auto; margin-right:auto;" src="http://www.win-vector.com/blog/wp-content/uploads/2011/12/IMG_0062.jpg" alt="IMG 0062" title="IMG_0062.JPG" border="0" height="300" /></p>
<p/>
SDC 920 computer, Computer History Museum, Mountain View CA<br />
</center></p>
<p>Historically computer scientists have concentrated on streaming or online algorithms (that is algorithms that work with the data in the order it is available and use limited memory).  For many problems we have found this an insufficient model and it is much better to assume you can re-order and replicate data (such as scattering data to many processors and re-collecting it to sort).  The scatter/gather paradigm is ubiquitous and is the underpinning of large scale sorting, databases and Map Reduce.  So in one sense databases and Map Reduce different APIs on top of very related technologies (journaling, splitting and merging).  Replicating data (or even delaying duplicate elimination) that is already &#8220;too large to handle&#8221; may seem counterintuitive; but it is exploiting the primary property of secondary storage: that secondary storage tends to be much larger than primary storage (typically by 2 orders of magnitude, compare a 2 terabyte drive to an 8 gigabyte memory stick).<span id="more-1892"></span>In our web age, the typical big data problems are inverting indices (for fast search lookup) and computing term frequencies (for <a target="_blank" href="http://en.wikipedia.org/wiki/Okapi_BM25">TF/IDF scoring</a> or for things like <a target="_blank" href="http://en.wikipedia.org/wiki/Naive_bayes">Naive Bayes classifiers</a>).  Since these are over-worked examples we will use a mathematical problem from <a href="http://terrytao.wordpress.com/books/additive-combinatorics/">&#8220;Additive Combinatorics&#8221;, Terence Tao, Van Vu, (ISBN-13: 9780521853866; ISBN-10: 0521853869)</a></p>
<p>We take one problem from the field of additive combinatorics: sum sets.   For two sets of integers A = {a_1, &#8230; a_s} and B {b_1, &#8230;, b_t} the sum set is defined as the set (without repetition) A + B = { a_i + b_j | i = 1,&#8230;s, j=1&#8230;t }.   For sets of integers the size of A+B (denoted as |A+B|) can vary from |A| + |B| &#8211; 1 to |A| * |B| depending on the relations between the numbers in A and B (or the structure of A and B).  If instead of working with integers we work with integers <a target="_blank" href="http://en.wikipedia.org/wiki/Modular_arithmetic">modulo p</a> where p is a prime number (or equivalently we treat all numbers as remainders of division by p) then by the Cauchy-Davenport inequality we have |A + B| &ge; min(|A|+|B|-1,p) (so essentially the same result, except when we run out of possible integers modulo p).</p>
<p>For example we would say (working modulo 19) that [0, 1, 4, 5] + [10, 11, 14, 15] = [0, 1, 10, 11, 12, 14, 15, 16, 18].   In fact there are 19 pairs of sets that add up to  [0, 1, 10, 11, 12, 14, 15, 16, 18] ( for instance [5, 6, 9, 10] + [5, 6, 9, 10] is another such pair).  Just to move forward assume we were interested in determining how many ways a set can be written as the sum of a pair of sets (each of size 4).  For a given sum result we might try search or <a target="_blank" href="http://en.wikipedia.org/wiki/Integer_programming">integer programming</a> to find all possible summands.  However, if we want the statistics on all sums simultaneously, we can work much quicker and without need for big gun mathematics.</p>
<p>The straightforward solution is this case is a bit of code like:</p>
<p><code></p>
<pre>
for set A from all possible sets of 4 integers from 0 to 18
    for set B from all possible sets of 4 integers from 0 to 18
        let set C = A + B modulo 19
        use set C as a key and add the pair (A,B) to the list associated with C
for all key sets C tracked above
     compute the size of the list of summand pairs found for C
print how many result sets C have a given number of summand pairs
</pre>
<p></code></p>
<p>The relations C which have a summand of form A can be collected by any bit of Java code implementing the interface below (just call <code>insertReln(C,(A,B))</code>  to store the relations and then <code>entries()</code> to get them back).  A small interface that declares the needed methods is given below:</p>
<p><code></p>
<pre>
public interface RelnCollector&lt;A,B&gt; {
	void insertReln(A a, B b) throws IOException;
	Iterable&lt;Map.Entry&lt;C,Iterable&lt;B&gt;&gt;&gt; entries() throws IOException, InterruptedException;
	void close() throws IOException;
}
</pre>
<p></code></p>
<p>An in-memory relation collector is trivially implemented by a nested map adjusted to declare the above interface, as we see in the next code snippet:</p>
<pre>
public final class InMemoryRelnCollector&lt;A,B&gt;
	implements RelnCollector&lt;A,B&gt; {
	private final DataAdapter&lt;A&gt; adapterA;
	private final DataAdapter&lt;B&gt; adapterB;
	private Map&lt;A,Iterable&lt;B&gt;&gt; atoBs;

	public InMemoryRelnCollector(final DataAdapter&lt;A&gt; adapterA,
		final DataAdapter&lt;B&gt; adapterB) {
		this.adapterA = adapterA;
		this.adapterB = adapterB;
		atoBs = new TreeMap&lt;A,Iterable&lt;B&gt;&gt;(this.adapterA);
	}

	@Override
	public void insertReln(final A a, final B b) {
		Set&lt;B&gt; set = (Set&lt;B&gt;) atoBs.get(a);
		if(null==set) {
			set = new TreeSet&lt;B&gt;(adapterB);
			atoBs.put(a,set);
		}
		if(!set.contains(b)) {
			set.add(b);
		}
	}

	@Override
	public Iterable&lt;Map.Entry&lt;A,Iterable&lt;B&gt;&gt;&gt; entries() {
		return atoBs.entrySet();
	}

	@Override
	public void close() {
		atoBs = null;
	}
}
</pre>
<p>The great savings in time is that we work from summands to results sums (but keep many sets of results indexed by result sets).  Thus we don&#8217;t have to figure out how to invert the sum operation (as we do our bookkeeping forward).  However, this very bookkeeping may overwhelm us.  As we can see below, a Java implementation of the above procedure runs out of memory when trying to characterize which sets of integers modulo 19 can be split into two sets of size four (and how many ways each such set can be split).  However, this was with the deliberately small default allocation of memory available to Java processes (so for this particular instance we could avoid trouble by allocating more memory, we ran out of allocation not system memory).  What happens when we don&#8217;t manage memory is illustrated below:</p>
<pre>
Start	com.winvector.consolidate.impl.InMemoryRelnCollector
	Tue Dec 06 10:04:38 PST 2011
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
	at java.util.TreeMap.put(TreeMap.java:554)
	at java.util.TreeSet.add(TreeSet.java:238)
	at com.winvector.consolidate.example.AdditiveSets.sum(AdditiveSets.java:25)
	at com.winvector.consolidate.example.AdditiveSets.main(AdditiveSets.java:55)
</pre>
<p>An out of core solution can solve the entire problem without needing any additional system memory (just some disk space which is still of a much greater size than primary memory).  The complete calculated result is given below:</p>
<pre>
Examining sums of 4 integers chosen from 0 through 18 modulo 19.
Start	com.winvector.consolidate.impl.FileRelnCollector
	Tue Dec 06 09:54:20 PST 2011
	Inserted 15023376 relations.
 [0, 1, 4, 5] + [10, 11, 14, 15] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [0, 1, 15, 16] + [0, 14, 15, 18] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [0, 3, 4, 18] + [11, 12, 15, 16] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [0, 14, 15, 18] + [0, 1, 15, 16] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [1, 2, 5, 6] + [9, 10, 13, 14] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [1, 2, 16, 17] + [13, 14, 17, 18] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [2, 3, 6, 7] + [8, 9, 12, 13] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [2, 3, 17, 18] + [12, 13, 16, 17] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [3, 4, 7, 8] + [7, 8, 11, 12] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [4, 5, 8, 9] + [6, 7, 10, 11] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [5, 6, 9, 10] + [5, 6, 9, 10] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [6, 7, 10, 11] + [4, 5, 8, 9] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [7, 8, 11, 12] + [3, 4, 7, 8] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [8, 9, 12, 13] + [2, 3, 6, 7] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [9, 10, 13, 14] + [1, 2, 5, 6] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [10, 11, 14, 15] + [0, 1, 4, 5] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [11, 12, 15, 16] + [0, 3, 4, 18] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [12, 13, 16, 17] + [2, 3, 17, 18] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [13, 14, 17, 18] + [1, 2, 16, 17] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
	Examined 128820 sums and 15023376 summands.
	found 3705 sums with 19 distinct summands
	found 39900 sums with 38 distinct summands
	found 26847 sums with 76 distinct summands
	found 22230 sums with 114 distinct summands
	found 10602 sums with 152 distinct summands
	found 8892 sums with 190 distinct summands
	found 2736 sums with 228 distinct summands
	found 5016 sums with 266 distinct summands
	found 2736 sums with 304 distinct summands
	found 1710 sums with 342 distinct summands
	found 171 sums with 361 distinct summands
	found 1710 sums with 380 distinct summands
	found 855 sums with 418 distinct summands
	found 342 sums with 456 distinct summands
	found 342 sums with 532 distinct summands
	found 342 sums with 570 distinct summands
	found 171 sums with 722 distinct summands
	found 171 sums with 760 distinct summands
	found 171 sums with 912 distinct summands
	found 171 sums with 1026 distinct summands
Done:	com.winvector.consolidate.impl.FileRelnCollector
   elapsed time: 618473MS
   Tue Dec 06 10:04:38 PST 2011
</pre>
<p>We performed the calculation be using a different implementation of <code>RelnCollector</code> called <code>FileRelnCollector</code>.  What this implementation does is write relations to a file as they are made available.  That is <cod>FileRelnCollector</code> implementation of <code>insertReln</code> is literally a <code>println()</code>.  Something not more more complicated than the following:</p>
<p><code></p>
<pre>
	@Override
	public void insertReln(final A a, final B b) {
		System.out.println("" + a + "\t" + b);
	}
</pre>
<p></code></p>
<p>The heavy lifting is done when <code>entries()</code> is called.  When the entries are wanted the <code>FileRelnCollector</code> calls <a target="_blank" href="http://www.gnu.org/s/coreutils/manual/html_node/sort-invocation.html">GNU sort</a> on the saved file to get all the results ordered by result sum (instead of by summand).  GNU sort can sort files larger than memory by a split and merge strategy involving temporary files.  We provide such  <a target="_blank" href="https://github.com/WinVector/OutOfCore/blob/master/OutOfCore/src/com/winvector/consolidate/impl/FileRelnCollector.java">a file plus GNU sort based implementation of RelnCollector</a>.  </p>
<p>Note that this runtime can be deceptively low.  If running on a machine with a modern operating system and enough memory the file being used as "external storage" actually gets cached into memory (and gets near memory speed performance).  To get a reliable timing you need to test a problem of the size you are interested in on the size machine you are going to deploy on (not on a larger machine).</p>
<p>For better or worse this method should seem familiar as a lot of science has been done using the Unix text tools (sort, join and a few more).  This is also the basis of Map Reduce and we demonstrate a <a target="_blank" href="https://github.com/WinVector/OutOfCore/blob/master/OutOfCore/src/com/winvector/consolidate/impl/MapReduceRelnCollector.java">Hadoop implementation of RelnCollector</a> as well.  Or we can link up with the other technology designed for beyond memory size data manipulation and get <a target="_blank" href="https://github.com/WinVector/OutOfCore/blob/master/OutOfCore/src/com/winvector/consolidate/impl/DBRelnCollector.java">a database based implementation of RelnCollector</a>.  </p>
<p>In all cases the implementations we call depend on journaling (in the sense of keeping a sequential log of operations to be done instead of immediately performing the operations), scattering (splitting into multiple temp files and structures) and merging (combining data form multiple ordered files).  We could write our own code to perform all of these operations (obliviating any need for GNU sort, Hadoop or a database), but it is much less code to do as we have here and write an adapter to use existing implementations.</p>
<p>The sum-set example is deliberately artificial.  More common examples are, as we mentioned, index inversion and term frequency calculation.  All of our example code is available here: <a target="_blank" href="https://github.com/WinVector/OutOfCore">https://github.com/WinVector/OutOfCore</a> including JUnit tests and an <a target="_blank" href="https://github.com/WinVector/OutOfCore/blob/master/OutOfCore/src/com/winvector/consolidate/example/AdditiveSets.java">example program</a>.  The code depends on libraries for <a target="_blank" href="http://www.junit.org/">JUnit 4.10</a>, <a target="_blank" href="http://www.h2database.com/html/main.html">h2 database</a>, <a target="_blank" href="http://hadoop.apache.org/mapreduce/releases.html">Hadoop 0.21.0</a> for the various implementations.</p>
<p>The main trick is basing your code on a very thin storage abstraction (like the <code>RelnCollector</code> interface, instead of explicitly known data structures) and then using this abstraction to hide all of the details away from the rest of your code (keeping complexity at a manageable level).  The two things to avoid are either infecting your code with too much knowledge of your storage plans (i.e. pushing implementation details into your important code to "speed things up") or being forced to re-design your entire project to fit within some framework (like re-writing all of your code as a database stored procedure or an explicit Hadoop map/reduce pair as this over-commits you to one technology).</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/' rel='bookmark' title='Automatic Differentiation with Scala'>Automatic Differentiation with Scala</a></li>
<li><a href='http://www.win-vector.com/blog/2010/12/large-data-logistic-regression-with-example-hadoop-code/' rel='bookmark' title='Large Data Logistic Regression (with example Hadoop code)'>Large Data Logistic Regression (with example Hadoop code)</a></li>
<li><a href='http://www.win-vector.com/blog/2011/01/sql-screwdriver/' rel='bookmark' title='SQL Screwdriver'>SQL Screwdriver</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2011/12/what-to-do-when-you-run-out-of-memory/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Large Data Logistic Regression (with example Hadoop code)</title>
		<link>http://www.win-vector.com/blog/2010/12/large-data-logistic-regression-with-example-hadoop-code/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=large-data-logistic-regression-with-example-hadoop-code</link>
		<comments>http://www.win-vector.com/blog/2010/12/large-data-logistic-regression-with-example-hadoop-code/#comments</comments>
		<pubDate>Mon, 27 Dec 2010 00:00:59 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Programming]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Amazon Elastic MapReduce]]></category>
		<category><![CDATA[Big Data]]></category>
		<category><![CDATA[EC2]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Logistic Regression]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Mahout]]></category>
		<category><![CDATA[Map Reduce]]></category>
		<category><![CDATA[S3]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1607</guid>
		<description><![CDATA[Living in the age of big data we ask what to do when we have the good fortune to be presented with a huge amount of supervised training data? Most often at large scale we are presented with the un-supervised problems of characterization and information extraction; but some problem domains offer an almost limitless supply [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2011/09/the-equivalence-of-logistic-regression-and-maximum-entropy-models/' rel='bookmark' title='The equivalence of logistic regression and maximum entropy models'>The equivalence of logistic regression and maximum entropy models</a></li>
<li><a href='http://www.win-vector.com/blog/2011/09/the-simpler-derivation-of-logistic-regression/' rel='bookmark' title='The Simpler Derivation of Logistic Regression'>The Simpler Derivation of Logistic Regression</a></li>
<li><a href='http://www.win-vector.com/blog/2010/11/learn-a-powerful-machine-learning-tool-logistic-regression-and-beyond/' rel='bookmark' title='Learn Logistic Regression (and beyond)'>Learn Logistic Regression (and beyond)</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>Living in the <a target="_blank" href="http://www.win-vector.com/blog/2010/08/what-did-theorists-do-before-the-age-of-big-data/">age of big data</a> we ask what to do when we have the good fortune to be presented with a huge amount of supervised training data?  Most often at large scale we are presented with the un-supervised problems of <a target="_blank" href="http://www.win-vector.com/blog/2010/10/a-personal-perspective-on-machine-learning/">characterization and information extraction</a>; but some problem domains offer an almost limitless supply of supervised training data (such as using older data to build models that predict the near future).  Having too much training data is a good problem to have and there are ways to use traditional methods (like logistic regression) at this scale.  We present an &#8220;out of core&#8221; logistic regression implementation and a quick example in <a target="_blank" href="http://hadoop.apache.org/">Apache Hadoop</a> running on <a target="_blank" href="http://aws.amazon.com/elasticmapreduce/">Amazon Elastic MapReduce</a>.  This presentation assumes familiarity with Unix style command lines, Java and Hadoop.<span id="more-1607"></span>Apache Hadoop already has a machine learning infrastructure named <a target="_blank" href="http://mahout.apache.org/">Mahout</a>.   While Mahout seems to concentrate more on unsupervised methods (like clustering, nearest neighbor and recommender systems) it does already include a <a target="_blank" href="https://cwiki.apache.org/MAHOUT/logistic-regression.html">logistic regression package</a>.   This package uses a learning method called &#8220;Stochastic Gradient Descent&#8221;, which is in a sense the perceptron update algorithm updated for the new millennium.  This method is fast in most cases but differs from the traditional method of solving a logistic regression which are based on Fisher Scoring or the Newton/Raphston Method (see &#8220;Categorical Data Analysis,&#8221; Alan Agresti, 1990 and  <a target="_blank" href="http://www.autonlab.org/autonweb/14709/version/4/part/5/data/komarek:lr_thesis.pdf?branch=main&#038;language=en">Paul Komarek&#8217;s thesis &#8220;Logistic Regression for Data Mining and High-Dimensional Classification&#8221;</a>).  Fisher Scoring remains interesting in that it parallelizes in exactly the manner described in &#8220;Map-Reduce for Machine Learning on Multicore,&#8221; Cheng-Tao Chu, Sang Kyun Kim, Yi-An Lin, Yuan Yuan Yu, Gary Bradski, Andrew Y Ng, Kunle Olukotoun NIPS 2006.</p>
<p>Stochastic gradient descent is in fact an appropriate method for big data.  For example: if our model complexity is held constant and our data set size is allowed to grow; then stochastic gradient descent will achieve its convergence condition before it even completes a single random order traversal of the data.  However, stochastic gradient descent has a control called the learning rate and one can easily imagine a series of problems that require the learning rate to be set arbitrarily slow.  For example a data set formed as the union of very many &#8220;typical&#8221; examples where a given variable is independent of the outcome and small minority of &#8220;special&#8221; examples where the same variable helps influence the outcome presents a problem.  Training on the &#8220;typical&#8221; examples causes the stochastic gradient descent method to perform a random walk on the given variable coefficient.  So the learning rate must be slow enough that the expected drift does not swamp out the rare contributions from the &#8220;special&#8221; examples (meaning the learning rate must slow roughly proportionally to the square root of the ratio of the typical to special examples).</p>
<p>Not too much must be made of artificial problems designed to slow stochastic gradient descent.  The traditional Fisher scoring (or the Newton/Raphston method) can simply be killed by specifying a problem with a great number of levels for categorical variables.  In this case traditional methods have to solve a linear system that can in fact be much larger than the entire data set (causing representation, work and numeric stability problems).  So it takes little imagination to design problems that kill the traditional methods.  Other intermediate complexity methods (like conjugate gradient) avoid the storage size problem; but can require a many more passes through the training data.</p>
<p>There is a common situation where Fisher scoring makes good sense: you are trying to fit a relatively simple model to an enormous amount of data (often to predict a rare event).  One could sub-sample the training data to shrink the scale of the problem- but this is a case of the analyst being forced to accede to poor tools.  What one would naturally want is a training method that can fit reasonable sized models (that is models with a reasonable number of variables and levels) onto enormous data sets.  The software package <a target="_blank" href="http://cran.r-project.org/">R</a> can work with fairly large data sets (in the gigabytes range) and has some parallel flavors, but R is mostly an in-memory system.  It is appropriate to want a direct method that both &#8220;works out of core&#8221; (i.e. in the terabytes and petabytes ranges), parallelizes to hundreds of machines (using current typical infrastructure- like a Hadoop cluster) and is exact (without additional parameters like learning rate).  </p>
<p>We demonstrate here an example implementation in Java for both single machine &#8220;out of core&#8221; training (allowing filesystem sized datasets) and MapReduce style parallelism (allowing even larger scale).  The method also includes the problem regularization steps discussed in our recent <a target="_blank" href="http://www.win-vector.com/blog/2010/11/learn-a-powerful-machine-learning-tool-logistic-regression-and-beyond/">logistic regression article</a>.  The code (packaged in: <a target="_blank" href="http://www.win-vector.com/dfiles/WinVectorLogisticRegression/WinVectorLogistic.Hadoop0.20.2.jar" title="WinVectorLogistic.Hadoop0.20.2.jar">WinVectorLogistic.Hadoop0.20.2.jar</a> ) is being distributed under the GNU Affero General Public License version 3.  This is an open source license that (roughly) requires (among other things) redistribution of source code of systems linked against the licensed project to anyone receiving a compiled version or using the system as a network service.  The license also promises no warranty or implied fitness.  The distribution is a standalone runnable Jar (source code and license inside the jar) and is the minimal object required to run on Hadoop (which is itself a Java project).    More advanced versions of the library (with better linear algebra libraries, better problem slice control, unit tests, JDBC bindings and with different license arrangements) can be arranged from the code owners: <a target="_blank" href="http://www.win-vector.com/">Win-Vector LLC</a>.  This jar was built for Apache Hadoop version 0.20.2 (the latest version Amazon Elastic Map Reduce runs at this time) and we use as many of the newer interfaces as possible (so the code will run against the current Hadoop 0.21.0 if re-built against Hadoop 0.21.0, the jar can not switch versions without being re-built due to how Hadoop calls methods).</p>
<p>For our example we will work on a small data set.  The code is designed to pass through data directly from disk, storing only the Fisher structures- which require storage proportional to the square of the number of variables and levels but is independent of the number of data rows.   The data format is what we call &#8220;naive TSV&#8221; or &#8220;naive tab separated values.&#8221;  This is a file where each line has exactly the same number of values (separated by tabs) and the first line of the file is the header line naming each column.  This is compatible with Microsoft Excel and R with the proviso that this file format does not allow any sort of escapes, quoting or multiple line fields.  Our data set is taken from the <a target="_blank" href="http://archive.ics.uci.edu/ml/">UCI machine learning database</a> ( <a  target="_blank" href="http://archive.ics.uci.edu/ml/machine-learning-databases/car/">data</a>, <a target="_blank" href="http://archive.ics.uci.edu/ml/machine-learning-databases/car/car.names">description</a> )  and converted into the naive TSV format (split into training and testing subsets: <a target="_blank" href="http://www.win-vector.com/dfiles/WinVectorLogisticRegression/uciCarTrain.tsv" title="uciCarTrain.tsv">uciCarTrain.tsv</a>, <a target="_blank" href="http://www.win-vector.com/dfiles/WinVectorLogisticRegression/uciCarTest.tsv" title="uciCarTest.tsv">uciCarTest.tsv</a>).</p>
<p>The first few lines of the training file are given here:</p>
<pre>
buying	maintenance	doors	persons	lug_boot	safety	rating
vhigh	vhigh	2	2	small	med	FALSE
vhigh	vhigh	2	2	med	low	FALSE
vhigh	vhigh	2	2	med	med	FALSE
</pre>
<p>The first experiment is to use the Java program standalone (without Hadoop) to train a model.  The method used is Fisher scoring by multiple passes over the data file.  Only the Fisher structures are stored in memory- so in principle the data set could be arbitrarily large.  To run the logistic training program download the files WinVectorLogistic.Hadoop0.20.2.jar and uciCarTrain.tsv .  You will also need some libraries ( commons-logging-*.jar and commons-logging-api-*.jar , and sometimes  hadoop-*-core.jar and log4j-*.jar ) from the appropriate <a target="_blan" href="http://hadoop.apache.org/">Hadoop distribution</a>.  Before running the code you can examine the source (and re-build the project using an IDE like <a target="_blank" href="http://www.eclipse.org/">Eclipse</a>) by extracting the code in an empty directory using the Java jar command:</p>
<pre>
jar xvf WinVectorLogistic.Hadoop0.20.2.jar
</pre>
<p>To run the code type at the command line (all in a single line, we have inserted line breaks for clarity, we are also assuming you are using a Unix style shell on Linux, OSX or Cygwin on Windows):</p>
<pre>
java -cp WinVectorLogistic.Hadoop0.20.2.jar:commons-logging-1.0.4.jar:commons-logging-api-1.0.4.jar
   com.winvector.logistic.demo.LogisticTrain
   file:uciCarTrain.tsv "rating ~ buying + maintenance + doors + persons + lug_boot + safety" model.ser
</pre>
<p>The portion of interest is the last three arguments:</p>
<ul>
<li>file:uciCarTrain.tsv :  The URI pointing to the file containing the training data.</li>
<li> &#8220;rating ~ buying + maintenance + doors + persons + lug_boot + safety&#8221; : The formula specifying that rating will be predicted as a function of  buying, maintenance, doors, persons, lug_boot  and safety.</li>
<li>model.ser :  Where to write the Java Serialized model result.</li>
</ul>
<p>After that we can run the scoring procedure on the held-out test data:</p>
<pre>
java -cp WinVectorLogistic.Hadoop0.20.2.jar:commons-logging-1.0.4.jar:commons-logging-api-1.0.4.jar
   com.winvector.logistic.demo.LogisticScore
   model.ser file:uciCarTest.tsv scored.tsv
</pre>
<p>In this case the last three arguments are:</p>
<ul>
<li>model.ser :  Where to read the Java Serialized model from.</li>
<li>file:uciCarTest.tsv : The URI pointing to the file to make predictions for.</li>
<li>scored.tsv : Where to write the predictions to.</li>
</ul>
<p>The first few lines of the result file are:</p>
<pre>
predict.rating.FALSE	predict.rating.TRUE	buying	maintenance	doors	persons	lug_boot	safety	rating
0.9999999999999392	6.091299561082107E-14	vhigh	vhigh	2	2	small	low	FALSE
0.9999999824028766	1.759712345446162E-8	vhigh	vhigh	2	2	small	high	FALSE
</pre>
<p>These lines are just lines from the file uciCarTest.tsv (same format is uciCarTrain.tsv) copied over with the addition of the first two columns that show the modeled probabilities of rating acceptable being FALSE or TRUE.  The accuracy of the prediction is computed and written into the runlog if the data had the rating outcomes in it (else we just get a file of predictions- which is the usual application of machine learning).</p>
<p>The details of running the Hadoop versions of the same process depend on the configuration of your Hadoop environment.  Just unpacking the 0.20.2 version of Hadoop will let you try the single-machine version of the MapReduce Logistic Regression process (which will be much slower than the standalone Java version).  To run the training step the Hadoop command line is as follows (notice this time we do not have to specify the logging jars as they are part of the Hadoop environment):</p>
<pre>
hadoop-0.20.2/bin/hadoop jar WinVectorLogistic.Hadoop0.20.2.jar
   logistictrain
   uciCarTrain.tsv "rating ~ buying + maintinance + doors + persons + lug_boot + safety" model.ser
</pre>
<p>And the scoring procedure is below:</p>
<pre>
hadoop-0.20.2/bin/hadoop jar WinVectorLogistic.Hadoop0.20.2.jar
   logisticscore
   model.ser uciCarTest.tsv scoredDir
</pre>
<p>The only operational differences are that the results are written into the file scoredDir/part-r-00000 (as is Hadoop convention) instead of scored.tsv (and an extra &#8220;offset&#8221; column is also included) and data is handled in Files (to allow Hadoop Paths to be formed) instead of URIs.   The Hadoop training and test steps are able to run in this manner because we have constructed WinVectorLogistic.Hadoop0.20.2.jar as an executable jar file with the class com.winvector.logistic.demo.DemoDriver as the class to execute.  This class uses that standard org.apache.hadoop.util.ProgramDriver pattern to run our jobs under the org.apache.hadoop.util.Tool interface.  This means that the standard Hadoop generic flags for specifying cluster configuration will be respected.</p>
<p>The big benefit of all of this packaging is: if this command is run on a large Hadoop cluster (instead of on a single machine) then the input file could be split up and processed in parallel on many machines.   The easiest way to do this is to use Amazon.com&#8217;s <a target="_blank" href="http://aws.amazon.com/elasticmapreduce/">Amazon Elastic MapReduce</a>.  This service (used in conjunction with S3 storage and EC2 virtual machines) allows the immediate remote provisioning and execution on a version 0.20.* Hadoop cluster.  To demonstrate this service we created a new S3 Bucket named wvlogistic.  Into wvlogistic we copied our jar of our code compiled against Hadoop 0.20.2 APIs ( WinVectorLogistic.Hadoop0.20.2.jar ) and a moderate sized synthetic training data set ( bigProb.tsv,  created by running: java -cp WinVectorLogistic.Hadoop0.20.2.jar com.winvector.logistic.demo.BigExample bigProb.tsv ).  Once this has been set up (and you have signed up for the Amazon Elastic MapReduce credentials) you can run the training procedure from the <a target="_blank" href="https://console.aws.amazon.com/elasticmapreduce/home">Amazon web UI</a>.  In five steps (following the direcitons found in <a href="http://aws.amazon.com/articles/3938">Tutorial: How to Create and Debug an Amazon Elastic MapReduce Job Flow</a> ) the job can be configured and launched.</p>
<p>First: press &#8220;Crate New Job Flow&#8221; and choose a job name, check &#8220;Run your own application&#8221; and select &#8220;Cusom Jar&#8221;.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/12/MRExStep1.png" alt="MRExStep1.png" border="0" width="700" /></p>
<p>Step 1/5<br />
</center></p>
<p>Second: specify the location of the jar in your Bucket and give the command line arguments (prepending S3 paths with &#8220;s3n://&#8221;).</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/12/MRExStep2.png" alt="MRExStep2.png" border="0" width="700"  /></p>
<p>Step 2/5<br />
</center></p>
<p>Third: select the type and number of machine instances you want, run without and EC2 key pair, enable logging and send the log back to your S3 bucket.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/12/MRExStep3.png" alt="MRExStep3.png" border="0" width="700"  /></p>
<p>Step 3/5<br />
</center></p>
<p>Fourth: add the default bootstrap action of configuring the Hadoop cluster.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/12/MRExStep4.png" alt="MRExStep4.png" border="0" width="700"  /></p>
<p>Step 4/5<br />
</center></p>
<p>Fifth: confirm and launch the job.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/12/MRExStep5.png" alt="MRExStep5.png" border="0" width="700"  /></p>
<p>Step 5/5<br />
</center></p>
<p>When the job completes transfer the result ( bigModel.ser )  back to your local system and you have your new map reduced produced logistic model.    We can confirm and use the model locally with a Java command similar to our earlier examples:</p>
<pre>
java -cp WinVectorLogistic.Hadoop0.20.2.jar:commons-logging-1.0.4.jar:commons-logging-api-1.0.4.jar:hadoop-0.20.2-core.jar:log4j-1.2.15.jar
   com.winvector.logistic.demo.LogisticScore
   bigModel.ser bigProb.tsv bigScored.tsv
</pre>
<p>Be aware that at this tens of megabytes scale  there is no advantage in running on a Hadoop cluster (versus using the stand-alone program).  At moderate scale parallelism may not even be attempted (due to block size) and the costs of data motion can overcome the benefit of parallel scans.   The biggest gain is being able to train many models from many gigabytes of data on a single machine without sub-sampling.  While we have the ability to build a logistic model at &#8220;web scale&#8221; (terabytes or petabytes of data) you would not want to use the MapReduce calling pattern until you had a web-scale amount of training data.</p>
<p>The point of this exercise was to take a solid implementation of  <a target="_blank" href="http://www.win-vector.com/blog/2010/11/learn-a-powerful-machine-learning-tool-logistic-regression-and-beyond/">regularized logistic regression article</a> and use the decomposition into the &#8221; Statistical Query Model&#8221;  (as suggested in the NIPS paper &#8220;Map-Reduce for Machine Learning on Multicore&#8221;) to quickly get an intermediate sophistication machine learning method (more sophisticated than Naive Bayes, less sophisticated than Kernelized Support Vector Machines) working at large (beyond RAM) scale.  Briefly: most of the technique is in an interface that considers the mis-fit, gradient if mis-fit and hessian of mis-fit as a linear (summable) function over the data.  Or in the &#8220;book&#8217;s worth of preparation so we can write the result in one line&#8221; paradigm: all of the machinery we have been discussing is support so the following summable interface (part of the source code we are distributing) can be used to do all of the work:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/12/LinearContribution.png" alt="LinearContribution.png" border="0" width="956" height="286" /></p>
<p>Summable Interface<br />
</center></p>
<p>Of course once you have the framework up that makes one non-trivial task easy you have likely made many other non-trivial tasks easy.</p>
<p>We hope this demonstration and examining the source code in our WinVectorLogistic.Hadoop0.20.2.jar will help you find ways to tackle your large data machine learning problems.</p>
<hr/>
<p>Code License:</p>
<blockquote><p>
Packages com.winvector.*, extra.winvector.*<br />
	     Code for performing logistic regression on Hadoop.<br />
	     Copyright (C) Win Vector LLC 2010 (contact: John Mount jmount@win-vector.com).<br />
	     Distributed under GNU Affero General Public License version 3 (2007, see http://www.gnu.org/licenses/agpl.html ).<br />
	       This program is free software: you can redistribute it and/or modify<br />
	       it under the terms of the GNU Affero General Public License as<br />
	       published by the Free Software Foundation, only version 3 of the<br />
	       License.<br />
	       This program is distributed in the hope that it will be useful,<br />
	       but WITHOUT ANY WARRANTY; without even the implied warranty of<br />
	       MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the<br />
	       GNU Affero General Public License for more details.<br />
	       You should have received a copy of the GNU Affero General Public License<br />
	       along with this program.  If not, see <http://www.gnu.org/licenses/>.<br />
	    (Source code in jar, see also http://www.win-vector.com/blog/2010/12/large-data-logistic-regression-with-example-hadoop-code/ )
</p></blockquote>
<hr/>
Note Dec-15-2011:  We have moved the code distribution to <a target="_blank" href="https://github.com/WinVector/SQL-Screwdriver">github.com/WinVector/SQL-Screwdriver</a> .  We have fixed some major bugs in the supplied optimizers and moved com.winvector.logistic.LogisticScore and com.winvector.logistic.LogisticTrain form freeform arguments to Apache CLI.  The new command lines need flags as shown below:</p>
<pre>
usage: com.winvector.logistic.LogisticTrain
 -formula &lt;arg&gt;      formula to fit
 -inmemory           if set data is held in memory during training
 -resultSer &lt;arg&gt;    (optional) file to write seriazlized results to
 -resultTSV &lt;arg&gt;    (optional) file to write TSV results to
 -trainClass &lt;arg&gt;   (optional) alternate class to use for training
 -trainHDL &lt;arg&gt;     XML file to get JDBC connection to training data
                     table
 -trainTBL &lt;arg&gt;     table to use from database for training data
 -trainURI &lt;arg&gt;     URI to get training TSV data from
</pre>
<pre>
usage: com.winvector.logistic.LogisticScore
 -dataHDL &lt;arg&gt;      XML file to get JDBC connection to scoring data table
 -dataTBL &lt;arg&gt;      table to use from database for scoring data
 -dataURI &lt;arg&gt;      URI to get scoring data from
 -modelFile &lt;arg&gt;    file to read serialized model from
 -resultFile &lt;arg&gt;   file to write results to
</pre>
<hr/>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2011/09/the-equivalence-of-logistic-regression-and-maximum-entropy-models/' rel='bookmark' title='The equivalence of logistic regression and maximum entropy models'>The equivalence of logistic regression and maximum entropy models</a></li>
<li><a href='http://www.win-vector.com/blog/2011/09/the-simpler-derivation-of-logistic-regression/' rel='bookmark' title='The Simpler Derivation of Logistic Regression'>The Simpler Derivation of Logistic Regression</a></li>
<li><a href='http://www.win-vector.com/blog/2010/11/learn-a-powerful-machine-learning-tool-logistic-regression-and-beyond/' rel='bookmark' title='Learn Logistic Regression (and beyond)'>Learn Logistic Regression (and beyond)</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2010/12/large-data-logistic-regression-with-example-hadoop-code/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Map Reduce: A Good Idea</title>
		<link>http://www.win-vector.com/blog/2009/01/map-reduce-a-good-idea/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=map-reduce-a-good-idea</link>
		<comments>http://www.win-vector.com/blog/2009/01/map-reduce-a-good-idea/#comments</comments>
		<pubDate>Sun, 25 Jan 2009 20:32:20 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Opinion]]></category>
		<category><![CDATA[Rants]]></category>
		<category><![CDATA[External Sorting]]></category>
		<category><![CDATA[Map Reduce]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=30</guid>
		<description><![CDATA[Some time ago I subscribed to The Database Column because it would be fun to see what these incredible people wanted to discuss. We owe much of our current database technology to Professor Stonebraker and Vertica sounds like an incredible product. And I definitely want to continue to subscribe. However, the reading experience is marred [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2011/10/the-mythical-man-month-is-still-a-good-read/' rel='bookmark' title='&#8220;The Mythical Man Month&#8221; is still a good read'>&#8220;The Mythical Man Month&#8221; is still a good read</a></li>
<li><a href='http://www.win-vector.com/blog/2011/04/brevity-is-a-virtue/' rel='bookmark' title='Brevity is a Virtue'>Brevity is a Virtue</a></li>
<li><a href='http://www.win-vector.com/blog/2009/08/good-graphs-graphical-perception-and-data-visualization/' rel='bookmark' title='Good Graphs: Graphical Perception and Data Visualization'>Good Graphs: Graphical Perception and Data Visualization</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>Some time ago I subscribed to The <a href="http://www.databasecolumn.com/">Database Column</a>  because it would be fun to see what these incredible people wanted to discuss.  We owe much of our current database technology to Professor Stonebraker and Vertica sounds like an incredible product.  And I definitely want to continue to subscribe.</p>
<p>However, the reading experience is marred by some flaw in their RSS system that keeps marking the article <a href="http://www.databasecolumn.com/2008/01/mapreduce-a-major-step-back.html">&#8220;MapReduce: A major step backwards&#8221;</a> as a new article.  This causes the article to appear in my RSS reader every few weeks as &#8220;new.&#8221;  This wouldn&#8217;t bother me too much except that the article runs so counter to experience that it is itself offensive.<br />
<span id="more-30"></span><br />
I have no desire to defend Google (the home of MapReduce)- they don&#8217;t need it and are clearly laughing all the way to the bank.  However the points used to kick at MapReduce are so broad and so devalue practitioner experience that they are insulting.  I find the individual arguments offensive and wish to stand against them.  I am not that concerned about the conclusion, use MapReduce or don&#8217;t.  For some things MapReduce is a good tool and for some things it is not.</p>
<p>Let&#8217;s limit ourselves to the 5 primary complaints from the article.  The article (verbatim) says MapReduce is:</p>
<blockquote><p>
1. A giant step backward in the programming paradigm for large-scale data intensive applications.</p>
<p>2. A sub-optimal implementation, in that it uses brute force instead of indexing.</p>
<p>3. Not novel at all &#8212; it represents a specific implementation of well known techniques developed nearly 25 years ago.</p>
<p>4. Missing most of the features that are routinely included in current DBMS.</p>
<p>5. Incompatible with all of the tools DBMS users have come to depend on.
</p></blockquote>
<p>Now let us comment:</p>
<p>1. <strong>&#8220;A giant step backward in the programming paradigm for large-scale data intensive applications.&#8221;</strong>  </p>
<p>Actually, no.  </p>
<blockquote><p>
MapReduce represents a continuity in a stream of ideas that made UNIX great: composable transient tools.  Not everything is a database or data warehouse.  A lot of the grungy UNIX tools (like sort, sed, awk, join) have often been combined to do large scale (at the time) research because they all worked &#8220;out of core&#8221; fairly well.  This makes for a horrible bailing-wire set-up.  However, it often handles problems of a size much larger than would have been possible on the hardware at the time.</p>
<p>In addition the author trots out the  &#8220;it&#8217;s Codasyl all over again&#8221; argument.  This argument refers to the ongoing pain and expense derived  from binding algorithmic details too close to the data representation.  In earlier writing it was a fantastic point that warned that the up and coming object oriented databases were going to be the same nasty pointer chasing nightmares that hierarchical databases had been.  I can see why an author might feel that just saying &#8220;it&#8217;s Codasyl&#8221; could win any argument.
</p></blockquote>
<p>2. <strong> &#8220;A sub-optimal implementation, in that it uses brute force instead of indexing.&#8221; </strong> </p>
<p>MapReduce does not use brute force.</p>
<blockquote><p>
MapReduce uses the idea (one that goes back to merge sort) that parallel traversals (that is: running through two lists in the same order synchronously) are a very powerful technique that can, among other things, produce indices.  MapReduce is so efficient that it has been shown to be competitive with the best large scale sorting algorithms on their home-turf: sorting.</p>
<p>MapReduce looks brutish because it drops a lot of popular design features.  One such feature is trying to speed things up through local caching and combining.  However, on the data that MapReduce is commonly used (free form written text) it is a provable property of the data that local caching is an ineffective complication (due to the heavy-tailedness of the data).  So many of the graceful features missing from MapReduce are actually no help on the types of data it is used on.  There is a certain grace in leaving only only the features that are actually helping.
</p></blockquote>
<p>3. <strong>&#8220;Not novel at all &#8212; it represents a specific implementation of well known techniques developed nearly 25 years ago.&#8221;</strong></p>
<p> A nasty attack.</p>
<blockquote><p>
MapReduce is a good explanation of some merging techniques that have been common knowledge for quite a while.  This is a legitimate expository goal: explaining something everybody already &#8220;knows&#8221; better.  In fact this is very hard to do and considered a legitimate accomplishment in many fields (for example Rota stated it was a legitimate goal in mathematics).  I myself looked at some of my own older code for dealing with very large data sets after reading the MapReduce paper.  I saw that the paper was describing what I was already doing (splitting the data into streams for later re-joining) and explaining it so well that it was now a method and no longer a hack.  When a paper successfully teaches about you something you already &#8220;know&#8221; it is a good work.</p>
<p>The attack is is also inaccurate- the ideas are not  25 years old it is closer to 120 years old.<br />
We could easily trace the lineage of MapReduce back to Hollerith style sorting machines that pre-date general purpose  computers (i.e. going back to before 1889) .  MapReduce refers back to a time when all computation was performed by what we now call external sorting and tabulation.  These 19th century technologies may seem archaic but they were developed in a word similar to ours: worlds where the amount of data is in excess of your conveniently reconfigurable computational resources.
</p></blockquote>
<p>4. <strong> &#8220;Missing most of the features that are routinely included in current DBMS.&#8221; </strong></p>
<p> Unfortunate.</p>
<blockquote><p>
I miss a lot of those features.</p>
<p>However, because MapReduce is such a lean technique I have seen engineers implement their own MapReduce systems in a day (to solve a problem they are working on).  That is they are successfully sorting, joining, indexing and summarizing hundreds of gigabytes of data on a consumer PC within a couple of days of being asked to.  This is from scratch after reading the MapReduce paper.</p>
<p>The &#8220;make versus buy&#8221; decision should not always come out &#8220;make.&#8221;  But it is not wise to artificially bloat up requirements so that the decision can only be &#8220;buy.&#8221;
</p></blockquote>
<p>5. <strong>&#8220;Incompatible with all of the tools DBMS users have come to depend on.&#8221; </strong></p>
<p> Good.</p>
<blockquote><p>
Frankly for a lot of analytic practitioners many DMBS systems and tools have become expensive obstacles in the way getting results.  Yes, we  enjoy humiliating an interview candidate that does not know all of the Codd normal forms (or can&#8217;t remember which of the alphabet soups of OLTP or OLAP is the &#8220;good one&#8221; ) as much as the next person.  But to many of us a lot of these tools and procedures are more obstacles than a solutions.</p>
<p>This may sound nasty, but if were not the case why would companies like Vertica be producing radical new database tools?  The fact is existing DBMS tools were designed for a different type and scale of data than we regularly see on the web (and column oriented database designers seem to share this view).  The situation is so bad that &#8220;roach motel&#8221; is a common analyst&#8217;s slang for &#8220;data warehouse&#8221; (derived from: &#8220;data checks in but it never checks out&#8221;).
</p></blockquote>
<p>This isn&#8217;t meant to be a hagiography of MapReduce, but given that MapReduce has paid the bills I feel it deserves a small show of respect along the lines of &#8220;dance with the one who brung you.&#8221;</p>
<p>MapReduce is not a panacea.  One of the tasks I have hated most in my career was maintaining a seven step MapReduce based system.  I would love to have avoided all the detail fiddling that set-up required.  However, the system paid our bills by performing a calculation that was beyond the scale of simpler methods and it would have been unaffordable to buy a solution.  </p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2011/10/the-mythical-man-month-is-still-a-good-read/' rel='bookmark' title='&#8220;The Mythical Man Month&#8221; is still a good read'>&#8220;The Mythical Man Month&#8221; is still a good read</a></li>
<li><a href='http://www.win-vector.com/blog/2011/04/brevity-is-a-virtue/' rel='bookmark' title='Brevity is a Virtue'>Brevity is a Virtue</a></li>
<li><a href='http://www.win-vector.com/blog/2009/08/good-graphs-graphical-perception-and-data-visualization/' rel='bookmark' title='Good Graphs: Graphical Perception and Data Visualization'>Good Graphs: Graphical Perception and Data Visualization</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2009/01/map-reduce-a-good-idea/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>

