<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Win-Vector Blog &#187; Programming</title>
	<atom:link href="http://www.win-vector.com/blog/category/programming/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.win-vector.com/blog</link>
	<description>The Applied Theorist&#039;s Point of View</description>
	<lastBuildDate>Sat, 04 Feb 2012 17:42:12 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>What to do when you run out of memory</title>
		<link>http://www.win-vector.com/blog/2011/12/what-to-do-when-you-run-out-of-memory/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=what-to-do-when-you-run-out-of-memory</link>
		<comments>http://www.win-vector.com/blog/2011/12/what-to-do-when-you-run-out-of-memory/#comments</comments>
		<pubDate>Tue, 06 Dec 2011 12:25:59 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Coding]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Tutorials]]></category>
		<category><![CDATA[Additive Combinatorics]]></category>
		<category><![CDATA[GNU sort]]></category>
		<category><![CDATA[H2]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Map Reduce]]></category>
		<category><![CDATA[Out of core]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1892</guid>
		<description><![CDATA[A constant problem for computer science (since its inception) is how to manipulate data that is larger than machine memory. We present here some general strategies for working &#8220;out of core&#8221; or what you should do when you run out of memory. Early computers were most limited by their paltry memory sizes. von Neumann himself [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/' rel='bookmark' title='Automatic Differentiation with Scala'>Automatic Differentiation with Scala</a></li>
<li><a href='http://www.win-vector.com/blog/2010/12/large-data-logistic-regression-with-example-hadoop-code/' rel='bookmark' title='Large Data Logistic Regression (with example Hadoop code)'>Large Data Logistic Regression (with example Hadoop code)</a></li>
<li><a href='http://www.win-vector.com/blog/2011/01/sql-screwdriver/' rel='bookmark' title='SQL Screwdriver'>SQL Screwdriver</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>A constant problem for computer science (since its inception) is how to manipulate data that is larger than machine memory.  We present here some general strategies for working &#8220;out of core&#8221; or what you should do when you run out of memory.</p>
<p>Early computers were most limited by their paltry memory sizes.  von Neumann himself commented that even a room full of genius mathematicians would not be capable of much if all they could communicate, think upon or remember were the characters on a single type written page (much more memory than the few hundred words available to the <a target="_blank" href="http://en.wikipedia.org/wiki/ENIAC">Eniac</a>).   The most visible portions of early computers are their external memories or secondary stores: card readers, paper tape readers and tape drives.</p>
<p><center><br />
<img style="display:block; margin-left:auto; margin-right:auto;" src="http://www.win-vector.com/blog/wp-content/uploads/2011/12/IMG_0062.jpg" alt="IMG 0062" title="IMG_0062.JPG" border="0" height="300" /></p>
<p/>
SDC 920 computer, Computer History Museum, Mountain View CA<br />
</center></p>
<p>Historically computer scientists have concentrated on streaming or online algorithms (that is algorithms that work with the data in the order it is available and use limited memory).  For many problems we have found this an insufficient model and it is much better to assume you can re-order and replicate data (such as scattering data to many processors and re-collecting it to sort).  The scatter/gather paradigm is ubiquitous and is the underpinning of large scale sorting, databases and Map Reduce.  So in one sense databases and Map Reduce different APIs on top of very related technologies (journaling, splitting and merging).  Replicating data (or even delaying duplicate elimination) that is already &#8220;too large to handle&#8221; may seem counterintuitive; but it is exploiting the primary property of secondary storage: that secondary storage tends to be much larger than primary storage (typically by 2 orders of magnitude, compare a 2 terabyte drive to an 8 gigabyte memory stick).<span id="more-1892"></span>In our web age, the typical big data problems are inverting indices (for fast search lookup) and computing term frequencies (for <a target="_blank" href="http://en.wikipedia.org/wiki/Okapi_BM25">TF/IDF scoring</a> or for things like <a target="_blank" href="http://en.wikipedia.org/wiki/Naive_bayes">Naive Bayes classifiers</a>).  Since these are over-worked examples we will use a mathematical problem from <a href="http://terrytao.wordpress.com/books/additive-combinatorics/">&#8220;Additive Combinatorics&#8221;, Terence Tao, Van Vu, (ISBN-13: 9780521853866; ISBN-10: 0521853869)</a></p>
<p>We take one problem from the field of additive combinatorics: sum sets.   For two sets of integers A = {a_1, &#8230; a_s} and B {b_1, &#8230;, b_t} the sum set is defined as the set (without repetition) A + B = { a_i + b_j | i = 1,&#8230;s, j=1&#8230;t }.   For sets of integers the size of A+B (denoted as |A+B|) can vary from |A| + |B| &#8211; 1 to |A| * |B| depending on the relations between the numbers in A and B (or the structure of A and B).  If instead of working with integers we work with integers <a target="_blank" href="http://en.wikipedia.org/wiki/Modular_arithmetic">modulo p</a> where p is a prime number (or equivalently we treat all numbers as remainders of division by p) then by the Cauchy-Davenport inequality we have |A + B| &ge; min(|A|+|B|-1,p) (so essentially the same result, except when we run out of possible integers modulo p).</p>
<p>For example we would say (working modulo 19) that [0, 1, 4, 5] + [10, 11, 14, 15] = [0, 1, 10, 11, 12, 14, 15, 16, 18].   In fact there are 19 pairs of sets that add up to  [0, 1, 10, 11, 12, 14, 15, 16, 18] ( for instance [5, 6, 9, 10] + [5, 6, 9, 10] is another such pair).  Just to move forward assume we were interested in determining how many ways a set can be written as the sum of a pair of sets (each of size 4).  For a given sum result we might try search or <a target="_blank" href="http://en.wikipedia.org/wiki/Integer_programming">integer programming</a> to find all possible summands.  However, if we want the statistics on all sums simultaneously, we can work much quicker and without need for big gun mathematics.</p>
<p>The straightforward solution is this case is a bit of code like:</p>
<p><code></p>
<pre>
for set A from all possible sets of 4 integers from 0 to 18
    for set B from all possible sets of 4 integers from 0 to 18
        let set C = A + B modulo 19
        use set C as a key and add the pair (A,B) to the list associated with C
for all key sets C tracked above
     compute the size of the list of summand pairs found for C
print how many result sets C have a given number of summand pairs
</pre>
<p></code></p>
<p>The relations C which have a summand of form A can be collected by any bit of Java code implementing the interface below (just call <code>insertReln(C,(A,B))</code>  to store the relations and then <code>entries()</code> to get them back).  A small interface that declares the needed methods is given below:</p>
<p><code></p>
<pre>
public interface RelnCollector&lt;A,B&gt; {
	void insertReln(A a, B b) throws IOException;
	Iterable&lt;Map.Entry&lt;C,Iterable&lt;B&gt;&gt;&gt; entries() throws IOException, InterruptedException;
	void close() throws IOException;
}
</pre>
<p></code></p>
<p>An in-memory relation collector is trivially implemented by a nested map adjusted to declare the above interface, as we see in the next code snippet:</p>
<pre>
public final class InMemoryRelnCollector&lt;A,B&gt;
	implements RelnCollector&lt;A,B&gt; {
	private final DataAdapter&lt;A&gt; adapterA;
	private final DataAdapter&lt;B&gt; adapterB;
	private Map&lt;A,Iterable&lt;B&gt;&gt; atoBs;

	public InMemoryRelnCollector(final DataAdapter&lt;A&gt; adapterA,
		final DataAdapter&lt;B&gt; adapterB) {
		this.adapterA = adapterA;
		this.adapterB = adapterB;
		atoBs = new TreeMap&lt;A,Iterable&lt;B&gt;&gt;(this.adapterA);
	}

	@Override
	public void insertReln(final A a, final B b) {
		Set&lt;B&gt; set = (Set&lt;B&gt;) atoBs.get(a);
		if(null==set) {
			set = new TreeSet&lt;B&gt;(adapterB);
			atoBs.put(a,set);
		}
		if(!set.contains(b)) {
			set.add(b);
		}
	}

	@Override
	public Iterable&lt;Map.Entry&lt;A,Iterable&lt;B&gt;&gt;&gt; entries() {
		return atoBs.entrySet();
	}

	@Override
	public void close() {
		atoBs = null;
	}
}
</pre>
<p>The great savings in time is that we work from summands to results sums (but keep many sets of results indexed by result sets).  Thus we don&#8217;t have to figure out how to invert the sum operation (as we do our bookkeeping forward).  However, this very bookkeeping may overwhelm us.  As we can see below, a Java implementation of the above procedure runs out of memory when trying to characterize which sets of integers modulo 19 can be split into two sets of size four (and how many ways each such set can be split).  However, this was with the deliberately small default allocation of memory available to Java processes (so for this particular instance we could avoid trouble by allocating more memory, we ran out of allocation not system memory).  What happens when we don&#8217;t manage memory is illustrated below:</p>
<pre>
Start	com.winvector.consolidate.impl.InMemoryRelnCollector
	Tue Dec 06 10:04:38 PST 2011
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
	at java.util.TreeMap.put(TreeMap.java:554)
	at java.util.TreeSet.add(TreeSet.java:238)
	at com.winvector.consolidate.example.AdditiveSets.sum(AdditiveSets.java:25)
	at com.winvector.consolidate.example.AdditiveSets.main(AdditiveSets.java:55)
</pre>
<p>An out of core solution can solve the entire problem without needing any additional system memory (just some disk space which is still of a much greater size than primary memory).  The complete calculated result is given below:</p>
<pre>
Examining sums of 4 integers chosen from 0 through 18 modulo 19.
Start	com.winvector.consolidate.impl.FileRelnCollector
	Tue Dec 06 09:54:20 PST 2011
	Inserted 15023376 relations.
 [0, 1, 4, 5] + [10, 11, 14, 15] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [0, 1, 15, 16] + [0, 14, 15, 18] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [0, 3, 4, 18] + [11, 12, 15, 16] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [0, 14, 15, 18] + [0, 1, 15, 16] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [1, 2, 5, 6] + [9, 10, 13, 14] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [1, 2, 16, 17] + [13, 14, 17, 18] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [2, 3, 6, 7] + [8, 9, 12, 13] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [2, 3, 17, 18] + [12, 13, 16, 17] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [3, 4, 7, 8] + [7, 8, 11, 12] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [4, 5, 8, 9] + [6, 7, 10, 11] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [5, 6, 9, 10] + [5, 6, 9, 10] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [6, 7, 10, 11] + [4, 5, 8, 9] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [7, 8, 11, 12] + [3, 4, 7, 8] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [8, 9, 12, 13] + [2, 3, 6, 7] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [9, 10, 13, 14] + [1, 2, 5, 6] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [10, 11, 14, 15] + [0, 1, 4, 5] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [11, 12, 15, 16] + [0, 3, 4, 18] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [12, 13, 16, 17] + [2, 3, 17, 18] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [13, 14, 17, 18] + [1, 2, 16, 17] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
	Examined 128820 sums and 15023376 summands.
	found 3705 sums with 19 distinct summands
	found 39900 sums with 38 distinct summands
	found 26847 sums with 76 distinct summands
	found 22230 sums with 114 distinct summands
	found 10602 sums with 152 distinct summands
	found 8892 sums with 190 distinct summands
	found 2736 sums with 228 distinct summands
	found 5016 sums with 266 distinct summands
	found 2736 sums with 304 distinct summands
	found 1710 sums with 342 distinct summands
	found 171 sums with 361 distinct summands
	found 1710 sums with 380 distinct summands
	found 855 sums with 418 distinct summands
	found 342 sums with 456 distinct summands
	found 342 sums with 532 distinct summands
	found 342 sums with 570 distinct summands
	found 171 sums with 722 distinct summands
	found 171 sums with 760 distinct summands
	found 171 sums with 912 distinct summands
	found 171 sums with 1026 distinct summands
Done:	com.winvector.consolidate.impl.FileRelnCollector
   elapsed time: 618473MS
   Tue Dec 06 10:04:38 PST 2011
</pre>
<p>We performed the calculation be using a different implementation of <code>RelnCollector</code> called <code>FileRelnCollector</code>.  What this implementation does is write relations to a file as they are made available.  That is <cod>FileRelnCollector</code> implementation of <code>insertReln</code> is literally a <code>println()</code>.  Something not more more complicated than the following:</p>
<p><code></p>
<pre>
	@Override
	public void insertReln(final A a, final B b) {
		System.out.println("" + a + "\t" + b);
	}
</pre>
<p></code></p>
<p>The heavy lifting is done when <code>entries()</code> is called.  When the entries are wanted the <code>FileRelnCollector</code> calls <a target="_blank" href="http://www.gnu.org/s/coreutils/manual/html_node/sort-invocation.html">GNU sort</a> on the saved file to get all the results ordered by result sum (instead of by summand).  GNU sort can sort files larger than memory by a split and merge strategy involving temporary files.  We provide such  <a target="_blank" href="https://github.com/WinVector/OutOfCore/blob/master/OutOfCore/src/com/winvector/consolidate/impl/FileRelnCollector.java">a file plus GNU sort based implementation of RelnCollector</a>.  </p>
<p>Note that this runtime can be deceptively low.  If running on a machine with a modern operating system and enough memory the file being used as "external storage" actually gets cached into memory (and gets near memory speed performance).  To get a reliable timing you need to test a problem of the size you are interested in on the size machine you are going to deploy on (not on a larger machine).</p>
<p>For better or worse this method should seem familiar as a lot of science has been done using the Unix text tools (sort, join and a few more).  This is also the basis of Map Reduce and we demonstrate a <a target="_blank" href="https://github.com/WinVector/OutOfCore/blob/master/OutOfCore/src/com/winvector/consolidate/impl/MapReduceRelnCollector.java">Hadoop implementation of RelnCollector</a> as well.  Or we can link up with the other technology designed for beyond memory size data manipulation and get <a target="_blank" href="https://github.com/WinVector/OutOfCore/blob/master/OutOfCore/src/com/winvector/consolidate/impl/DBRelnCollector.java">a database based implementation of RelnCollector</a>.  </p>
<p>In all cases the implementations we call depend on journaling (in the sense of keeping a sequential log of operations to be done instead of immediately performing the operations), scattering (splitting into multiple temp files and structures) and merging (combining data form multiple ordered files).  We could write our own code to perform all of these operations (obliviating any need for GNU sort, Hadoop or a database), but it is much less code to do as we have here and write an adapter to use existing implementations.</p>
<p>The sum-set example is deliberately artificial.  More common examples are, as we mentioned, index inversion and term frequency calculation.  All of our example code is available here: <a target="_blank" href="https://github.com/WinVector/OutOfCore">https://github.com/WinVector/OutOfCore</a> including JUnit tests and an <a target="_blank" href="https://github.com/WinVector/OutOfCore/blob/master/OutOfCore/src/com/winvector/consolidate/example/AdditiveSets.java">example program</a>.  The code depends on libraries for <a target="_blank" href="http://www.junit.org/">JUnit 4.10</a>, <a target="_blank" href="http://www.h2database.com/html/main.html">h2 database</a>, <a target="_blank" href="http://hadoop.apache.org/mapreduce/releases.html">Hadoop 0.21.0</a> for the various implementations.</p>
<p>The main trick is basing your code on a very thin storage abstraction (like the <code>RelnCollector</code> interface, instead of explicitly known data structures) and then using this abstraction to hide all of the details away from the rest of your code (keeping complexity at a manageable level).  The two things to avoid are either infecting your code with too much knowledge of your storage plans (i.e. pushing implementation details into your important code to "speed things up") or being forced to re-design your entire project to fit within some framework (like re-writing all of your code as a database stored procedure or an explicit Hadoop map/reduce pair as this over-commits you to one technology).</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/' rel='bookmark' title='Automatic Differentiation with Scala'>Automatic Differentiation with Scala</a></li>
<li><a href='http://www.win-vector.com/blog/2010/12/large-data-logistic-regression-with-example-hadoop-code/' rel='bookmark' title='Large Data Logistic Regression (with example Hadoop code)'>Large Data Logistic Regression (with example Hadoop code)</a></li>
<li><a href='http://www.win-vector.com/blog/2011/01/sql-screwdriver/' rel='bookmark' title='SQL Screwdriver'>SQL Screwdriver</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2011/12/what-to-do-when-you-run-out-of-memory/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>&#8220;The Mythical Man Month&#8221; is still a good read</title>
		<link>http://www.win-vector.com/blog/2011/10/the-mythical-man-month-is-still-a-good-read/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=the-mythical-man-month-is-still-a-good-read</link>
		<comments>http://www.win-vector.com/blog/2011/10/the-mythical-man-month-is-still-a-good-read/#comments</comments>
		<pubDate>Sun, 23 Oct 2011 18:57:52 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Computers]]></category>
		<category><![CDATA[Opinion]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Architects]]></category>
		<category><![CDATA[Mythical Man Month]]></category>
		<category><![CDATA[SAGE]]></category>
		<category><![CDATA[WIMP]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1834</guid>
		<description><![CDATA[Re-read Fred Brooks &#8220;The Mythical Man Month&#8221; over vacation.  Book remains insightful about computer science and project management.My spin on some points: System architects once were the people who said &#8220;no&#8221; to features to maintain design consistency and coherency. Now architects are the people who buy and bring in external frameworks and technologies (killing any [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2009/01/map-reduce-a-good-idea/' rel='bookmark' title='Map Reduce: A Good Idea'>Map Reduce: A Good Idea</a></li>
<li><a href='http://www.win-vector.com/blog/2009/08/good-graphs-graphical-perception-and-data-visualization/' rel='bookmark' title='Good Graphs: Graphical Perception and Data Visualization'>Good Graphs: Graphical Perception and Data Visualization</a></li>
<li><a href='http://www.win-vector.com/blog/2010/05/must-have-software/' rel='bookmark' title='Must Have Software'>Must Have Software</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>Re-read Fred Brooks &#8220;The Mythical Man Month&#8221; over vacation.  Book remains insightful about computer science and project management.<span id="more-1834"></span>My spin on some points:</p>
<ul>
<li>
System architects once were the people who said &#8220;no&#8221; to features to maintain design consistency and coherency.  Now architects are the people who buy and bring in external frameworks and technologies (killing any chance of consistency or coherency).  Kind of like the Fahrenheit 451 quote &#8220;I remember firemen used to fight fires.&#8221;
</li>
<li>
By far the thing that aged the worst was the reverence for the WIMP (windows, icons, menus, pointing) paradigm.  At this point I think we can argue that WIMP codified a lot of provably bad decisions: desktops, icons, menus and mouse out of visual field.  Maybe some of the ideas prior to WIMP (like SAGE&#8217;s light-pens) or after WIMP (application launcher noun-verb theories like Quicksilver, search, touch pads, full screen apps, versioning and not forcing the user to adapt to the file storage abstraction) are actually much more fundamental.  I think we all were seduced by the 1968 Engelbart demo but forget that the Semi Automated Ground Environment was a production deployed direct (light pen) multi user information sharing point and click system since 1959.</p>
<p><center><br />
<img style="display:block; margin-left:auto; margin-right:auto;" src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/IMG_0064.jpg" alt="SAGE station" title="IMG_0064.JPG" border="0" width="600" height="450" /></p>
<p>SAGE station, Computer History Museum- Mountain View, CA</p>
<p></center>
</li>
<li>
Most everything else ages very well.  The discussions of pain of having to work &#8220;out of core&#8221; remain relevant as this is what we now call &#8220;big data&#8221; (though in Brooks&#8217; time this pain extends to documentation, source code and binaries all of which are too big to hold in memory or even in machine accessible format in the time of the IBM System/360).  </p>
<p>Though in the old days- &#8220;out of core&#8221; meant punched cards, punched tape, magnetic tape or very slow hard disks (which were a new luxury for the period Brooks writes about).<br />
<center><br />
<img style="display:block; margin-left:auto; margin-right:auto;" src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/IMG_0062.jpg" alt="IMG 0062" title="IMG_0062.JPG" border="0" width="450" height="600" /></p>
<p>SDS 920 with built in tape-drive, Computer History Museum- Mountain View, CA</p>
<p></center>
</li>
<li>
Linkers were among the biggest problems in the 1960s and remain the so now (though we now call it late binding, jars, shared libraries and APIs).  At one point Brooks throws up his hands and says that it would be faster to just re-compile everything than to deal with some relocating linkers.
</li>
<li>
Brooks definitely advocates and anticipates things like developer wikis (though he had to use microfiche as the computers of his day didn&#8217;t have enough storage to manage their own documentation).
</li>
<li>
&#8220;Literate Programming&#8221; is clearly anticipated.
</li>
<li>
Version control procedures are definitely written about, but Brooks seems not to anticipate version control software.
</li>
</ul>
<p>Overall: very well written and still interesting and relevant.</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2009/01/map-reduce-a-good-idea/' rel='bookmark' title='Map Reduce: A Good Idea'>Map Reduce: A Good Idea</a></li>
<li><a href='http://www.win-vector.com/blog/2009/08/good-graphs-graphical-perception-and-data-visualization/' rel='bookmark' title='Good Graphs: Graphical Perception and Data Visualization'>Good Graphs: Graphical Perception and Data Visualization</a></li>
<li><a href='http://www.win-vector.com/blog/2010/05/must-have-software/' rel='bookmark' title='Must Have Software'>Must Have Software</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2011/10/the-mythical-man-month-is-still-a-good-read/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Programmers Should Know R</title>
		<link>http://www.win-vector.com/blog/2011/08/programmers-should-know-r/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=programmers-should-know-r</link>
		<comments>http://www.win-vector.com/blog/2011/08/programmers-should-know-r/#comments</comments>
		<pubDate>Sat, 06 Aug 2011 15:29:22 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Tutorials]]></category>
		<category><![CDATA[Analytics]]></category>
		<category><![CDATA[diagnosis]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[R]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1711</guid>
		<description><![CDATA[Programmers should definitely know how to use R. I don&#8217;t mean they should switch from their current language to R, but they should think of R as a handy tool during development.Again and again I find myself working with Java code like the following. public class SomeBigProject1 { public static double logStirlingApproximation(final int n) { [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/' rel='bookmark' title='Automatic Differentiation with Scala'>Automatic Differentiation with Scala</a></li>
<li><a href='http://www.win-vector.com/blog/2009/11/r-examine-objects-tutorial/' rel='bookmark' title='R examine objects tutorial'>R examine objects tutorial</a></li>
<li><a href='http://www.win-vector.com/blog/2009/09/survive-r/' rel='bookmark' title='Survive R'>Survive R</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>Programmers should definitely know how to use <a target="_blan" href="http://cran.r-project.org/">R</a>.  I don&#8217;t mean they should switch from their current language to R, but they should think of R as a handy tool during development.<span id="more-1711"></span>Again and again I find myself working with Java code like the following.</p>
<style type="text/css">
td.linenos { background-color: #f0f0f0; padding-right: 10px; }
span.lineno { background-color: #f0f0f0; padding: 0 5px 0 5px; }
pre { line-height: 125%; }
body .hll { background-color: #ffffcc }
body  { background: #f8f8f8; }
body .c { color: #408080; font-style: italic } /* Comment */
body .err { border: 1px solid #FF0000 } /* Error */
body .k { color: #008000; font-weight: bold } /* Keyword */
body .o { color: #666666 } /* Operator */
body .cm { color: #408080; font-style: italic } /* Comment.Multiline */
body .cp { color: #BC7A00 } /* Comment.Preproc */
body .c1 { color: #408080; font-style: italic } /* Comment.Single */
body .cs { color: #408080; font-style: italic } /* Comment.Special */
body .gd { color: #A00000 } /* Generic.Deleted */
body .ge { font-style: italic } /* Generic.Emph */
body .gr { color: #FF0000 } /* Generic.Error */
body .gh { color: #000080; font-weight: bold } /* Generic.Heading */
body .gi { color: #00A000 } /* Generic.Inserted */
body .go { color: #808080 } /* Generic.Output */
body .gp { color: #000080; font-weight: bold } /* Generic.Prompt */
body .gs { font-weight: bold } /* Generic.Strong */
body .gu { color: #800080; font-weight: bold } /* Generic.Subheading */
body .gt { color: #0040D0 } /* Generic.Traceback */
body .kc { color: #008000; font-weight: bold } /* Keyword.Constant */
body .kd { color: #008000; font-weight: bold } /* Keyword.Declaration */
body .kn { color: #008000; font-weight: bold } /* Keyword.Namespace */
body .kp { color: #008000 } /* Keyword.Pseudo */
body .kr { color: #008000; font-weight: bold } /* Keyword.Reserved */
body .kt { color: #B00040 } /* Keyword.Type */
body .m { color: #666666 } /* Literal.Number */
body .s { color: #BA2121 } /* Literal.String */
body .na { color: #7D9029 } /* Name.Attribute */
body .nb { color: #008000 } /* Name.Builtin */
body .nc { color: #0000FF; font-weight: bold } /* Name.Class */
body .no { color: #880000 } /* Name.Constant */
body .nd { color: #AA22FF } /* Name.Decorator */
body .ni { color: #999999; font-weight: bold } /* Name.Entity */
body .ne { color: #D2413A; font-weight: bold } /* Name.Exception */
body .nf { color: #0000FF } /* Name.Function */
body .nl { color: #A0A000 } /* Name.Label */
body .nn { color: #0000FF; font-weight: bold } /* Name.Namespace */
body .nt { color: #008000; font-weight: bold } /* Name.Tag */
body .nv { color: #19177C } /* Name.Variable */
body .ow { color: #AA22FF; font-weight: bold } /* Operator.Word */
body .w { color: #bbbbbb } /* Text.Whitespace */
body .mf { color: #666666 } /* Literal.Number.Float */
body .mh { color: #666666 } /* Literal.Number.Hex */
body .mi { color: #666666 } /* Literal.Number.Integer */
body .mo { color: #666666 } /* Literal.Number.Oct */
body .sb { color: #BA2121 } /* Literal.String.Backtick */
body .sc { color: #BA2121 } /* Literal.String.Char */
body .sd { color: #BA2121; font-style: italic } /* Literal.String.Doc */
body .s2 { color: #BA2121 } /* Literal.String.Double */
body .se { color: #BB6622; font-weight: bold } /* Literal.String.Escape */
body .sh { color: #BA2121 } /* Literal.String.Heredoc */
body .si { color: #BB6688; font-weight: bold } /* Literal.String.Interpol */
body .sx { color: #008000 } /* Literal.String.Other */
body .sr { color: #BB6688 } /* Literal.String.Regex */
body .s1 { color: #BA2121 } /* Literal.String.Single */
body .ss { color: #19177C } /* Literal.String.Symbol */
body .bp { color: #008000 } /* Name.Builtin.Pseudo */
body .vc { color: #19177C } /* Name.Variable.Class */
body .vg { color: #19177C } /* Name.Variable.Global */
body .vi { color: #19177C } /* Name.Variable.Instance */
body .il { color: #666666 } /* Literal.Number.Integer.Long */
</style>
<div class="highlight">
<pre><span class="kd">public</span> <span class="kd">class</span> <span class="nc">SomeBigProject1</span> <span class="o">{</span>
	<span class="kd">public</span> <span class="kd">static</span> <span class="kt">double</span> <span class="nf">logStirlingApproximation</span><span class="o">(</span><span class="kd">final</span> <span class="kt">int</span> <span class="n">n</span><span class="o">)</span> <span class="o">{</span>
		<span class="k">return</span> <span class="n">n</span><span class="o">*(</span><span class="n">Math</span><span class="o">.</span><span class="na">log</span><span class="o">(</span><span class="n">n</span><span class="o">)-</span><span class="mi">1</span><span class="o">)</span> <span class="o">+</span> <span class="mf">0.5</span><span class="o">*</span><span class="n">Math</span><span class="o">.</span><span class="na">log</span><span class="o">(</span><span class="mi">2</span><span class="o">*</span><span class="n">Math</span><span class="o">.</span><span class="na">PI</span><span class="o">*</span><span class="n">n</span><span class="o">);</span>
	<span class="o">}</span>

	<span class="kd">public</span> <span class="kd">static</span> <span class="kt">double</span> <span class="nf">logFactorial</span><span class="o">(</span><span class="kd">final</span> <span class="kt">int</span> <span class="n">n</span><span class="o">)</span> <span class="o">{</span>
		<span class="kt">double</span> <span class="n">r</span> <span class="o">=</span> <span class="mf">0.0</span><span class="o">;</span>
		<span class="k">for</span><span class="o">(</span><span class="kt">int</span> <span class="n">i</span><span class="o">=</span><span class="n">n</span><span class="o">;</span><span class="n">i</span><span class="o">&gt;</span><span class="mi">1</span><span class="o">;--</span><span class="n">i</span><span class="o">)</span> <span class="o">{</span>
			<span class="n">r</span> <span class="o">+=</span> <span class="n">Math</span><span class="o">.</span><span class="na">log</span><span class="o">(</span><span class="n">i</span><span class="o">);</span>
		<span class="o">}</span>
		<span class="k">return</span> <span class="n">r</span><span class="o">;</span>
	<span class="o">}</span>

	<span class="kd">public</span> <span class="kd">static</span> <span class="kt">void</span> <span class="nf">main</span><span class="o">(</span><span class="kd">final</span> <span class="n">String</span><span class="o">[]</span> <span class="n">args</span><span class="o">)</span> <span class="o">{</span>
		<span class="kt">int</span> <span class="n">nbad</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span>
		<span class="k">for</span><span class="o">(</span><span class="kt">int</span> <span class="n">n</span><span class="o">=</span><span class="mi">1000</span><span class="o">;</span><span class="n">n</span><span class="o">&lt;</span><span class="mi">10000</span><span class="o">;++</span><span class="n">n</span><span class="o">)</span> <span class="o">{</span>
			<span class="k">if</span><span class="o">(</span><span class="n">Math</span><span class="o">.</span><span class="na">abs</span><span class="o">(</span><span class="n">logFactorial</span><span class="o">(</span><span class="n">n</span><span class="o">)-</span><span class="n">logStirlingApproximation</span><span class="o">(</span><span class="n">n</span><span class="o">))&gt;=</span><span class="mf">1.0</span><span class="n">e</span><span class="o">-</span><span class="mi">5</span><span class="o">)</span> <span class="o">{</span>
				<span class="o">++</span><span class="n">nbad</span><span class="o">;</span>
			<span class="o">}</span>
		<span class="o">}</span>
		<span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="s">&quot;nbad: &quot;</span> <span class="o">+</span> <span class="n">nbad</span><span class="o">);</span>
	<span class="o">}</span>
<span class="o">}</span>
</pre>
</div>
<p>Imagine that this is some humongous project to use <a target="_blank" href="http://en.wikipedia.org/wiki/Stirling's_approximation">Stirling&#8217;s Approximation</a> as a replacement for factorial.  All the code up until main is great.  But the unfortunate developer has hard-coded an acceptance test into <code>main()</code>.  If they run their big project all they get out is:</p>
<pre>
nbad: 7334
</pre>
<p>The developer needs to re-code and re-build to diagnose the failure, tweak their acceptance criteria or add more measurements.</p>
<p>I strongly recommend a different work pattern.  Instead of bringing criteria into the code, bring the data out:</p>
<div class="highlight">
<pre><span class="kd">public</span> <span class="kd">class</span> <span class="nc">SomeBigProject2</span> <span class="o">{</span>
	<span class="kd">public</span> <span class="kd">static</span> <span class="kt">void</span> <span class="nf">main</span><span class="o">(</span><span class="kd">final</span> <span class="n">String</span><span class="o">[]</span> <span class="n">args</span><span class="o">)</span> <span class="o">{</span>
		<span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="s">&quot;n&quot;</span>
				<span class="o">+</span> <span class="s">&quot;\t&quot;</span> <span class="o">+</span> <span class="s">&quot;logFactorial&quot;</span>
				<span class="o">+</span> <span class="s">&quot;\t&quot;</span> <span class="o">+</span> <span class="s">&quot;logStirlingApproximation&quot;</span><span class="o">);</span>
		<span class="k">for</span><span class="o">(</span><span class="kt">int</span> <span class="n">n</span><span class="o">=</span><span class="mi">1000</span><span class="o">;</span><span class="n">n</span><span class="o">&lt;</span><span class="mi">10000</span><span class="o">;++</span><span class="n">n</span><span class="o">)</span> <span class="o">{</span>
			<span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="n">String</span><span class="o">.</span><span class="na">valueOf</span><span class="o">(</span><span class="n">n</span><span class="o">)</span>
					<span class="o">+</span> <span class="s">&quot;\t&quot;</span> <span class="o">+</span> <span class="n">SomeBigProject1</span><span class="o">.</span><span class="na">logFactorial</span><span class="o">(</span><span class="n">n</span><span class="o">)</span>
					<span class="o">+</span> <span class="s">&quot;\t&quot;</span> <span class="o">+</span> <span class="n">SomeBigProject1</span><span class="o">.</span><span class="na">logStirlingApproximation</span><span class="o">(</span><span class="n">n</span><span class="o">));</span>
		<span class="o">}</span>
	<span class="o">}</span>
<span class="o">}</span>
</pre>
</div>
<p>Capture this output in a file named &#8220;data.tsv&#8221; and both Microsoft Excel and R can open it.  Naturally I prefer to use R (so that is what I will demonstrate).  To read the results into R you start up an R and type in a command like the following:</p>
<pre>
 &gt; d &lt;- read.table('data.tsv',
        header=T,sep='\t',quote='',as.is=T,
        stringsAsFactors=F,comment.char='',allowEscapes=F)
</pre>
<p>Most of the arguments controlling the style of file R is to expected (what the field separator is, weather to expect escapes and quotes and so on).  The settings I suggest here are the &#8220;ultra hardened&#8221; settings.  If you make sure none of your fields have a tab or line-break in them when you print then it is guaranteed R can read the data (not matter what whacky symbols are in it).  On the java side that usually means making sure any varying text fields are run through <code>.replaceAll("\\s+"," ")</code> &#8220;just in case.&#8221; At this point you can already look at your data with the <code>summary()</code> command:</p>
<pre>
 &gt; summary(d)
</pre>
<pre>
       n         logFactorial   logStirlingApproximation
 Min.   :1000   Min.   : 5912   Min.   : 5912
 1st Qu.:3250   1st Qu.:23034   1st Qu.:23034
 Median :5500   Median :41870   Median :41870
 Mean   :5500   Mean   :42536   Mean   :42536
 3rd Qu.:7749   3rd Qu.:61653   3rd Qu.:61653
 Max.   :9999   Max.   :82100   Max.   :82100
</pre>
<p>This immediately hints that you should have been thinking in terms of relative error instead of absolute error (since insisting on high absolute accuracy on large results does not always make sense).</p>
<p>You also have access to standard statistical measures of agreement like correlation: </p>
<pre>
 &gt; with(d,cor(logFactorial,logStirlingApproximation))
</pre>
<pre>
result: 1
</pre>
<p>You can see where your failures were:</p>
<pre>
 &gt; library(ggplot2)
 &gt; d$bad &lt;- with(d,abs(logFactorial-logStirlingApproximation)&gt;=1.0e-5)
 &gt; ggplot(d) + geom_point(aes(x=n,y=bad))
</pre>
<p>Yields the graph:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/07/bad.png" alt="bad.png" border="0" width="525" height="525" /><br />
</center></p>
<p>You can see all your failures are in the initial interval.  You can then drill in:</p>
<pre>
 &gt; ggplot(d) + geom_point(aes(x=n,y=logFactorial-logStirlingApproximation))
                + scale_y_log10()
</pre>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/07/diff.png" alt="diff.png" border="0" width="525" height="525" /><br />
</center></p>
<p>And here we see some things (that are in general true for Stirling&#8217;s approximation):</p>
<ol>
<li>It is very accurate.</li>
<li>It is always an under estimate.</li>
<li>It gets better as n gets larger.</li>
</ol>
<p>Essentially by poking around with graphs in R you can figure out the nature of your errors (telling you what to fix) and generate findings that tell you how to fix your criteria (perhaps your code is working- but your test wasn&#8217;t sensible).  The &#8220;dump everything and then use R&#8221; technique is also particularly good for generating reports on code timings using either <code>geom_histogram</code> or <code>geom_density</code>. </p>
<p>For example, if we had data with a field <code>runTimeMS</code> then it is a simple one-liner to get plot like the following:</p>
<pre>
 &gt; ggplot(t) + geom_density(aes(x=runTimeMS))
</pre>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/07/timing.png" alt="timing.png" border="0" width="525" height="525" /><br />
</center></p>
<p>From this graph we can immediately see:</p>
<ol>
<li>Most of our run-times are very fast.</li>
<li>We have a heavy right-tail (evidence of &#8220;contagion&#8221; or one slow-down causing others, like CPU or IO contention).</li>
<li>Data is truncated at 100MS (could be something &#8220;censoring&#8221; the measurement, an exception being thrown or an abort).</li>
<li>There is a spike at 30MS (something is true and slow for some subset of the data that isn&#8217;t present in the majority).</li>
</ol>
<p>This is a lot more that would be seen in a mean-only or mean and standard deviation summary.  We may even being seeings signs of two different bugs (the truncation and the spike).</p>
<p>In all cases the key is to dump a lot of data in machine readable form and then come back to to analyze.  This is far more flexible than hoping to code in the right summaries and then further hoping the summaries don&#8217;t miss something important (or that you at least get a chance to notice if they do miss something).  Being able to do exploratory statistics on dumps from your code (both results and timing) gives you incredible measurement, tuning and debugging powers.   The scriptability of R means any later analysis is as easy as cut and paste.</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/' rel='bookmark' title='Automatic Differentiation with Scala'>Automatic Differentiation with Scala</a></li>
<li><a href='http://www.win-vector.com/blog/2009/11/r-examine-objects-tutorial/' rel='bookmark' title='R examine objects tutorial'>R examine objects tutorial</a></li>
<li><a href='http://www.win-vector.com/blog/2009/09/survive-r/' rel='bookmark' title='Survive R'>Survive R</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2011/08/programmers-should-know-r/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Automatic Detection of Potential Deadlock</title>
		<link>http://www.win-vector.com/blog/2011/06/automatic-detection-of-potential-deadlock/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=automatic-detection-of-potential-deadlock</link>
		<comments>http://www.win-vector.com/blog/2011/06/automatic-detection-of-potential-deadlock/#comments</comments>
		<pubDate>Sat, 04 Jun 2011 16:55:52 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[cycle detection]]></category>
		<category><![CDATA[deadlock]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1664</guid>
		<description><![CDATA[We would like to share a programming article we wrote on the automatic detection of potential deadlock.The article touches on some fun issues: multithreaded programming, graph algorithms. It was also back when I was considering the bipartite graph as a fundamental basis for data structures (instead of lists, arrays or maps). Related posts: Automatic Differentiation [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/' rel='bookmark' title='Automatic Differentiation with Scala'>Automatic Differentiation with Scala</a></li>
<li><a href='http://www.win-vector.com/blog/2009/05/programs-reduced-to-statistics/' rel='bookmark' title='Programs reduced to statistics'>Programs reduced to statistics</a></li>
<li><a href='http://www.win-vector.com/blog/2008/04/i-know-i-am-the-one-being-a-jerk/' rel='bookmark' title='I know, I am the one being a jerk'>I know, I am the one being a jerk</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>We would like to share a programming article we wrote on the <a target="_blank" href="http://www.mzlabs.com/JMPubs/Automatic%20Detection%20of%20Potential%20Deadlock-Mount.pdf">automatic detection of potential deadlock</a>.<span id="more-1664"></span>The article touches on some fun issues: multithreaded programming, graph algorithms.  It was also back when I was considering the bipartite graph as a fundamental basis for data structures (instead of lists, arrays or maps).</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/' rel='bookmark' title='Automatic Differentiation with Scala'>Automatic Differentiation with Scala</a></li>
<li><a href='http://www.win-vector.com/blog/2009/05/programs-reduced-to-statistics/' rel='bookmark' title='Programs reduced to statistics'>Programs reduced to statistics</a></li>
<li><a href='http://www.win-vector.com/blog/2008/04/i-know-i-am-the-one-being-a-jerk/' rel='bookmark' title='I know, I am the one being a jerk'>I know, I am the one being a jerk</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2011/06/automatic-detection-of-potential-deadlock/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Brevity is a Virtue</title>
		<link>http://www.win-vector.com/blog/2011/04/brevity-is-a-virtue/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=brevity-is-a-virtue</link>
		<comments>http://www.win-vector.com/blog/2011/04/brevity-is-a-virtue/#comments</comments>
		<pubDate>Wed, 27 Apr 2011 14:58:33 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Hadoop]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1652</guid>
		<description><![CDATA[Our friends at Dataspora have a nice article on the more modern Map Reduce languages. A very good read and clearly a lot of thought went into preparing it.In passing we are rightfully taken to task for hiding a huge glob of code in a tar file that few people are likely to open. Using [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2011/12/what-to-do-when-you-run-out-of-memory/' rel='bookmark' title='What to do when you run out of memory'>What to do when you run out of memory</a></li>
<li><a href='http://www.win-vector.com/blog/2011/01/sql-screwdriver/' rel='bookmark' title='SQL Screwdriver'>SQL Screwdriver</a></li>
<li><a href='http://www.win-vector.com/blog/2011/08/programmers-should-know-r/' rel='bookmark' title='Programmers Should Know R'>Programmers Should Know R</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>Our friends at <a target="_blank" href="http://www.dataspora.com/">Dataspora</a> have a nice <a target="_blank" href="http://www.dataspora.com/2011/04/pigs-bees-and-elephants-a-comparison-of-eight-mapreduce-languages/">article on the more modern Map Reduce languages</a>.  A very good read and clearly a lot of thought went into preparing it.<span id="more-1652"></span>In passing we are rightfully taken to task for hiding a huge glob of code in a tar file that few people are likely to open.   Using higher order tools could indeed make the code smaller.  Perhaps small enough that we could share it in a more readable format.  It is a good point and our only answer to it is we at Win-Vector LLC see ourselves as tool builders delivering complete tools that perform well defined tasks (like a logistic regression) so that most people do not have to open the tar file (but they can if they need to).  That is: we believe in higher order languages tools, and we supply some of them.  We also, however, like to minimize external dependencies so that our code can run on more systems.</p>
<p>Back to the tar file issue.  We had been meaning to get our code up on github or some other public source control system.  Instead we have <a target="_blank" href="http://www.win-vector.com/dfiles/WinVectorLogisticRegression/WinVectorLogisticHadoopHTML/list.html">HTMLified it</a> (with some cross reference links, it still isn&#8217;t pretty).</p>
<p>And Antonio Piccolboni, thanks for the great article.</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2011/12/what-to-do-when-you-run-out-of-memory/' rel='bookmark' title='What to do when you run out of memory'>What to do when you run out of memory</a></li>
<li><a href='http://www.win-vector.com/blog/2011/01/sql-screwdriver/' rel='bookmark' title='SQL Screwdriver'>SQL Screwdriver</a></li>
<li><a href='http://www.win-vector.com/blog/2011/08/programmers-should-know-r/' rel='bookmark' title='Programmers Should Know R'>Programmers Should Know R</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2011/04/brevity-is-a-virtue/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>SQL Screwdriver</title>
		<link>http://www.win-vector.com/blog/2011/01/sql-screwdriver/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=sql-screwdriver</link>
		<comments>http://www.win-vector.com/blog/2011/01/sql-screwdriver/#comments</comments>
		<pubDate>Tue, 18 Jan 2011 05:29:51 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Programming]]></category>
		<category><![CDATA[Tutorials]]></category>
		<category><![CDATA[Analytics]]></category>
		<category><![CDATA[H2]]></category>
		<category><![CDATA[Medium Scale Data]]></category>
		<category><![CDATA[No DB]]></category>
		<category><![CDATA[SQL Screwdriver]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1630</guid>
		<description><![CDATA[We discuss a &#8220;medium scale data&#8221; technique that we call &#8220;SQL Screwdriver.&#8221; Previously we discussed some of the issues of large scale data analytics. A lot of the work done at the MapReduce scale is necessarily limited to mere aggregation and report generation. But what of medium scale? That is data too large to perform [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2011/12/what-to-do-when-you-run-out-of-memory/' rel='bookmark' title='What to do when you run out of memory'>What to do when you run out of memory</a></li>
<li><a href='http://www.win-vector.com/blog/2011/08/programmers-should-know-r/' rel='bookmark' title='Programmers Should Know R'>Programmers Should Know R</a></li>
<li><a href='http://www.win-vector.com/blog/2011/04/brevity-is-a-virtue/' rel='bookmark' title='Brevity is a Virtue'>Brevity is a Virtue</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>We discuss a &#8220;medium scale data&#8221; technique that we call &#8220;SQL Screwdriver.&#8221;</p>
<p>Previously we discussed some of the issues of <a target="_blank" href="http://www.win-vector.com/blog/2010/12/large-data-logistic-regression-with-example-hadoop-code/">large scale data analytics</a>.  A lot of the work done at the MapReduce scale is necessarily limited to mere aggregation and report generation.  But what of medium scale?  That is data too large to perform all steps in your favorite tool (<a target="_blank" href="http://cran.r-project.org/">R</a>, Excel or something else) but small enough that you are expected to produce sophisticated models, decisions and analysis.  At this scale, if properly prepared, you don&#8217;t need large scale tools and their limitations.  With extra preparation you can continue to use your preferred tools.  We call this the realm of medium scale data and discuss a preparation tool style we call &#8220;screwdriver&#8221; (as opposed to larger hammers).</p>
<p>We stand the <a target="_blank" href="http://en.wikipedia.org/wiki/NoSQL">&#8220;no SQL&#8221;</a> movement on its head and discuss the beneficial use of SQL without a server (as opposed to their vision of a key-value store without SQL). Database servers can be a nuisance- but that is not enough reason to give up the power of relational query languages.<br />
<span id="more-1630"></span><br />
One of the tenants of the  <a target="_blank" href="http://db.cs.berkeley.edu/papers/vldb09-madskills.pdf">MAD  analytics</a> movement is that you have to constantly move new data towards your decision problem.  This wisdom is compatible with the machine learning rule that 90% of your effort is spent on feature design and another 90% is needed for data-tubing.  As an example: suppose you are attempting to predict or model the probability that a candidate will make a given purchase or take out a given loan.  You might want to merge what you know about candidates (both past candidates which form your training data and future candidates that you are trying to characterize) with other data sources before you start your machine learning process.  An example &#8220;other data source&#8221; is the <a target="_blank" href="http://www.census.gov/geo/ZCTA/zcta.html">Census ZCTA arranged data</a> which is aggregated census data (age, income, education and so on) keyed by ZCTA (ZIP Code Tabulation Areas). If you had a table of per-person data called &#8220;people&#8221; then what you want is to merge the ZCTA data into this table as additional columns.   By far the most reasonable way to express this known as a &#8220;join&#8221;.  The SQL expression for such a join looks like the following:</p>
<pre>
SELECT people.*, zctaSummaries.*
      FROM people LEFT JOIN ( zctaSummaries )
      ON ( people.ZIPCODE = zctaSummaries.ZCTA )
</pre>
<p>This is the SQL way of saying &#8220;make a new table where each row is a row from my people data with the ZCTA data appended as additional columns.&#8221;  Notice, and<br />
this is the one grace of SQL, that we do not have to specify how this assembly is to be done (no loops, variables or explicit sorting).  This is what we  want.  What we don&#8217;t want is the pain of setting up a persistent database server just so we can run some SQL.  A server involves processes listening on ports, passwords, provisioning, maintenance and so on.  The idea is we should only specify how we want our data prepared for analysis (or de-normalized in database terms), we shouldn&#8217;t have to specify how it is done or manage a server just to get it done.</p>
<p>Luckily there are a number of no-server databases that implement SQL.  In particular we call out <a target="_blank" href="http://www.h2database.com/html/main.html">H2</a>.  H2 is a pure Java SQL engine, so any place we can run Java we can run H2.  And one of the better graphical database clients (<a target="_blank" href="http://squirrel-sql.sourceforge.net/">SQuirreL SQL</a>) is both free and compatible with H2.  So not only can you run your SQL- you can explore your data interactively!</p>
<p>To go further with this example you need a few minor database tools we are releasing under the <a target="_blank" href="http://www.gnu.org/licenses/agpl.html">GPL3 Affero License</a>. The latest copy of  <a target="_blank" href="http://www.win-vector.com/dfiles/WinVectorLogisticRegression/WinVectorLogistic.Hadoop0.20.2.jar" title="WinVectorLogistic.Hadoop0.20.2.jar">WinVectorLogistic.Hadoop0.20.2.jar</a> contains both compiled Java classes and Java 6 compatible source code.  This jar includes the tools that talk to H2 (or any other JDBC compatible database) that embody our SQL Screwdriver idea.  To complete the tutorial you will need to download the WinVector Logistic jar, a H2 distribution (contains h2-1.2.147.jar) and the pre-processed ZCTA data (<a target="_blank" href="http://www.win-vector.com/dfiles/WinVectorLogisticRegression/zctaSummaries.tsv" title="zctaSummaries.tsv">zctaSummaries.tsv</a>).</p>
<p>Next we build a Java XML style properties file describing our database.  In our case we leave user and password blank, specify use of the H2 embedded driver and name the file that will be the backing store for our small database (in this case H2TestDB).</p>
<pre>
&lt;?xml version="1.0" encoding="UTF-8"?&gt;
&lt;!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd"&gt;
&lt;properties&gt;
  &lt;comment&gt;testdb&lt;/comment&gt;
  &lt;entry key="user"&gt;&lt;/entry&gt;
  &lt;entry key="password"&gt;&lt;/entry&gt;
  &lt;entry key="driver"&gt;org.h2.Driver&lt;/entry&gt;
  &lt;entry key="url"&gt;jdbc:h2:H2TestDB/H2DB
         ;LOG=0;CACHE_SIZE=65536;LOCK_MODE=0;UNDO_LOG=0&lt;/entry&gt;
&lt;/properties&gt;
</pre>
<p>Once we have done this we can transfer the tab-separated data into our database by running the following java command:</p>
<pre>
  java -cp WinVectorLogistic.Hadoop0.20.2.jar:h2-1.2.147.jar
     com.winvector.db.LoadTable
     file:h2Test.xml t file:zctaSummaries.tsv zctaSummaries
</pre>
<p>The four final arguments are the location of the XML file we just created, &#8220;t&#8221; to denote tab separated data (&#8220;|&#8221; for pipe separated), the location of the ZCTA tab separated data and what name to give the new table in the database.  Once this has been run the contents of the ZCTA table are in the database.  To see that we start up Squirrel SQL, configure Squirrel SQL&#8217;s H2 driver to point to our h2-1.2.147.jar and configure a database alias identical to our XML file.  One of the great merits of Squirrel SQL is it uses JDBC configuration just like any other Java program- so if you can get it to run in some other program you can get Squirrel SQL to work (and vise versa).</p>
<p>Once Squirrel SQL is up we can examine our table by running a simple select:</p>
<pre>
   select * from zctaSummaries
</pre>
<p>And we see our data:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/01/SquirrelSQL3.png" alt="SquirrelSQL3.png" border="0" width="759" height="455" /><br />
</center></p>
<p>We can now use another small tool to dump our freshly joined data into a tab separated format ready for use by an analysis tool (like R):</p>
<pre>
java -cp WinVectorLogistic.Hadoop0.20.2.jar:h2-1.2.147.jar
   com.winvector.db.DBDump
   file:h2Test.xml
   "SELECT people.*, zctaSummaries.*
       FROM people LEFT JOIN ( zctaSummaries )
       ON ( people.ZIPCODE = zctaSummaries.ZCTA )"
   mergedData.tsv
</pre>
<p>The last three arguments being: the database definition XML again, the exact query we want and the name of a file to write tab separated results into.  At this point we are done.  The file &#8220;mergedData.tsv&#8221; can be moved into R for modeling and analysis.  We can, for neatness or security, now dispose of the database file if we wish.  We have deliberately avoided the built-in bulk table import and export tools as they tend to be finicky and database implementation dependent (our screwdriver tools can be used to move in and out of persistent databases like MySQL just by specifying the correct JDBC driver, URL and driver jars).  But the tools are not as important as the attitude of using powerful relational tools (e.g. SQL) in a batch manner on transient data stores (very different than <a target="_blank" href="http://en.wikipedia.org/wiki/Online_transaction_processing">OLTP</a> and <a target="_blank" href="http://en.wikipedia.org/wiki/Online_analytical_processing">OLAP</a> which have high maintenance costs and tend to trap data).</p>
<p>What we have done (with minimal preparation) is: brought the full relational power of SQL to perform the joins required to bring new data into an analysis.  We can execute arbitrary SQL (much more powerful than Unix command line &#8220;join&#8221; and typical R table manipulation tools) and quickly get our data organized for machine learning analysis.  We can work on datasets larger than machine memory if needed and have not incurred the cost of configuring or getting access to a server.</p>
<p>Instead of &#8220;no SQL&#8221; we say: &#8220;no server&#8221; (which is appropriate in the medium sized data regime so common in predictive analytics).</p>
<hr/>
Note Dec-15-2011:  We have moved the code distribution to <a target="_blank" href="https://github.com/WinVector/SQL-Screwdriver">github.com/WinVector/SQL-Screwdriver</a> </p>
<hr/>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2011/12/what-to-do-when-you-run-out-of-memory/' rel='bookmark' title='What to do when you run out of memory'>What to do when you run out of memory</a></li>
<li><a href='http://www.win-vector.com/blog/2011/08/programmers-should-know-r/' rel='bookmark' title='Programmers Should Know R'>Programmers Should Know R</a></li>
<li><a href='http://www.win-vector.com/blog/2011/04/brevity-is-a-virtue/' rel='bookmark' title='Brevity is a Virtue'>Brevity is a Virtue</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2011/01/sql-screwdriver/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Large Data Logistic Regression (with example Hadoop code)</title>
		<link>http://www.win-vector.com/blog/2010/12/large-data-logistic-regression-with-example-hadoop-code/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=large-data-logistic-regression-with-example-hadoop-code</link>
		<comments>http://www.win-vector.com/blog/2010/12/large-data-logistic-regression-with-example-hadoop-code/#comments</comments>
		<pubDate>Mon, 27 Dec 2010 00:00:59 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Programming]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Amazon Elastic MapReduce]]></category>
		<category><![CDATA[Big Data]]></category>
		<category><![CDATA[EC2]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Logistic Regression]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Mahout]]></category>
		<category><![CDATA[Map Reduce]]></category>
		<category><![CDATA[S3]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1607</guid>
		<description><![CDATA[Living in the age of big data we ask what to do when we have the good fortune to be presented with a huge amount of supervised training data? Most often at large scale we are presented with the un-supervised problems of characterization and information extraction; but some problem domains offer an almost limitless supply [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2011/09/the-equivalence-of-logistic-regression-and-maximum-entropy-models/' rel='bookmark' title='The equivalence of logistic regression and maximum entropy models'>The equivalence of logistic regression and maximum entropy models</a></li>
<li><a href='http://www.win-vector.com/blog/2011/09/the-simpler-derivation-of-logistic-regression/' rel='bookmark' title='The Simpler Derivation of Logistic Regression'>The Simpler Derivation of Logistic Regression</a></li>
<li><a href='http://www.win-vector.com/blog/2010/11/learn-a-powerful-machine-learning-tool-logistic-regression-and-beyond/' rel='bookmark' title='Learn Logistic Regression (and beyond)'>Learn Logistic Regression (and beyond)</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>Living in the <a target="_blank" href="http://www.win-vector.com/blog/2010/08/what-did-theorists-do-before-the-age-of-big-data/">age of big data</a> we ask what to do when we have the good fortune to be presented with a huge amount of supervised training data?  Most often at large scale we are presented with the un-supervised problems of <a target="_blank" href="http://www.win-vector.com/blog/2010/10/a-personal-perspective-on-machine-learning/">characterization and information extraction</a>; but some problem domains offer an almost limitless supply of supervised training data (such as using older data to build models that predict the near future).  Having too much training data is a good problem to have and there are ways to use traditional methods (like logistic regression) at this scale.  We present an &#8220;out of core&#8221; logistic regression implementation and a quick example in <a target="_blank" href="http://hadoop.apache.org/">Apache Hadoop</a> running on <a target="_blank" href="http://aws.amazon.com/elasticmapreduce/">Amazon Elastic MapReduce</a>.  This presentation assumes familiarity with Unix style command lines, Java and Hadoop.<span id="more-1607"></span>Apache Hadoop already has a machine learning infrastructure named <a target="_blank" href="http://mahout.apache.org/">Mahout</a>.   While Mahout seems to concentrate more on unsupervised methods (like clustering, nearest neighbor and recommender systems) it does already include a <a target="_blank" href="https://cwiki.apache.org/MAHOUT/logistic-regression.html">logistic regression package</a>.   This package uses a learning method called &#8220;Stochastic Gradient Descent&#8221;, which is in a sense the perceptron update algorithm updated for the new millennium.  This method is fast in most cases but differs from the traditional method of solving a logistic regression which are based on Fisher Scoring or the Newton/Raphston Method (see &#8220;Categorical Data Analysis,&#8221; Alan Agresti, 1990 and  <a target="_blank" href="http://www.autonlab.org/autonweb/14709/version/4/part/5/data/komarek:lr_thesis.pdf?branch=main&#038;language=en">Paul Komarek&#8217;s thesis &#8220;Logistic Regression for Data Mining and High-Dimensional Classification&#8221;</a>).  Fisher Scoring remains interesting in that it parallelizes in exactly the manner described in &#8220;Map-Reduce for Machine Learning on Multicore,&#8221; Cheng-Tao Chu, Sang Kyun Kim, Yi-An Lin, Yuan Yuan Yu, Gary Bradski, Andrew Y Ng, Kunle Olukotoun NIPS 2006.</p>
<p>Stochastic gradient descent is in fact an appropriate method for big data.  For example: if our model complexity is held constant and our data set size is allowed to grow; then stochastic gradient descent will achieve its convergence condition before it even completes a single random order traversal of the data.  However, stochastic gradient descent has a control called the learning rate and one can easily imagine a series of problems that require the learning rate to be set arbitrarily slow.  For example a data set formed as the union of very many &#8220;typical&#8221; examples where a given variable is independent of the outcome and small minority of &#8220;special&#8221; examples where the same variable helps influence the outcome presents a problem.  Training on the &#8220;typical&#8221; examples causes the stochastic gradient descent method to perform a random walk on the given variable coefficient.  So the learning rate must be slow enough that the expected drift does not swamp out the rare contributions from the &#8220;special&#8221; examples (meaning the learning rate must slow roughly proportionally to the square root of the ratio of the typical to special examples).</p>
<p>Not too much must be made of artificial problems designed to slow stochastic gradient descent.  The traditional Fisher scoring (or the Newton/Raphston method) can simply be killed by specifying a problem with a great number of levels for categorical variables.  In this case traditional methods have to solve a linear system that can in fact be much larger than the entire data set (causing representation, work and numeric stability problems).  So it takes little imagination to design problems that kill the traditional methods.  Other intermediate complexity methods (like conjugate gradient) avoid the storage size problem; but can require a many more passes through the training data.</p>
<p>There is a common situation where Fisher scoring makes good sense: you are trying to fit a relatively simple model to an enormous amount of data (often to predict a rare event).  One could sub-sample the training data to shrink the scale of the problem- but this is a case of the analyst being forced to accede to poor tools.  What one would naturally want is a training method that can fit reasonable sized models (that is models with a reasonable number of variables and levels) onto enormous data sets.  The software package <a target="_blank" href="http://cran.r-project.org/">R</a> can work with fairly large data sets (in the gigabytes range) and has some parallel flavors, but R is mostly an in-memory system.  It is appropriate to want a direct method that both &#8220;works out of core&#8221; (i.e. in the terabytes and petabytes ranges), parallelizes to hundreds of machines (using current typical infrastructure- like a Hadoop cluster) and is exact (without additional parameters like learning rate).  </p>
<p>We demonstrate here an example implementation in Java for both single machine &#8220;out of core&#8221; training (allowing filesystem sized datasets) and MapReduce style parallelism (allowing even larger scale).  The method also includes the problem regularization steps discussed in our recent <a target="_blank" href="http://www.win-vector.com/blog/2010/11/learn-a-powerful-machine-learning-tool-logistic-regression-and-beyond/">logistic regression article</a>.  The code (packaged in: <a target="_blank" href="http://www.win-vector.com/dfiles/WinVectorLogisticRegression/WinVectorLogistic.Hadoop0.20.2.jar" title="WinVectorLogistic.Hadoop0.20.2.jar">WinVectorLogistic.Hadoop0.20.2.jar</a> ) is being distributed under the GNU Affero General Public License version 3.  This is an open source license that (roughly) requires (among other things) redistribution of source code of systems linked against the licensed project to anyone receiving a compiled version or using the system as a network service.  The license also promises no warranty or implied fitness.  The distribution is a standalone runnable Jar (source code and license inside the jar) and is the minimal object required to run on Hadoop (which is itself a Java project).    More advanced versions of the library (with better linear algebra libraries, better problem slice control, unit tests, JDBC bindings and with different license arrangements) can be arranged from the code owners: <a target="_blank" href="http://www.win-vector.com/">Win-Vector LLC</a>.  This jar was built for Apache Hadoop version 0.20.2 (the latest version Amazon Elastic Map Reduce runs at this time) and we use as many of the newer interfaces as possible (so the code will run against the current Hadoop 0.21.0 if re-built against Hadoop 0.21.0, the jar can not switch versions without being re-built due to how Hadoop calls methods).</p>
<p>For our example we will work on a small data set.  The code is designed to pass through data directly from disk, storing only the Fisher structures- which require storage proportional to the square of the number of variables and levels but is independent of the number of data rows.   The data format is what we call &#8220;naive TSV&#8221; or &#8220;naive tab separated values.&#8221;  This is a file where each line has exactly the same number of values (separated by tabs) and the first line of the file is the header line naming each column.  This is compatible with Microsoft Excel and R with the proviso that this file format does not allow any sort of escapes, quoting or multiple line fields.  Our data set is taken from the <a target="_blank" href="http://archive.ics.uci.edu/ml/">UCI machine learning database</a> ( <a  target="_blank" href="http://archive.ics.uci.edu/ml/machine-learning-databases/car/">data</a>, <a target="_blank" href="http://archive.ics.uci.edu/ml/machine-learning-databases/car/car.names">description</a> )  and converted into the naive TSV format (split into training and testing subsets: <a target="_blank" href="http://www.win-vector.com/dfiles/WinVectorLogisticRegression/uciCarTrain.tsv" title="uciCarTrain.tsv">uciCarTrain.tsv</a>, <a target="_blank" href="http://www.win-vector.com/dfiles/WinVectorLogisticRegression/uciCarTest.tsv" title="uciCarTest.tsv">uciCarTest.tsv</a>).</p>
<p>The first few lines of the training file are given here:</p>
<pre>
buying	maintenance	doors	persons	lug_boot	safety	rating
vhigh	vhigh	2	2	small	med	FALSE
vhigh	vhigh	2	2	med	low	FALSE
vhigh	vhigh	2	2	med	med	FALSE
</pre>
<p>The first experiment is to use the Java program standalone (without Hadoop) to train a model.  The method used is Fisher scoring by multiple passes over the data file.  Only the Fisher structures are stored in memory- so in principle the data set could be arbitrarily large.  To run the logistic training program download the files WinVectorLogistic.Hadoop0.20.2.jar and uciCarTrain.tsv .  You will also need some libraries ( commons-logging-*.jar and commons-logging-api-*.jar , and sometimes  hadoop-*-core.jar and log4j-*.jar ) from the appropriate <a target="_blan" href="http://hadoop.apache.org/">Hadoop distribution</a>.  Before running the code you can examine the source (and re-build the project using an IDE like <a target="_blank" href="http://www.eclipse.org/">Eclipse</a>) by extracting the code in an empty directory using the Java jar command:</p>
<pre>
jar xvf WinVectorLogistic.Hadoop0.20.2.jar
</pre>
<p>To run the code type at the command line (all in a single line, we have inserted line breaks for clarity, we are also assuming you are using a Unix style shell on Linux, OSX or Cygwin on Windows):</p>
<pre>
java -cp WinVectorLogistic.Hadoop0.20.2.jar:commons-logging-1.0.4.jar:commons-logging-api-1.0.4.jar
   com.winvector.logistic.demo.LogisticTrain
   file:uciCarTrain.tsv "rating ~ buying + maintenance + doors + persons + lug_boot + safety" model.ser
</pre>
<p>The portion of interest is the last three arguments:</p>
<ul>
<li>file:uciCarTrain.tsv :  The URI pointing to the file containing the training data.</li>
<li> &#8220;rating ~ buying + maintenance + doors + persons + lug_boot + safety&#8221; : The formula specifying that rating will be predicted as a function of  buying, maintenance, doors, persons, lug_boot  and safety.</li>
<li>model.ser :  Where to write the Java Serialized model result.</li>
</ul>
<p>After that we can run the scoring procedure on the held-out test data:</p>
<pre>
java -cp WinVectorLogistic.Hadoop0.20.2.jar:commons-logging-1.0.4.jar:commons-logging-api-1.0.4.jar
   com.winvector.logistic.demo.LogisticScore
   model.ser file:uciCarTest.tsv scored.tsv
</pre>
<p>In this case the last three arguments are:</p>
<ul>
<li>model.ser :  Where to read the Java Serialized model from.</li>
<li>file:uciCarTest.tsv : The URI pointing to the file to make predictions for.</li>
<li>scored.tsv : Where to write the predictions to.</li>
</ul>
<p>The first few lines of the result file are:</p>
<pre>
predict.rating.FALSE	predict.rating.TRUE	buying	maintenance	doors	persons	lug_boot	safety	rating
0.9999999999999392	6.091299561082107E-14	vhigh	vhigh	2	2	small	low	FALSE
0.9999999824028766	1.759712345446162E-8	vhigh	vhigh	2	2	small	high	FALSE
</pre>
<p>These lines are just lines from the file uciCarTest.tsv (same format is uciCarTrain.tsv) copied over with the addition of the first two columns that show the modeled probabilities of rating acceptable being FALSE or TRUE.  The accuracy of the prediction is computed and written into the runlog if the data had the rating outcomes in it (else we just get a file of predictions- which is the usual application of machine learning).</p>
<p>The details of running the Hadoop versions of the same process depend on the configuration of your Hadoop environment.  Just unpacking the 0.20.2 version of Hadoop will let you try the single-machine version of the MapReduce Logistic Regression process (which will be much slower than the standalone Java version).  To run the training step the Hadoop command line is as follows (notice this time we do not have to specify the logging jars as they are part of the Hadoop environment):</p>
<pre>
hadoop-0.20.2/bin/hadoop jar WinVectorLogistic.Hadoop0.20.2.jar
   logistictrain
   uciCarTrain.tsv "rating ~ buying + maintinance + doors + persons + lug_boot + safety" model.ser
</pre>
<p>And the scoring procedure is below:</p>
<pre>
hadoop-0.20.2/bin/hadoop jar WinVectorLogistic.Hadoop0.20.2.jar
   logisticscore
   model.ser uciCarTest.tsv scoredDir
</pre>
<p>The only operational differences are that the results are written into the file scoredDir/part-r-00000 (as is Hadoop convention) instead of scored.tsv (and an extra &#8220;offset&#8221; column is also included) and data is handled in Files (to allow Hadoop Paths to be formed) instead of URIs.   The Hadoop training and test steps are able to run in this manner because we have constructed WinVectorLogistic.Hadoop0.20.2.jar as an executable jar file with the class com.winvector.logistic.demo.DemoDriver as the class to execute.  This class uses that standard org.apache.hadoop.util.ProgramDriver pattern to run our jobs under the org.apache.hadoop.util.Tool interface.  This means that the standard Hadoop generic flags for specifying cluster configuration will be respected.</p>
<p>The big benefit of all of this packaging is: if this command is run on a large Hadoop cluster (instead of on a single machine) then the input file could be split up and processed in parallel on many machines.   The easiest way to do this is to use Amazon.com&#8217;s <a target="_blank" href="http://aws.amazon.com/elasticmapreduce/">Amazon Elastic MapReduce</a>.  This service (used in conjunction with S3 storage and EC2 virtual machines) allows the immediate remote provisioning and execution on a version 0.20.* Hadoop cluster.  To demonstrate this service we created a new S3 Bucket named wvlogistic.  Into wvlogistic we copied our jar of our code compiled against Hadoop 0.20.2 APIs ( WinVectorLogistic.Hadoop0.20.2.jar ) and a moderate sized synthetic training data set ( bigProb.tsv,  created by running: java -cp WinVectorLogistic.Hadoop0.20.2.jar com.winvector.logistic.demo.BigExample bigProb.tsv ).  Once this has been set up (and you have signed up for the Amazon Elastic MapReduce credentials) you can run the training procedure from the <a target="_blank" href="https://console.aws.amazon.com/elasticmapreduce/home">Amazon web UI</a>.  In five steps (following the direcitons found in <a href="http://aws.amazon.com/articles/3938">Tutorial: How to Create and Debug an Amazon Elastic MapReduce Job Flow</a> ) the job can be configured and launched.</p>
<p>First: press &#8220;Crate New Job Flow&#8221; and choose a job name, check &#8220;Run your own application&#8221; and select &#8220;Cusom Jar&#8221;.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/12/MRExStep1.png" alt="MRExStep1.png" border="0" width="700" /></p>
<p>Step 1/5<br />
</center></p>
<p>Second: specify the location of the jar in your Bucket and give the command line arguments (prepending S3 paths with &#8220;s3n://&#8221;).</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/12/MRExStep2.png" alt="MRExStep2.png" border="0" width="700"  /></p>
<p>Step 2/5<br />
</center></p>
<p>Third: select the type and number of machine instances you want, run without and EC2 key pair, enable logging and send the log back to your S3 bucket.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/12/MRExStep3.png" alt="MRExStep3.png" border="0" width="700"  /></p>
<p>Step 3/5<br />
</center></p>
<p>Fourth: add the default bootstrap action of configuring the Hadoop cluster.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/12/MRExStep4.png" alt="MRExStep4.png" border="0" width="700"  /></p>
<p>Step 4/5<br />
</center></p>
<p>Fifth: confirm and launch the job.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/12/MRExStep5.png" alt="MRExStep5.png" border="0" width="700"  /></p>
<p>Step 5/5<br />
</center></p>
<p>When the job completes transfer the result ( bigModel.ser )  back to your local system and you have your new map reduced produced logistic model.    We can confirm and use the model locally with a Java command similar to our earlier examples:</p>
<pre>
java -cp WinVectorLogistic.Hadoop0.20.2.jar:commons-logging-1.0.4.jar:commons-logging-api-1.0.4.jar:hadoop-0.20.2-core.jar:log4j-1.2.15.jar
   com.winvector.logistic.demo.LogisticScore
   bigModel.ser bigProb.tsv bigScored.tsv
</pre>
<p>Be aware that at this tens of megabytes scale  there is no advantage in running on a Hadoop cluster (versus using the stand-alone program).  At moderate scale parallelism may not even be attempted (due to block size) and the costs of data motion can overcome the benefit of parallel scans.   The biggest gain is being able to train many models from many gigabytes of data on a single machine without sub-sampling.  While we have the ability to build a logistic model at &#8220;web scale&#8221; (terabytes or petabytes of data) you would not want to use the MapReduce calling pattern until you had a web-scale amount of training data.</p>
<p>The point of this exercise was to take a solid implementation of  <a target="_blank" href="http://www.win-vector.com/blog/2010/11/learn-a-powerful-machine-learning-tool-logistic-regression-and-beyond/">regularized logistic regression article</a> and use the decomposition into the &#8221; Statistical Query Model&#8221;  (as suggested in the NIPS paper &#8220;Map-Reduce for Machine Learning on Multicore&#8221;) to quickly get an intermediate sophistication machine learning method (more sophisticated than Naive Bayes, less sophisticated than Kernelized Support Vector Machines) working at large (beyond RAM) scale.  Briefly: most of the technique is in an interface that considers the mis-fit, gradient if mis-fit and hessian of mis-fit as a linear (summable) function over the data.  Or in the &#8220;book&#8217;s worth of preparation so we can write the result in one line&#8221; paradigm: all of the machinery we have been discussing is support so the following summable interface (part of the source code we are distributing) can be used to do all of the work:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/12/LinearContribution.png" alt="LinearContribution.png" border="0" width="956" height="286" /></p>
<p>Summable Interface<br />
</center></p>
<p>Of course once you have the framework up that makes one non-trivial task easy you have likely made many other non-trivial tasks easy.</p>
<p>We hope this demonstration and examining the source code in our WinVectorLogistic.Hadoop0.20.2.jar will help you find ways to tackle your large data machine learning problems.</p>
<hr/>
<p>Code License:</p>
<blockquote><p>
Packages com.winvector.*, extra.winvector.*<br />
	     Code for performing logistic regression on Hadoop.<br />
	     Copyright (C) Win Vector LLC 2010 (contact: John Mount jmount@win-vector.com).<br />
	     Distributed under GNU Affero General Public License version 3 (2007, see http://www.gnu.org/licenses/agpl.html ).<br />
	       This program is free software: you can redistribute it and/or modify<br />
	       it under the terms of the GNU Affero General Public License as<br />
	       published by the Free Software Foundation, only version 3 of the<br />
	       License.<br />
	       This program is distributed in the hope that it will be useful,<br />
	       but WITHOUT ANY WARRANTY; without even the implied warranty of<br />
	       MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the<br />
	       GNU Affero General Public License for more details.<br />
	       You should have received a copy of the GNU Affero General Public License<br />
	       along with this program.  If not, see <http://www.gnu.org/licenses/>.<br />
	    (Source code in jar, see also http://www.win-vector.com/blog/2010/12/large-data-logistic-regression-with-example-hadoop-code/ )
</p></blockquote>
<hr/>
Note Dec-15-2011:  We have moved the code distribution to <a target="_blank" href="https://github.com/WinVector/SQL-Screwdriver">github.com/WinVector/SQL-Screwdriver</a> .  We have fixed some major bugs in the supplied optimizers and moved com.winvector.logistic.LogisticScore and com.winvector.logistic.LogisticTrain form freeform arguments to Apache CLI.  The new command lines need flags as shown below:</p>
<pre>
usage: com.winvector.logistic.LogisticTrain
 -formula &lt;arg&gt;      formula to fit
 -inmemory           if set data is held in memory during training
 -resultSer &lt;arg&gt;    (optional) file to write seriazlized results to
 -resultTSV &lt;arg&gt;    (optional) file to write TSV results to
 -trainClass &lt;arg&gt;   (optional) alternate class to use for training
 -trainHDL &lt;arg&gt;     XML file to get JDBC connection to training data
                     table
 -trainTBL &lt;arg&gt;     table to use from database for training data
 -trainURI &lt;arg&gt;     URI to get training TSV data from
</pre>
<pre>
usage: com.winvector.logistic.LogisticScore
 -dataHDL &lt;arg&gt;      XML file to get JDBC connection to scoring data table
 -dataTBL &lt;arg&gt;      table to use from database for scoring data
 -dataURI &lt;arg&gt;      URI to get scoring data from
 -modelFile &lt;arg&gt;    file to read serialized model from
 -resultFile &lt;arg&gt;   file to write results to
</pre>
<hr/>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2011/09/the-equivalence-of-logistic-regression-and-maximum-entropy-models/' rel='bookmark' title='The equivalence of logistic regression and maximum entropy models'>The equivalence of logistic regression and maximum entropy models</a></li>
<li><a href='http://www.win-vector.com/blog/2011/09/the-simpler-derivation-of-logistic-regression/' rel='bookmark' title='The Simpler Derivation of Logistic Regression'>The Simpler Derivation of Logistic Regression</a></li>
<li><a href='http://www.win-vector.com/blog/2010/11/learn-a-powerful-machine-learning-tool-logistic-regression-and-beyond/' rel='bookmark' title='Learn Logistic Regression (and beyond)'>Learn Logistic Regression (and beyond)</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2010/12/large-data-logistic-regression-with-example-hadoop-code/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Gradients via Reverse Accumulation</title>
		<link>http://www.win-vector.com/blog/2010/07/gradients-via-reverse-accumulation/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=gradients-via-reverse-accumulation</link>
		<comments>http://www.win-vector.com/blog/2010/07/gradients-via-reverse-accumulation/#comments</comments>
		<pubDate>Thu, 15 Jul 2010 00:00:04 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Applications]]></category>
		<category><![CDATA[Coding]]></category>
		<category><![CDATA[Exciting Techniques]]></category>
		<category><![CDATA[Mathematics]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Tutorials]]></category>
		<category><![CDATA[Automatic Differentiation]]></category>
		<category><![CDATA[Conjugate Gradient]]></category>
		<category><![CDATA[Gradient]]></category>
		<category><![CDATA[Mathematical Bedside Reading]]></category>
		<category><![CDATA[Optimization]]></category>
		<category><![CDATA[Reverse Accumulation]]></category>
		<category><![CDATA[Scala]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1493</guid>
		<description><![CDATA[We extend the ideas of from Automatic Differentiation with Scala to include the reverse accumulation. Reverse accumulation is a non-obvious improvement to automatic differentiation that can in many cases vastly speed up calculations of gradients. As the tables, diagrams and equations do not translate well into HTML, our full article is available here in PDF: [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/' rel='bookmark' title='Automatic Differentiation with Scala'>Automatic Differentiation with Scala</a></li>
<li><a href='http://www.win-vector.com/blog/2011/06/automatic-detection-of-potential-deadlock/' rel='bookmark' title='Automatic Detection of Potential Deadlock'>Automatic Detection of Potential Deadlock</a></li>
<li><a href='http://www.win-vector.com/blog/2010/05/must-have-software/' rel='bookmark' title='Must Have Software'>Must Have Software</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>We extend the ideas of from <a target="ext" href="http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/">Automatic Differentiation with Scala</a> to include the <em>reverse accumulation</em>.  Reverse accumulation is a non-obvious improvement to automatic differentiation that can in many cases vastly speed up calculations of gradients.<span id="more-1493"></span><br />
As the tables, diagrams and equations do not translate well into HTML, our full article is available here in PDF: <a href="http://www.win-vector.com/dfiles/ReverseAccumulation.pdf">http://www.win-vector.com/dfiles/ReverseAccumulation.pdf</a>.</p>
<p>The purpose of our article is to explain reverse accumulation automatic differentiation clearly (and to release some sample code and timing results).  A side effect of the article is to make sense of the following two diagrams:</p>
<p>If the following is picture of standard or forward differentiation:</p>
<p><img src="http://www.win-vector.com/blog/wp-content/uploads/2010/07/cutFwd.png" alt="cutFwd.png" border="0" width="408" height="677" /></p>
<p>then the following is a picture of reverse accumulation:</p>
<p><img src="http://www.win-vector.com/blog/wp-content/uploads/2010/07/cutRev.png" alt="cutRev.png" border="0" width="487" height="739" /></p>
<hr/>
Example code now distributed from: <a target="_blank" href="https://github.com/WinVector/AutoDiff">github.com/WinVector/AutoDiff</a>.</p>
<hr/>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/' rel='bookmark' title='Automatic Differentiation with Scala'>Automatic Differentiation with Scala</a></li>
<li><a href='http://www.win-vector.com/blog/2011/06/automatic-detection-of-potential-deadlock/' rel='bookmark' title='Automatic Detection of Potential Deadlock'>Automatic Detection of Potential Deadlock</a></li>
<li><a href='http://www.win-vector.com/blog/2010/05/must-have-software/' rel='bookmark' title='Must Have Software'>Must Have Software</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2010/07/gradients-via-reverse-accumulation/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Automatic Differentiation with Scala</title>
		<link>http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=automatic-differentiation-with-scala</link>
		<comments>http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/#comments</comments>
		<pubDate>Tue, 15 Jun 2010 04:19:20 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Applications]]></category>
		<category><![CDATA[Coding]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Exciting Techniques]]></category>
		<category><![CDATA[Mathematics]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Tutorials]]></category>
		<category><![CDATA[Automatic Differentiation]]></category>
		<category><![CDATA[Conjugate Gradient]]></category>
		<category><![CDATA[Dual Numbers]]></category>
		<category><![CDATA[Geometric Median]]></category>
		<category><![CDATA[Numeric Methods]]></category>
		<category><![CDATA[Optimization]]></category>
		<category><![CDATA[Scala]]></category>
		<category><![CDATA[Steiner Tree]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1481</guid>
		<description><![CDATA[This article is a worked-out exercise in applying the Scala type system to solve a small scale optimization problem. For this article we supply complete Scala source code (under a GPLv3 license) and some design discussion.Usually we work using a combination of databases, Java, optimization libraries and analysis suites (like R). The reason is that, [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2010/07/gradients-via-reverse-accumulation/' rel='bookmark' title='Gradients via Reverse Accumulation'>Gradients via Reverse Accumulation</a></li>
<li><a href='http://www.win-vector.com/blog/2011/08/programmers-should-know-r/' rel='bookmark' title='Programmers Should Know R'>Programmers Should Know R</a></li>
<li><a href='http://www.win-vector.com/blog/2011/06/automatic-detection-of-potential-deadlock/' rel='bookmark' title='Automatic Detection of Potential Deadlock'>Automatic Detection of Potential Deadlock</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>This article is a worked-out exercise in applying the <a href="http://www.scala-lang.org/" target="ext">Scala</a> type system to solve a small scale optimization problem.    For this article we supply <a href="http://www.win-vector.com/dfiles/ScalaDiff.jar">complete Scala source code</a> (under a GPLv3 license) and some design discussion.<span id="more-1481"></span>Usually we work using a combination of databases, Java, optimization libraries and analysis suites (like R).  The reason is that, for our typical problems, Java hits a sweet spot of trading off runtime performance against ease of development and maintenance.  In the tens of gigabytes range (data sets larger than the Wikipedia but smaller than the Web) Java outperforms the scripting languages (Ruby, Python &#8230;) and is much easer to develop in and document than C++.  This sweet spot is both subjective and situational- if the tasks were smaller and in a services framework Python is a better choice, if performance is paramount then C or C++ (with the STL) and Hadoop are a better choice, if pre-built statistical libraries are needed then R becomes a better choice.  For the type problem we present here Scala is a very good choice.</p>
<style type="text/css">
td.linenos { background-color: #f0f0f0; padding-right: 10px; }
span.lineno { background-color: #f0f0f0; padding: 0 5px 0 5px; }
pre { line-height: 125%; }
body .hll { background-color: #ffffcc }
body  { background: #f8f8f8; }
body .c { color: #408080; font-style: italic } /* Comment */
body .err { border: 1px solid #FF0000 } /* Error */
body .k { color: #008000; font-weight: bold } /* Keyword */
body .o { color: #666666 } /* Operator */
body .cm { color: #408080; font-style: italic } /* Comment.Multiline */
body .cp { color: #BC7A00 } /* Comment.Preproc */
body .c1 { color: #408080; font-style: italic } /* Comment.Single */
body .cs { color: #408080; font-style: italic } /* Comment.Special */
body .gd { color: #A00000 } /* Generic.Deleted */
body .ge { font-style: italic } /* Generic.Emph */
body .gr { color: #FF0000 } /* Generic.Error */
body .gh { color: #000080; font-weight: bold } /* Generic.Heading */
body .gi { color: #00A000 } /* Generic.Inserted */
body .go { color: #808080 } /* Generic.Output */
body .gp { color: #000080; font-weight: bold } /* Generic.Prompt */
body .gs { font-weight: bold } /* Generic.Strong */
body .gu { color: #800080; font-weight: bold } /* Generic.Subheading */
body .gt { color: #0040D0 } /* Generic.Traceback */
body .kc { color: #008000; font-weight: bold } /* Keyword.Constant */
body .kd { color: #008000; font-weight: bold } /* Keyword.Declaration */
body .kn { color: #008000; font-weight: bold } /* Keyword.Namespace */
body .kp { color: #008000 } /* Keyword.Pseudo */
body .kr { color: #008000; font-weight: bold } /* Keyword.Reserved */
body .kt { color: #B00040 } /* Keyword.Type */
body .m { color: #666666 } /* Literal.Number */
body .s { color: #BA2121 } /* Literal.String */
body .na { color: #7D9029 } /* Name.Attribute */
body .nb { color: #008000 } /* Name.Builtin */
body .nc { color: #0000FF; font-weight: bold } /* Name.Class */
body .no { color: #880000 } /* Name.Constant */
body .nd { color: #AA22FF } /* Name.Decorator */
body .ni { color: #999999; font-weight: bold } /* Name.Entity */
body .ne { color: #D2413A; font-weight: bold } /* Name.Exception */
body .nf { color: #0000FF } /* Name.Function */
body .nl { color: #A0A000 } /* Name.Label */
body .nn { color: #0000FF; font-weight: bold } /* Name.Namespace */
body .nt { color: #008000; font-weight: bold } /* Name.Tag */
body .nv { color: #19177C } /* Name.Variable */
body .ow { color: #AA22FF; font-weight: bold } /* Operator.Word */
body .w { color: #bbbbbb } /* Text.Whitespace */
body .mf { color: #666666 } /* Literal.Number.Float */
body .mh { color: #666666 } /* Literal.Number.Hex */
body .mi { color: #666666 } /* Literal.Number.Integer */
body .mo { color: #666666 } /* Literal.Number.Oct */
body .sb { color: #BA2121 } /* Literal.String.Backtick */
body .sc { color: #BA2121 } /* Literal.String.Char */
body .sd { color: #BA2121; font-style: italic } /* Literal.String.Doc */
body .s2 { color: #BA2121 } /* Literal.String.Double */
body .se { color: #BB6622; font-weight: bold } /* Literal.String.Escape */
body .sh { color: #BA2121 } /* Literal.String.Heredoc */
body .si { color: #BB6688; font-weight: bold } /* Literal.String.Interpol */
body .sx { color: #008000 } /* Literal.String.Other */
body .sr { color: #BB6688 } /* Literal.String.Regex */
body .s1 { color: #BA2121 } /* Literal.String.Single */
body .ss { color: #19177C } /* Literal.String.Symbol */
body .bp { color: #008000 } /* Name.Builtin.Pseudo */
body .vc { color: #19177C } /* Name.Variable.Class */
body .vg { color: #19177C } /* Name.Variable.Global */
body .vi { color: #19177C } /* Name.Variable.Instance */
body .il { color: #666666 } /* Literal.Number.Integer.Long */
 </style>
<h2>Our Example Problem</h2>
<p>Our small scale problem is this:  we have a number of target points on a map and we want to pick a central point to <em>directly</em> connect to all of these points with wire.  Our goal is to minimize the total amount of wire used.  This problem is called the <a href="http://en.wikipedia.org/wiki/Geometric_median" ref="ext">&#8220;Geometric Median&#8221;</a>.  So we are trying to find a point that minimizes the sum of distances from our chosen center to every target point. If we were trying to minimize the sum of squared distances from our chosen center to every target point the answer would be obvious: the average or mean (which by Hooke&#8217;s law is also the point where a set of identical springs would relax to).  The mean is in fact a fairly good guess, but you can do better (which could important if the &#8220;wire&#8221; is expensive, such as cutting irrigation or drainage ditches).  For example given the three target points (20,0), (-1,-1) and (-1,1) the optimal point is (-0.42,0) not the mean (6,0) and the choice of optimal point represents an over 19% savings in total wiring distance (see figure).</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/06/points.png" alt="points.png" border="0" width="525" height="525" /><br />
</center></p>
<p>This is a substantial saving in cost.  </p>
<p>The problem changes as we consider variations.  If indirect connections (such as routing one point through another, which may or may not be possible for reasons of capacity or safety) and multiple new centers are allowed  we then have an instance of the <a href="http://en.wikipedia.org/wiki/Steiner_tree_problem" ref="ext">Steiner Tree Problem</a> which is harder  to solve (since it is known to be NP complete).  If no new centers are allowed (all routing must be between pre-existing target points) then we have a Spanning Tree Problem- which admits very quick solutions.</p>
<p>We bring up the geometric median as a mere example.  We don&#8217;t intend for our code to solve only the geometric median problem and we don&#8217;t intend to touch on the literature of specialized methods for solving the geometric median problem.  Instead we are trying to demonstrate the speed you can develop prototype solutions if you have a few good tools (like various optimizers) available in your toolkit.  Numeric optimizers may sound exotic, but they often are the kind of thing you want to experiment with and link directly into your code.</p>
<h2>Optimization as General Tool</h2>
<p>Now that we have the example problem we can describe a solution strategy.  In this case the solution uses code &#8220;we wished we had lying around&#8221; before we started on the problem.  We will pretend we have the tools we want ready to solve our problem and then we will pay our debt and build the required tools.  The issue is that there is not an obvious closed form for the solution of the geometric median problem.  So we are forced to work a bit harder.  In this case harder means we need to solve an optimization problem.  Consider the contour plot of the total wiring cost as function of where we choose to place our center.  Our optimal point (-0.42,0) had wiring cost of 22.73 and the contour plot given here shows concentric regions of solution positions with higher cost.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/06/contour.png" alt="contour.png" border="0" width="525" height="525" /><br />
</center></p>
<p>In general it is unwise to throw an optimizer at an arbitrary problem and hope to find the globally best solution.  But in this case (and in many similar situations) we can prove that a simple local optimizer will in fact find the unique best solution.  This is a property of the problem not of the optimizer.  The concentric regions shown in the contour plot have a very nice shape: they are <a href="http://en.wikipedia.org/wiki/Convex_set" ref="ext">convex</a>.   That is: they have no intrusions- for any two points drawn from one of these shapes the straight line segment between these points stays inside the given shape.  We don&#8217;t have to depend on observation- we can actually prove this is always the case for this problem.  The wiring cost from a proposed center to any single target point is a <a href="http://en.wikipedia.org/wiki/Convex_function" ref="ext">convex function</a> of where we choose to place our center (a convex function is a function whose graph never reaches above the secant line drawn between any two points on its graph).  The total wiring cost is just the sum of the wiring costs to each target point.  And to finish: the sum of a collection of convex functions is itself a convex function.  Since the contour plot of a convex function has only convex shapes and we have proven the statement.</p>
<p>But how does this help us?  There is a standard technique to find &#8220;local minima&#8221; of a function by inspecting a function for places where the gradient is zero (points where there is no obvious down hill direction on the contour plot).  This technique usually can only be guaranteed to find local minima (places where no small change improves your situation).  But there is no guarantee that the local minimum you find is in fact the global minimum (the best possible solution).  Except when you are dealing with a convex function.  When a function is convex then all of the local minima are always grouped together into a single convex connected shape (if not a line drawn between two remote minima would violate the convexity definition).  And if the function is never flat then this set is a single unique point: the unique best solution.  Our inspection technique will be a gradient driven optimizer- that is an optimizer that when the gradient is non-zero improves its objective by running down hill and halts when the gradient is zero.</p>
<p>The stated function to minimize is to sum the distance from our proposed center to each target point.  We can write this as the sum of the distances:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/06/dist1.png" alt="dist1.png" border="0" width="309" height="81" /><br />
</center></p>
<p>( <img src="http://www.win-vector.com/blog/wp-content/uploads/2010/06/euclid1.png" alt="euclid1.png" border="0" width="119" height="37" /> which is the traditional Euclidean or L2 distance).  This function actually has one one subtle flaw that we will deal with in the appendix (see: Fixing Smoothness).</p>
<h2>Using Scala to Apply the Optimization Solution</h2>
<p>To find our optimal center placement using Scala we first write our cost or objective as a Scala function:</p>
<div class="highlight">
<pre>    <span class="k">val</span> <span class="n">dat</span><span class="k">:</span><span class="kt">Array</span><span class="o">[</span><span class="kt">Array</span><span class="o">[</span><span class="kt">Double</span><span class="o">]]</span> <span class="o">=</span> <span class="nc">Array</span><span class="o">(</span>
      <span class="nc">Array</span><span class="o">(</span> <span class="mi">20</span><span class="o">,</span> <span class="mf">0.0</span><span class="o">),</span>
      <span class="nc">Array</span><span class="o">(</span> <span class="o">-</span><span class="mf">1.0</span><span class="o">,</span> <span class="mf">1.0</span><span class="o">),</span>
      <span class="nc">Array</span><span class="o">(</span> <span class="o">-</span><span class="mf">1.0</span><span class="o">,</span> <span class="o">-</span><span class="mf">1.0</span><span class="o">)</span>
    <span class="o">)</span>

    <span class="k">def</span> <span class="n">fx</span><span class="o">(</span><span class="n">p</span><span class="k">:</span><span class="kt">Array</span><span class="o">[</span><span class="kt">Double</span><span class="o">])</span><span class="k">:</span><span class="kt">Double</span> <span class="o">=</span> <span class="o">{</span>
      <span class="k">val</span> <span class="n">dim</span> <span class="k">=</span> <span class="n">p</span><span class="o">.</span><span class="n">length</span>
      <span class="k">val</span> <span class="n">npoint</span> <span class="k">=</span> <span class="n">dat</span><span class="o">.</span><span class="n">length</span>
      <span class="k">var</span> <span class="n">total</span> <span class="k">=</span> <span class="mf">0.0</span>
      <span class="k">for</span><span class="o">(</span><span class="n">k</span> <span class="k">&lt;-</span> <span class="mi">0</span> <span class="n">to</span> <span class="o">(</span><span class="n">npoint</span><span class="o">-</span><span class="mi">1</span><span class="o">))</span> <span class="o">{</span>
        <span class="k">var</span> <span class="n">term</span> <span class="k">=</span> <span class="mf">0.0</span>
        <span class="k">for</span><span class="o">(</span><span class="n">i</span> <span class="k">&lt;-</span> <span class="mi">0</span> <span class="n">to</span> <span class="o">(</span><span class="n">dim</span><span class="o">-</span><span class="mi">1</span><span class="o">))</span> <span class="o">{</span>
          <span class="k">val</span> <span class="n">diff</span> <span class="k">=</span> <span class="n">p</span><span class="o">(</span><span class="n">i</span><span class="o">)</span> <span class="o">-</span> <span class="n">dat</span><span class="o">(</span><span class="n">k</span><span class="o">)(</span><span class="n">i</span><span class="o">)</span>
          <span class="n">term</span> <span class="k">=</span> <span class="n">term</span> <span class="o">+</span> <span class="n">diff</span><span class="o">*</span><span class="n">diff</span>
        <span class="o">}</span>
        <span class="n">total</span> <span class="k">=</span> <span class="n">total</span> <span class="o">+</span> <span class="n">scala</span><span class="o">.</span><span class="n">math</span><span class="o">.</span><span class="n">sqrt</span><span class="o">(</span><span class="n">term</span><span class="o">)</span>
      <span class="o">}</span>
      <span class="n">total</span>
    <span class="o">}</span>
</pre>
</div>
<p>Scala is succinct and it is a great connivence to have a function definition capture data from its environment.   What we would like to do is generate an initial guess as the solution (we use the mean as our initial guess) and then call an optimizer (in this case a conjugate gradient optimizer) to do all the work:</p>
<div class="highlight">
<pre> <span class="k">val</span> <span class="n">p0</span><span class="k">:</span><span class="kt">Array</span><span class="o">[</span><span class="kt">Double</span><span class="o">]</span> <span class="o">=</span> <span class="n">mean</span><span class="o">(</span><span class="n">dat</span><span class="o">)</span>
 <span class="k">val</span> <span class="o">(</span><span class="n">pF</span><span class="o">,</span><span class="n">fpF</span><span class="o">)</span> <span class="k">=</span> <span class="nc">CG</span><span class="o">.</span><span class="n">minimize</span><span class="o">(</span><span class="n">fx</span><span class="o">,</span><span class="n">p0</span><span class="o">)</span>
</pre>
</div>
<p>At this point we would be done, except the conjugate gradient method (which is superior to gradient descent and many the non-gradient methods) requires a gradient.<br />
We could provide a numeric estimate of the gradient by the following divided difference method:</p>
<div class="highlight">
<pre>  <span class="k">def</span> <span class="n">gradientD</span><span class="o">(</span><span class="n">f</span><span class="k">:</span><span class="kt">Array</span><span class="o">[</span><span class="kt">Double</span><span class="o">]</span><span class="k">=&gt;</span><span class="kt">Double</span><span class="o">,</span><span class="n">p</span><span class="k">:</span><span class="kt">Array</span><span class="o">[</span><span class="kt">Double</span><span class="o">])</span><span class="k">:</span><span class="kt">Array</span><span class="o">[</span><span class="kt">Double</span><span class="o">]</span> <span class="o">=</span> <span class="o">{</span>
    <span class="k">val</span> <span class="n">xdim</span> <span class="k">=</span> <span class="n">p</span><span class="o">.</span><span class="n">length</span>
    <span class="k">val</span> <span class="n">p2</span> <span class="k">=</span> <span class="n">copy</span><span class="o">(</span><span class="n">p</span><span class="o">)</span>
    <span class="k">val</span> <span class="n">base</span> <span class="k">=</span> <span class="n">f</span><span class="o">(</span><span class="n">p2</span><span class="o">)</span>
    <span class="k">val</span> <span class="n">ret</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">Array</span><span class="o">[</span><span class="kt">Double</span><span class="o">](</span><span class="n">xdim</span><span class="o">)</span>
    <span class="k">val</span> <span class="n">delta</span> <span class="k">=</span> <span class="mf">1.0e-6</span>
    <span class="k">for</span><span class="o">(</span><span class="n">i</span> <span class="k">&lt;-</span> <span class="mi">0</span> <span class="n">to</span> <span class="o">(</span><span class="n">xdim</span><span class="o">-</span><span class="mi">1</span><span class="o">))</span> <span class="o">{</span>
      <span class="n">p2</span><span class="o">(</span><span class="n">i</span><span class="o">)</span> <span class="k">=</span> <span class="n">p</span><span class="o">(</span><span class="n">i</span><span class="o">)</span> <span class="o">+</span> <span class="n">delta</span>
      <span class="k">val</span> <span class="n">fplus</span> <span class="k">=</span> <span class="n">f</span><span class="o">(</span><span class="n">p2</span><span class="o">)</span>
      <span class="n">p2</span><span class="o">(</span><span class="n">i</span><span class="o">)</span> <span class="k">=</span> <span class="n">p</span><span class="o">(</span><span class="n">i</span><span class="o">)</span>
      <span class="k">val</span> <span class="n">diff</span> <span class="k">=</span> <span class="o">(</span><span class="n">fplus</span><span class="o">-</span><span class="n">base</span><span class="o">)/</span><span class="n">delta</span>
      <span class="n">ret</span><span class="o">(</span><span class="n">i</span><span class="o">)</span> <span class="k">=</span> <span class="n">diff</span>
    <span class="o">}</span>
    <span class="n">ret</span>
  <span class="o">}</span>
</pre>
</div>
<p>This numeric divided difference method often outperforms non-derivative optimization methods (like Powell&#8217;s Method and the Nelder-Mead Amoeba method).  But the technique can run into numeric difficulties.   We can remedy this if we are willing to write our function in a slightly more general way.   If we re-encode our function in a generic manner we can use <a href="http://en.wikipedia.org/wiki/Automatic_differentiation" target="ext">automatic differentiation</a>  (not to be confused with numeric differentiation or with symbolic differentiation) to produce a reliable gradient for optimization.  What we need to do is re-write our function to work over an abstract field of numbers instead of only the machine supplied doubles.  In fact what we need to do is specify a generic function that will work over any field, with the field to be determined later.  The code to do this in Scala is very similar to the non-generic code:</p>
<div class="highlight">
<pre>   <span class="k">val</span> <span class="n">genericFx</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">VectorFN</span> <span class="o">{</span>
      <span class="k">def</span> <span class="n">apply</span><span class="o">[</span><span class="kt">Y</span> <span class="k">&lt;:</span> <span class="kt">NumberBase</span><span class="o">[</span><span class="kt">Y</span><span class="o">]](</span><span class="n">p</span><span class="k">:</span><span class="kt">Array</span><span class="o">[</span><span class="kt">Y</span><span class="o">])</span><span class="k">:</span><span class="kt">Y</span> <span class="o">=</span> <span class="o">{</span>
        <span class="k">val</span> <span class="n">field</span> <span class="k">=</span> <span class="n">p</span><span class="o">(</span><span class="mi">0</span><span class="o">).</span><span class="n">field</span>
        <span class="k">val</span> <span class="n">dim</span> <span class="k">=</span> <span class="n">p</span><span class="o">.</span><span class="n">length</span>
        <span class="k">val</span> <span class="n">npoint</span> <span class="k">=</span> <span class="n">dat</span><span class="o">.</span><span class="n">length</span>
        <span class="k">var</span> <span class="n">total</span> <span class="k">=</span> <span class="n">field</span><span class="o">.</span><span class="n">zero</span>
        <span class="k">for</span><span class="o">(</span><span class="n">k</span> <span class="k">&lt;-</span> <span class="mi">0</span> <span class="n">to</span> <span class="o">(</span><span class="n">npoint</span><span class="o">-</span><span class="mi">1</span><span class="o">))</span> <span class="o">{</span>
          <span class="k">var</span> <span class="n">term</span> <span class="k">=</span> <span class="n">field</span><span class="o">.</span><span class="n">zero</span>
          <span class="k">for</span><span class="o">(</span><span class="n">i</span> <span class="k">&lt;-</span> <span class="mi">0</span> <span class="n">to</span> <span class="o">(</span><span class="n">dim</span><span class="o">-</span><span class="mi">1</span><span class="o">))</span> <span class="o">{</span>
            <span class="k">val</span> <span class="n">diff</span> <span class="k">=</span> <span class="n">p</span><span class="o">(</span><span class="n">i</span><span class="o">)</span> <span class="o">-</span> <span class="n">field</span><span class="o">.</span><span class="n">inject</span><span class="o">(</span><span class="n">dat</span><span class="o">(</span><span class="n">k</span><span class="o">)(</span><span class="n">i</span><span class="o">))</span>
            <span class="n">term</span> <span class="k">=</span> <span class="n">term</span> <span class="o">+</span> <span class="n">diff</span><span class="o">*</span><span class="n">diff</span>
          <span class="o">}</span>
          <span class="n">total</span> <span class="k">=</span> <span class="n">total</span> <span class="o">+</span> <span class="n">smoothSQRT</span><span class="o">(</span><span class="n">term</span><span class="o">)</span>
        <span class="o">}</span>
        <span class="n">total</span>
      <span class="o">}</span>
    <span class="o">}</span>
</pre>
</div>
<p>Notice that code is very similar to the &#8220;def fx()&#8221; code.  The key differences are that we had to define genericFx as extending a trait (a type of Scala interface) called VectorFN and inside this trait extension we defined a parameterized function name apply().  apply() is a generic function that is willing to work over any type Y where Y is at least of type NumberBase[Y] (we will get more into what that means in a moment).  The difference in notation is that while the Scala function <em>syntax</em> can not specify a generic function with free type parameters (the incompletely specified Y) the Scala <em>semantics</em> are strong enough to implement this.  In fact standard function definitions (such as &#8220;def fx()&#8221;) are just syntactic sugar for extending the Scala built-in <a href="http://www.scala-lang.org/docu/files/api/scala/Function1.html" target="ext">Function1 trait</a>.  With a generic objective function in hand all we need is conjugate gradient code that is expecting a VectorFN (and willing to call apply() instead of just using naked function parenthesis) and some type NumberBase[Y] that can compute gradients for us.  The Scala compiler can specialize our genericFx() into one version for quick calculation and another for gradients.  How this is done is what we will discuss next.  From our point of view our problem is solved with the following one line of code:</p>
<div class="highlight">
<pre><span class="k">val</span> <span class="o">(</span><span class="n">pF</span><span class="o">,</span><span class="n">fpF</span><span class="o">)</span> <span class="k">=</span> <span class="nc">CG</span><span class="o">.</span><span class="n">minimize</span><span class="o">(</span><span class="n">genericFx</span><span class="o">,</span><span class="n">p0</span><span class="o">)</span>
</pre>
</div>
<p>This should always be your goal- build sufficient preparation so your last step is a &#8220;obvious one liner.&#8221;</p>
<h2>What Tools we Wish we Had Lying Around</h2>
<p>We supply in our example some workable conjugate gradient code, but that is standard so we will not discuss it.  What is of interest (and facilitated by Scala&#8217;s parametrized type system) is the implementation of <a href="http://en.wikipedia.org/wiki/Dual_number" target="ext">dual numbers</a> as a framework to supply automatic differentiation.  An implementation of dual numbers as a NumerBase[DualNumber] type is the core of our demonstration.</p>
<p>Dual numbers are an algebraic structure written as pairs of real numbers &#8220;(a,b)&#8221;.  The arithmetic table for dual numbers is given below:</p>
<table>
<tr>
<td>(a,b) + (c,d)</td>
<td>=</td>
<td>((a+c) , (b+d))</td>
</tr>
<tr>
<td>(a,b) &#8211; (c,d)</td>
<td>=</td>
<td>((a-c) , (b-d))</td>
</tr>
<tr>
<td>(a,b) * (c,d)</td>
<td>=</td>
<td>((a*c) , (a*d+b*c))</td>
</tr>
<tr>
<td>(a,b) / (c,d)</td>
<td>=</td>
<td>((a/c) , ((b*c-a*d)/(a*a)))</td>
</tr>
</table>
<p>In a dual number (a,b) &#8220;a&#8221; is the &#8220;large&#8221; or &#8220;standard&#8221; part of the number.  You can check from the arithmetic table that the pair of dual numbers (a,0) and (c,0) behave just as we would expect the real numbers a and c to behave.  In the dual number (a,b) &#8220;b&#8221; is the &#8220;small&#8221; or &#8220;ideal&#8221; portion of the number.  From the multiplication rule above  we can observe two rules: (0,b) * (c,0) = (0,b*c) (something small times anything else is small) and (0,b)*(0,d) = (0,0) (two small things become zero when multiplied).  Essentially the dual numbers are carrying around the first two terms of a Taylor series: we get as a result both the function value and the function derivative.  For a function f() over the real numbers we extend f() to work over the dual number by defining: f((a,b)) = (f(a),b f&#8217;(a)) (which is consistent with the previously defined arithmetic). We can check that the dual numbers numbers obey the usual laws of arithmetic (associative, commutative, distributive, identities and inverses).  The punchline is that over the dual numbers the divided difference estimate of f&#8217;(x) (the derivative of f() evaluated at x)  is in fact exact in the sense that f((x,1)) = (f(x),f&#8217;(x)) (or f((x,0)+(0,1)) &#8211; f((x,0)) = (0, f&#8217;(x))).  Implementing the DualNumber class is little more than transcribing the above arithmetic table into Scala.</p>
<p>We have already seen how to write code that uses NumberBase[Y] types (genericFx() itself is an example).  A more complicated example is the CG.minimize() code which not only accepts a generic function (in the form of VectorFN) but then specializes it to NumberBase[DualNumber] to compute gradients and also specializes to NumberBase[MDouble] for quick calculation during line searches (MDouble is just an adapter for machine Doubles, used for speed).  The ability to re-specialize a function is one of the advantages of a parameterized type system.  The DualNumbers are an example of forward automatic differentiation.  We could also use the same object framework to capture a representation of the computation path and apply more sophisticated methods such as reverse automatic differentiation. </p>
<p>We give a link to a jar containing <a href="http://www.win-vector.com/dfiles/ScalaDiff.jar">complete Scala source code</a> including this example, the DualNumber implementation, a conjugate gradient implementation and some JUnit tests (all under a GPLv3 license) and will go on to describe some of the design decisions.  The code is the bulky part of this work, so we will move on to discuss something more compact: types.</p>
<h2>Types</h2>
<p>If code is ever beautiful it is only when it is succinct.  Among the most succinct forms of code are individual type signatures and interfaces (though the indiscriminate repetition of type signatures is rightly considered ugly bloat, which Scala works to avoid).   Since we are distributing complete source we will describe only types and method signatures.  The entry points to the code are the JUnit tests (organized in the ScalaDiff/test source directory and depending on JUnit which was not included) and the demo program in ScalaDiff/src/demo/Demo.scala).</p>
<p>To be a usable arithmetic type (like DualNumber or MDouble) you must extend the following parameterized abstract class:</p>
<div class="highlight">
<pre><span class="k">abstract</span> <span class="k">class</span> <span class="nc">NumberBase</span><span class="o">[</span><span class="kt">NUMBERTYPE</span> <span class="k">&lt;:</span> <span class="kt">NumberBase</span><span class="o">[</span><span class="kt">NUMBERTYPE</span><span class="o">]]</span> <span class="o">{</span>
  <span class="c">// basic arithmetic</span>
  <span class="k">def</span> <span class="o">+</span> <span class="o">(</span><span class="n">that</span><span class="k">:</span> <span class="kt">NUMBERTYPE</span><span class="o">)</span><span class="k">:</span><span class="kt">NUMBERTYPE</span>
  <span class="k">def</span> <span class="o">-</span> <span class="o">(</span><span class="n">that</span><span class="k">:</span> <span class="kt">NUMBERTYPE</span><span class="o">)</span><span class="k">:</span><span class="kt">NUMBERTYPE</span>
  <span class="k">def</span> <span class="n">unary_-</span><span class="o">()</span><span class="k">:</span><span class="kt">NUMBERTYPE</span>
  <span class="k">def</span> <span class="o">*</span> <span class="o">(</span><span class="n">that</span><span class="k">:</span> <span class="kt">NUMBERTYPE</span><span class="o">)</span><span class="k">:</span><span class="kt">NUMBERTYPE</span>
  <span class="k">def</span> <span class="o">/</span> <span class="o">(</span><span class="n">that</span><span class="k">:</span> <span class="kt">NUMBERTYPE</span><span class="o">)</span><span class="k">:</span><span class="kt">NUMBERTYPE</span>  <span class="kt">//</span> <span class="kt">that</span> <span class="kt">not</span> <span class="kt">equal</span> <span class="kt">to</span> <span class="kt">zero</span>
  <span class="c">// more complicated</span>
  <span class="k">def</span> <span class="n">pow</span><span class="o">(</span><span class="n">that</span><span class="k">:</span><span class="kt">Double</span><span class="o">)</span><span class="k">:</span><span class="kt">NUMBERTYPE</span>
  <span class="k">def</span> <span class="n">exp</span><span class="k">:</span><span class="kt">NUMBERTYPE</span>
  <span class="k">def</span> <span class="n">log</span><span class="k">:</span><span class="kt">NUMBERTYPE</span> <span class="kt">//</span> <span class="kt">this</span> <span class="kt">is</span> <span class="kt">positive</span>
  <span class="c">// comparison functions</span>
  <span class="k">def</span> <span class="o">&gt;</span> <span class="o">(</span><span class="n">that</span><span class="k">:</span> <span class="kt">NUMBERTYPE</span><span class="o">)</span><span class="k">:</span><span class="kt">Boolean</span>
  <span class="k">def</span> <span class="o">&gt;=</span> <span class="o">(</span><span class="n">that</span><span class="k">:</span> <span class="kt">NUMBERTYPE</span><span class="o">)</span><span class="k">:</span><span class="kt">Boolean</span>
  <span class="k">def</span> <span class="o">==</span> <span class="o">(</span><span class="n">that</span><span class="k">:</span> <span class="kt">NUMBERTYPE</span><span class="o">)</span><span class="k">:</span><span class="kt">Boolean</span>
  <span class="k">def</span> <span class="o">!=</span> <span class="o">(</span><span class="n">that</span><span class="k">:</span> <span class="kt">NUMBERTYPE</span><span class="o">)</span><span class="k">:</span><span class="kt">Boolean</span>
  <span class="k">def</span> <span class="o">&lt;</span> <span class="o">(</span><span class="n">that</span><span class="k">:</span> <span class="kt">NUMBERTYPE</span><span class="o">)</span><span class="k">:</span><span class="kt">Boolean</span>
  <span class="k">def</span> <span class="o">&lt;=</span> <span class="o">(</span><span class="n">that</span><span class="k">:</span> <span class="kt">NUMBERTYPE</span><span class="o">)</span><span class="k">:</span><span class="kt">Boolean</span>
  <span class="c">// utility</span>
  <span class="k">def</span> <span class="n">field</span><span class="k">:</span><span class="kt">Field</span><span class="o">[</span><span class="kt">NUMBERTYPE</span><span class="o">]</span>
<span class="o">}</span>
</pre>
</div>
<p>In particular DualNumber extends NumberBase[DualNumber].  This deliberate circular reference has a big purpose: it allows publicly visible covariant return types (returning nearly the exact type we really are instead of a base type).  This allows us to have strict type arguments so that trying to add a MDouble to DualNumber is a type error (even though they both extend the same base class).  The automatic differentiation technique encapsulated in the DualNumber class only works if all of the calculation is in the DualNumber types and this strict type enforcement allows the compiler to help prevent results sneaking in and out through other types.  All of the methods on NumberBase are obviously related to arithmetic except the field() method.  This method gives us access to a Field object which is responsible for carrying around the runtime type information (this is a common problem in Java and Scala, that some type information known at compile type such choice of template types is not easily accessed at runtime).  The Field class is as follows:</p>
<div class="highlight">
<pre><span class="k">abstract</span> <span class="k">class</span> <span class="nc">Field</span> <span class="o">[</span><span class="kt">NUMBERTYPE</span> <span class="k">&lt;:</span> <span class="kt">NumberBase</span><span class="o">[</span><span class="kt">NUMBERTYPE</span><span class="o">]]</span> <span class="o">{</span>
  <span class="k">def</span> <span class="n">zero</span><span class="k">:</span><span class="kt">NUMBERTYPE</span>            <span class="kt">//</span> <span class="kt">return</span> <span class="kt">canonical</span> <span class="kt">zero</span> <span class="kt">in</span> <span class="kt">field</span>
  <span class="k">def</span> <span class="n">one</span><span class="k">:</span><span class="kt">NUMBERTYPE</span>             <span class="kt">//</span> <span class="kt">return</span> <span class="kt">canonical</span> <span class="kt">one</span> <span class="kt">in</span> <span class="kt">field</span>
  <span class="k">def</span> <span class="n">inject</span><span class="o">(</span><span class="n">v</span><span class="k">:</span><span class="kt">Double</span><span class="o">)</span><span class="k">:</span><span class="kt">NUMBERTYPE</span>  <span class="kt">//</span> <span class="kt">return</span> <span class="kt">canonical</span> <span class="kt">representation</span> <span class="kt">of</span> <span class="kt">number</span> <span class="kt">in</span> <span class="kt">field</span>
  <span class="k">def</span> <span class="n">project</span><span class="o">(</span><span class="n">v</span><span class="k">:</span><span class="kt">NUMBERTYPE</span><span class="o">)</span><span class="k">:</span><span class="kt">Double</span> <span class="kt">//</span> <span class="kt">return</span> <span class="kt">standard-number</span> <span class="kt">represented</span> <span class="kt">in</span> <span class="kt">field</span>
  <span class="k">def</span> <span class="n">array</span><span class="o">(</span><span class="n">n</span><span class="k">:</span><span class="kt">Int</span><span class="o">)</span><span class="k">:</span><span class="kt">Array</span><span class="o">[</span><span class="kt">NUMBERTYPE</span><span class="o">]</span> <span class="kt">//</span> <span class="kt">return</span> <span class="kt">an</span> <span class="kt">array</span> <span class="kt">of</span> <span class="kt">this</span> <span class="k">type</span>
</pre>
</div>
<p>The Field class is where we have factories for numbers (zero, one, arrays, injection from standard Doubles), casting (projection back to standard Doubles).</p>
<p>With these types defined we can actually read intent off some of the method signatures.  </p>
<p>For example our conjugate gradient optimizer is accessed through the following method signature:</p>
<div class="highlight">
<pre> <span class="k">def</span> <span class="n">minimize</span><span class="o">(</span><span class="n">fn</span><span class="k">:</span><span class="kt">VectorFN</span><span class="o">,</span><span class="n">x0</span><span class="k">:</span><span class="kt">Array</span><span class="o">[</span><span class="kt">Double</span><span class="o">])</span><span class="k">:</span><span class="o">(</span><span class="kt">Array</span><span class="o">[</span><span class="kt">Double</span><span class="o">],</span><span class="kt">Double</span><span class="o">)</span> <span class="c">// return x,f(x)</span>
</pre>
</div>
<p>The above can be read as: CG.minimize() requires a VectorFN (our trait representing single argument functions with a free type parameter) and an initial point (in standard Doubles).  The code will the return a pair of the optimum point and the function evaluated at the optimum point.  From the type signature we can see that CG.minimize() expects to re-specialize the function &#8220;fn&#8221; to types of its own choosing (else it could have accepted a parameterized argument instead of our custom trait) and will handle all up-conversion and down-conversion between machine Doubles and NumberBase[Y]&#8216;s itself.  This sort of type information is hard to express (let alone enforce) in a dynamically typed language.</p>
<p>A slightly more complicated example is the lineMinD() method:</p>
<div class="highlight">
<pre><span class="k">def</span> <span class="n">lineMinD</span><span class="o">[</span><span class="kt">Y&lt;:NumberBase</span><span class="o">[</span><span class="kt">Y</span><span class="o">]](</span><span class="n">field</span><span class="k">:</span><span class="kt">Field</span><span class="o">[</span><span class="kt">Y</span><span class="o">],
 </span><span class="n">f</span><span class="k">:</span><span class="kt">Array</span><span class="o">[</span><span class="kt">Y</span><span class="o">]</span><span class="k">=&gt;</span><span class="kt">Y</span><span class="o">,
 </span><span class="n">xm</span><span class="k">:</span><span class="kt">Array</span><span class="o">[</span><span class="kt">Double</span><span class="o">],
 </span><span class="n">di</span><span class="k">:</span><span class="kt">Array</span><span class="o">[</span><span class="kt">Double</span><span class="o">])</span><span class="k">:</span><span class="o">(</span><span class="kt">Array</span><span class="o">[</span><span class="kt">Double</span><span class="o">],</span><span class="kt">Double</span><span class="o">)</span>
</pre>
</div>
<p>Notice it is willing to work with any type parameterized function (which means it is willing to let the caller pick the actual type of NumberBase[Y] and work with that).  Most callers will call with Y=MDouble (the wrapper for machine Doubles) and lineMin() will then work with that (without ever really knowing the actual underlying type).</p>
<p>A lot of fans of dynamic languages consider type systems to be mere hairshirt penance.   But that is not so.  Broken type systems (like Java&#8217;s collections before  erasure parameters were introduced in Java 1.5) are indeed more trouble than they are worth.  Working type systems (like C++ Templates/STL, Java 1.5+ and Scala) allow you to solve problems (and enforce decisions) during the design phase (which is much much cheaper than during the deployment phase).  You can&#8217;t set your types in stone (you are likely going to have them subtly wrong for the first few iteration).  You must be willing to think like a &#8220;language lawyer&#8221; to find out what parts of your work can be specified and enforced in the language type system.  To use an analogy: static types are your blueprint or your underpainting.</p>
<h2>Tests</h2>
<p>One argument against static types is that you can get much of their benefit from unit tests.  My opinion is you never have enough unit tests, so putting more pressure on your test suite is not wise.   Static types plus tests are strictly more powerful than static types alone or tests alone. </p>
<p>Even for this example toy-scale project we have include a JUnit test set to pursue a number of goals:</p>
<ul>
<li>Confirm our number implementations (DualNumber and MDouble) correctly model machine Doubles (perform parallel calculations and compare).</li>
<li>Confirm DualNumber obeys expected laws of algebra composition and cancellation <em>including the portions that can not be modeled in machine Doubles</em>.</li>
<li>Confirm DualNumbers compute gradients.</li>
<li>Confirm operations of optimizers and optimizer components.</li>
</ul>
<p>Many of these tests are related, but they don&#8217;t all imply each other and give different perspective on the errors they catch.  For example no amount of parallel computation between DualNumbers and machine Doubles is going to confirm the infinitesimal portion of the DualNumber is propagating correctly (since this is not a property of machine Doubles).  So we add extra tests that expect DualNumber to obey algebraic relations like: a*(b+c) = a*b + a*c hold.  It is then another step to confirm that whatever the DualNumbers calculate is not only self-consistent, but also models a truncated Taylor Series or differentiation.</p>
<h2>Conclusion</h2>
<p>We hope we have demonstrated how the complexity of a mathematical programming problem can be managed by breaking the problem into an objective function that is separate from the optimizer (allowing the optimizer to be both good and hidden) and a static type system (such as Scala) to help enforce required properties of a calculation (such as all numbers being routed though a required representation).  With these sort of tools available many formerly hard problems (that are often, unfortunately solved by over-specifying direct inefficient iterative improvement techniques) become &#8220;if I can write a reasonable objective function this may already by solved by an optimizer in my library.&#8221;  The more of these tools you have (either in your code or in your reference library) the more of these problems become easy (this is the topic of my earlier paper: <a href="http://www.win-vector.com/blog/2009/11/the-local-to-global-principle/">The Local to Global Principle</a>).</p>
<h2>Appendix: Fixing Smoothness</h2>
<p>Our chosen example objective function is very nice (i.e. convex) but it has a small (but correctable) problem.   The derivative or gradient or gradient has some jump discontinuities that could cause an optimizer to exit prematurely (not at the global optimum).  Consider the simple form of this for wiring a center to a single point at the origin (even in 1 dimension).  The wiring cost function is sqrt(x*x) has a cost graph as shown here.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/06/abs.png" alt="abs.png" border="0" width="525" height="525" /><br />
</center></p>
<p>This is convex- but derivative is not smooth as we see in the included graph of the derivative of sqrt(x*x).</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/06/dabs.png" alt="dabs.png" border="0" width="525" height="525" /><br />
</center></p>
<p>So: in this case if the optimizer stops at one of the target points we can&#8217;t be sure that it stopped at the global optimum (it may have stopped due to the discontinuity in the gradient).  For some simple problems the optimum is necessarily at a target point.  For example on the number line take the target points 0,1 and x.  As long as x&ge;0 and x&le;1 the optimum placement will be x itself.</p>
<p>One way to defend against this is to use some sort of smoothed version of sqrt() that essentially decreases a little faster near the origin.  Our cost function becomes:<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/06/cost2.png" alt="cost2.png" border="0" width="237" height="55" /><br />
</center><br />
where s() is our suitable approximation of the sqrt() function.  Two candidates are s(x) = (x+tau)^(1/2) and s(x) = x^(1/2 + tau); where tau is a small constant.  As long as tau is greater than zero we have no derivative discontinuity in s(x^2) and convexity is preserved (even made a bit stricter).  Other ways to deal with this include adding additional coordinates to the problem and small perturbations on these coordinates.  Finally, a point found by optimizing with respect to s(x) can be &#8220;polished&#8221; by re-starting the optimization at the first found solution and using sqrt(x) as the new objective (if the original point is not near any of the target points).</p>
<hr/>
Example code now distributed from: <a target="_blank" href="https://github.com/WinVector/AutoDiff">github.com/WinVector/AutoDiff</a>.</p>
<hr/>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2010/07/gradients-via-reverse-accumulation/' rel='bookmark' title='Gradients via Reverse Accumulation'>Gradients via Reverse Accumulation</a></li>
<li><a href='http://www.win-vector.com/blog/2011/08/programmers-should-know-r/' rel='bookmark' title='Programmers Should Know R'>Programmers Should Know R</a></li>
<li><a href='http://www.win-vector.com/blog/2011/06/automatic-detection-of-potential-deadlock/' rel='bookmark' title='Automatic Detection of Potential Deadlock'>Automatic Detection of Potential Deadlock</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Must Have Software</title>
		<link>http://www.win-vector.com/blog/2010/05/must-have-software/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=must-have-software</link>
		<comments>http://www.win-vector.com/blog/2010/05/must-have-software/#comments</comments>
		<pubDate>Fri, 28 May 2010 17:26:07 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Computers]]></category>
		<category><![CDATA[Opinion]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Tutorials]]></category>
		<category><![CDATA[Excel]]></category>
		<category><![CDATA[git]]></category>
		<category><![CDATA[GnuPG]]></category>
		<category><![CDATA[Keynote]]></category>
		<category><![CDATA[Latex]]></category>
		<category><![CDATA[Must Have Software]]></category>
		<category><![CDATA[Papers]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[Software]]></category>
		<category><![CDATA[Tools]]></category>
		<category><![CDATA[TrueCrypt]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1461</guid>
		<description><![CDATA[Having worked with Unix (BSD, HPUX, IRIX, Linux and OSX), Windows (NT4, 2000, XP, Vista and 7) for quite a while I have seen a lot of different software tools. I would like to quickly exhibit my &#8220;must have&#8221; list. These are the packages that I find to be the single &#8220;must have offerings&#8221; in [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2009/01/exciting-technique-1-the-r-language/' rel='bookmark' title='Exciting Technique #1: The &#8220;R&#8221; language.'>Exciting Technique #1: The &#8220;R&#8221; language.</a></li>
<li><a href='http://www.win-vector.com/blog/2011/10/the-mythical-man-month-is-still-a-good-read/' rel='bookmark' title='&#8220;The Mythical Man Month&#8221; is still a good read'>&#8220;The Mythical Man Month&#8221; is still a good read</a></li>
<li><a href='http://www.win-vector.com/blog/2009/07/microsoft-store-again/' rel='bookmark' title='Microsoft Store Again'>Microsoft Store Again</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>Having worked with Unix (BSD, HPUX, IRIX, Linux and OSX), Windows (NT4, 2000, XP, Vista and 7) for quite a while I have seen a lot of different software tools.  I would like to quickly exhibit my &#8220;must have&#8221; list.  These are the packages that I find to be the single &#8220;must have offerings&#8221; in a number of categories.  I have avoided some categories (such as editors, email programs, programing language, IDEs, photo editors, backup solutions, databases, database tools and web tools) where I have no feeling of having seen a single absolute best offering.</p>
<p>The spirit of the list is to pick items such that: if you disagree with an item in this list then either you are wrong or you know something I would really like to hear about.</p>
<p><span id="more-1461"></span></p>
<dl>
<dt><strong>Encryption, disk images: <a href="http://www.truecrypt.org/" target="ext">TrueCrypt</a> (open source: Linux, Windows, OSX)</strong></dt>
<dd>TrueCrypt can create portable encrypted virtual disks (files that can be mounted as a disk on any operating system).</dd>
<dd></dd>
<dt><strong>Encryption, files: <a href="http://www.gnupg.org/" target="ext">GnuPG</a> (open source: Linux, Windows, OSX)</strong></dt>
<dd>GnuPG is the tool to use to encrypt files for email.</dd>
<dd></dd>
<dt><strong>Presentation: <a href="http://www.apple.com/iwork/keynote/" target="ext">Apple Keynote</a> (commercial: OSX)</strong></dt>
<dd>Keynote is not quite as friendly as Microsoft PowerPoint, but it quickly produces beautiful presentations.</dd>
<dt><strong>Reference Library: <a href="http://mekentosj.com/papers/" target="ext">Papers</a> (commercial: OSX)</strong></dt>
<dd>&#8220;iTunes for PDF.&#8221;  Manage thousands of PDFs and references, annotate with meta-data, place papers into multiple project folders.  An interesting runner-up is <a href="http://bibdesk.sourceforge.net/" target="ext">BibDesk</a> (open source: OSX).</dd>
<dt><strong>Spreadsheet: <a href="http://office.microsoft.com/en-gb/excel/default.aspx" target="ext">Microsoft Excel</a> (commercial: Windows, OSX)</strong></dt>
<dd>Open Office and Google Docs are getting better every day, but neither come close to Microsoft Excel in functionality and versatility of user interface.  If you are on a platform that supports Excel, working regularly with spreadsheets and using something other than Excel: it really means that you do not value your time.</dd>
<dt><strong>Statistics Software: <a href="http://www.r-project.org/" target="ext">R</a> (open source: Linux, Windows, OSX)</strong></dt>
<dd>R is rapidly becoming the platform of choice for statisticians and is (with the addition of lattice and ggplot2) the best way to produce graphs.  R has fairly nasty programming language, but has so many statistical operations available that it can not be avoided.</dd>
<dt><strong>Technical Documentation: <a href="http://www.tug.org/" target="ext">LaTeX</a> (open source: Linux, Windows, OSX)</strong></dt>
<dd>It may seem antiquated but TeX/LaTex is still far more powerful than the &#8220;WSYWYG&#8221; pretenders.  The separation of presentation from specification, automatic management of references, table of contents and being able<br />
to include PDFs from external files (which get refreshed when you re-build the document) are all lifesavers.</dd>
<dt><strong>Version Control: <a href="http://git-scm.com/" target="ext">git</a> (open source: Linux, Windows, OSX)</strong></dt>
<dd>Just about the only version control system that: doesn&#8217;t damage the data you are trying to manage by adding dot-files into all of the directories, can routinely handle large files and can work productively without a network connection.  <a href="http://www.perforce.com/" target="ext">Perforce</a> is powerful central server commercial option (with the ability to have central policies, control and review).
</dd>
</dl>
<p></p>
<p>I look forward to learning which of my choices are considered poor and what your must-haves are.</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2009/01/exciting-technique-1-the-r-language/' rel='bookmark' title='Exciting Technique #1: The &#8220;R&#8221; language.'>Exciting Technique #1: The &#8220;R&#8221; language.</a></li>
<li><a href='http://www.win-vector.com/blog/2011/10/the-mythical-man-month-is-still-a-good-read/' rel='bookmark' title='&#8220;The Mythical Man Month&#8221; is still a good read'>&#8220;The Mythical Man Month&#8221; is still a good read</a></li>
<li><a href='http://www.win-vector.com/blog/2009/07/microsoft-store-again/' rel='bookmark' title='Microsoft Store Again'>Microsoft Store Again</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2010/05/must-have-software/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>

