<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Win-Vector Blog &#187; Computer Science</title>
	<atom:link href="http://www.win-vector.com/blog/category/computer-science/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.win-vector.com/blog</link>
	<description>The Applied Theorist&#039;s Point of View</description>
	<lastBuildDate>Thu, 29 Jul 2010 17:09:49 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
		<item>
		<title>Automatic Differentiation with Scala</title>
		<link>http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=automatic-differentiation-with-scala</link>
		<comments>http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/#comments</comments>
		<pubDate>Tue, 15 Jun 2010 04:19:20 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Applications]]></category>
		<category><![CDATA[Coding]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Exciting Techniques]]></category>
		<category><![CDATA[Mathematics]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Tutorials]]></category>
		<category><![CDATA[Automatic Differentiation]]></category>
		<category><![CDATA[Conjugate Gradient]]></category>
		<category><![CDATA[Dual Numbers]]></category>
		<category><![CDATA[Geometric Median]]></category>
		<category><![CDATA[Numeric Methods]]></category>
		<category><![CDATA[Optimization]]></category>
		<category><![CDATA[Scala]]></category>
		<category><![CDATA[Steiner Tree]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1481</guid>
		<description><![CDATA[This article is a worked-out exercise in applying the Scala type system to solve a small scale optimization problem. For this article we supply complete Scala source code (under a GPLv3 license) and some design discussion. Usually we work using a combination of databases, Java, optimization libraries and analysis suites (like R). The reason is [...]


Related posts:<ol><li><a href='http://www.win-vector.com/blog/2010/07/gradients-via-reverse-accumulation/' rel='bookmark' title='Permanent Link: Gradients via Reverse Accumulation'>Gradients via Reverse Accumulation</a></li>
<li><a href='http://www.win-vector.com/blog/2009/11/r-examine-objects-tutorial/' rel='bookmark' title='Permanent Link: R examine objects tutorial'>R examine objects tutorial</a></li>
<li><a href='http://www.win-vector.com/blog/2009/09/survive-r/' rel='bookmark' title='Permanent Link: Survive R'>Survive R</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>This article is a worked-out exercise in applying the <a href="http://www.scala-lang.org/" target="ext">Scala</a> type system to solve a small scale optimization problem.    For this article we supply <a href="http://www.win-vector.com/dfiles/ScalaDiff.jar">complete Scala source code</a> (under a GPLv3 license) and some design discussion.<span id="more-1481"></span><br />
Usually we work using a combination of databases, Java, optimization libraries and analysis suites (like R).  The reason is that, for our typical problems, Java hits a sweet spot of trading off runtime performance against ease of development and maintenance.  In the tens of gigabytes range (data sets larger than the Wikipedia but smaller than the Web) Java outperforms the scripting languages (Ruby, Python &#8230;) and is much easer to develop in and document than C++.  This sweet spot is both subjective and situational- if the tasks were smaller and in a services framework Python is a better choice, if performance is paramount then C or C++ (with the STL) and Hadoop are a better choice, if pre-built statistical libraries are needed then R becomes a better choice.  For the type problem we present here Scala is a very good choice.</p>
<style type="text/css">
td.linenos { background-color: #f0f0f0; padding-right: 10px; }
span.lineno { background-color: #f0f0f0; padding: 0 5px 0 5px; }
pre { line-height: 125%; }
body .hll { background-color: #ffffcc }
body  { background: #f8f8f8; }
body .c { color: #408080; font-style: italic } /* Comment */
body .err { border: 1px solid #FF0000 } /* Error */
body .k { color: #008000; font-weight: bold } /* Keyword */
body .o { color: #666666 } /* Operator */
body .cm { color: #408080; font-style: italic } /* Comment.Multiline */
body .cp { color: #BC7A00 } /* Comment.Preproc */
body .c1 { color: #408080; font-style: italic } /* Comment.Single */
body .cs { color: #408080; font-style: italic } /* Comment.Special */
body .gd { color: #A00000 } /* Generic.Deleted */
body .ge { font-style: italic } /* Generic.Emph */
body .gr { color: #FF0000 } /* Generic.Error */
body .gh { color: #000080; font-weight: bold } /* Generic.Heading */
body .gi { color: #00A000 } /* Generic.Inserted */
body .go { color: #808080 } /* Generic.Output */
body .gp { color: #000080; font-weight: bold } /* Generic.Prompt */
body .gs { font-weight: bold } /* Generic.Strong */
body .gu { color: #800080; font-weight: bold } /* Generic.Subheading */
body .gt { color: #0040D0 } /* Generic.Traceback */
body .kc { color: #008000; font-weight: bold } /* Keyword.Constant */
body .kd { color: #008000; font-weight: bold } /* Keyword.Declaration */
body .kn { color: #008000; font-weight: bold } /* Keyword.Namespace */
body .kp { color: #008000 } /* Keyword.Pseudo */
body .kr { color: #008000; font-weight: bold } /* Keyword.Reserved */
body .kt { color: #B00040 } /* Keyword.Type */
body .m { color: #666666 } /* Literal.Number */
body .s { color: #BA2121 } /* Literal.String */
body .na { color: #7D9029 } /* Name.Attribute */
body .nb { color: #008000 } /* Name.Builtin */
body .nc { color: #0000FF; font-weight: bold } /* Name.Class */
body .no { color: #880000 } /* Name.Constant */
body .nd { color: #AA22FF } /* Name.Decorator */
body .ni { color: #999999; font-weight: bold } /* Name.Entity */
body .ne { color: #D2413A; font-weight: bold } /* Name.Exception */
body .nf { color: #0000FF } /* Name.Function */
body .nl { color: #A0A000 } /* Name.Label */
body .nn { color: #0000FF; font-weight: bold } /* Name.Namespace */
body .nt { color: #008000; font-weight: bold } /* Name.Tag */
body .nv { color: #19177C } /* Name.Variable */
body .ow { color: #AA22FF; font-weight: bold } /* Operator.Word */
body .w { color: #bbbbbb } /* Text.Whitespace */
body .mf { color: #666666 } /* Literal.Number.Float */
body .mh { color: #666666 } /* Literal.Number.Hex */
body .mi { color: #666666 } /* Literal.Number.Integer */
body .mo { color: #666666 } /* Literal.Number.Oct */
body .sb { color: #BA2121 } /* Literal.String.Backtick */
body .sc { color: #BA2121 } /* Literal.String.Char */
body .sd { color: #BA2121; font-style: italic } /* Literal.String.Doc */
body .s2 { color: #BA2121 } /* Literal.String.Double */
body .se { color: #BB6622; font-weight: bold } /* Literal.String.Escape */
body .sh { color: #BA2121 } /* Literal.String.Heredoc */
body .si { color: #BB6688; font-weight: bold } /* Literal.String.Interpol */
body .sx { color: #008000 } /* Literal.String.Other */
body .sr { color: #BB6688 } /* Literal.String.Regex */
body .s1 { color: #BA2121 } /* Literal.String.Single */
body .ss { color: #19177C } /* Literal.String.Symbol */
body .bp { color: #008000 } /* Name.Builtin.Pseudo */
body .vc { color: #19177C } /* Name.Variable.Class */
body .vg { color: #19177C } /* Name.Variable.Global */
body .vi { color: #19177C } /* Name.Variable.Instance */
body .il { color: #666666 } /* Literal.Number.Integer.Long */
 </style>
<h2>Our Example Problem</h2>
<p>Our small scale problem is this:  we have a number of target points on a map and we want to pick a central point to <em>directly</em> connect to all of these points with wire.  Our goal is to minimize the total amount of wire used.  This problem is called the <a href="http://en.wikipedia.org/wiki/Geometric_median" ref="ext">&#8220;Geometric Median&#8221;</a>.  So we are trying to find a point that minimizes the sum of distances from our chosen center to every target point. If we were trying to minimize the sum of squared distances from our chosen center to every target point the answer would be obvious: the average or mean (which by Hooke&#8217;s law is also the point where a set of identical springs would relax to).  The mean is in fact a fairly good guess, but you can do better (which could important if the &#8220;wire&#8221; is expensive, such as cutting irrigation or drainage ditches).  For example given the three target points (20,0), (-1,-1) and (-1,1) the optimal point is (-0.42,0) not the mean (6,0) and the choice of optimal point represents an over 19% savings in total wiring distance (see figure).</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/06/points.png" alt="points.png" border="0" width="525" height="525" /><br />
</center></p>
<p>This is a substantial saving in cost.  </p>
<p>The problem changes as we consider variations.  If indirect connections (such as routing one point through another, which may or may not be possible for reasons of capacity or safety) and multiple new centers are allowed  we then have an instance of the <a href="http://en.wikipedia.org/wiki/Steiner_tree_problem" ref="ext">Steiner Tree Problem</a> which is harder  to solve (since it is known to be NP complete).  If no new centers are allowed (all routing must be between pre-existing target points) then we have a Spanning Tree Problem- which admits very quick solutions.</p>
<p>We bring up the geometric median as a mere example.  We don&#8217;t intend for our code to solve only the geometric median problem and we don&#8217;t intend to touch on the literature of specialized methods for solving the geometric median problem.  Instead we are trying to demonstrate the speed you can develop prototype solutions if you have a few good tools (like various optimizers) available in your toolkit.  Numeric optimizers may sound exotic, but they often are the kind of thing you want to experiment with and link directly into your code.</p>
<h2>Optimization as General Tool</h2>
<p>Now that we have the example problem we can describe a solution strategy.  In this case the solution uses code &#8220;we wished we had lying around&#8221; before we started on the problem.  We will pretend we have the tools we want ready to solve our problem and then we will pay our debt and build the required tools.  The issue is that there is not an obvious closed form for the solution of the geometric median problem.  So we are forced to work a bit harder.  In this case harder means we need to solve an optimization problem.  Consider the contour plot of the total wiring cost as function of where we choose to place our center.  Our optimal point (-0.42,0) had wiring cost of 22.73 and the contour plot given here shows concentric regions of solution positions with higher cost.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/06/contour.png" alt="contour.png" border="0" width="525" height="525" /><br />
</center></p>
<p>In general it is unwise to throw an optimizer at an arbitrary problem and hope to find the globally best solution.  But in this case (and in many similar situations) we can prove that a simple local optimizer will in fact find the unique best solution.  This is a property of the problem not of the optimizer.  The concentric regions shown in the contour plot have a very nice shape: they are <a href="http://en.wikipedia.org/wiki/Convex_set" ref="ext">convex</a>.   That is: they have no intrusions- for any two points drawn from one of these shapes the straight line segment between these points stays inside the given shape.  We don&#8217;t have to depend on observation- we can actually prove this is always the case for this problem.  The wiring cost from a proposed center to any single target point is a <a href="http://en.wikipedia.org/wiki/Convex_function" ref="ext">convex function</a> of where we choose to place our center (a convex function is a function whose graph never reaches above the secant line drawn between any two points on its graph).  The total wiring cost is just the sum of the wiring costs to each target point.  And to finish: the sum of a collection of convex functions is itself a convex function.  Since the contour plot of a convex function has only convex shapes and we have proven the statement.</p>
<p>But how does this help us?  There is a standard technique to find &#8220;local minima&#8221; of a function by inspecting a function for places where the gradient is zero (points where there is no obvious down hill direction on the contour plot).  This technique usually can only be guaranteed to find local minima (places where no small change improves your situation).  But there is no guarantee that the local minimum you find is in fact the global minimum (the best possible solution).  Except when you are dealing with a convex function.  When a function is convex then all of the local minima are always grouped together into a single convex connected shape (if not a line drawn between two remote minima would violate the convexity definition).  And if the function is never flat then this set is a single unique point: the unique best solution.  Our inspection technique will be a gradient driven optimizer- that is an optimizer that when the gradient is non-zero improves its objective by running down hill and halts when the gradient is zero.</p>
<p>The stated function to minimize is to sum the distance from our proposed center to each target point.  We can write this as the sum of the distances:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/06/dist1.png" alt="dist1.png" border="0" width="309" height="81" /><br />
</center></p>
<p>( <img src="http://www.win-vector.com/blog/wp-content/uploads/2010/06/euclid1.png" alt="euclid1.png" border="0" width="119" height="37" /> which is the traditional Euclidean or L2 distance).  This function actually has one one subtle flaw that we will deal with in the appendix (see: Fixing Smoothness).</p>
<h2>Using Scala to Apply the Optimization Solution</h2>
<p>To find our optimal center placement using Scala we first write our cost or objective as a Scala function:</p>
<div class="highlight">
<pre>    <span class="k">val</span> <span class="n">dat</span><span class="k">:</span><span class="kt">Array</span><span class="o">[</span><span class="kt">Array</span><span class="o">[</span><span class="kt">Double</span><span class="o">]]</span> <span class="o">=</span> <span class="nc">Array</span><span class="o">(</span>
      <span class="nc">Array</span><span class="o">(</span> <span class="mi">20</span><span class="o">,</span> <span class="mf">0.0</span><span class="o">),</span>
      <span class="nc">Array</span><span class="o">(</span> <span class="o">-</span><span class="mf">1.0</span><span class="o">,</span> <span class="mf">1.0</span><span class="o">),</span>
      <span class="nc">Array</span><span class="o">(</span> <span class="o">-</span><span class="mf">1.0</span><span class="o">,</span> <span class="o">-</span><span class="mf">1.0</span><span class="o">)</span>
    <span class="o">)</span>

    <span class="k">def</span> <span class="n">fx</span><span class="o">(</span><span class="n">p</span><span class="k">:</span><span class="kt">Array</span><span class="o">[</span><span class="kt">Double</span><span class="o">])</span><span class="k">:</span><span class="kt">Double</span> <span class="o">=</span> <span class="o">{</span>
      <span class="k">val</span> <span class="n">dim</span> <span class="k">=</span> <span class="n">p</span><span class="o">.</span><span class="n">length</span>
      <span class="k">val</span> <span class="n">npoint</span> <span class="k">=</span> <span class="n">dat</span><span class="o">.</span><span class="n">length</span>
      <span class="k">var</span> <span class="n">total</span> <span class="k">=</span> <span class="mf">0.0</span>
      <span class="k">for</span><span class="o">(</span><span class="n">k</span> <span class="k">&lt;-</span> <span class="mi">0</span> <span class="n">to</span> <span class="o">(</span><span class="n">npoint</span><span class="o">-</span><span class="mi">1</span><span class="o">))</span> <span class="o">{</span>
        <span class="k">var</span> <span class="n">term</span> <span class="k">=</span> <span class="mf">0.0</span>
        <span class="k">for</span><span class="o">(</span><span class="n">i</span> <span class="k">&lt;-</span> <span class="mi">0</span> <span class="n">to</span> <span class="o">(</span><span class="n">dim</span><span class="o">-</span><span class="mi">1</span><span class="o">))</span> <span class="o">{</span>
          <span class="k">val</span> <span class="n">diff</span> <span class="k">=</span> <span class="n">p</span><span class="o">(</span><span class="n">i</span><span class="o">)</span> <span class="o">-</span> <span class="n">dat</span><span class="o">(</span><span class="n">k</span><span class="o">)(</span><span class="n">i</span><span class="o">)</span>
          <span class="n">term</span> <span class="k">=</span> <span class="n">term</span> <span class="o">+</span> <span class="n">diff</span><span class="o">*</span><span class="n">diff</span>
        <span class="o">}</span>
        <span class="n">total</span> <span class="k">=</span> <span class="n">total</span> <span class="o">+</span> <span class="n">scala</span><span class="o">.</span><span class="n">math</span><span class="o">.</span><span class="n">sqrt</span><span class="o">(</span><span class="n">term</span><span class="o">)</span>
      <span class="o">}</span>
      <span class="n">total</span>
    <span class="o">}</span>
</pre>
</div>
<p>Scala is succinct and it is a great connivence to have a function definition capture data from its environment.   What we would like to do is generate an initial guess as the solution (we use the mean as our initial guess) and then call an optimizer (in this case a conjugate gradient optimizer) to do all the work:</p>
<div class="highlight">
<pre> <span class="k">val</span> <span class="n">p0</span><span class="k">:</span><span class="kt">Array</span><span class="o">[</span><span class="kt">Double</span><span class="o">]</span> <span class="o">=</span> <span class="n">mean</span><span class="o">(</span><span class="n">dat</span><span class="o">)</span>
 <span class="k">val</span> <span class="o">(</span><span class="n">pF</span><span class="o">,</span><span class="n">fpF</span><span class="o">)</span> <span class="k">=</span> <span class="nc">CG</span><span class="o">.</span><span class="n">minimize</span><span class="o">(</span><span class="n">fx</span><span class="o">,</span><span class="n">p0</span><span class="o">)</span>
</pre>
</div>
<p>At this point we would be done, except the conjugate gradient method (which is superior to gradient descent and many the non-gradient methods) requires a gradient.<br />
We could provide a numeric estimate of the gradient by the following divided difference method:</p>
<div class="highlight">
<pre>  <span class="k">def</span> <span class="n">gradientD</span><span class="o">(</span><span class="n">f</span><span class="k">:</span><span class="kt">Array</span><span class="o">[</span><span class="kt">Double</span><span class="o">]</span><span class="k">=&gt;</span><span class="kt">Double</span><span class="o">,</span><span class="n">p</span><span class="k">:</span><span class="kt">Array</span><span class="o">[</span><span class="kt">Double</span><span class="o">])</span><span class="k">:</span><span class="kt">Array</span><span class="o">[</span><span class="kt">Double</span><span class="o">]</span> <span class="o">=</span> <span class="o">{</span>
    <span class="k">val</span> <span class="n">xdim</span> <span class="k">=</span> <span class="n">p</span><span class="o">.</span><span class="n">length</span>
    <span class="k">val</span> <span class="n">p2</span> <span class="k">=</span> <span class="n">copy</span><span class="o">(</span><span class="n">p</span><span class="o">)</span>
    <span class="k">val</span> <span class="n">base</span> <span class="k">=</span> <span class="n">f</span><span class="o">(</span><span class="n">p2</span><span class="o">)</span>
    <span class="k">val</span> <span class="n">ret</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">Array</span><span class="o">[</span><span class="kt">Double</span><span class="o">](</span><span class="n">xdim</span><span class="o">)</span>
    <span class="k">val</span> <span class="n">delta</span> <span class="k">=</span> <span class="mf">1.0e-6</span>
    <span class="k">for</span><span class="o">(</span><span class="n">i</span> <span class="k">&lt;-</span> <span class="mi">0</span> <span class="n">to</span> <span class="o">(</span><span class="n">xdim</span><span class="o">-</span><span class="mi">1</span><span class="o">))</span> <span class="o">{</span>
      <span class="n">p2</span><span class="o">(</span><span class="n">i</span><span class="o">)</span> <span class="k">=</span> <span class="n">p</span><span class="o">(</span><span class="n">i</span><span class="o">)</span> <span class="o">+</span> <span class="n">delta</span>
      <span class="k">val</span> <span class="n">fplus</span> <span class="k">=</span> <span class="n">f</span><span class="o">(</span><span class="n">p2</span><span class="o">)</span>
      <span class="n">p2</span><span class="o">(</span><span class="n">i</span><span class="o">)</span> <span class="k">=</span> <span class="n">p</span><span class="o">(</span><span class="n">i</span><span class="o">)</span>
      <span class="k">val</span> <span class="n">diff</span> <span class="k">=</span> <span class="o">(</span><span class="n">fplus</span><span class="o">-</span><span class="n">base</span><span class="o">)/</span><span class="n">delta</span>
      <span class="n">ret</span><span class="o">(</span><span class="n">i</span><span class="o">)</span> <span class="k">=</span> <span class="n">diff</span>
    <span class="o">}</span>
    <span class="n">ret</span>
  <span class="o">}</span>
</pre>
</div>
<p>This numeric divided difference method often outperforms non-derivative optimization methods (like Powell&#8217;s Method and the Nelder-Mead Amoeba method).  But the technique can run into numeric difficulties.   We can remedy this if we are willing to write our function in a slightly more general way.   If we re-encode our function in a generic manner we can use <a href="http://en.wikipedia.org/wiki/Automatic_differentiation" target="ext">automatic differentiation</a>  (not to be confused with numeric differentiation or with symbolic differentiation) to produce a reliable gradient for optimization.  What we need to do is re-write our function to work over an abstract field of numbers instead of only the machine supplied doubles.  In fact what we need to do is specify a generic function that will work over any field, with the field to be determined later.  The code to do this in Scala is very similar to the non-generic code:</p>
<div class="highlight">
<pre>   <span class="k">val</span> <span class="n">genericFx</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">VectorFN</span> <span class="o">{</span>
      <span class="k">def</span> <span class="n">apply</span><span class="o">[</span><span class="kt">Y</span> <span class="k">&lt;:</span> <span class="kt">NumberBase</span><span class="o">[</span><span class="kt">Y</span><span class="o">]](</span><span class="n">p</span><span class="k">:</span><span class="kt">Array</span><span class="o">[</span><span class="kt">Y</span><span class="o">])</span><span class="k">:</span><span class="kt">Y</span> <span class="o">=</span> <span class="o">{</span>
        <span class="k">val</span> <span class="n">field</span> <span class="k">=</span> <span class="n">p</span><span class="o">(</span><span class="mi">0</span><span class="o">).</span><span class="n">field</span>
        <span class="k">val</span> <span class="n">dim</span> <span class="k">=</span> <span class="n">p</span><span class="o">.</span><span class="n">length</span>
        <span class="k">val</span> <span class="n">npoint</span> <span class="k">=</span> <span class="n">dat</span><span class="o">.</span><span class="n">length</span>
        <span class="k">var</span> <span class="n">total</span> <span class="k">=</span> <span class="n">field</span><span class="o">.</span><span class="n">zero</span>
        <span class="k">for</span><span class="o">(</span><span class="n">k</span> <span class="k">&lt;-</span> <span class="mi">0</span> <span class="n">to</span> <span class="o">(</span><span class="n">npoint</span><span class="o">-</span><span class="mi">1</span><span class="o">))</span> <span class="o">{</span>
          <span class="k">var</span> <span class="n">term</span> <span class="k">=</span> <span class="n">field</span><span class="o">.</span><span class="n">zero</span>
          <span class="k">for</span><span class="o">(</span><span class="n">i</span> <span class="k">&lt;-</span> <span class="mi">0</span> <span class="n">to</span> <span class="o">(</span><span class="n">dim</span><span class="o">-</span><span class="mi">1</span><span class="o">))</span> <span class="o">{</span>
            <span class="k">val</span> <span class="n">diff</span> <span class="k">=</span> <span class="n">p</span><span class="o">(</span><span class="n">i</span><span class="o">)</span> <span class="o">-</span> <span class="n">field</span><span class="o">.</span><span class="n">inject</span><span class="o">(</span><span class="n">dat</span><span class="o">(</span><span class="n">k</span><span class="o">)(</span><span class="n">i</span><span class="o">))</span>
            <span class="n">term</span> <span class="k">=</span> <span class="n">term</span> <span class="o">+</span> <span class="n">diff</span><span class="o">*</span><span class="n">diff</span>
          <span class="o">}</span>
          <span class="n">total</span> <span class="k">=</span> <span class="n">total</span> <span class="o">+</span> <span class="n">smoothSQRT</span><span class="o">(</span><span class="n">term</span><span class="o">)</span>
        <span class="o">}</span>
        <span class="n">total</span>
      <span class="o">}</span>
    <span class="o">}</span>
</pre>
</div>
<p>Notice that code is very similar to the &#8220;def fx()&#8221; code.  The key differences are that we had to define genericFx as extending a trait (a type of Scala interface) called VectorFN and inside this trait extension we defined a parameterized function name apply().  apply() is a generic function that is willing to work over any type Y where Y is at least of type NumberBase[Y] (we will get more into what that means in a moment).  The difference in notation is that while the Scala function <em>syntax</em> can not specify a generic function with free type parameters (the incompletely specified Y) the Scala <em>semantics</em> are strong enough to implement this.  In fact standard function definitions (such as &#8220;def fx()&#8221;) are just syntactic sugar for extending the Scala built-in <a href="http://www.scala-lang.org/docu/files/api/scala/Function1.html" target="ext">Function1 trait</a>.  With a generic objective function in hand all we need is conjugate gradient code that is expecting a VectorFN (and willing to call apply() instead of just using naked function parenthesis) and some type NumberBase[Y] that can compute gradients for us.  The Scala compiler can specialize our genericFx() into one version for quick calculation and another for gradients.  How this is done is what we will discuss next.  From our point of view our problem is solved with the following one line of code:</p>
<div class="highlight">
<pre><span class="k">val</span> <span class="o">(</span><span class="n">pF</span><span class="o">,</span><span class="n">fpF</span><span class="o">)</span> <span class="k">=</span> <span class="nc">CG</span><span class="o">.</span><span class="n">minimize</span><span class="o">(</span><span class="n">genericFx</span><span class="o">,</span><span class="n">p0</span><span class="o">)</span>
</pre>
</div>
<p>This should always be your goal- build sufficient preparation so your last step is a &#8220;obvious one liner.&#8221;</p>
<h2>What Tools we Wish we Had Lying Around</h2>
<p>We supply in our example some workable conjugate gradient code, but that is standard so we will not discuss it.  What is of interest (and facilitated by Scala&#8217;s parametrized type system) is the implementation of <a href="http://en.wikipedia.org/wiki/Dual_number" target="ext">dual numbers</a> as a framework to supply automatic differentiation.  An implementation of dual numbers as a NumerBase[DualNumber] type is the core of our demonstration.</p>
<p>Dual numbers are an algebraic structure written as pairs of real numbers &#8220;(a,b)&#8221;.  The arithmetic table for dual numbers is given below:</p>
<table>
<tr>
<td>(a,b) + (c,d)</td>
<td>=</td>
<td>((a+c) , (b+d))</td>
</tr>
<tr>
<td>(a,b) &#8211; (c,d)</td>
<td>=</td>
<td>((a-c) , (b-d))</td>
</tr>
<tr>
<td>(a,b) * (c,d)</td>
<td>=</td>
<td>((a*c) , (a*d+b*c))</td>
</tr>
<tr>
<td>(a,b) / (c,d)</td>
<td>=</td>
<td>((a/c) , ((b*c-a*d)/(a*a)))</td>
</tr>
</table>
<p>In a dual number (a,b) &#8220;a&#8221; is the &#8220;large&#8221; or &#8220;standard&#8221; part of the number.  You can check from the arithmetic table that the pair of dual numbers (a,0) and (c,0) behave just as we would expect the real numbers a and c to behave.  In the dual number (a,b) &#8220;b&#8221; is the &#8220;small&#8221; or &#8220;ideal&#8221; portion of the number.  From the multiplication rule above  we can observe two rules: (0,b) * (c,0) = (0,b*c) (something small times anything else is small) and (0,b)*(0,d) = (0,0) (two small things become zero when multiplied).  Essentially the dual numbers are carrying around the first two terms of a Taylor series: we get as a result both the function value and the function derivative.  For a function f() over the real numbers we extend f() to work over the dual number by defining: f((a,b)) = (f(a),b f&#8217;(a)) (which is consistent with the previously defined arithmetic). We can check that the dual numbers numbers obey the usual laws of arithmetic (associative, commutative, distributive, identities and inverses).  The punchline is that over the dual numbers the divided difference estimate of f&#8217;(x) (the derivative of f() evaluated at x)  is in fact exact in the sense that f((x,1)) = (f(x),f&#8217;(x)) (or f((x,0)+(0,1)) &#8211; f((x,0)) = (0, f&#8217;(x))).  Implementing the DualNumber class is little more than transcribing the above arithmetic table into Scala.</p>
<p>We have already seen how to write code that uses NumberBase[Y] types (genericFx() itself is an example).  A more complicated example is the CG.minimize() code which not only accepts a generic function (in the form of VectorFN) but then specializes it to NumberBase[DualNumber] to compute gradients and also specializes to NumberBase[MDouble] for quick calculation during line searches (MDouble is just an adapter for machine Doubles, used for speed).  The ability to re-specialize a function is one of the advantages of a parameterized type system.  The DualNumbers are an example of forward automatic differentiation.  We could also use the same object framework to capture a representation of the computation path and apply more sophisticated methods such as reverse automatic differentiation. </p>
<p>We give a link to a jar containing <a href="http://www.win-vector.com/dfiles/ScalaDiff.jar">complete Scala source code</a> including this example, the DualNumber implementation, a conjugate gradient implementation and some JUnit tests (all under a GPLv3 license) and will go on to describe some of the design decisions.  The code is the bulky part of this work, so we will move on to discuss something more compact: types.</p>
<h2>Types</h2>
<p>If code is ever beautiful it is only when it is succinct.  Among the most succinct forms of code are individual type signatures and interfaces (though the indiscriminate repetition of type signatures is rightly considered ugly bloat, which Scala works to avoid).   Since we are distributing complete source we will describe only types and method signatures.  The entry points to the code are the JUnit tests (organized in the ScalaDiff/test source directory and depending on JUnit which was not included) and the demo program in ScalaDiff/src/demo/Demo.scala).</p>
<p>To be a usable arithmetic type (like DualNumber or MDouble) you must extend the following parameterized abstract class:</p>
<div class="highlight">
<pre><span class="k">abstract</span> <span class="k">class</span> <span class="nc">NumberBase</span><span class="o">[</span><span class="kt">NUMBERTYPE</span> <span class="k">&lt;:</span> <span class="kt">NumberBase</span><span class="o">[</span><span class="kt">NUMBERTYPE</span><span class="o">]]</span> <span class="o">{</span>
  <span class="c">// basic arithmetic</span>
  <span class="k">def</span> <span class="o">+</span> <span class="o">(</span><span class="n">that</span><span class="k">:</span> <span class="kt">NUMBERTYPE</span><span class="o">)</span><span class="k">:</span><span class="kt">NUMBERTYPE</span>
  <span class="k">def</span> <span class="o">-</span> <span class="o">(</span><span class="n">that</span><span class="k">:</span> <span class="kt">NUMBERTYPE</span><span class="o">)</span><span class="k">:</span><span class="kt">NUMBERTYPE</span>
  <span class="k">def</span> <span class="n">unary_-</span><span class="o">()</span><span class="k">:</span><span class="kt">NUMBERTYPE</span>
  <span class="k">def</span> <span class="o">*</span> <span class="o">(</span><span class="n">that</span><span class="k">:</span> <span class="kt">NUMBERTYPE</span><span class="o">)</span><span class="k">:</span><span class="kt">NUMBERTYPE</span>
  <span class="k">def</span> <span class="o">/</span> <span class="o">(</span><span class="n">that</span><span class="k">:</span> <span class="kt">NUMBERTYPE</span><span class="o">)</span><span class="k">:</span><span class="kt">NUMBERTYPE</span>  <span class="kt">//</span> <span class="kt">that</span> <span class="kt">not</span> <span class="kt">equal</span> <span class="kt">to</span> <span class="kt">zero</span>
  <span class="c">// more complicated</span>
  <span class="k">def</span> <span class="n">pow</span><span class="o">(</span><span class="n">that</span><span class="k">:</span><span class="kt">Double</span><span class="o">)</span><span class="k">:</span><span class="kt">NUMBERTYPE</span>
  <span class="k">def</span> <span class="n">exp</span><span class="k">:</span><span class="kt">NUMBERTYPE</span>
  <span class="k">def</span> <span class="n">log</span><span class="k">:</span><span class="kt">NUMBERTYPE</span> <span class="kt">//</span> <span class="kt">this</span> <span class="kt">is</span> <span class="kt">positive</span>
  <span class="c">// comparison functions</span>
  <span class="k">def</span> <span class="o">&gt;</span> <span class="o">(</span><span class="n">that</span><span class="k">:</span> <span class="kt">NUMBERTYPE</span><span class="o">)</span><span class="k">:</span><span class="kt">Boolean</span>
  <span class="k">def</span> <span class="o">&gt;=</span> <span class="o">(</span><span class="n">that</span><span class="k">:</span> <span class="kt">NUMBERTYPE</span><span class="o">)</span><span class="k">:</span><span class="kt">Boolean</span>
  <span class="k">def</span> <span class="o">==</span> <span class="o">(</span><span class="n">that</span><span class="k">:</span> <span class="kt">NUMBERTYPE</span><span class="o">)</span><span class="k">:</span><span class="kt">Boolean</span>
  <span class="k">def</span> <span class="o">!=</span> <span class="o">(</span><span class="n">that</span><span class="k">:</span> <span class="kt">NUMBERTYPE</span><span class="o">)</span><span class="k">:</span><span class="kt">Boolean</span>
  <span class="k">def</span> <span class="o">&lt;</span> <span class="o">(</span><span class="n">that</span><span class="k">:</span> <span class="kt">NUMBERTYPE</span><span class="o">)</span><span class="k">:</span><span class="kt">Boolean</span>
  <span class="k">def</span> <span class="o">&lt;=</span> <span class="o">(</span><span class="n">that</span><span class="k">:</span> <span class="kt">NUMBERTYPE</span><span class="o">)</span><span class="k">:</span><span class="kt">Boolean</span>
  <span class="c">// utility</span>
  <span class="k">def</span> <span class="n">field</span><span class="k">:</span><span class="kt">Field</span><span class="o">[</span><span class="kt">NUMBERTYPE</span><span class="o">]</span>
<span class="o">}</span>
</pre>
</div>
<p>In particular DualNumber extends NumberBase[DualNumber].  This deliberate circular reference has a big purpose: it allows publicly visible contravariant return types (returning nearly the exact type we really are instead of a base type).  This allows us to have strict type arguments so that trying to add a MDouble to DualNumber is a type error (even though they both extend the same base class).  The automatic differentiation technique encapsulated in the DualNumber class only works if all of the calculation is in the DualNumber types and this strict type enforcement allows the compiler to help prevent results sneaking in and out through other types.  All of the methods on NumberBase are obviously related to arithmetic except the field() method.  This method gives us access to a Field object which is responsible for carrying around the runtime type information (this is a common problem in Java and Scala, that some type information known at compile type such choice of template types is not easily accessed at runtime).  The Field class is as follows:</p>
<div class="highlight">
<pre><span class="k">abstract</span> <span class="k">class</span> <span class="nc">Field</span> <span class="o">[</span><span class="kt">NUMBERTYPE</span> <span class="k">&lt;:</span> <span class="kt">NumberBase</span><span class="o">[</span><span class="kt">NUMBERTYPE</span><span class="o">]]</span> <span class="o">{</span>
  <span class="k">def</span> <span class="n">zero</span><span class="k">:</span><span class="kt">NUMBERTYPE</span>            <span class="kt">//</span> <span class="kt">return</span> <span class="kt">canonical</span> <span class="kt">zero</span> <span class="kt">in</span> <span class="kt">field</span>
  <span class="k">def</span> <span class="n">one</span><span class="k">:</span><span class="kt">NUMBERTYPE</span>             <span class="kt">//</span> <span class="kt">return</span> <span class="kt">canonical</span> <span class="kt">one</span> <span class="kt">in</span> <span class="kt">field</span>
  <span class="k">def</span> <span class="n">inject</span><span class="o">(</span><span class="n">v</span><span class="k">:</span><span class="kt">Double</span><span class="o">)</span><span class="k">:</span><span class="kt">NUMBERTYPE</span>  <span class="kt">//</span> <span class="kt">return</span> <span class="kt">canonical</span> <span class="kt">representation</span> <span class="kt">of</span> <span class="kt">number</span> <span class="kt">in</span> <span class="kt">field</span>
  <span class="k">def</span> <span class="n">project</span><span class="o">(</span><span class="n">v</span><span class="k">:</span><span class="kt">NUMBERTYPE</span><span class="o">)</span><span class="k">:</span><span class="kt">Double</span> <span class="kt">//</span> <span class="kt">return</span> <span class="kt">standard-number</span> <span class="kt">represented</span> <span class="kt">in</span> <span class="kt">field</span>
  <span class="k">def</span> <span class="n">array</span><span class="o">(</span><span class="n">n</span><span class="k">:</span><span class="kt">Int</span><span class="o">)</span><span class="k">:</span><span class="kt">Array</span><span class="o">[</span><span class="kt">NUMBERTYPE</span><span class="o">]</span> <span class="kt">//</span> <span class="kt">return</span> <span class="kt">an</span> <span class="kt">array</span> <span class="kt">of</span> <span class="kt">this</span> <span class="k">type</span>
</pre>
</div>
<p>The Field class is where we have factories for numbers (zero, one, arrays, injection from standard Doubles), casting (projection back to standard Doubles).</p>
<p>With these types defined we can actually read intent off some of the method signatures.  </p>
<p>For example our conjugate gradient optimizer is accessed through the following method signature:</p>
<div class="highlight">
<pre> <span class="k">def</span> <span class="n">minimize</span><span class="o">(</span><span class="n">fn</span><span class="k">:</span><span class="kt">VectorFN</span><span class="o">,</span><span class="n">x0</span><span class="k">:</span><span class="kt">Array</span><span class="o">[</span><span class="kt">Double</span><span class="o">])</span><span class="k">:</span><span class="o">(</span><span class="kt">Array</span><span class="o">[</span><span class="kt">Double</span><span class="o">],</span><span class="kt">Double</span><span class="o">)</span> <span class="c">// return x,f(x)</span>
</pre>
</div>
<p>The above can be read as: CG.minimize() requires a VectorFN (our trait representing single argument functions with a free type parameter) and an initial point (in standard Doubles).  The code will the return a pair of the optimum point and the function evaluated at the optimum point.  From the type signature we can see that CG.minimize() expects to re-specialize the function &#8220;fn&#8221; to types of its own choosing (else it could have accepted a parameterized argument instead of our custom trait) and will handle all up-conversion and down-conversion between machine Doubles and NumberBase[Y]&#8216;s itself.  This sort of type information is hard to express (let alone enforce) in a dynamically typed language.</p>
<p>A slightly more complicated example is the lineMinD() method:</p>
<div class="highlight">
<pre><span class="k">def</span> <span class="n">lineMinD</span><span class="o">[</span><span class="kt">Y&lt;:NumberBase</span><span class="o">[</span><span class="kt">Y</span><span class="o">]](</span><span class="n">field</span><span class="k">:</span><span class="kt">Field</span><span class="o">[</span><span class="kt">Y</span><span class="o">],
 </span><span class="n">f</span><span class="k">:</span><span class="kt">Array</span><span class="o">[</span><span class="kt">Y</span><span class="o">]</span><span class="k">=&gt;</span><span class="kt">Y</span><span class="o">,
 </span><span class="n">xm</span><span class="k">:</span><span class="kt">Array</span><span class="o">[</span><span class="kt">Double</span><span class="o">],
 </span><span class="n">di</span><span class="k">:</span><span class="kt">Array</span><span class="o">[</span><span class="kt">Double</span><span class="o">])</span><span class="k">:</span><span class="o">(</span><span class="kt">Array</span><span class="o">[</span><span class="kt">Double</span><span class="o">],</span><span class="kt">Double</span><span class="o">)</span>
</pre>
</div>
<p>Notice it is willing to work with any type parameterized function (which means it is willing to let the caller pick the actual type of NumberBase[Y] and work with that).  Most callers will call with Y=MDouble (the wrapper for machine Doubles) and lineMin() will then work with that (without ever really knowing the actual underlying type).</p>
<p>A lot of fans of dynamic languages consider type systems to be mere hairshirt penance.   But that is not so.  Broken type systems (like Java&#8217;s collections before  erasure parameters were introduced in Java 1.5) are indeed more trouble than they are worth.  Working type systems (like C++ Templates/STL, Java 1.5+ and Scala) allow you to solve problems (and enforce decisions) during the design phase (which is much much cheaper than during the deployment phase).  You can&#8217;t set your types in stone (you are likely going to have them subtly wrong for the first few iteration).  You must be willing to think like a &#8220;language lawyer&#8221; to find out what parts of your work can be specified and enforced in the language type system.  To use an analogy: static types are your blueprint or your underpainting.</p>
<h2>Tests</h2>
<p>One argument against static types is that you can get much of their benefit from unit tests.  My opinion is you never have enough unit tests, so putting more pressure on your test suite is not wise.   Static types plus tests are strictly more powerful than static types alone or tests alone. </p>
<p>Even for this example toy-scale project we have include a JUnit test set to pursue a number of goals:</p>
<ul>
<li>Confirm our number implementations (DualNumber and MDouble) correctly model machine Doubles (perform parallel calculations and compare).</li>
<li>Confirm DualNumber obeys expected laws of algebra composition and cancellation <em>including the portions that can not be modeled in machine Doubles</em>.</li>
<li>Confirm DualNumbers compute gradients.</li>
<li>Confirm operations of optimizers and optimizer components.</li>
</ul>
<p>Many of these tests are related, but they don&#8217;t all imply each other and give different perspective on the errors they catch.  For example no amount of parallel computation between DualNumbers and machine Doubles is going to confirm the infinitesimal portion of the DualNumber is propagating correctly (since this is not a property of machine Doubles).  So we add extra tests that expect DualNumber to obey algebraic relations like: a*(b+c) = a*b + a*c hold.  It is then another step to confirm that whatever the DualNumbers calculate is not only self-consistent, but also models a truncated Taylor Series or differentiation.</p>
<h2>Conclusion</h2>
<p>We hope we have demonstrated how the complexity of a mathematical programming problem can be managed by breaking the problem into an objective function that is separate from the optimizer (allowing the optimizer to be both good and hidden) and a static type system (such as Scala) to help enforce required properties of a calculation (such as all numbers being routed though a required representation).  With these sort of tools available many formerly hard problems (that are often, unfortunately solved by over-specifying direct inefficient iterative improvement techniques) become &#8220;if I can write a reasonable objective function this may already by solved by an optimizer in my library.&#8221;  The more of these tools you have (either in your code or in your reference library) the more of these problems become easy (this is the topic of my earlier paper: <a href="http://www.win-vector.com/blog/2009/11/the-local-to-global-principle/">The Local to Global Principle</a>).</p>
<h2>Appendix: Fixing Smoothness</h2>
<p>Our chosen example objective function is very nice (i.e. convex) but it has a small (but correctable) problem.   The derivative or gradient or gradient has some jump discontinuities that could cause an optimizer to exit prematurely (not at the global optimum).  Consider the simple form of this for wiring a center to a single point at the origin (even in 1 dimension).  The wiring cost function is sqrt(x*x) has a cost graph as shown here.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/06/abs.png" alt="abs.png" border="0" width="525" height="525" /><br />
</center></p>
<p>This is convex- but derivative is not smooth as we see in the included graph of the derivative of sqrt(x*x).</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/06/dabs.png" alt="dabs.png" border="0" width="525" height="525" /><br />
</center></p>
<p>So: in this case if the optimizer stops at one of the target points we can&#8217;t be sure that it stopped at the global optimum (it may have stopped due to the discontinuity in the gradient).  For some simple problems the optimum is necessarily at a target point.  For example on the number line take the target points 0,1 and x.  As long as x&ge;0 and x&le;1 the optimum placement will be x itself.</p>
<p>One way to defend against this is to use some sort of smoothed version of sqrt() that essentially decreases a little faster near the origin.  Our cost function becomes:<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/06/cost2.png" alt="cost2.png" border="0" width="237" height="55" /><br />
</center><br />
where s() is our suitable approximation of the sqrt() function.  Two candidates are s(x) = (x+tau)^(1/2) and s(x) = x^(1/2 + tau); where tau is a small constant.  As long as tau is greater than zero we have no derivative discontinuity in s(x^2) and convexity is preserved (even made a bit stricter).  Other ways to deal with this include adding additional coordinates to the problem and small perturbations on these coordinates.  Finally, a point found by optimizing with respect to s(x) can be &#8220;polished&#8221; by re-starting the optimization at the first found solution and using sqrt(x) as the new objective (if the original point is not near any of the target points).</p>


<p>Related posts:<ol><li><a href='http://www.win-vector.com/blog/2010/07/gradients-via-reverse-accumulation/' rel='bookmark' title='Permanent Link: Gradients via Reverse Accumulation'>Gradients via Reverse Accumulation</a></li>
<li><a href='http://www.win-vector.com/blog/2009/11/r-examine-objects-tutorial/' rel='bookmark' title='Permanent Link: R examine objects tutorial'>R examine objects tutorial</a></li>
<li><a href='http://www.win-vector.com/blog/2009/09/survive-r/' rel='bookmark' title='Permanent Link: Survive R'>Survive R</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Postel&#8217;s Law: Not Sure Who To Be Angry With</title>
		<link>http://www.win-vector.com/blog/2010/02/postels-law-not-sure-who-to-be-angry-with/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=postels-law-not-sure-who-to-be-angry-with</link>
		<comments>http://www.win-vector.com/blog/2010/02/postels-law-not-sure-who-to-be-angry-with/#comments</comments>
		<pubDate>Fri, 26 Feb 2010 00:38:55 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Coding]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Rants]]></category>
		<category><![CDATA[Postel's Law]]></category>
		<category><![CDATA[Unit Testing]]></category>
		<category><![CDATA[Worse is Better]]></category>
		<category><![CDATA[XML]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1394</guid>
		<description><![CDATA[One of my research interests is finding the principles that underly the management of information, complexity and uncertainty. When something as simple as a web-form is called &#8220;technology&#8221; it is time to step back and examine your principles. One principle I am not sure about Postel&#8217;s law. It doesn&#8217;t hold often enough to be relied [...]


Related posts:<ol><li><a href='http://www.win-vector.com/blog/2009/01/map-reduce-a-good-idea/' rel='bookmark' title='Permanent Link: Map Reduce: A Good Idea'>Map Reduce: A Good Idea</a></li>
<li><a href='http://www.win-vector.com/blog/2010/03/r-annoyances/' rel='bookmark' title='Permanent Link: R annoyances'>R annoyances</a></li>
<li><a href='http://www.win-vector.com/blog/2008/10/something-i-dont-get-about-business-and-bailouts/' rel='bookmark' title='Permanent Link: Something I don&#8217;t get about business and bailouts'>Something I don&#8217;t get about business and bailouts</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>One of my research interests is finding the principles that underly the management of information, complexity and uncertainty.  When something as simple as a web-form is called &#8220;technology&#8221; it is time to step back and examine your principles.  One principle I am not sure about Postel&#8217;s law.  It doesn&#8217;t hold often enough to be relied on and when it fails I am not sure who to be angry with.<span id="more-1394"></span></p>
<p>Postel&#8217;s Law (also called The Robustness Principle) comes from RFC 761 &#8220;Transmission Control Protocol&#8221; in 1980 and is: &#8220;Be conservative in what you do; be liberal in what you accept from others.&#8221; (Side note: RFC is the now ironic acronym used to describe Internet standards- the letters stood for &#8220;request for comments&#8221;).</p>
<p>This idea probably worked best where it started- in the TCP/IP world which is where a lot of hairy details of how computers network are handled.  When your goal is the basic establishment of transient communications- success is measured by getting information through and not unnecessarily triggering a failure.  It may be okay to tolerate mistakes here- because they don&#8217;t live long anyway.</p>
<p>Unfortunately, the law works less well other places where it is applied.  The law is a downright hazard in dealing with archiving meaningful data (instead of managing transient signaling protocols). Sometimes the cost of obeying the law far outweighs the potential benefit.</p>
<p>A common arena for Postel&#8217;s law is now HTML, the markup language used to represent the content we view in web-browsers.  In this arena Postel&#8217;s law has had two consequences: one good, one bad.</p>
<p>The good: almost anyone can create a working web-page or even a web-site because modern browsers have been designed to paper around almost every common HTML mistake.  It has been pointed out that this ease of creation and &#8220;worse is better&#8221; (a deep principal due to Richard P. Gabriel see <a target='ext' href="http://en.wikipedia.org/wiki/Worse_is_better">wikipedia: worse is better</a>) has been one of the reasons that HTML out-competed and killed many other ideas.  Philip Greenspun&#8217;s famous <a target='ext' href="http://philip.greenspun.com/panda/html">story</a> of a 10-year-old building web site to get his mother medical attention happened in the sloppy world of HTML and could not have happened in the straight jacket of RDF (Resource Description Framework: the darling of the semantic web).  I would not wish having to actually read or adhere to the incredibly long and irrelevant standards from w3.org (where the ratio of value to pedantry goes to zero) on an enemy.  The web is only interesting due to its content and much of its content was only possible due to low barrier of entry.</p>
<p>The bad: to read HTML you almost have to re-create the entire history of web-browsers.  This is a history of many hostile competitors (Microsoft, Netscape, Opera, WebKit, Mozzila, Google) and billions of dollars.  Reproducing a significant fraction of this history is a significant (and useless) expense.  For the most part I use a permissive parsing library like TagSoup or HTMLTidy but even these miss some things that browsers accept and are far more complicated than the task truly justifies.</p>
<p>Even worse is the cases of XML and RDF.  These are often used for archival storage of semantic data.  That is you may need to read and understand (not just display) data in XML for a long time.  To be liberal in what you accept you have to again master a long set of useless complications (DTDs, namespaces incredibly inept character encoding and escapes) and still get burned by improperly encoded XML (that &#8220;used to work&#8221; because the bugs in the emitted XML matched the bugs in a library that is now out of date).</p>
<p>It is clear in the case of HTML and XML that Postel&#8217;s law&#8217;s cost is too high for what it delivers.  Or at least half of the law is too expensive: no amount of being generous in what we accept makes up for the original data not have been impounded correctly (not being &#8220;conservative in what they do&#8221; and not having checked that at the time it was created).  Some of this is that the producers of the data have no way of telling they are not being &#8220;conservative in what they do&#8221; because the &#8220;generous in what they accept&#8221; libraries they use to debug don&#8217;t tell them they are emitting bad data.  And lets be honest- most systems are not designed for correctness, they are instead debugged until they seem to work.  I would say that in fact HTML is not an example of the power of Postel&#8217;s law but of the pernicious influence of &#8220;worse is better.&#8221; Computer science has not risen to the level of &#8220;software engineering&#8221; we still are a horrible &#8220;fit to finish&#8221; industry.</p>
<p>Frankly for many things we need a simpler &#8220;fail early&#8221; discipline.  Tools need to be better and standards need to be simpler so that if you write something that is wrong it is easy to see why it is wrong and easy to fix it.  Postel&#8217;s law has helped hide the negative impacts of complicated standards, we need to push the cost of complications back on to standards committees.  The need to be &#8220;generous in what you accept&#8221; overly favors large, rich entrenched players who have had the time and resources in incrementally invest in papering around every common mistake.</p>
<p>However, I am not sure if we can throw out half of Postel&#8217;s law or even if we want to.  When Postel&#8217;s law fails it is not clear who to be mad at.</p>
<p>Sun, to kick somebody who is already down, was famous for making elaborate frameworks that correctly and brutally implement many details of RFCs.  Sun&#8217;s Java includes huge frameworks for XML, UTF8 and email that scrupulously implement page after page of useless standard documentation but fail in the wild due to not being &#8220;generous in what they accept.&#8221;  For example Sun&#8217;s GlassFish (which got listed named as one of four or five important assets during Sun&#8217;s various acquisition talks much like the fact the car has cup-holder somehow always gets mentioned in spec sheets) is an &#8220;open source production-quality enterprise software application server.&#8221;  A supposedly major component of the GlassFish is its email component which is a huge unwieldy framework that implements many of the email related RFCs and protocols including IMAP.  Unfortunately for all its hugeness it can not reliably read email folder names from one of the biggest IMAP servers: Google Mail.  Google Mail includes &#8220;against standard&#8221; characters in the protocol and crashes the GlassFish software.</p>
<p>And here is where Postel&#8217;s law fails us: under Postel&#8217;s law both sides are at fault (Sun for failing to be generous in what they accepted, Google for failing to be conservative in what they did).  We can&#8217;t assign only one villain.  We have no proscription of who to ask for a fix.   Postel&#8217;s law seems useful in that if either Google or Sun had followed it the two systems would work.  But the law doesn&#8217;t pick one side to assign blame and help us to efficiently diagnose and fix the problem.  It becomes difficult to find the critical bugs when they are masked by a see of &#8220;acceptable&#8221; bugs.  Take a contrary example: the simpler law &#8220;implement the standard or fix the standard&#8221; would clearly assign blame to GMail.</p>
<p>Similar pain is encountered in Java&#8217;s handling of character encodings like UTF8.  It is hard to move up the stack of artificial intelligence (from words, to concepts, to ideas, to reasoning to consciousness) when you can&#8217;t even reliably transcribe characters.  When faced with bad character sequences (a common occurrence on the web) there is no practical way to get Java &#8220;mostly parse it,&#8221; Java libraries and frameworks authors seem to extract a perverse joy in throwing a program-killing exception (it does not matter if you catch it the library has already stopped doing what you wanted) because they are concerned that a diacritical mark was not properly encoded (web browsers, on the other hand, lose the mark or show some sort of damage near the mistake and blunder on).  And here is were the frustration sets in, how can you make applications that are generous in what they accept when the libraries and frameworks are overly proud and picky?  This, at first, seems like an argument for Postel&#8217;s law- if everybody else (especially the library authors) were generous in what they accepted your life could be easy.  That is certainly one possibility- but I argue it often becomes a matter of semantics to assign blame where there is no pre-existing specification or performance agreement.  In the end you will waste more time dealing with errors that should never have made it to you than the time you save emitting the odd error of your own.</p>
<p>The unit testing people have a somewhat better idea: fail early, fail at the factory where it is cheap to fix.  Don&#8217;t   litter all of your code with indecisive statements like:</p>
<pre>
  Set<String> matches = computeMatches();
  if( matches!=null ) {
     for(String match: matches) {
         ...
     }
  }
</pre>
<p>Instead: write a unit test to document you expectation that the empty set is expressed in single consistent way:</p>
<pre>
   Set<String> matches = computeMatches();
   assertNotNull(matches);
</pre>
<p>And from then on write more confident code:</p>
<pre>
  for(String match: computeMatches()) {
         ...
  }
</pre>
<p>This may seem overly optimistic and overly strict- but I have a point.  One of the few good principles in computer science (and perhaps one of computer science&#8217;s contributions to knowledge, computers are a huge contribution to society- but they were made by engineers) is composition.  A plan for getting from A to B followed by (or composed with) a plan for getting from B to C is a plan for getting from A to C.  Well a correct plan for getting from A to B when composed with a correct plan for getting from B to C, if each of the plans &#8220;is mostly right if the piece after is so nice to fix up a few mistakes&#8221; you really don&#8217;t know what you have.  You may have nothing.</p>
<p>That is my complaint- you can&#8217;t put an a priori bound on how expensive attempting to allow both sides of Postel&#8217;s law will be.  You would like others to paper over your mistakes, but it is becoming too expensive to paper over the mistakes of others.  In the end Postel&#8217;s law is of little help when cleaning up the inevitable mess.</p>


<p>Related posts:<ol><li><a href='http://www.win-vector.com/blog/2009/01/map-reduce-a-good-idea/' rel='bookmark' title='Permanent Link: Map Reduce: A Good Idea'>Map Reduce: A Good Idea</a></li>
<li><a href='http://www.win-vector.com/blog/2010/03/r-annoyances/' rel='bookmark' title='Permanent Link: R annoyances'>R annoyances</a></li>
<li><a href='http://www.win-vector.com/blog/2008/10/something-i-dont-get-about-business-and-bailouts/' rel='bookmark' title='Permanent Link: Something I don&#8217;t get about business and bailouts'>Something I don&#8217;t get about business and bailouts</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2010/02/postels-law-not-sure-who-to-be-angry-with/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The Local to Global Principle</title>
		<link>http://www.win-vector.com/blog/2009/11/the-local-to-global-principle/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=the-local-to-global-principle</link>
		<comments>http://www.win-vector.com/blog/2009/11/the-local-to-global-principle/#comments</comments>
		<pubDate>Wed, 11 Nov 2009 16:37:53 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Exciting Techniques]]></category>
		<category><![CDATA[Expository Writing]]></category>
		<category><![CDATA[Mathematics]]></category>
		<category><![CDATA[Dynamic Programming]]></category>
		<category><![CDATA[Local to Global]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[PageRank]]></category>
		<category><![CDATA[Problem Solving]]></category>
		<category><![CDATA[Speech Recognition]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1123</guid>
		<description><![CDATA[We describe the &#8220;the local to global principle.&#8221; It is a principle used to break algorithmic problem solving into two distinct phases (local criticism followed by global solution) and is an aid both in the design and in the application of algorithms. Instead of giving a formal definition of the principle we quickly define it [...]


Related posts:<ol><li><a href='http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/' rel='bookmark' title='Permanent Link: Automatic Differentiation with Scala'>Automatic Differentiation with Scala</a></li>
<li><a href='http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/' rel='bookmark' title='Permanent Link: A Demonstration of Data Mining'>A Demonstration of Data Mining</a></li>
<li><a href='http://www.win-vector.com/blog/2009/07/should-your-mom-use-google-search/' rel='bookmark' title='Permanent Link: Should your mom use Google search?'>Should your mom use Google search?</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>We describe the &#8220;the local to global principle.&#8221; It is a principle used to break algorithmic problem solving into two distinct phases (local criticism followed by global solution) and is an aid both in the design and in the application of algorithms. Instead of giving a formal definition of the principle we quickly define it and discuss a few examples and methods.  We have produced both a stand-alone <a href="http://www.win-vector.com/dfiles/LocalToGlobal.pdf">PDF</a> (more legible) and a HTML/blog form (more skimable).<br />
<span id="more-1123"></span></p>
<h1 align="center">The Local to Global Principle</h1>
<p align="center"><strong>John Mount<a name="tex2html3" href="#foot21" id="tex2html3"><sup>1</sup></a></strong></p>
<p></p>
<p align="center"><b>Date:</b> November 11, 2009</p>
<hr />
<h3>Abstract:</h3>
<div>We describe the &#8220;the local to global principle.&#8221; It is a principle used to break algorithmic problem solving into two distinct phases (local criticism followed by global solution) and is an aid both in the design and in the application of algorithms. Instead of giving a formal definition of the principle we quickly define it and discuss a few examples and methods.</div>
<p></p>
<h2><a name="SECTION00010000000000000000" id="SECTION00010000000000000000">Contents</a></h2>
<p><!--Table of Contents--></p>
<ul>
<li><a name="tex2html32" href="#SECTION00020000000000000000" id="tex2html32">Introduction</a></li>
<li><a name="tex2html33" href="#SECTION00030000000000000000" id="tex2html33">The Examples</a>
<ul>
<li><a name="tex2html34" href="#SECTION00031000000000000000" id="tex2html34">Web Page Link Analysis</a></li>
<li><a name="tex2html35" href="#SECTION00032000000000000000" id="tex2html35">Natural Language Processing</a></li>
<li><a name="tex2html36" href="#SECTION00033000000000000000" id="tex2html36">Machine Learning</a></li>
</ul>
<p></li>
<li><a name="tex2html37" href="#SECTION00040000000000000000" id="tex2html37">Some Methods</a>
<ul>
<li><a name="tex2html38" href="#SECTION00041000000000000000" id="tex2html38">Local Methods</a></li>
<li><a name="tex2html39" href="#SECTION00042000000000000000" id="tex2html39">Globalization Methods</a></li>
</ul>
<p></li>
<li><a name="tex2html40" href="#SECTION00050000000000000000" id="tex2html40">Conclusion</a></li>
<li><a name="tex2html41" href="#SECTION00060000000000000000" id="tex2html41">Bibliography</a></li>
<li><a name="tex2html42" href="#SECTION00070000000000000000" id="tex2html42">Acknowledgement</a></li>
</ul>
<p><!--End of Table of Contents--></p>
<h1><a name="SECTION00020000000000000000" id="SECTION00020000000000000000">Introduction</a></h1>
<p><font>A common vain hope of computer scientists and algorithm designers is that a domain expert has already &#8220;boiled down&#8221; a problem to a precise, but unsolved, algorithmic core. On this point the mathematician Gian-Carlo Rota wrote:</font></p>
<blockquote><p><font>One of the rarest mathematical talents is the talent for applied mathematics, for picking out of a maze of experimental data the two or three parameters that are relevant, and to discard all other data. This talent is rare. It is taught only at the shop level.[<a href="#IndiscreteThoughts">Rot97</a>, ``A Mathematician's Gossip'']</font></p></blockquote>
<p><font>We describe a useful tool for designing algorithmic applications and solutions which we call &#8220;the local to global principle.&#8221; The local to global principle is the method of deriving applications and solutions by specifying &#8220;local&#8221; (and deliberately myopic) heuristics, critiques and methods followed by using a powerful general method to &#8220;globalize&#8221; this specification into a complete solution.</font></p>
<p><font>There are many important problem solving prescriptions and methods of thought already systematically described and taught:</font></p>
<ul>
<li>Bacon&#8217;s &#8220;New Organon&#8221; and Mill&#8217;s principles of inductive logic.[<a href="#Mill">Mil02</a>]</li>
<li>Feynman&#8217;s genius method.[<a href="#IndiscreteThoughts">Rot97</a>, ``Ten Lessons I Wish I Had Been Taught'']</li>
<li>Reductionism (top down and bottom up).</li>
<li>Divide and conquer.[<a href="#IntroductionToAlgorithms">CLRS09</a>]</li>
<li>Forward deduction, backwards induction.</li>
<li>Root Cause Analysis.</li>
<li>Polya&#8217;s heuristic and conjecture and prove patterns [<a href="#citeulike:679515">Pol71</a>,<a href="#Polya1">Pol54a</a>,<a href="#Polya2">Pol54b</a>]</li>
<li>Doron Zeilberger&#8217;s &#8220;Method of Undetermined Generalization and Specialization.&#8221; [<a href="#Zeilberger:1995p277">Zei95</a>]</li>
<li>Zbigniew Michalewicz and David B. Fogel&#8217;s presentation of evolutionary algorithms.[<a href="#HTSMH">MF00</a>]</li>
</ul>
<p><font>The local to global principle is more of an organizational pattern than &#8220;computer aided technique&#8221; as no one specific species of software or family of notation is required.</font></p>
<p><font>The local to global principle can be identified in a number of previous important applications, but it is not currently an identified principle.<a name="tex2html4" href="#foot244" id="tex2html4"><sup>2</sup></a> The principle is very general, so any succinct description of it is going to be painfully vague. Instead, we explain the principle by discussing some example applications and methods.  For each of our example applications we deliberately use a different globalization technique. The effective algorithmist or practitioner must in fact come to each problem already familiar with a reasonably large set of already known local and global techniques, so we conclude with some appropriate fields of study and preparation.</font></p>
<p><font>The local to global principle is divided into two parts: local encoding of the problem followed by a globalization step that uses the encoding. The guiding feature of local encodings is that they are usually easy to compute from the data at hand. Any extension that looks like enumeration, search or optimization is best left to the global step. The local step is essentially the translation of your problem into an abstract language that is ready for the globalization step. In contrast globalization methods are often &#8220;off the shelf&#8221; in that once you abstract and encode the particulars of your problem you can look for pre-existing useful methods or software to finish your solution. The idea of globalization is to find a best overall or global compromise between competing local criteria. The local step does not so much have to avoid conflicts but instead &#8220;price them.&#8221; There is also an important trade-off that sophisticated local techniques allow the use of simpler globalization methods and more powerful globalization methods allow the use of simpler local techniques.</font></p>
<h1><a name="SECTION00030000000000000000" id="SECTION00030000000000000000">The Examples</a></h1>
<p><font>To demonstrate the breadth of the local to global principle we choose a diverse collection of example applications: web page link analysis, natural language processing and machine learning. For each example application we will set up the problem, introduce a reasonable set of local criteria and pick an appropriate globalization technique. We will favor finishing each example without describing the globalization technique in detail, as this would distract from our point and is best left to the given references. These examples are previously solved problems, our contribution is demonstrating the shared underlying principle.</font></p>
<h2><a name="SECTION00031000000000000000" id="SECTION00031000000000000000">Web Page Link Analysis</a></h2>
<p><font>For our first example application we demonstrate web page link analysis in the form of the famous PageRank score.[<a href="#Page:1998p2689">PBMW98</a>]</font></p>
<p><font>One of the many good ideas leading up to the early Google search engine was the design of a non-text based measure of importance or interestingness of web pages. A search engine that could fold &#8220;interestingness&#8221; or popularity into its notion of relevance could better sort important pages into the search user&#8217;s view. When the web got so large that there were many pages that were exact matches to any common user query popularity became a critical consideration. A link based notion of popularity exploits what is important about the web (the link structure, for example see [<a href="#Kleinberg:1997p32">Kle97</a>]) and avoids having to depend on a lot of natural language understanding technology. This technique also uses authority outside of the given page, so has some hope at being resistant (though not immune) to web-spam.</font></p>
<p><font>Taken all at once, the task of designing a score of page importance is a daunting task. However, by working in stages (as the local to global principle prescribes) we can quickly derive interesting scores including the famous PageRank score. We start with the idea that popularity (or the amount of web traffic a page receives) is (loosely) correlated with importance. So for our first approximation step we decide to try to estimate popularity (or web traffic) and use this estimate as our importance score. Accurately estimating web traffic is itself a hard problem and a big industry (just a few of the major companies involved in this are: Google/Urchin, Quantcast, Nielsen, comScore, Alexa, Hitwise and LookSmart). For our second approximation step we are going to try and estimate popularity from the link structure<a name="tex2html6" href="#foot43" id="tex2html6"><sup>4</sup></a> of the web (using no other measurements or historic data) and use this as our score. This link based estimate is unlikely to completely reproduce real web surfing patterns, but it is very interesting in its own right and has been proven in the market to be a useful score.</font></p>
<p><font>Now the problem is to try to estimate the popularity of a web page from the link structure of the web. We claim: we can generate a useful (but not necessarily accurate) estimate of web traffic from the web&#8217;s link structure alone. Consider Figure&nbsp;<a href="#fig:Links1">1</a> where we have a universe of three web pages A,B and C that link to each other in the pattern illustrated by what is called a graph<a name="tex2html7" href="#foot45" id="tex2html7"><sup>5</sup></a></font></p>
<div align="center"><a name="fig:Links1" id="fig:Links1"></a><a name="50"></a></p>
<table>
<caption align="bottom"><strong>Figure 1:</strong> A set of Mutually Linked Web Pages</caption>
<tr>
<td>
<div align="center"><img width="300" height="436" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/Links1.png" alt="Image Links1"></div>
</td>
</tr>
</table>
</div>
<p><font>In Figure&nbsp;<a href="#fig:Links1">1</a> we can consider each link to another page as evidence the other page is interesting or popular. One idea is to simulate a very simple web surfer who clicks on the links on a page uniformly at random. This is called &#8220;the random surfer model&#8221; and even a model this simple allows us to read some useful information from the link structure of the web. For instance, we could ask what fraction of their time the random surfer spends on each web page, with an eye to the idea that the pages the random surfer visits more often are the more important ones. Let <img width="35" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg2.png" alt="$ p(A)$"> denote the proportion of time the random web surfer spends on page A (and define <img width="36" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg3.png" alt="$ p(B)$"> and <img width="36" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg4.png" alt="$ p(C)$"> similarly). While we do not know any of <!-- MATH<br />
 $p(A), p(B)$<br />
 --><br />
<img width="76" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg5.png" alt="$ p(A), p(B)$"> or <img width="36" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg4.png" alt="$ p(C)$"> we can derive some relationships between them by inspecting the link graph:</font></p>
<p></p>
<div align="center"><!-- MATH<br />
 \begin{eqnarray*}<br />
p(A) &#038; = &#038; \frac{1}{2} P(B) + P(C) \\<br />
p(B) &#038; = &#038; \frac{1}{2} P(A) \\<br />
p(C) &#038; = &#038; \frac{1}{2} P(A) + \frac{1}{2} P(B) .<br />
\end{eqnarray*}<br />
 --></p>
<table cellpadding="0" align="center" width="100%">
<tr valign="middle">
<td nowrap align="right"><img width="35" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg6.png" alt="$\displaystyle p(A)$"></td>
<td width="10" align="center" nowrap><img width="16" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg7.png" alt="$\displaystyle =$"></td>
<td align="left" nowrap><img width="109" height="49" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg8.png" alt="$\displaystyle \frac{1}{2} P(B) + P(C)$"></td>
<td width="10" align="right">&nbsp;</td>
</tr>
<tr valign="middle">
<td nowrap align="right"><img width="36" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg9.png" alt="$\displaystyle p(B)$"></td>
<td width="10" align="center" nowrap><img width="16" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg7.png" alt="$\displaystyle =$"></td>
<td align="left" nowrap><img width="52" height="49" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg10.png" alt="$\displaystyle \frac{1}{2} P(A)$"></td>
<td width="10" align="right">&nbsp;</td>
</tr>
<tr valign="middle">
<td nowrap align="right"><img width="36" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg11.png" alt="$\displaystyle p(C)$"></td>
<td width="10" align="center" nowrap><img width="16" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg7.png" alt="$\displaystyle =$"></td>
<td align="left" nowrap><img width="125" height="49" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg12.png" alt="$\displaystyle \frac{1}{2} P(A) + \frac{1}{2} P(B) .$"></td>
<td width="10" align="right">&nbsp;</td>
</tr>
</table>
</div>
<p><br clear="all"></p>
<p><font>The first equation is just reading from the graph that: all visits on page-A must come from pages B and C, half of the visitors on page-B continue on to A and all of the visitors on page-C continue on to A. The second and third equations are the appropriate summaries of how traffic is routed to pages B and C. We can insist that <!-- MATH<br />
 $P(A) + P(B)<br />
+ P(C) = 1$<br />
 --><br />
<img width="183" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg13.png" alt="$ P(A) + P(B) + P(C) = 1$"> as we want these numbers to represent the fraction of time the random web surfer spends on each page. A more sophisticated model would add more features<a name="tex2html9" href="#foot245" id="tex2html9"><sup>6</sup></a> to get a more useful result.</font></p>
<p><font>It turns out we have already encoded enough local rules to completely determine <!-- MATH<br />
 $P(A), P(B)$<br />
 --><br />
<img width="85" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg14.png" alt="$ P(A), P(B)$"> and <img width="40" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg15.png" alt="$ P(C)$"> . In this example application an algorithmist already familiar with linear algebra&nbsp;[<a href="#Strang">Str76</a>] would recognize these local conditions as &#8220;a system of linear equations.&#8221; Solving even web-scale systems of linear systems is considered easy with modern techniques and modern computers. For our small example example the solution is: <!-- MATH<br />
 $p(A) = \frac{4}{9}$<br />
 --><br />
<img width="67" height="34" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg16.png" alt="$ p(A) = \frac{4}{9}$"> , <!-- MATH<br />
 $p(B) = \frac{2}{9}$<br />
 --><br />
<img width="68" height="34" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg17.png" alt="$ p(B) = \frac{2}{9}$"> , and <!-- MATH<br />
 $p(C) = \frac{3}{9}$<br />
 --><br />
<img width="67" height="34" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg18.png" alt="$ p(C) = \frac{3}{9}$"> . The role of the local steps was to reduce a new problem (estimating the importance or popularity of web page from the link structure) to something with its <em>already known</em> known techniques (like solving a linear system as illustrated in Figure&nbsp;<a href="#fig:LinAlg">2</a>).</font></p>
<div align="center"><a name="fig:LinAlg" id="fig:LinAlg"></a><a name="79"></a></p>
<table>
<caption align="bottom"><strong>Figure 2:</strong> Linear Algebra Solution: As Taught in School</caption>
<tr>
<td>
<div align="center"><img width="400" height="365" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LinAlg.jpg" alt="Image LinAlg"></div>
</td>
</tr>
</table>
</div>
<p><font>So page-A is the most important page by the PageRank measure.</font></p>
<p><font>In this example application the local step was setting up the system of linear equalities (which are easy to derive from the web link graph) and the global step was solving the entire system for the final scores (which were not obvious). You spend most of your time encoding the problem and then use a known technique (in this case solving a linear system) to finish the solution.</font></p>
<h2><a name="SECTION00032000000000000000" id="SECTION00032000000000000000">Natural Language Processing</a></h2>
<p><font>Our next example application is natural language processing&nbsp;[<a href="#CharniakBook">Cha96</a>,<a href="#Charniak:1997p1484">Cha97</a>]. Speech recognition (the alignment or transcription of recognized intelligible segments of sound to written text) is an important problem in natural language processing. An example problem is the need to find the most likely text matching a sequence of sounds such as is shown in Figure&nbsp;<a href="#fig:SoundSeq1">3</a>.</font></p>
<div align="center"><a name="fig:SoundSeq1" id="fig:SoundSeq1"></a><a name="89"></a></p>
<table>
<caption align="bottom"><strong>Figure 3:</strong> A Sequence of Sounds</caption>
<tr>
<td>
<div align="center"><img width="500" height="69" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/SoundSeq1.png" alt="Image SoundSeq1"></div>
</td>
</tr>
</table>
</div>
<p><font>Consider Figure&nbsp;<a href="#fig:SoundSeq3">4</a> (which shows a bad transcription) and Figure&nbsp;<a href="#fig:SoundSeq2">5</a> (which shows a good transcription).</font></p>
<div align="center"><a name="fig:SoundSeq3" id="fig:SoundSeq3"></a><a name="98"></a></p>
<table>
<caption align="bottom"><strong>Figure 4:</strong> A Bad Transcription</caption>
<tr>
<td>
<div align="center"><img width="500" height="142" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/SoundSeq3.png" alt="Image SoundSeq3"></div>
</td>
</tr>
</table>
</div>
<div align="center"><a name="fig:SoundSeq2" id="fig:SoundSeq2"></a><a name="105"></a></p>
<table>
<caption align="bottom"><strong>Figure 5:</strong> A Good Transcription</caption>
<tr>
<td>
<div align="center"><img width="500" height="142" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/SoundSeq2.png" alt="Image SoundSeq2"></div>
</td>
</tr>
</table>
</div>
<p><font>Our claim: we can (given access to training data, and this is the age of data&nbsp;[<a href="#Halevy:2009p2327">HNP09</a>]) solve this problem with a local step that is a set of simple criticisms of proposed transcriptions. A good starting point is a database of previous sounds to text transcriptions. This database allows the construction of a series of tables that give the historic frequency (or probability) of all of the following:</font></p>
<ul>
<li>Prior probability of each sound</li>
<li>Probability of each sound given the immediately previous sound</li>
<li>Prior probability of each word</li>
<li>Probability of each word given the immediately previous word</li>
<li>Which combinations of word fragments are legitimate words</li>
<li>Probability of each sound being assigned to each word fragment (syllables, phonemes and so on).</li>
</ul>
<p><font>These tables encode a &#8220;speech model&#8221; (the rules involving sounds only), a language model (the rules involving text or words only) and the linkage between the two models. These models are deliberately simple in that they capture only local interactions (like probability of a word given the word before it) but no long range interactions (like subject predicate agreement).</font></p>
<p><font>Each box, nested box and arrow on our diagram represents one possible local critique. For each item in our diagram (again, the boxes and arrows) we can use our tables to assign a goodness or plausibility score. For instance bad word to word transitions (like &#8220;won&#8221; <!-- MATH<br />
 $\rightarrow$<br />
 --><br />
<img width="19" height="13" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg19.png" alt="$ \rightarrow$"> &#8220;won&#8221;) will be rare in our historic tables so, just looking up probabilities from the tables (or, better, using the logarithms of probabilities) gives as a &#8220;plausibility score&#8221; that prefers known patterns of language. Then a score for the overall transcription can be derived by multiplying all of the local scores together. These local scores (though simple) already have encoded enough evidence to prefer the good transcription to the bad transcription <em>without</em> requiring any deep knowledge of speech, text or the meaning of the text. This is because the bad transcription has a series of obvious flaws such as: unlikely sound to word fragment assignments and unlikely word to word transitions.</font></p>
<div align="center"><a name="fig:SoundSeqPartial" id="fig:SoundSeqPartial"></a><a name="116"></a></p>
<table>
<caption align="bottom"><strong>Figure 6:</strong> Naively Extending a Partial Transcription</caption>
<tr>
<td>
<div align="center"><img width="500" height="142" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/SoundSeqPartial.png" alt="Image SoundSeqPartial"></div>
</td>
</tr>
</table>
</div>
<p><font>For example consider Figure&nbsp;<a href="#fig:SoundSeqPartial">6</a> where a naive solver is in the process of considering selecting the word &#8220;one&#8221; as the third word to fill in. The <em>only</em> local critiques they need to consider are:</font></p>
<ul>
<li>how likely the word &#8220;one&#8221; is in general (call this <img width="49" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg20.png" alt="$ P[one]$"> )</li>
<li>how likely the word &#8220;one&#8221; is to follow the word &#8220;nine&#8221; (call this <!-- MATH<br />
 $P[one | nine]$<br />
 --><br />
<img width="86" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg21.png" alt="$ P[one \vert nine]$"> )</li>
<li>how likely the letter sequence &#8220;o&#8221; is given the sound &#8220;w&#8221; (call this <!-- MATH<br />
 $P[o | \text{w\textschwa}]$<br />
 --><br />
<img width="55" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg24.png" alt="$P[o \vert \text{w\textschwa}]$"> )</li>
<li>how likely the letter sequence &#8220;ne&#8221; is given the sound &#8220;n&#8221; (call this <!-- MATH<br />
 $P[ne | \text{n}]$<br />
 --><br />
<img width="41" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg25.png" alt="$ P[ne \vert$">&nbsp; &nbsp;n<img width="7" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg23.png" alt="$ ]$"> ).</li>
</ul>
<p><font>So the local plausibility of the fill-in word &#8220;one&#8221; is: <!-- MATH<br />
 $P[one]<br />
\times P[one | nine] \times P[o | \text{w\textschwa}] \times P[ne |<br />
\text{o}]$<br />
 --><br />
<img width="292" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg28.png" alt="$P[one] \times P[one \vert nine] \times P[o \vert \text{w\textschwa}] \times P[ne \vert \text{o}]$"> . We will call this the critique of &#8220;one&#8221; in position 3 and write as <!-- MATH<br />
 $C_3(w_2,one)$<br />
 --><br />
<img width="84" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg29.png" alt="$ C_3(w_2,one)$"> where <img width="22" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg30.png" alt="$ w_2$"> is the word known to be in position 2. Similarly we can generate all of the possible critiques <img width="53" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg31.png" alt="$ C_1(w_1)$"> , <!-- MATH<br />
 $C_2(w_1,w_2)$<br />
 --><br />
<img width="78" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg32.png" alt="$ C_2(w_1,w_2)$"> , <!-- MATH<br />
 $C_3(w_2,w_3)$<br />
 --><br />
<img width="78" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg33.png" alt="$ C_3(w_2,w_3)$"> , <!-- MATH<br />
 $C_4(w_3,w_4)$<br />
 --><br />
<img width="78" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg34.png" alt="$ C_4(w_3,w_4)$"> and the overall criticize of a sequence <!-- MATH<br />
 $w_1 \; w_2 \; w_3 \; w_4$<br />
 --><br />
<img width="77" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg35.png" alt="$ w_1 \; w_2 \; w_3 \; w_4$"> : <!-- MATH<br />
 $C_1(w_1)<br />
\times C_2(w_1,w_2) \times C_3(w_2,w_3) \times C_4(w_3,w_4)$<br />
 --><br />
<img width="336" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg36.png" alt="$ C_1(w_1) \times C_2(w_1,w_2) \times C_3(w_2,w_3) \times C_4(w_3,w_4)$"> from our pre-computed tables of probabilities. Notice for all of these critiques only the immediately previous word and the nearby sounds were used to determine the plausibility of the word we are attempting to fit in. Instead of using these critiques to directly fill in a possible solution (or using search) we will package up these critiques (in the form of the <img width="32" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg37.png" alt="$ C_i()$"> ) and pass them on to a powerful separate globalization step called Dynamic Programming&nbsp;[<a href="#DynamicProgramming">Bel57</a>].</font></p>
<p><font>The globalization or finding of a best overall transcription is not trivial even though our score is simple. This is because the overall <em>best</em> sequence could depend on clever non-local fill-ins (like deliberately picking a less likely first word to allow a later favored transition to a fantastically good third word). Dynamic Programing does not fill in the transcription from left to right, but instead uses a table of scores derived from the left to right arrows and the <img width="32" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg37.png" alt="$ C_i()$"> . In our example Dynamic Programming consists of building a table of information as shown in Figure&nbsp;<a href="#fig:DynBackFill">7</a>. Let <img width="9" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg38.png" alt="$ i$"> represent the word position we are working looking at (so <img width="9" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg38.png" alt="$ i$"> ranges from 1 to 4) and let <img width="15" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg39.png" alt="$ w$"> be a variable that ranges over every word in the dictionary. Our table is indexed by <img width="9" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg38.png" alt="$ i$"> and <img width="15" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg39.png" alt="$ w$"> and when filled in <img width="51" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg40.png" alt="$ T(i,w)$"> stores what the highest &#8220;plausibility score&#8221; of a partial sequence of words where words 1 through <img width="9" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg38.png" alt="$ i$"> have been filled in and the <img width="9" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg38.png" alt="$ i$"> -th word is <img width="15" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg39.png" alt="$ w$"> .</font></p>
<div align="center"><a name="fig:DynBackFill" id="fig:DynBackFill"></a><a name="134"></a></p>
<table>
<caption align="bottom"><strong>Figure 7:</strong> Dynamic Programming: Back Chaining in <img width="27" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg1.png" alt="$ T()$"> for a Solution</caption>
<tr>
<td>
<div align="center"><img width="300" height="298" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/DynTableBackFill.png" alt="Image DynTableBackFill"></div>
</td>
</tr>
</table>
</div>
<p><font>If we already had this magic table <img width="27" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg1.png" alt="$ T()$"> we could find a best possible sequence by &#8220;back chaining.&#8221; We start by finding a fourth word (<img width="22" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg41.png" alt="$ w_4$"> ) such that <img width="61" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg42.png" alt="$ T(4,w_4)$"> is maximal (in this case &#8220;one&#8221;). We then find a best third word (<img width="22" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg43.png" alt="$ w_3$"> ) by enumerating all words and picking <img width="22" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg43.png" alt="$ w_3$"> such that <!-- MATH<br />
 $T(3,w_3) \times C_4(w_3,w_4) = T(4,w_4)$<br />
 --><br />
<img width="234" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg44.png" alt="$ T(3,w_3) \times C_4(w_3,w_4) = T(4,w_4)$"> . We continue back until we had found words <img width="22" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg30.png" alt="$ w_2$"> and <img width="22" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg45.png" alt="$ w_1$"> to get a complete best sequence. Notice that we work from right to left (backwards) and except for the starting step we pick each word to match the calculation we are trying to un-roll, not to be the maximal entry in the column. For instance we pick <!-- MATH<br />
 $w_1 = dial$<br />
 --><br />
<img width="70" height="29" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg46.png" alt="$ w_1 = dial$"> even though it does not have a the highest score, but because <!-- MATH<br />
 $T(1,dial) C_2(dial,nine)<br />
C_3(nine,one) C_4(one,one) = T(4,one)$<br />
 --><br />
<img width="433" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg47.png" alt="$ T(1,dial) C_2(dial,nine) C_3(nine,one) C_4(one,one) = T(4,one)$"> is the maximal complete chain.</font></p>
<p><font>Of course, we don&#8217;t start with the table <img width="27" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg1.png" alt="$ T()$"> already filled in- so we need a procedure to build it. This procedure is the heart of the Dynamic Programming method (for more examples see: &#8220;Introduction to Algorithms&#8221;&nbsp;[<a href="#IntroductionToAlgorithms">CLRS09</a>]). Notice that <img width="54" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg48.png" alt="$ T(1,w)$"> can be filled in for all <img width="15" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg39.png" alt="$ w$"> just by plugging in words and computing the critiques <img width="46" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg49.png" alt="$ C_1(w)$"> (i.e. <!-- MATH<br />
 $T(1,w) = C_1(w)$<br />
 --><br />
<img width="118" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg50.png" alt="$ T(1,w) = C_1(w)$"> ). Once all the <img width="54" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg48.png" alt="$ T(1,w)$"> are filled in we can fill in the the <img width="54" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg51.png" alt="$ T(2,w)$"> with the general (and slightly trickier) formula:</font></p>
<p><!-- MATH<br />
 \begin{displaymath}<br />
T(i+1,w) = \max_{v} T(i,v) C_{i+1}(v,w)<br />
\end{displaymath}<br />
 --></p>
<div align="center"><img width="249" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg52.png" alt="$\displaystyle T(i+1,w) = \max_{v} T(i,v) C_{i+1}(v,w) $"></div>
<p><font>as we illustrate for <img width="74" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg53.png" alt="$ T(2,nine)$"> in Figure&nbsp;<a href="#fig:DynTable">8</a>.</font></p>
<div align="center"><a name="fig:DynTable" id="fig:DynTable"></a><a name="145"></a></p>
<table>
<caption align="bottom"><strong>Figure 8:</strong> Dynamic Programming: Building the Table <img width="27" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg1.png" alt="$ T()$"></caption>
<tr>
<td>
<div align="center"><img width="400" height="261" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/DynTableCalculate.png" alt="Image DynTableCalculate"></div>
</td>
</tr>
</table>
</div>
<p><font>The magic of the Dynamic Programing technique is: by being careful to not store too much in the table <img width="51" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg40.png" alt="$ T(i,w)$"> we avoid an explosion in record keeping that would render the method inefficient. Dynamic Programming exploits the small dependence structure encoded in <img width="32" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg37.png" alt="$ C_i()$"> (each box in our diagram depending on only a few arrows) and as we have shown can find &#8220;clever&#8221; solutions (such as taking a sub-optimal first word to get better transitions into preferred later words). For those who want more detail on solving this problem we recommend [<a href="#CharniakBook">Cha96</a>] (as our goal is not to fully explain Dynamic Programming, but to demonstrate how it could be applied to the transcription problem as a pre-packaged globalizer).</font></p>
<p><font>In this example the local step was the graph link based critiques and the globalization step was Dynamic Programming. The separation of concerns from the local scoring to the globalizing step is a strength of the local to global principle.</font></p>
<h2><a name="SECTION00033000000000000000" id="SECTION00033000000000000000">Machine Learning</a></h2>
<p><font>Our final example application is machine learning. Machine learning is loosely defined as computer programs that adapt or learn from data. Thomas Mitchell helps distinguish this activity as a specialty of artificial intelligence that concentrates on &#8220;well-posed learning problems.&#8221;&nbsp;[<a href="#MitchellML">Mit97</a>] Trevor Hastie, Robert Tibshirani, Jerome Friedman emphasize the relation to statistics (versus more traditional symbolic AI)&nbsp;[<a href="#TibHat">TH09</a>]. A simple demonstration can be found in [<a href="#MLArt">Mou09b</a>].</font></p>
<p><font>Machine learning is perhaps the strongest example of the local to global principle and is inspired by the work of Kristin P. Bennett and Emilio Parrado-Hernandez&nbsp;[<a href="#Bennett:2006p400">BPH06</a>]. In hindsight many machine learning algorithms (each of which has had a turn at being &#8220;the most exciting breakthrough ever&#8221; for a while) can be seen as the pairing of a performance criterion (which we call a local criterion as it applies to one specific set of parameter values at a time) and an optimization method (what we have been calling the globalization step). The work of Bennett and Parrado-Hernandez calls this distinction out and shows how it is not productive to present machine learning systems as unique named monolithic units, but instead to consider how to break them into an objective function and an optimizer. This allows both choice of better optimizers (such as replacing the inferior method of gradient descent method wherever it occurs) and for explicit control of important concepts such as hypothesis regularization and control of over-fitting (which some algorithms claim to achieve by deliberately using early exit from a an inferior optimizer).</font></p>
<p><font>At a &#8220;30,000 feet level&#8221; we can build a table of common machine learning techniques and name what is commonly used to implement their local and global steps. When a machine learning algorithm is defined by what conditions are meant to be true at the optimum we are no longer bound by details of the original implementation and can examine fix and improve the components.<a name="tex2html17" href="#foot154" id="tex2html17"><sup>7</sup></a> Table&nbsp;<a href="#fig:MachineLearning">1</a> is a crude summary of a wide selection for machine learning algorithms that may be more likely to offend everybody than just offend somebody. But this is also the point: it is the algorithmist&#8217;s job to think fluidly (beyond given names and provenances) and to invent scaffolding to convert partial analogies into practical correspondences.</font></p>
<p></p>
<div align="center"><a name="190"></a></p>
<table>
<caption><strong>Table 1:</strong> Various Machine Learning Techniques</caption>
<tr>
<td>
<div align="center">
<table cellpadding="3" border="1" align="center">
<tr>
<td align="left" valign="top" width="180"><font size="-1">Machine Learning Method</font></td>
<td align="left" valign="top" width="144"><font size="-1">Local Criterion</font></td>
<td align="left" valign="top" width="144"><font size="-1">Globalization Method</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">Linear Regression [<a href="#Breiman:1997p1133">BF97</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">square error</font></td>
<td align="left" valign="top" width="144"><font size="-1">Linear Algebra</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">Linear Discriminant Analysis [<a href="#Fisher:1936p2576">Fis36</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">square error</font></td>
<td align="left" valign="top" width="144"><font size="-1">Linear Algebra</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">Logistic Regression [<a href="#Komarek:2008p1742">Kom08</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">logit penalty</font></td>
<td align="left" valign="top" width="144"><font size="-1">Newton&#8217;s Method</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">Perceptron [<a href="#Beigel:1991p1027">BRS91</a>] [<a href="#Blum:2002p1867">BD02</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">error rate</font></td>
<td align="left" valign="top" width="144"><font size="-1">error based update</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">Naive Bayes [<a href="#Maron:2000p2553">MK00</a>] [<a href="#Maron:1961p2566">Mar61</a>] [<a href="#Lewis:1998p105">Lew98</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">frequency tables</font></td>
<td align="left" valign="top" width="144"><font size="-1">arithmetic</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">Nearest Neighbor [<a href="#Ailon:2006p872">AC06</a>] [<a href="#Indyk:1999p166">IM99</a>] [<a href="#Andoni:2006p52">AI06</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">Kernel Methods</font></td>
<td align="left" valign="top" width="144"><font size="-1">enumeration,<br />
projection</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">Decision Trees [<a href="#bfso:1984">BFSO84</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">information theory</font></td>
<td align="left" valign="top" width="144"><font size="-1">partitioning</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">clustering [<a href="#Cilibrasi:2005p8">CV05</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">square error</font></td>
<td align="left" valign="top" width="144"><font size="-1">partitioning</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">MaxEnt [<a href="#Grunwald:2000p108">Gru00</a>] [<a href="#Grunwald:2004p739">GD04</a>] [<a href="#Skilling:1988p780">Ski88</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">entropy penalty</font></td>
<td align="left" valign="top" width="144"><font size="-1">Newton&#8217;s Method</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">Neural Net with Back Propagation [<a href="#NNCPE">Hus99</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">sigmoid penalty function</font></td>
<td align="left" valign="top" width="144"><font size="-1">Automatic Differentiation,<br />
steepest descent</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">Winnow [<a href="#Kivinen:1995p1836">KWA95</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">error rate</font></td>
<td align="left" valign="top" width="144"><font size="-1">multiplicative error based update</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">Boosting [<a href="#Freund:1999p1015">FS99</a>] [<a href="#Breiman:2000p1134">Bre00</a>] [<a href="#Collins:2002p1008">CSS02</a>] [<a href="#Trevisan:2008p2166">TTV08</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">weighted errors,<br />
data re-weighting</font></td>
<td align="left" valign="top" width="144"><font size="-1">Conjugate Gradient</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">HMM [<a href="#Kristjansson:2004p545">KCVM04</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">probability penalty</font></td>
<td align="left" valign="top" width="144"><font size="-1">Gibbs Sampler</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">Latent Dirichlet Allocation [<a href="#Blei:2003p1063">BNJ03</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">KL divergence</font></td>
<td align="left" valign="top" width="144"><font size="-1">Variational Methods</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">Support Vector Machine [<a href="#Joachims:1998p406">Joa98</a>] [<a href="#SVMBook">STC00</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">L1 Margin,<br />
Kernel Methods</font></td>
<td align="left" valign="top" width="144"><font size="-1">Quadratic Optimization</font></td>
</tr>
</table>
</div>
<p><a name="fig:MachineLearning" id="fig:MachineLearning"></a></td>
</tr>
</table>
</div>
<p></p>
<p><font>This table is a necessarily crude summary. For example: notice that several known techniques can not even be distinguished from each other by the local and global columns of the table.</font></p>
<p><font>There are a few points we would like to make. Back propagation was considered unique to Neural Nets for quite a while because it was so entwined with the technique it was not recognized as the simple application of Automatic Differentiation&nbsp;[<a href="#Rall:1996p2473">RC96</a>] that it is. Support Vector Machines (SVM) are remarkable for their uniform very good choice of component methods (maximum L1 margin objective regularization, Kernel Methods&nbsp;[<a href="#KernBook">STC04</a>] and sophisticated optimization methods&nbsp;[<a href="#Joachims:2006p403">Joa06</a>]). Many of the machine learning methods that SVM outperforms become again competitive when they adopt some of SVM&#8217;s technologies (especially using kernel methods to produce synthetic features).</font></p>
<p><font>Beyond these points we invoke a &#8220;globalizers are pre-packaged&#8221; principle and leave the discussion of machine learning and optimization to our reference: [<a href="#Bennett:2006p400">BPH06</a>]. In this example the local step is a per-example score or penalty and the globalization step is optimization.</font></p>
<h1><a name="SECTION00040000000000000000" id="SECTION00040000000000000000">Some Methods</a></h1>
<p><font>The application of the local to global principle is similar to the Feynman &#8220;genius method.&#8221; Feynman&#8217;s method is to always have in mind a list of problems and a list of solution methods. The genius step is: anytime you see a new problem or a new solution method to immediately try it against every item from the complementary list.&nbsp;[<a href="#IndiscreteThoughts">Rot97</a>, ``Ten Lessons I Wish I Had Been Taught''] This deliberate retention and activity greatly increases your problem solving ability. The power of the local to global principle is itself proportional to the number of local methods times the number of globalization strategies. Of course, to even start: the practitioner must already have available a number of candidate local and globalization methods. We list some methods and some guidance on variation and invention.</font></p>
<h2><a name="SECTION00041000000000000000" id="SECTION00041000000000000000">Local Methods</a></h2>
<p><font><img width="100" height="100" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/nails.jpg" alt="Image nails"> Good sources of ideas and analogies for local methods include:</font></p>
<ul>
<li>Introduce a Graph Structure
<p>A graph structure is a network of nodes connected by edges. Use of graphs was demonstrated both in the natural language processing and web page link analysis examples. We can dress up how we solved these problems and say we used a &#8220;Hidden Markov Model&#8221;, but the real power was we encoded our problem in a simple graph. Some problems (especially those from logic or those involving time) are essentially solved once they are translated out of their original form and into graph notation (for an example see: [<a href="#Mount:2000p360">Mou00</a>]).</p>
</li>
<li>Appeal to Physical Conservation Laws
<p>A good example physical law is Kirchhoff&#8217;s law or conservation of flow. All of the web page link analysis&#8217;s equations were derived by saying that the attention of at node is essentially the sum of attentions from other nodes (more sophisticated versions of the analysis actually do create and destroy flow, but they do it in a principled way).</p>
</li>
<li>Encode the Problem into an Objective Function
<p>This method is essentially your declaration that you intend to use an optimizer for the globalization step. In operations research this specific technique has long been the practice (with no disrespect: a very productive part of operations research has been translating different problems into linear programs so the simplex method can be applied, for an example see [<a href="#TradeArt">Mou09a</a>]).</p>
</li>
<li>Gradient Like Computations
<p>Includes Gradients, Secants, Lagrangians and other ideas from calculus. Gradients can drive optimizer based globalizers and techniques like Lagrangians are often powerful enough use mere inspection as the globalization step.</p>
</li>
<li>Violation Driven Updates
<p>This method is particularly effective when your problem is not amenable to continuous optimization. A good example is the Lin-Kernighan heuristic for solving the traveling salesman problem.[<a href="#Lin:1973p2739">LK73</a>] This heuristic looks at subsets of the problem and suggests improving &#8220;surgeries&#8221; (until no more such improvements are possible).</p>
</li>
<li>Introduction of Symbols
<p>Often, as with the web page link analysis example, you can not specify specific values for the unknowns, but you can specify relationships. You often can then solve for the symbols or introduce additional conditions and use an optimizer to complete the solution (see for example the maximum entropy method as described in [<a href="#Skilling:1988p780">Ski88</a>]).</p>
</li>
<li>Over Specification
<p>If we anticipate using a global step like search, enumeration, summation or integration then over specification is a good local idea.</p>
<p>For example: consider computing the probability that a fair count flipped 10 times comes up with heads exactly 3 times. The easiest way to perform this calculation is to specify exactly which 3 coins come up heads (the local over-specification) and then sum over all choices of 3 out of 10 coins (the global step). In mathematical notation this is:</p>
<p><!-- MATH<br />
 \begin{displaymath}<br />
P[\text{exactly 3 heads out of 10 flips}] = \binom{10}{3} 2^{-10} \approx 0.117<br />
\end{displaymath}<br />
 --></p>
<div align="center"><img width="20" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg54.png" alt="$\displaystyle P[$">exactly 3 heads out of 10 flips<img width="157" height="54" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg55.png" alt="$\displaystyle ] = \binom{10}{3} 2^{-10} \approx 0.117 $"></div>
<p>or just under 12%.</li>
<li>Under Specification
<p>One of the core principles of Dynamic Programming is to forget as much as possible about partial solutions, keeping only partial solution cost and just enough information to extend partial solutions. If you anticipate using something like Dynamic Programming as your globalization step then your goal should be to under specify.</p>
</li>
<li>Tables
<p>A key step of the natural language processing example was the use of tables of past experience to determine which sounds likely corresponded to which words, which words likely followed each other (and so on). Encoding domain knowledge or expertise as probability tables is a very effective problem solving strategy (especially if the globalization strategy is going to be search or Dynamic Programming). In natural language processing examples tables and statistics are <em>much</em> easier to manage than comprehensive rules or grammars.</p>
</li>
<li>Set up as Ranking or Machine Learning Problem
<p>This tactic is especially appropriate if your solution success metric is counts, frequencies or probabilities (instead of having to always be correct or always be optimal).</p>
</li>
</ul>
<h2><a name="SECTION00042000000000000000" id="SECTION00042000000000000000">Globalization Methods</a></h2>
<p><font><img width="100" height="100" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/hammer.jpg" alt="Image hammer"> The universe of possible globalization methods is very diverse (in particular globalization is not always optimization).</font></p>
<ul>
<li>Search / Enumeration
<p>Search can be slow, but it is always an option to consider. If your problem translates naturally into a graph structure or your solutions are naturally seen as being composed of small pieces search should be considered. One of the big advantages using the local phase to formally encode your problem&#8217;s structure and putting search off to the global phase is: you can use advanced search techniques. Once you are freed from your specific problem details it becomes much easier to consider search techniques like branch and bound, A*, game theoretic search and general speed-up techniques like hashing and caching.</p>
</li>
<li>Dynamic Programming
<p>If your problem has a bit more structure (in that partial solutions summarize and compose easily) then you can likely replace search with Dynamic Programming. The advantage is that Dynamic Programming typically offers an incredible speed up when compared to search.</p>
</li>
<li>Optimization
<p>If your problem is continuous (involves numbers instead of discrete or categorical decisions), can be encoded as a reasonable objective function (linear, positive definite quadratic) and has reasonable constraints (linear or convex) then you can immediately apply an optimizer as your globalization step. Typical optimization methods include: conjugate gradient, Newton methods, quasi Newton methods, linear programming and quadratic programming.</p>
</li>
<li>Combinatorial Optimization
<p>If your problem includes a &#8220;discrete variables&#8221; (that is variables that take on one of fixed set of values instead of values from a numeric range) then you may not be able to apply standard optimization techniques. At this point you may want to use more expensive combinatorial optimization techniques like integer linear programing or constraint satisfaction.</p>
</li>
<li>Fixed Point Methods / Iteration
<p>Fixed point methods are based on the idea: &#8220;incrementally improve until there is no incremental improvement possible.&#8221; If the problem is continuous this is similar to steepest descent. If the problem is discrete then this is similar the Lin-Kernighan heuristic.</p>
</li>
<li>Linear Algebra
<p>The web page link analysis and optimization examples were essentially solved once we reduced them to linear algebra. If you can write your problem as a linear relationship between unknowns or as the fixed-point of a linear operator (i.e. an <img width="12" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg56.png" alt="$ x$"> such that <img width="54" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg57.png" alt="$ A x = x$"> ) then you can immediately use linear algebra to solve the problem at very large scale (e.g. web scale).</p>
</li>
<li>Sampling / Problem Kernels
<p>A very successful line of attack on large problems is to reduce to a smaller problem containing most of the essential difficulty. David Karger has produced a number of effective algorithms for graph cuts and flows using a theory of sampling&nbsp;[<a href="#Karger:1998p556">Kar98</a>]. Rod Downey and M. Fellows have demonstrated an effective theory of &#8220;problem kernels&#8221; that finds solution by focusing on smaller sub-problems (on which we can afford to use more expensive procedures).[<a href="#DF98">DF98</a>]</p>
</li>
<li>Amortized Analysis / Economic Mechanism Methods
<p>Daniel Sleator and Robert Tarjan&#8217;s ideas of amortized analysis&nbsp;[<a href="#Sleator:1985p168">ST85</a>] allow approximation schemes similar to problem kernels. The method is to approximately optimize by pairing a bunch of unavoidable large penalties (conditions we can&#8217;t meet) with some accounting credits (say bonuses from other conditions we are meeting very well). We then isolate these paired items and optimize the rest of the problem exactly. The technique often works by showing the approximation can not be too bad because, due to the pairing of large penalties to good credits, there can not be too many large penalties. An informal example is: if it is impossible to pick someplace where all of an office will eat for lunch, perhaps you can solve the problem by paying one person to accept a restaurant they do not like (if the removal of their objection opens up a venue that is acceptable to everybody else).</p>
</li>
<li>Relaxation / Homotopic methods
<p>These methods involve changing hard constraints to soft penalties (so allowing the constraints to be violated, but at a slowly increasing cost). After such a relaxation the homotopic (or continuous deformation) method is to increase the cost of violation and re-solve to try and get a trajectory of better and better nearly acceptable solutions that point to a possible overall solution.</p>
</li>
</ul>
<h1><a name="SECTION00050000000000000000" id="SECTION00050000000000000000">Conclusion</a></h1>
<p><font>The purpose of this article has been to make more visible an idea we call the local to global principle. This principle is an organizing tool useful both in designing and analyzing a wide variety of applications. Essentially the whole point of this writeup is to set up enough framework to quickly write down a table of advice such as Table&nbsp;<a href="#fig:ProblemTable">2</a> (and for such a table to mean something).</font></p>
<p></p>
<div align="center"><a name="227"></a></p>
<table>
<caption><strong>Table 2:</strong> Various Applications, Local Steps and Global Steps</caption>
<tr>
<td>
<div align="center">
<table cellpadding="3" border="1" align="center">
<tr>
<td align="left"><font size="-1">Example</font></td>
<td align="left"><font size="-1">Local Step</font></td>
<td align="left"><font size="-1">Global Step</font></td>
</tr>
<tr>
<td align="left"><font size="-1">speech transcription</font></td>
<td align="left"><font size="-1">tables</font></td>
<td align="left"><font size="-1">Dynamic Programming</font></td>
</tr>
<tr>
<td align="left"><font size="-1">PageRank</font></td>
<td align="left"><font size="-1">graph structure, linear equations</font></td>
<td align="left"><font size="-1">Linear Algebra</font></td>
</tr>
<tr>
<td align="left"><font size="-1">machine learning</font></td>
<td align="left"><font size="-1">objective function</font></td>
<td align="left"><font size="-1">optimization</font></td>
</tr>
</table>
</div>
<p><a name="fig:ProblemTable" id="fig:ProblemTable"></a></td>
</tr>
</table>
</div>
<p></p>
<p><font>The principle is not universal; not everything can be fit into such a table. For example the local to global decoupling is <em>not</em> a feature of the famous EM algorithm&nbsp;[<a href="#Dempster:1977p761">DLR77</a>], which depends on mixing predictions and corrections.</font></p>
<p><font>To conclude: the recipe is as follows. If you come to a problem with a large shopping bag of possible ways to build local criteria and powerful globalization procedures then you stand a very good chance of solving the problem quickly. Also, if you keep the local to global principle in mind you are more likely to identify and retain potential local tricks and globalizers when you see them and thus have a larger more nimble set of tools available to solve problems when the time comes.</font></p>
<h2><a name="SECTION00060000000000000000" id="SECTION00060000000000000000">Bibliography</a></h2>
<dl compact>
<dt><a name="Ailon:2006p872" id="Ailon:2006p872">AC06</a></dt>
<dd>Nir Ailon and Bernard Chazelle, <i>Approximate nearest neighbors and the fast johnson-lindenstrauss transform</i>, STOC (2006).</dd>
<dt><a name="Andoni:2006p52" id="Andoni:2006p52">AI06</a></dt>
<dd>Alexandr Andoni and Piotr Indyk, <i>Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions</i>.</dd>
<dt><a name="Blum:2002p1867" id="Blum:2002p1867">BD02</a></dt>
<dd>Avrim Blum and John Dunagan, <i>Smoothed analysis of the perceptron algorithm for linear programming</i>, SODA (2002), 11.</dd>
<dt><a name="DynamicProgramming" id="DynamicProgramming">Bel57</a></dt>
<dd>Richard Bellman, <i>Dynamic programming</i>, Princeton University Press, 1957.</dd>
<dt><a name="Breiman:1997p1133" id="Breiman:1997p1133">BF97</a></dt>
<dd>Leo Breiman and Jerome&nbsp;H Friedman, <i>Predicting multivariate responses in multiple linear regression</i>, Journal of the Royal Statistical Society, Series B (Methodological) <b>59</b> (1997), no.&nbsp;1, 3-54.</dd>
<dt><a name="bfso:1984" id="bfso:1984">BFSO84</a></dt>
<dd>Leo Breiman, Jerome Friedman, Charles&nbsp;J. Stone, and R.&nbsp;A. Olshen, <i>Classification and regression trees</i>, Chapman &amp; Hall/CRC, January 1984.</dd>
<dt><a name="Blei:2003p1063" id="Blei:2003p1063">BNJ03</a></dt>
<dd>David&nbsp;M Blei, Andrew&nbsp;Y Ng, and Michael&nbsp;I Jordan, <i>Latent dirichlet allocation</i>, Journal of Machine Learning Research <b>3</b> (2003), 993-1022.</dd>
<dt><a name="Bennett:2006p400" id="Bennett:2006p400">BPH06</a></dt>
<dd>Kristin&nbsp;P. Bennett and Emilio Parrado-Hernandez, <i>The interplay of optimization and machine learning research</i>, Journal of Machine Learning Research <b>7</b> (2006), 1265-1281.</dd>
<dt><a name="Breiman:2000p1134" id="Breiman:2000p1134">Bre00</a></dt>
<dd>Leo Breiman, <i>Special invited paper. additive logistic regression: A statistical view of boosting: Discussion</i>, Ann. Statist. <b>28</b> (2000), no.&nbsp;2, 374-377.</dd>
<dt><a name="Beigel:1991p1027" id="Beigel:1991p1027">BRS91</a></dt>
<dd>R&nbsp;Beigel, N&nbsp;Reingold, and D&nbsp;Spielman, <i>The perceptron strikes back</i>, Structure in Complexity Theory Conference <b>6</b> (1991), 286-291.</dd>
<dt><a name="CharniakBook" id="CharniakBook">Cha96</a></dt>
<dd>Eugene Charniak, <i>Statistical language learning</i>, MIT Press, 1996.</dd>
<dt><a name="Charniak:1997p1484" id="Charniak:1997p1484">Cha97</a></dt>
<dd>to3em, <i>Statistial techniques for natural language parsing</i>, AI Magazine <b>18</b> (1997), no.&nbsp;4, 33-44.</dd>
<dt><a name="IntroductionToAlgorithms" id="IntroductionToAlgorithms">CLRS09</a></dt>
<dd>Thomas&nbsp;H. Cormen, Charles&nbsp;E. Leiserson, Ronald&nbsp;L. Rivest, and Clifford Stein, <i>Introduction to algorithms</i>, MIT Press, 2009.</dd>
<dt><a name="Collins:2002p1008" id="Collins:2002p1008">CSS02</a></dt>
<dd>Michael Collins, Robert&nbsp;E Schapire, and Yoram Singer, <i>Logistic regression, adaboost and bregman distances</i>, Machine Learning <b>48</b> (2002), no.&nbsp;1/2/3, 30.</dd>
<dt><a name="Cilibrasi:2005p8" id="Cilibrasi:2005p8">CV05</a></dt>
<dd>Rudi Cilibrasi and Paul&nbsp;M.B Vitanyi, <i>Clustering by compression</i>, IEEE Transactions on Information Theory <b>51</b> (2005), no.&nbsp;4, 1523-1545.</dd>
<dt><a name="DF98" id="DF98">DF98</a></dt>
<dd>Rod&nbsp;G. Downey and M.&nbsp;R. Fellows, <i>Parameterized complexity</i>, Monographs in Computer Science, Springer, November 1998.</dd>
<dt><a name="Dempster:1977p761" id="Dempster:1977p761">DLR77</a></dt>
<dd>A&nbsp;P Dempster, N&nbsp;M Laird, and D&nbsp;B Rubin, <i>Maximum likelihood from incomplete data via the em algorithm</i>, Journal of the Royal Statistical Society, Series B (Methodological) <b>39</b> (1977), no.&nbsp;1, 1-38.</dd>
<dt><a name="Fisher:1936p2576" id="Fisher:1936p2576">Fis36</a></dt>
<dd>Ronald&nbsp;A Fisher, <i>The use of multiple measurements in taxonomic problems</i>, Annals of Eugenics <b>7</b> (1936), 179-188.</dd>
<dt><a name="Freund:1999p1015" id="Freund:1999p1015">FS99</a></dt>
<dd>Yoav Freund and Robert&nbsp;E Schapire, <i>A short introduction to boosting</i>, Journal of Japanese Society for Artificial Intelligence <b>14</b> (1999), no.&nbsp;5, 771-780.</dd>
<dt><a name="Grunwald:2004p739" id="Grunwald:2004p739">GD04</a></dt>
<dd>Peter&nbsp;D Grunwald and A&nbsp;Philip Dawid, <i>Game theory, maximum entropy, minimum discrepancy and robust bayesian decision theory</i>, Ann. Statist. <b>32</b> (2004), no.&nbsp;4, 1367-1433.</dd>
<dt><a name="Grunwald:2000p108" id="Grunwald:2000p108">Gru00</a></dt>
<dd>PD&nbsp;Grunwald, <i>Maximum entropy and the glasses you are looking through</i>, Conference on Uncertainty in Artificial Intelligence (2000), 238-246.</dd>
<dt><a name="Halevy:2009p2327" id="Halevy:2009p2327">HNP09</a></dt>
<dd>Alon Halevy, Peter Norvig, and Fernando Pereira, <i>The unreasonable effectiveness of data</i>, IEEE Intellegent Systems (2009).</dd>
<dt><a name="NNCPE" id="NNCPE">Hus99</a></dt>
<dd>Dirk Husmeier, <i>Neural networks for conditional probability estimation</i>, Springer, 1999.</dd>
<dt><a name="Indyk:1999p166" id="Indyk:1999p166">IM99</a></dt>
<dd>Piotr Indyk and Rajeev Motwani, <i>Approximate nearest neighbors: Towards removing the curse of dimensionality</i>.</dd>
<dt><a name="Joachims:1998p406" id="Joachims:1998p406">Joa98</a></dt>
<dd>Thorsten Joachims, <i>Making large-scale svm learning practical</i>, Advances in Kernel Methods &#8211; Support Vector Learning (1998).</dd>
<dt><a name="Joachims:2006p403" id="Joachims:2006p403">Joa06</a></dt>
<dd>to3em, <i>Training linear svms in linear time</i>, KDD (2006).</dd>
<dt><a name="Karger:1998p556" id="Karger:1998p556">Kar98</a></dt>
<dd>David&nbsp;R Karger, <i>Randomization in graph optimization problems: A survey</i>, Optima: Mathematical Programming Society Newsletter <b>58</b> (1998).</dd>
<dt><a name="Kristjansson:2004p545" id="Kristjansson:2004p545">KCVM04</a></dt>
<dd>Trausti Kristjansson, Aron Culotta, Paul Viola, and Andrew&nbsp;Kachites McCallum, <i>Interactive information extraction with constrained conditional random fields</i>, AAAI (2004).</dd>
<dt><a name="Kleinberg:1997p32" id="Kleinberg:1997p32">Kle97</a></dt>
<dd>Jon&nbsp;M Kleinberg, <i>Authoritative souces in a hyperlinked environment</i>, ACM SIAM Symposium on Discrete Algorithms (1997).</dd>
<dt><a name="Komarek:2008p1742" id="Komarek:2008p1742">Kom08</a></dt>
<dd>Paul Komarek, <i>Logistic regression for data mining and high-dimensional classification</i>, CMU CS Thesis (2008), 138.</dd>
<dt><a name="Kivinen:1995p1836" id="Kivinen:1995p1836">KWA95</a></dt>
<dd>J&nbsp;Kivinen, Manfred&nbsp;K Warmuth, and P&nbsp;Auer, <i>The perceptron algorithm v.s. winnowo: Linear v.s. logarithmic mistake bounds when few input variables are relevant</i>, COLT (1995), 289-296.</dd>
<dt><a name="Lewis:1998p105" id="Lewis:1998p105">Lew98</a></dt>
<dd>David&nbsp;D Lewis, <i>Naive (bayes) at forty: The independence assumption in information retrieval</i>, find journal (1998).</dd>
<dt><a name="Lin:1973p2739" id="Lin:1973p2739">LK73</a></dt>
<dd>S&nbsp;Lin and BW&nbsp;Kernighan, <i>An effective heuristic algorithm for the traveling-salesman problem</i>, Operations Research (1973), 498-516.</dd>
<dt><a name="Maron:1961p2566" id="Maron:1961p2566">Mar61</a></dt>
<dd>M&nbsp;E Maron, <i>Automatic indexing: An experimental inquiry</i>, RAND Technical Report (1961), 404-417.</dd>
<dt><a name="HTSMH" id="HTSMH">MF00</a></dt>
<dd>Zbigniew Michalewicz and David&nbsp;B. Fogel, <i>How to solve it: Modern heuristics</i>, Springer, 2000.</dd>
<dt><a name="Mill" id="Mill">Mil02</a></dt>
<dd>John&nbsp;Stuart Mill, <i>A system of logic</i>, University Press of the Pacific, 2002.</dd>
<dt><a name="MitchellML" id="MitchellML">Mit97</a></dt>
<dd>Thomas Mitchell, <i>Machine learning</i>, McGraw-Hill, 1997.</dd>
<dt><a name="Maron:2000p2553" id="Maron:2000p2553">MK00</a></dt>
<dd>M&nbsp;E Maron and J&nbsp;L Kuhns, <i>On relevance, probabilistic indexing and information retrieval</i>, 1960 (2000), 1-29.</dd>
<dt><a name="Mount:2000p360" id="Mount:2000p360">Mou00</a></dt>
<dd>John&nbsp;A Mount, <i>Automatic detection of potential deadlock</i>, Dr. Dobbs Journal (2000).</dd>
<dt><a name="TradeArt" id="TradeArt">Mou09a</a></dt>
<dd>John Mount, <i>Automatic generation and testing of un-rolls for profitable technical trades</i>, <a href="http://www.win-vector.com/blog/2007/10/paper-on-stock-trading/">http://www.win-vector.com/blog/2007/10/paper-on-stock-trading/</a>, 2009.</dd>
<dt><a name="MLArt" id="MLArt">Mou09b</a></dt>
<dd>to3em, <i>A demonstration of data mining</i>, <a href="http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/">http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/</a>, 2009.</dd>
<dt><a name="Page:1998p2689" id="Page:1998p2689">PBMW98</a></dt>
<dd>Lawrence Page, Sergey Brin, Rajeev Motwani, and Tery Winograd, <i>The pagerank citation ranking: Bringing order to the web</i>, <a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.31.1768">http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.31.1768</a> (1998).</dd>
<dt><a name="Polya1" id="Polya1">Pol54a</a></dt>
<dd>G.&nbsp;Polya, <i>Induction and analogy in mathematics</i>, Princeton University Press, 1954.</dd>
<dt><a name="Polya2" id="Polya2">Pol54b</a></dt>
<dd>to3em, <i>Patterns of plausible inference</i>, Princeton University Press, 1954.</dd>
<dt><a name="citeulike:679515" id="citeulike:679515">Pol71</a></dt>
<dd>to3em, <i>How to solve it</i>, Princeton University Press, November 1971.</dd>
<dt><a name="Rall:1996p2473" id="Rall:1996p2473">RC96</a></dt>
<dd>Louis&nbsp;B Rall and George&nbsp;F Corliss, <i>An introduction to automatic differentiation</i>, SIAM: Computational Differentiation: Techniques, Applications and Tools (1996), 1-18.</dd>
<dt><a name="IndiscreteThoughts" id="IndiscreteThoughts">Rot97</a></dt>
<dd>Gian-Carlo Rota, <i>Indiscrete thoughts</i>, Birkhauser, 1997.</dd>
<dt><a name="Skilling:1988p780" id="Skilling:1988p780">Ski88</a></dt>
<dd>John Skilling, <i>The axioms of maximum entropy</i>, Maximum Entropy and Bayesian Methods in Science and Engineering <b>1</b> (1988), no.&nbsp;173-187.</dd>
<dt><a name="Sleator:1985p168" id="Sleator:1985p168">ST85</a></dt>
<dd>Daniel&nbsp;Dominic Sleator and Robert&nbsp;Endre Tarjan, <i>Amortized efficiency of list update and paging rules</i>, Communications of the ACM <b>28</b> (1985), no.&nbsp;2.</dd>
<dt><a name="SVMBook" id="SVMBook">STC00</a></dt>
<dd>Jown Shawe-Taylor and Nello Cristianini, <i>Support vector machines</i>, Cambridge University Press, 2000.</dd>
<dt><a name="KernBook" id="KernBook">STC04</a></dt>
<dd>to3em, <i>Kernel methods for pattern analysis</i>, Cambridge University Press, 2004.</dd>
<dt><a name="Strang" id="Strang">Str76</a></dt>
<dd>Gilbert Strang, <i>Linear algebra and its applications</i>, Academic Press, Inc., 1976.</dd>
<dt><a name="TibHat" id="TibHat">TH09</a></dt>
<dd>Jerome&nbsp;Friedman Trevor&nbsp;Hastie, Robert&nbsp;Tibshirani, <i>The elements of statistical learning: Data mining, inference and prediction</i>, Springer, 2009.</dd>
<dt><a name="Trevisan:2008p2166" id="Trevisan:2008p2166">TTV08</a></dt>
<dd>Luca Trevisan, Madhur Tulsiani, and Salil Vadhan, <i>Regularity, boosting, and efficiently simulating every high-entropy distribution</i>, Electronic Colloquium on Computational Complexity (2008), 18.</dd>
<dt><a name="Zeilberger:1995p277" id="Zeilberger:1995p277">Zei95</a></dt>
<dd>Doron Zeilberger, <i>The method of undetermined generalization and specialization illustrated with fred galvin&#8217;s amazing proof of the dinitz conjecture</i>, <a href="http://arxiv.org/abs/math/9506215">http://arxiv.org/abs/math/9506215</a>, 1995.</dd>
</dl>
<h1><a name="SECTION00070000000000000000" id="SECTION00070000000000000000">Acknowledgement</a></h1>
<p><font><font>A thank you to readers who supplied help and comments on earlier drafts.</font></font></p>
<p></p>
<hr />
<h4>Footnotes</h4>
<dl>
<dt><a name="foot21" id="foot21">&#8230; Mount</a><a href="#tex2html3"><sup>1</sup></a></dt>
<dd>email: <tt><a name="tex2html1" href="mailto:jmount@win-vector.com" id="tex2html1">mailto:jmount@win-vector.com</a></tt> web: <tt><a name="tex2html2" href="http://www.win-vector.com/" id="tex2html2">http://www.win-vector.com/</a></tt></dd>
<dt><a name="foot244" id="foot244">&#8230; principle.</a><a href="#tex2html4"><sup>2</sup></a></dt>
<dd>The pre-existing practice that comes cloesest to the local o global principle is found in operations research where encoding a problem to be solved by an optimizer is a central technique. We claim the natural statement of the local to global principle is more general than <font><em>always</em> encoding constraints for a particular optimizer (in particular globalization is not always optimization).</font></dd>
<dt><font><a name="foot43" id="foot43">&#8230; structure</a><a href="#tex2html6"><sup>4</sup></a></font></dt>
<dd><font>By &#8220;link structure&#8221; we mean which web pages link to which other web pages.</font></dd>
<dt><font><a name="foot45" id="foot45">&#8230; graph</a><a href="#tex2html7"><sup>5</sup></a></font></dt>
<dd><font>Remember, a graph is diagram consisting of nodes and edges (here depicted as arrows).</font></dd>
<dt><font><a name="foot245" id="foot245">&#8230; features</a><a href="#tex2html9"><sup>6</sup></a></font></dt>
<dd><font>For example the model could account for:</font></p>
<ul>
<li>surfers entering and leaving the model</li>
<li>link odds that vary where they are on a page</li>
<li>surfers staying on a page proportional to how much text is on the page</li>
<li>matching known traffic and click behavior where we have such data.</li>
</ul>
<p><font>For simplicity we will just stick with the example given example.</font></dd>
<dt><font><a name="foot154" id="foot154">&#8230; components.</a><a href="#tex2html17"><sup>7</sup></a></font></dt>
<dd><font>When a system is named and defined as an exact set of procedures the system can, by definition, not be improved. This is because with any change in procedure we have a new system that no longer matches the original definition and therefore requires a new name.</font></dd>
</dl>
<p><font><br /></font></p>
<hr />
<address><font>John Mount 2009-11-11</font></address>


<p>Related posts:<ol><li><a href='http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/' rel='bookmark' title='Permanent Link: Automatic Differentiation with Scala'>Automatic Differentiation with Scala</a></li>
<li><a href='http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/' rel='bookmark' title='Permanent Link: A Demonstration of Data Mining'>A Demonstration of Data Mining</a></li>
<li><a href='http://www.win-vector.com/blog/2009/07/should-your-mom-use-google-search/' rel='bookmark' title='Permanent Link: Should your mom use Google search?'>Should your mom use Google search?</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2009/11/the-local-to-global-principle/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>On The Hysteria Over &#8220;The Cloud&#8221;</title>
		<link>http://www.win-vector.com/blog/2009/08/on-the-hysteria-over-the-cloud/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=on-the-hysteria-over-the-cloud</link>
		<comments>http://www.win-vector.com/blog/2009/08/on-the-hysteria-over-the-cloud/#comments</comments>
		<pubDate>Thu, 13 Aug 2009 23:12:53 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Expository Writing]]></category>
		<category><![CDATA[Opinion]]></category>
		<category><![CDATA[Cloud Computing]]></category>
		<category><![CDATA[Data Centers]]></category>
		<category><![CDATA[Mainframes]]></category>
		<category><![CDATA[PC Revolution]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=237</guid>
		<description><![CDATA[On The Hysteria Over &#8220;The Cloud&#8221; The frenzy of anticipation and opinion about &#8220;The Cloud&#8221; is so intense and so pointless it becomes &#8220;parody proof.&#8221; It is as Jerry Holkins and Mike Krahulik wrote (regarding a different situation): It&#8217;s like trying to make fun of a clown. What, are you going to make fun of [...]


Related posts:<ol><li><a href='http://www.win-vector.com/blog/2010/02/postels-law-not-sure-who-to-be-angry-with/' rel='bookmark' title='Permanent Link: Postel&#8217;s Law: Not Sure Who To Be Angry With'>Postel&#8217;s Law: Not Sure Who To Be Angry With</a></li>
<li><a href='http://www.win-vector.com/blog/2008/04/sorting-in-anger/' rel='bookmark' title='Permanent Link: Sorting Used in Anger'>Sorting Used in Anger</a></li>
<li><a href='http://www.win-vector.com/blog/2008/04/i-know-i-am-the-one-being-a-jerk/' rel='bookmark' title='Permanent Link: I know, I am the one being a jerk'>I know, I am the one being a jerk</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>On The Hysteria Over &#8220;The Cloud&#8221;<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/180px-Lenticular_Cloud_in_Wyoming_0034b.jpg" alt="180px-Lenticular_Cloud_in_Wyoming_0034b.jpg" border="0" width="180" height="120" /><br />
</center></p>
<p />
The frenzy of anticipation and opinion about &#8220;The Cloud&#8221; is so intense and so pointless it becomes &#8220;parody proof.&#8221;<br />
<span id="more-237"></span>It is as Jerry Holkins and Mike Krahulik wrote (regarding a different situation):</p>
<blockquote><p>
It&#8217;s like trying to make fun of a clown.  What, are you going to make fun of his tiny car?  His floppy shoes? It just doesn&#8217;t work.
</p></blockquote>
<p />
I would like to point out that (by computer science standards) the cloud is not new and has for some time been considered inevitable.</p>
<p />
But what is &#8220;The Cloud?&#8221; What the cloud is depends a bit on what conversation you are being drawn into.  If the conversation is about computing then the cloud is remote computers, software and services like Wikipedia, GMail, SalesForce.com, Google Docs, Amazon EC2/S3 and Google App Engine.  If the conversation is about human interaction then the cloud is ecosystems like Facebook, Twitter and RSS.  Each of these are facets of important longer term trends, but for individual companies and technologies the pendulum is about as fast on the down-swing as it was on the up-swing.  At this time we can safely declare a number of recent important players dead: Friendster, AltaVista, WSDL, Usenet, IRC and Web2.0.  </p>
<p />
<p>It is true that the network itself is more useful than the computer, but this idea is not new to our third millennium.  The current people getting rich promoting this idea did not invent this idea, they grew up in its shadow.  The early big thinkers on computers had big plans.  Plans much larger than Tetris, payroll processing, COBOL and punched cards.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/Hollerith_punched_card.jpg" alt="Hollerith_punched_card.jpg" border="0" width="434" height="246" /><br />
</center></p>
<p />
<p>Take the article <cite>&#8220;As We May Think&#8221; (by Vannevar Bush, The Atlantic Monthly (1945))</cite>.  In it Vannevar Bush writes:
<p />
<blockquote><p>
Consider a future device for individual use, which is a sort of mechanized private file and library.  It needs a name, and, to coin one at random, &#8220;memex&#8221; will do. A memex is a device in which an individual stores all his books, records, and communications, and which is mechanized so that it may be consulted with exceeding speed and flexibility. It is an enlarged intimate supplement to his memory.
</p></blockquote>
<p /> At first this sounds like nothing more than <cite>&#8220;Danny Dunn and the Homework Machine&#8221; (by Jay Williams (1964), Scholastic Press)</cite><br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/DannyDunnHomeworkMachine.jpg" alt="DannyDunnHomeworkMachine.jpg" border="0" width="240" height="240" />.<br />
</center></p>
<p /> But in his essay Vannevar Bush uses the phrase &#8220;it can presumably be operated from a distance&#8221; and ends his essay with a long section of how many professions would benefit from a Memex (we show here only one):
<p />
<blockquote><p>
Wholly new forms of encyclopedias will appear, ready made with a mesh of associative trails running through them, ready to be dropped into the memex and there amplified. The lawyer has at his touch the associated opinions and decisions of his whole experience, and of the experience of friends and authorities.
</p></blockquote>
<p /> Obviously we are reading this with a modern eye, but here we have the antecedents of hypertext and the Wikipedia.
<p />
<p>We can trace this thread further forward to <cite>&#8220;Augmenting Human Intellect: A Conceptual Framework&#8221; (by Douglas C Engelbart (1962))</cite> and the famous <a href="http://sloan.stanford.edu/MouseSite/1968Demo.html">1968 demo</a>.</p>
<p>And we can further trace the ideas passing through: <cite> &#8220;Literary Machines: The report on, and of, Project Xanadu concerning word processing, electronic publishing, hypertext, thinkertoys, tomorrow&#8217;s intellectual revolution, and certain other topics including knowledge, education and freedom&#8221; (by Ted Nelson (1981), Mindful Press, Sausalito, California.) </cite>  </p>
<p />
<p>These works were all about knowledge engineering, information storage, networking and communication.  There was an extreme urgency in these works.  Both Engelbart and Nelson felt we had a limited window to gain the ability to organize the world&#8217;s information before some catastrophic error or misunderstanding eliminated us all.  This feeling of urgency and doom came from another exciting application of real time networked computers: <a href="http://en.wikipedia.org/wiki/Semi_Automatic_Ground_Environment">SAGE</a>.  SAGE was the &#8220;Semi Automatic Ground Environment&#8221; first made operational in 1959.  It involved networked computers, light pen based operator terminals and was the system that the United States had ready to fight World War III.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/File-SAGE_control_room.png" alt="File-SAGE_control_room.png" border="0" width="180" height="232" /><br />
</center></p>
<p>This was the era of near infinite budgets, block sized computer complexes, massive mainframes and IT priesthoods that ran the whole show.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/180px-Sage_typical_building.jpg" alt="180px-Sage_typical_building.jpg" border="0" width="180" height="136" /><br />
</center></p>
<p>The inevitable march was on. Some large fraction of the GDP would be forever dedicated to building and maintaining monument sized networked computing facilities.  Your degree of relevance and power in society would be directly determined by how close you could get to these facilities.  Then something happened and distracted everyone.  The distraction was so immediate and so complete that by the time the inevitable march restarted (block sized Google data centers and a <a href="http://green.yahoo.com/blog/ecogeek/1125/yahoo-data-center-will-be-powered-by-niagara-falls.html"> proposed Yahoo data center to be built attached to Niagara falls</a>) everyone thought it was a new thing.</p>
<p>What happened was the 1958 demonstrations of successful integrated circuits.  This and the transistor started an era of micro-miniaturization that took the world by storm.  By 1971 Intel had released a single chip CPU (the 4004) as a commercial product.  This chip implemented the core of a computer in a fingertip size package that contained 2300 transistors.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/180px-Intel_4004.jpg" alt="180px-Intel_4004.jpg" border="0" width="180" height="173" /><br />
</center></p>
<p>From here on everything was desktop calculators, pocket calculators and digital watches.  And then the personal computer and the personal computer revolution hit.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/180px-Popular_Electronics_Cover_Jan_1975.jpg" alt="180px-Popular_Electronics_Cover_Jan_1975.jpg" border="0" width="180" height="240" /><br />
</center></p>
<p>IBM kicked the PC revolution into high gear when they pushed into the market in 1981.  The personal computer was a supreme distraction that pulled attention away from the monolithic computers for fifteen years.  And for a long while networking and shared information were both nearly forgotten. Computers were for spreadsheets, desktop publishing and other non-networked tasks.
<p />
<p>However, out of public view the monolithic network continued to develop.  The Internet was started as ARPAnet and grew connecting universities and defense contractors from 1969 through now.  The messaging formats (it is inappropriate to use the more common term &#8220;technology&#8221; to describe HTTP and HTML) we call &#8220;The World Wide Web&#8221; were invented (without much fanfare) in 1989.  Netscape was founded in 1994 and made the World Wide Web and Internet available to the PC.  And then the Internet hit like a Tsunami.  Electronic commerce and speculation funded the the initial burst.  Then on-line advertising took over and we are back to building new encyclopedias, tracking everyone and once again building city block sized computers (now called data centers).</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/google_data_center_lenoir.jpg" alt="google_data_center_lenoir.jpg" border="0" width="512" height="355" /><br />
</center></p>
<p>Once again we are being told our data is too important to be locked in our desk (or PC) and everything is migrating back to the mainframe (now called &#8220;the cloud&#8221;).
<p />
<p>Will the cycle reverse?  If applications are moving into the cloud now will they ever move back out?
<p />
<p>Moore&#8217;s law has a way of shrinking things (a current smart phone outperforms many early mainframes, super computers and data centers).  Will individual PCs once again be more important than the network?  Some of the more useful parts of the Internet (like the Wikipedia) are small enough to put on current PCs.  The data centers and networks will not go away any time soon, but excitement and attention could move on to something else.  Devices that you could carry everywhere and that have intermittent or expensive connections to the Internet might have an advantage in being able to cache some of the Internet.  And excitement follows what is new, so a stable pervasive cloud would likely be taken for granted (like roads, power, telephone and other utilities).
<p />
<p>Another thing that could migrate applications back out of the cloud (assuming they migrate in) is if access to the user becomes too important to delegate to the cloud.  eCommerce applications take user access when they can get it, but many other applications may depend more on immediate access to the user than on grabbing fresh data from the network.  For example a pacemaker is likely to run most of its application from an embedded computer- this computer might talk to the cloud when it can, but the application will be designed to stand alone as long as possible.
<p />
<p>In the end evangelizing the coming triumph of factory scale computing and networking is pointless.  It is already here and has no great need for cheerleaders.</p>


<p>Related posts:<ol><li><a href='http://www.win-vector.com/blog/2010/02/postels-law-not-sure-who-to-be-angry-with/' rel='bookmark' title='Permanent Link: Postel&#8217;s Law: Not Sure Who To Be Angry With'>Postel&#8217;s Law: Not Sure Who To Be Angry With</a></li>
<li><a href='http://www.win-vector.com/blog/2008/04/sorting-in-anger/' rel='bookmark' title='Permanent Link: Sorting Used in Anger'>Sorting Used in Anger</a></li>
<li><a href='http://www.win-vector.com/blog/2008/04/i-know-i-am-the-one-being-a-jerk/' rel='bookmark' title='Permanent Link: I know, I am the one being a jerk'>I know, I am the one being a jerk</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2009/08/on-the-hysteria-over-the-cloud/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>What is &#8220;Genetic Art?&#8221;</title>
		<link>http://www.win-vector.com/blog/2009/06/what-is-genetic-art/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=what-is-genetic-art</link>
		<comments>http://www.win-vector.com/blog/2009/06/what-is-genetic-art/#comments</comments>
		<pubDate>Tue, 02 Jun 2009 05:17:57 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[History]]></category>
		<category><![CDATA[Mathematics]]></category>
		<category><![CDATA[art]]></category>
		<category><![CDATA[genetic art]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=125</guid>
		<description><![CDATA[What is &#8220;genetic art?&#8221; My answer to this is http://www.geneticart.org (redirects to http://www.mzlabs.com), but this requires some explanation. The quick answer is this is genetic art: The longer answer is that a number of times different forms of algorithmic art have been invented. Algorithmic art is art generated by mathematical procedures. Such art is similar [...]


Related posts:<ol><li><a href='http://www.win-vector.com/blog/2010/04/algorithmic-movie-with-texture/' rel='bookmark' title='Permanent Link: Algorithmic Movie (with texture)'>Algorithmic Movie (with texture)</a></li>
<li><a href='http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/' rel='bookmark' title='Permanent Link: Automatic Differentiation with Scala'>Automatic Differentiation with Scala</a></li>
<li><a href='http://www.win-vector.com/blog/2010/01/relative-returns-a-banker-versus-trader-paradox/' rel='bookmark' title='Permanent Link: Relative returns: a banker versus trader paradox'>Relative returns: a banker versus trader paradox</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>What is &#8220;genetic art?&#8221;  My answer to this is <a href="http://www.geneticart.org">http://www.geneticart.org</a>  (redirects to <a href="http://www.mzlabs.com">http://www.mzlabs.com</a>), but this requires some explanation.<span id="more-125"></span><br />
The quick answer is this is genetic art:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/06/pic1.png" alt="pic1.png" border="0" width="600"  /><br />
</center></p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/06/pic2.png" alt="pic2.png" border="0" width="600" /><br />
</center></p>
<p>The longer answer is that a number of times different forms of algorithmic art have been invented.  Algorithmic art is art generated by mathematical procedures.  Such art is similar to earlier mechanical and kinetic art forms.  One branch of mathematics often used to generate such art is called &#8220;fractals.&#8221;  We looked somewhere else for our inspiration (our art is not strictly fractal in nature).  What we worked on we called &#8220;genetic art&#8221; to emphasize the role of encoding and re-combination in the works.</p>
<p>In the early 90&#8242;s Karl Sims presented a number of art installations based on at least three interesting ideas: </p>
<ul>
<li>Transforming images</li>
<li>Evolving combinations of transforms</li>
<li>Direct participation</li>
</ul>
<p>(see: Karl Sims. Artificial Evolution for Computer Graphics. Proceedings of SIGGRAPH 1991 and <a href="http://www.karlsims.com/genetic-images.html">Karl Sims&#8217; homepage</a>).</p>
<p>The part that caught a number of people&#8217;s imaginations was the evolution aspect.  Karl Sims defined a method of combining transformations of original source images.  He then allowed people to manipulate his art installations and &#8220;vote&#8221; on art they liked best.  The more popular pieces were combined (or bred) to create newer works that then put up against criticism.  After many breedings (or generations) the combinations of transforms were quite complicated and a number of unexpected images were created.</p>
<p>At CMU Shumeet Baluja, Dean Pomerleau and Todd Jochem were interested both in the evolutionary aspects of the art and also seeing if a machine could learn to model user tastes (see Shumeet Baluja, Dean Pomerleau and Todd Jochem. Simulating User&#8217;s Preferences: Towards Automated Artificial Evolution for Computer Generated Images. Technical Report CMU-CS-93-198. Carnegie Mellon University. Pittsburgh, PA. October 1993. ).  They built a much simpler art system that combined primitive elements (elements closer to brush strokes than to original pictures) and tried to learn user preferences for complex pictures.  </p>
<p>Figures of this era looked much like this (well better than this, this comes from a scan of a black and white printing of the paper):</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/06/evolve.png" alt="evolve.png" border="0" width="209" height="205" />.<br />
</center></p>
<p>Scott Neal Reilly built a new, more simplified system; and with Michael Witbrock put the whole thing on the Web.  This was an unimaginably primitive time on the Web. Cutting edge interaction was sites like &#8220;Blue Dog Can Count.&#8221;  The Mac had no forms capable browser and Amazon.com was still a year away from launching.  An interactive art exhibition running directly on the Web (and manipulated by anybody) was a significant step forward.</p>
<p>Michael Witbrock was influenced by the stories of  <a href="http://en.wikipedia.org/wiki/Heikegani">Heikegani Crabs</a> and Alan Turing&#8217;s 1952 paper &#8220;The Chemical Basis of Morphogenesis&#8221; (which theorized how simple systems could develop textures).</p>
<p>At this point I (John Mount) got interested in the project and felt that much more could be done with how such systems handled color.  The art had been simplified to primitive elements that one could think of as brushes (really more like gradients) but the art was essentially grey-scale with a false-color map applied at the last step.  Karl Sims had made transformations on images his primitive operations, I wanted my primitive operation to be transformations on color.</p>
<p>Being a math-nerd I chose to encode color inside a mathematical system called &#8220;Quaternions&#8221; (see Ebbinghaus et al. Numbers Springer-Verlag, Second Edition, 1988).  Colors are often represented as three brightness terms- for example intensity of red, intensity of green and intensity of blue.  The Quaternions were discovered by Sir William Rowan Hamilton in 1843 (see <a href="http://en.wikipedia.org/wiki/Quaternion">Wikipeida: Quaterion</a>).  Sir Hamilton was trying to solve the problem of encoding positions in space in a nice structure and was so excited by his discovery he carved his fundamental formula for them in the Brougham Bridge the night he had his breakthrough.  Quaternions are represented as four standard numbers (so they have enough &#8220;slots&#8221; to encode a position in Sir Hamilton&#8217;s case or in our case a color) and they their selves behave a lot like individual numbers.  There are rules for adding, subtracting, multiplying and even dividing Quaterions.  This means you can write formulas over them and these formulas are now directly manipulating colors (instead of manipulating geometry or intensities as in the earlier systems).</p>
<p>So, as with Neal Reilly&#8217;s system, we represented all of our transformations as formulas and represented &#8220;breeding&#8221; as ripping a bit of one formula out and combining it with another.  For instance these two rather uninteresting color gradients were represented by the formulas:<br />
<center></p>
<table>
<tr>
<td> ( x &#8211; i y ) : </td>
<td>
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/06/x-iy.png" alt="x_iy.png" border="0" width="150" height="100" />
</td>
</tr>
<tr>
<td> ( x &#8211; i y &#8211; j x &#8211; k y ) : </td>
<td>
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/06/x-iy-jx-ky.png" alt="x_iy_jx_ky.png" border="0" width="150" height="100" />.
</td>
</tr>
</table>
<p></center></p>
<p>One of our arithmetic operations was named &#8220;mod&#8221; and we could use it to combine the two items into a more complicated formula and somewhat more interesting picture:</p>
<p><center></p>
<table>
<tr>
<td>( mod ( x &#8211; i y) ( x &#8211; i y &#8211; j x &#8211; k y ) )  : </td>
<td>
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/06/mod.png" alt="mod.png" border="0" width="150" height="100" />.
</td>
</tr>
</table>
<p></center></p>
<p>After enough generations of selection and breeding the formulas get long and complicated (luckily nobody but the machines have to look at them) and the pictures get interesting:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/06/silver.png" alt="silver.png" border="0" width="300" height="200" />.<br />
</center></p>
<p>Michael Witbrock and Scott Neal Reilly supplied an updated web interface. And at this point we got our 15 minutes of fame:</p>
<p>From Wired 3.01 January 1995 p. 147  Kristin Spence&#8217;s &#8220;Net Surf&#8221; column:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/06/wired301.png" alt="Wired3.01.png" border="0" width="443" height="252" />.<br />
</center></p>
<p>We also made large prints (using a parallel computation system named &#8220;WAX&#8221; by Peter Stout) and set up an exhibition in a Pittsburgh coffee house.  The development work was largely done on a machine with a black and white monitor (later we got access to a grey scale monitor) so it really was a treat that color was able to fend for itself.</p>
<p>On an unrelated track in 1991 another CMU student, <a href="http://scottdraves.com/history.html">Scott Draves</a>, was pursuing a serious project and building art based on iterated function systems (more related to fractals than our work) and in 1999 added some genetic ideas and released the Electronic Sheep client/server oriented screen saver (inspired by SETI@home).</p>
<p>Recently we have gotten back to some more of Sims&#8217; ideas and allowed more geometric transformations and real source images, such as incorporating some of Dover&#8217;s royalty free Japanese textile designs:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/06/texture.png" alt="texture.png" border="0" width="300" height="200" />.<br />
</center></p>
<p>The original system was group interactive and an incredible hit (and had its own Zephyr discussion instance- the then equivalent of Twitter).  The current demo is stand alone and server-free, so it is a single player game.</p>
<p>All the user has to do is point a Java 1.4 (or better) capable browser at:  <a href="http://mzlabs.com/MZLabsJM/page4/page22/page22.html">Genetic Art Program</a>.  Click (and hold) on the &#8220;Action Menu&#8221; of any of the sub-windows and select &#8220;take over left selection&#8221; on one picture that interests you and &#8220;take over right selection&#8221; on another picture that interests you.  You will notice doing this copies the pictures to the left and right of the &#8220;breed pictures&#8221; button.  Now press &#8220;breed pictures&#8221; as many times as you want.  Each time you press it you will get a new picture build by combining elements of the two small pictures.  At any time you can have some other picture you like take over a breeding position (again by using its action menu).  And you can also scroll the strip of five pictures around and click on them to introduce previous favorites from the earlier server version of the program into your space of opportunities.</p>
<p>We were originally a bit uncomfortable calling the work &#8220;Art&#8221; but a number of the images have made significant impressions on us and others- so perhaps it qualifies.</p>


<p>Related posts:<ol><li><a href='http://www.win-vector.com/blog/2010/04/algorithmic-movie-with-texture/' rel='bookmark' title='Permanent Link: Algorithmic Movie (with texture)'>Algorithmic Movie (with texture)</a></li>
<li><a href='http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/' rel='bookmark' title='Permanent Link: Automatic Differentiation with Scala'>Automatic Differentiation with Scala</a></li>
<li><a href='http://www.win-vector.com/blog/2010/01/relative-returns-a-banker-versus-trader-paradox/' rel='bookmark' title='Permanent Link: Relative returns: a banker versus trader paradox'>Relative returns: a banker versus trader paradox</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2009/06/what-is-genetic-art/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Programs reduced to statistics</title>
		<link>http://www.win-vector.com/blog/2009/05/programs-reduced-to-statistics/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=programs-reduced-to-statistics</link>
		<comments>http://www.win-vector.com/blog/2009/05/programs-reduced-to-statistics/#comments</comments>
		<pubDate>Sun, 31 May 2009 16:24:58 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[computer languages]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=110</guid>
		<description><![CDATA[An interesting article on programming languages by Guillaume Marceau is making the rounds: The speed, size and dependability of programming languages. The article points out very clearly what some of the differences in major programming languages are. The author uses benchmarking and graphs in an interesting way. I have had a soft spot for this [...]


Related posts:<ol><li><a href='http://www.win-vector.com/blog/2008/04/sorting-in-anger/' rel='bookmark' title='Permanent Link: Sorting Used in Anger'>Sorting Used in Anger</a></li>
<li><a href='http://www.win-vector.com/blog/2008/04/i-know-i-am-the-one-being-a-jerk/' rel='bookmark' title='Permanent Link: I know, I am the one being a jerk'>I know, I am the one being a jerk</a></li>
<li><a href='http://www.win-vector.com/blog/2009/09/survive-r/' rel='bookmark' title='Permanent Link: Survive R'>Survive R</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>An interesting article on programming languages by Guillaume Marceau is making the rounds:<br />
<a href="http://gmarceau.qc.ca/blog/2009/05/speed-size-and-dependability-of.html"> The speed, size and dependability of programming languages</a>. The article points out very clearly what some of the differences in major programming languages are.  The author uses benchmarking and graphs in an interesting way.<br />
<span id="more-110"></span></p>
<p>I have had a soft spot for this kind of study ever since I read: Donald E. Knuth: An Empirical Study of FORTRAN Programs. Softw., Pract. Exper. 1(2): 105-133 (1971).  In that article Knuth admits to breaking into people&#8217;s accounts to collect statistics on what evil people were feeding into the FORTRAN complier.</p>
<p>Let&#8217;s look at the gestalt of a few popular programing languages following button-sized excerpts from Marceau&#8217;s article:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/05/compplots.png" alt="compPlots.png" border="0" width="536" height="113" />.<br />
</center></p>
<p>To build these graphs 19 challenge problems were implemented in 72 programming languages.  Each square is programming language, the x-axis is runtime size and the y-axis is code size (large is bad on both of these).  Each line segment connects the code size and run-time of one example program run to the centroid of all such runs for the language.  We all know code size is not a very good stand-in for programming difficulty (compare C a merely primitive language to C++ an outright programmer hostile language), but the pictures actually tell a credible story.</p>
<ul>
<li>
GCC (or C) is very very fast but takes a lot of code (its graph is a vertical bar running up and down the left).
</li>
<li>
Java mostly works like C, but every once and a lets you down on performance (this is leaving out that Java is far safer than C and far more wasteful of memory).
</li>
<li>
Javascript and Ruby have such bad implementations that their centers are off the graph (this brings up a point the original authors well understand- you can not benchmark a language only a specific run of a specific program using a specific language implementation).
</li>
<li>
Perl and Erlang have similar run time performance (though are completely opposite poles of elegance, elegance not plotted on graph).
</li>
<li>
Ruby&#8217;s implementation makes Python look fast.
</li>
<li>
OCaml lives up to its reputation of being simultaneously very expressive and efficient (but expressive power is not a direct measure of ease of use, think of APL).
</li>
</ul>
<p>The benchmarking depends on people donating example programs and the problem types are heavily biased towards the puzzle are (where C, Java and OCaml excel) and not to the &#8220;its a one-liner because it is already done in a frame work&#8221; (Perl, Python, Ruby).</p>
<p>For all the problems inherent in such a study I think it is actually interesting what a little quantitative data lets us think about.</p>


<p>Related posts:<ol><li><a href='http://www.win-vector.com/blog/2008/04/sorting-in-anger/' rel='bookmark' title='Permanent Link: Sorting Used in Anger'>Sorting Used in Anger</a></li>
<li><a href='http://www.win-vector.com/blog/2008/04/i-know-i-am-the-one-being-a-jerk/' rel='bookmark' title='Permanent Link: I know, I am the one being a jerk'>I know, I am the one being a jerk</a></li>
<li><a href='http://www.win-vector.com/blog/2009/09/survive-r/' rel='bookmark' title='Permanent Link: Survive R'>Survive R</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2009/05/programs-reduced-to-statistics/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Map Reduce: A Good Idea</title>
		<link>http://www.win-vector.com/blog/2009/01/map-reduce-a-good-idea/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=map-reduce-a-good-idea</link>
		<comments>http://www.win-vector.com/blog/2009/01/map-reduce-a-good-idea/#comments</comments>
		<pubDate>Sun, 25 Jan 2009 20:32:20 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Opinion]]></category>
		<category><![CDATA[Rants]]></category>
		<category><![CDATA[External Sorting]]></category>
		<category><![CDATA[Map Reduce]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=30</guid>
		<description><![CDATA[Some time ago I subscribed to The Database Column because it would be fun to see what these incredible people wanted to discuss. We owe much of our current database technology to Professor Stonebraker and Vertica sounds like an incredible product. And I definitely want to continue to subscribe. However, the reading experience is marred [...]


Related posts:<ol><li><a href='http://www.win-vector.com/blog/2008/04/sorting-in-anger/' rel='bookmark' title='Permanent Link: Sorting Used in Anger'>Sorting Used in Anger</a></li>
<li><a href='http://www.win-vector.com/blog/2010/05/must-have-software/' rel='bookmark' title='Permanent Link: Must Have Software'>Must Have Software</a></li>
<li><a href='http://www.win-vector.com/blog/2009/01/exciting-technique-1-the-r-language/' rel='bookmark' title='Permanent Link: Exciting Technique #1: The &#8220;R&#8221; language.'>Exciting Technique #1: The &#8220;R&#8221; language.</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>Some time ago I subscribed to The <a href="http://www.databasecolumn.com/">Database Column</a>  because it would be fun to see what these incredible people wanted to discuss.  We owe much of our current database technology to Professor Stonebraker and Vertica sounds like an incredible product.  And I definitely want to continue to subscribe.</p>
<p>However, the reading experience is marred by some flaw in their RSS system that keeps marking the article <a href="http://www.databasecolumn.com/2008/01/mapreduce-a-major-step-back.html">&#8220;MapReduce: A major step backwards&#8221;</a> as a new article.  This causes the article to appear in my RSS reader every few weeks as &#8220;new.&#8221;  This wouldn&#8217;t bother me too much except that the article runs so counter to experience that it is itself offensive.<br />
<span id="more-30"></span><br />
I have no desire to defend Google (the home of MapReduce)- they don&#8217;t need it and are clearly laughing all the way to the bank.  However the points used to kick at MapReduce are so broad and so devalue practitioner experience that they are insulting.  I find the individual arguments offensive and wish to stand against them.  I am not that concerned about the conclusion, use MapReduce or don&#8217;t.  For some things MapReduce is a good tool and for some things it is not.</p>
<p>Let&#8217;s limit ourselves to the 5 primary complaints from the article.  The article (verbatim) says MapReduce is:</p>
<blockquote><p>
1. A giant step backward in the programming paradigm for large-scale data intensive applications.</p>
<p>2. A sub-optimal implementation, in that it uses brute force instead of indexing.</p>
<p>3. Not novel at all &#8212; it represents a specific implementation of well known techniques developed nearly 25 years ago.</p>
<p>4. Missing most of the features that are routinely included in current DBMS.</p>
<p>5. Incompatible with all of the tools DBMS users have come to depend on.
</p></blockquote>
<p>Now let us comment:</p>
<p>1. <strong>&#8220;A giant step backward in the programming paradigm for large-scale data intensive applications.&#8221;</strong>  </p>
<p>Actually, no.  </p>
<blockquote><p>
MapReduce represents a continuity in a stream of ideas that made UNIX great: composable transient tools.  Not everything is a database or data warehouse.  A lot of the grungy UNIX tools (like sort, sed, awk, join) have often been combined to do large scale (at the time) research because they all worked &#8220;out of core&#8221; fairly well.  This makes for a horrible bailing-wire set-up.  However, it often handles problems of a size much larger than would have been possible on the hardware at the time.</p>
<p>In addition the author trots out the  &#8220;it&#8217;s Codasyl all over again&#8221; argument.  This argument refers to the ongoing pain and expense derived  from binding algorithmic details too close to the data representation.  In earlier writing it was a fantastic point that warned that the up and coming object oriented databases were going to be the same nasty pointer chasing nightmares that hierarchical databases had been.  I can see why an author might feel that just saying &#8220;it&#8217;s Codasyl&#8221; could win any argument.
</p></blockquote>
<p>2. <strong> &#8220;A sub-optimal implementation, in that it uses brute force instead of indexing.&#8221; </strong> </p>
<p>MapReduce does not use brute force.</p>
<blockquote><p>
MapReduce uses the idea (one that goes back to merge sort) that parallel traversals (that is: running through two lists in the same order synchronously) are a very powerful technique that can, among other things, produce indices.  MapReduce is so efficient that it has been shown to be competitive with the best large scale sorting algorithms on their home-turf: sorting.</p>
<p>MapReduce looks brutish because it drops a lot of popular design features.  One such feature is trying to speed things up through local caching and combining.  However, on the data that MapReduce is commonly used (free form written text) it is a provable property of the data that local caching is an ineffective complication (due to the heavy-tailedness of the data).  So many of the graceful features missing from MapReduce are actually no help on the types of data it is used on.  There is a certain grace in leaving only only the features that are actually helping.
</p></blockquote>
<p>3. <strong>&#8220;Not novel at all &#8212; it represents a specific implementation of well known techniques developed nearly 25 years ago.&#8221;</strong></p>
<p> A nasty attack.</p>
<blockquote><p>
MapReduce is a good explanation of some merging techniques that have been common knowledge for quite a while.  This is a legitimate expository goal: explaining something everybody already &#8220;knows&#8221; better.  In fact this is very hard to do and considered a legitimate accomplishment in many fields (for example Rota stated it was a legitimate goal in mathematics).  I myself looked at some of my own older code for dealing with very large data sets after reading the MapReduce paper.  I saw that the paper was describing what I was already doing (splitting the data into streams for later re-joining) and explaining it so well that it was now a method and no longer a hack.  When a paper successfully teaches about you something you already &#8220;know&#8221; it is a good work.</p>
<p>The attack is is also inaccurate- the ideas are not  25 years old it is closer to 120 years old.<br />
We could easily trace the lineage of MapReduce back to Hollerith style sorting machines that pre-date general purpose  computers (i.e. going back to before 1889) .  MapReduce refers back to a time when all computation was performed by what we now call external sorting and tabulation.  These 19th century technologies may seem archaic but they were developed in a word similar to ours: worlds where the amount of data is in excess of your conveniently reconfigurable computational resources.
</p></blockquote>
<p>4. <strong> &#8220;Missing most of the features that are routinely included in current DBMS.&#8221; </strong></p>
<p> Unfortunate.</p>
<blockquote><p>
I miss a lot of those features.</p>
<p>However, because MapReduce is such a lean technique I have seen engineers implement their own MapReduce systems in a day (to solve a problem they are working on).  That is they are successfully sorting, joining, indexing and summarizing hundreds of gigabytes of data on a consumer PC within a couple of days of being asked to.  This is from scratch after reading the MapReduce paper.</p>
<p>The &#8220;make versus buy&#8221; decision should not always come out &#8220;make.&#8221;  But it is not wise to artificially bloat up requirements so that the decision can only be &#8220;buy.&#8221;
</p></blockquote>
<p>5. <strong>&#8220;Incompatible with all of the tools DBMS users have come to depend on.&#8221; </strong></p>
<p> Good.</p>
<blockquote><p>
Frankly for a lot of analytic practitioners many DMBS systems and tools have become expensive obstacles in the way getting results.  Yes, we  enjoy humiliating an interview candidate that does not know all of the Codd normal forms (or can&#8217;t remember which of the alphabet soups of OLTP or OLAP is the &#8220;good one&#8221; ) as much as the next person.  But to many of us a lot of these tools and procedures are more obstacles than a solutions.</p>
<p>This may sound nasty, but if were not the case why would companies like Vertica be producing radical new database tools?  The fact is existing DBMS tools were designed for a different type and scale of data than we regularly see on the web (and column oriented database designers seem to share this view).  The situation is so bad that &#8220;roach motel&#8221; is a common analyst&#8217;s slang for &#8220;data warehouse&#8221; (derived from: &#8220;data checks in but it never checks out&#8221;).
</p></blockquote>
<p>This isn&#8217;t meant to be a hagiography of MapReduce, but given that MapReduce has paid the bills I feel it deserves a small show of respect along the lines of &#8220;dance with the one who brung you.&#8221;</p>
<p>MapReduce is not a panacea.  One of the tasks I have hated most in my career was maintaining a seven step MapReduce based system.  I would love to have avoided all the detail fiddling that set-up required.  However, the system paid our bills by performing a calculation that was beyond the scale of simpler methods and it would have been unaffordable to buy a solution.  </p>


<p>Related posts:<ol><li><a href='http://www.win-vector.com/blog/2008/04/sorting-in-anger/' rel='bookmark' title='Permanent Link: Sorting Used in Anger'>Sorting Used in Anger</a></li>
<li><a href='http://www.win-vector.com/blog/2010/05/must-have-software/' rel='bookmark' title='Permanent Link: Must Have Software'>Must Have Software</a></li>
<li><a href='http://www.win-vector.com/blog/2009/01/exciting-technique-1-the-r-language/' rel='bookmark' title='Permanent Link: Exciting Technique #1: The &#8220;R&#8221; language.'>Exciting Technique #1: The &#8220;R&#8221; language.</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2009/01/map-reduce-a-good-idea/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>I know, I am the one being a jerk</title>
		<link>http://www.win-vector.com/blog/2008/04/i-know-i-am-the-one-being-a-jerk/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=i-know-i-am-the-one-being-a-jerk</link>
		<comments>http://www.win-vector.com/blog/2008/04/i-know-i-am-the-one-being-a-jerk/#comments</comments>
		<pubDate>Sat, 26 Apr 2008 17:52:58 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Opinion]]></category>
		<category><![CDATA[Rants]]></category>
		<category><![CDATA[Online Libraries]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=16</guid>
		<description><![CDATA[The other day&#8217;s blog post and a recent Andrew Binstock interview of Donald Knuth made me think more about how the ACM is really not serving the interests of computer science. Here is a question from the interview: Andrew: One of the few projects of yours that hasn’t been embraced by a widespread community is [...]


Related posts:<ol><li><a href='http://www.win-vector.com/blog/2009/05/programs-reduced-to-statistics/' rel='bookmark' title='Permanent Link: Programs reduced to statistics'>Programs reduced to statistics</a></li>
<li><a href='http://www.win-vector.com/blog/2008/10/something-i-dont-get-about-business-and-bailouts/' rel='bookmark' title='Permanent Link: Something I don&#8217;t get about business and bailouts'>Something I don&#8217;t get about business and bailouts</a></li>
<li><a href='http://www.win-vector.com/blog/2009/01/map-reduce-a-good-idea/' rel='bookmark' title='Permanent Link: Map Reduce: A Good Idea'>Map Reduce: A Good Idea</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>The other day&#8217;s <a href="http://www.win-vector.com/blog/2008/04/sorting-in-anger/">blog post</a> and a recent <a href="http://www.informit.com/articles/article.aspx?p=1193856">Andrew Binstock interview of Donald Knuth</a> made me think more about how the ACM is really not serving the interests of computer science.  <span id="more-16"></span></p>
<p>Here is a question from the interview:</p>
<blockquote><p><strong>Andrew: One of the few projects of yours that hasn’t been embraced by a widespread community is literate programming. What are your thoughts about why <a href="http://www.literateprogramming.com/">literate programming</a> didn’t catch on? And is there anything you’d have done differently in retrospect regarding literate programming?</strong></p></blockquote>
<p>Professor Knuth had a good and interesting answer, which I will not go into here.  Also, it was a good question- Literate Programming is a good idea, yet we have only seen weak imitations of it like Doxygen and JavaDoc (which automatically document the syntactic structure of code instead of really helping the programer become an author and explain their meaning and intent).</p>
<p>Mr. Binstock even includes a link to a site promoting the concept.  Lets as the kids these days say &#8220;click through&#8221; and see what awaits us.</p>
<p><img src="http://www.win-vector.com/blog/wp-content/uploads/2008/04/literateprogramming1.gif" border="0" alt="Invitation to Lean an Idea" width="816" height="381" /></p>
<p>This is great.  Literate programing is definitely Web 2.0 (rounded corners, use of light pastels and cool gradients).  This is now, this is modern, sign me on.  The site even has links to original articles by the masters:</p>
<pre>Literate Programming - CACM Series
	Programming Pearls: Literate Programming, CACM (May 1986)
	Programming Pearls: A Literate Program, CACM (June 1986)
	Programming Pearls: Abstract Data Types, CACM (April 1987)
	Announcing Literate Programming, CACM (July 1987)
	LP: Processing Transactions, CACM (December 1987)
	LP: Expanding Generalized Regular Expressions, CACM (December 1988)
	LP: A File Difference Program, CACM (June 1989)
	LP: Weaving a Language-Independent WEB, CACM (September 1989)
	LP: An Assessment, CACM (March 1990)
	The Literate-Programming Paradigm
	Donald Knuth. "Literate Programming (1984)" in Literate Programming. CSLI, 1992, pg. 99.</pre>
<p>Lets click through and see how the Association for Computing Machinery helps disseminate, guide and educate:</p>
<p><img src="http://www.win-vector.com/blog/wp-content/uploads/2008/04/literateprogramming2.gif" border="0" alt="Retraction of the Invitation" width="674" height="612" /></p>
<p>Oh, maybe this is part of why Literate Programming hasn&#8217;t been embraced: the whole purpose of Literate Programming is lost when you keep it a secret.</p>
<p>I am sure I have been a paid ACM member from time to time, but I don&#8217;t remember the online credentials and they have probably lapsed by now.  I tried applying for the free temporary credential (the online form ended up not sending me anything- ACM not so good with the computers).  I can afford pay (yet again) to re-join ACM but why would I want to give my money to support an organization so far from my (and common) academic values?</p>
<p>So in conclusion:</p>
<ul>
<li>Sorry Professor Knuth, you remain one of my heroes, but I&#8217;ll have to get to Literate Programing a bit later.  I would say that the marketing campaign behind Literate Programming has excessive &#8220;breakage.&#8221;</li>
<li>ACM: that was a funny joke,  great head-fake, impeccable comic timing, good fun and I certainly learned something.  Oh, and I will see you in hell.</li>
</ul>


<p>Related posts:<ol><li><a href='http://www.win-vector.com/blog/2009/05/programs-reduced-to-statistics/' rel='bookmark' title='Permanent Link: Programs reduced to statistics'>Programs reduced to statistics</a></li>
<li><a href='http://www.win-vector.com/blog/2008/10/something-i-dont-get-about-business-and-bailouts/' rel='bookmark' title='Permanent Link: Something I don&#8217;t get about business and bailouts'>Something I don&#8217;t get about business and bailouts</a></li>
<li><a href='http://www.win-vector.com/blog/2009/01/map-reduce-a-good-idea/' rel='bookmark' title='Permanent Link: Map Reduce: A Good Idea'>Map Reduce: A Good Idea</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2008/04/i-know-i-am-the-one-being-a-jerk/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Sorting Used in Anger</title>
		<link>http://www.win-vector.com/blog/2008/04/sorting-in-anger/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=sorting-in-anger</link>
		<comments>http://www.win-vector.com/blog/2008/04/sorting-in-anger/#comments</comments>
		<pubDate>Thu, 24 Apr 2008 07:17:14 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Opinion]]></category>
		<category><![CDATA[Sorting in Anger]]></category>
		<category><![CDATA[Sorting With Anger]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=13</guid>
		<description><![CDATA[“Sorting Used in Anger” (A rambling glimpse into the mind of a theorist) Author: John Mount 4-24-2008 The other day I had a bit of time to kill before an appointment. Luck was with me: there was a nearby bookstore and I was able to pass some of the time skimming through a book called [...]


Related posts:<ol><li><a href='http://www.win-vector.com/blog/2009/01/map-reduce-a-good-idea/' rel='bookmark' title='Permanent Link: Map Reduce: A Good Idea'>Map Reduce: A Good Idea</a></li>
<li><a href='http://www.win-vector.com/blog/2009/05/programs-reduced-to-statistics/' rel='bookmark' title='Permanent Link: Programs reduced to statistics'>Programs reduced to statistics</a></li>
<li><a href='http://www.win-vector.com/blog/2009/07/thievery-considered-harmful/' rel='bookmark' title='Permanent Link: Thievery considered harmful'>Thievery considered harmful</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>“Sorting Used in Anger” (A rambling glimpse into the mind of a theorist)<br />
Author: John Mount<br />
4-24-2008</p>
<p>The other day I had a bit of time to kill before an appointment.  Luck was with me: there was a nearby bookstore and I was able to pass some of the time skimming through a book called “Beautiful Code.”  Everything started out fun and nostalgic.  The book title reminded me of “The Art of Computer Programming”  (a book that probably did as much through the grace of its title as it did through its incredible contents to attract minds into theoretical computer science).  One of the chapters of “Beautiful Code” was by Jon Bentley (a hero of sharp reasoning and clever coding) and as I flipped to the chapter my day was ruined.  There it was: Quicksort an algorithm that I have a long love and hate relationship with.</p>
<p><span id="more-13"></span></p>
<p>Bentley’s code (combining his skill with a really good idea Bentley credits to Nico Lomuto) is indeed a masterpiece:</p>
<pre>void quicksort(int l, int u)
{   int i, m;
    if (l &gt;= u) return;
    swap(l, randint(l, u));
    m = l;
    for (i = l+1; i &lt; = u; i++)
        if (x[i] &lt; x[l])
            swap(++m, i);
    swap(l, m);
    quicksort(l, m-1);
    quicksort(m+1, u);
}</pre>
<p> </p>
<p>How can I have such strong feelings for something so abstract and small as this algorithm?   Part of the answer is that I had just the other day written some Quicksort code that, until I flipped open the book, I had been quite proud of.  The rest of the answer is that Quicksort and I have a history.</p>
<p>Coding is all about compromises.  Some of the greats in our field work to teach us simplicity and purity but are often accused of being idealists.  Not so: their point was that there will be more than enough compromise and lessening of your vision as your project progresses.  So to have anything left you must start with a lot (which we usually do not).  C. A R. Hoare himself (the inventor of Quicksort) worked hard to build frameworks that were able to prove small programs like Quicksort functioned correctly.  This effort is often criticized as being impractical for large software systems.  But you cannot build large systems without a library of well understood reliable components.  Why tolerate any unnecessary flaws when you will certainly face plenty of necessary flaws?</p>
<p>How I have managed and negotiated compromises has been a major source of pride in my work.  I have pushed hard to move “nimble activities” such as research and prototyping away from looser languages (like Perl) into safer and stricter (though somewhat more tedious to work with) languages like Java.</p>
<p>This isn’t a pure position, it represents compromises and has consequences.  Java itself has a number of unnecessary flaws.  One flaw is that Java is very inefficient in storing huge collections of small objects.  That is: Java is bad at a task that underlies a lot of the basic record keeping needed for a lot of the research I do (machine learning and data mining).  I had, until opening “Beautiful Code,” been feeling very good about a compromise I had recently made.  To avoid giving up on Java (and moving to an even more tedious language like C++) I figured out how to store all of my many many small bits of information without creating a great number of objects.  By introducing complicated code I could stay in Java and keep many of the advantages of the language.  But, the compromise was my data lost access to a number of essential Java supplied services- such as sorting and lookup.</p>
<p>The solution was to re-implement sorting.  Re-imlementing sorting in this day and age (especially in a language such as Java that supplies some great sorted and ordered data structures) seems stupid.  But none of the Java built-in functions are willing to sort a large collection of data without create a great many objects (which is prohibitively wasteful).  If I had code that was able to sort my “poor man’s records” directly I would have a complete solution.  The most expedient sort procedure to code-up is Quicksort (which tends to run very well and does not need a lot of extra space for record keeping).  So I coded up what I that was a pretty clean and clever Quicksort for my problem.  Had I more time I would have preferred to use Mergesort which has better guarantees that Quicksort (Quicksort is usually very fast, Mergesort is always fast).</p>
<p>Seeing Bentley’s code I realized my code was “bloated and flabby.”  What I did in four separate ways he did in one.  My Quicksort was quick and correct, but not perfect.  I began to think that perhaps more of my reasoning was flabby and expedient.  Should I have moved the whole project out of Java?  Should I have implemented Mergesort?  Why did my code seem so filled with fret and worry?</p>
<p>Then I saw it.  Bentley’s example was wrong.  The code was so unworried that it failed to anticipate an important possibility.  When you are really thinking hard about an algorithm you can often simulated it in your head.  Bentley’s clear writing made thinking easier.  Suppose you gave Bentley’s Quicksort code a set of items to sort that were all identical.  Then the “if”-statement would never be true and the variable “m” would never be increased.  At the end of each run you would always have “m” pointing to the left-most portion of the array (despite the clever choice of pivot made earlier in the code).  The Quicksort procedure depends on calling itself to re-sort shorter arrays, but in this case the re-sort would be on arrays only a single item shorter than the original.  This means Quicksort would be doing about the same amount of work over and over again- instead of working on smaller and smaller pieces.  Quicksort would be very very slow.</p>
<p>Why do I notice this?  Part of it is my training: theoretical computer scientists are trained to work examples and trained to worry.  The other part is that I had been burned by this in the past.  Years ago I had been with a group doing very interesting biotech research on what at the time were moderately large data sets.  One of the clever steps was that the group had reduced a lot of the research steps to standard computer science techniques like sorting and joining (much like Google’s famous MapReduce reduces complicated computations to primitives).   Then the standard “commercial package” failed us.  The built-in sorting function was often very very slow.  The often happened if there were a lot of repeated values in the data we were trying to sort.</p>
<p>Years ago we fixed the problem by implementing our own Mergesort.  Mergesort is a bit harder to implement than Quicksort but there are no “special situations” that make Mergesort slow down.  Quicksort itself can, at the loss of some elegance, be coded in such a way to avoid any problems with duplicate entries (as I had recently done).  However, there is always a unlucky data set that Quicksort will be slow on.  The unlucky data set is rare, but it can happen (unlike with Mergesort).</p>
<p>Most implementations of Quicksort have this flaw (nearly guaranteed slowness when there are many duplicate keys in the data to sort).  One thing to look for is if when Quicksort calls itself is it using a single mid-value variable to represent the partition (like the “m-1” and “m+1” or does it use two variables to try and suppress all values similar to the pivot.  For instance the code found at <a href="http://www.csse.monash.edu.au/~lloyd/tildeAlgDS/Sort/Quick/">http://www.csse.monash.edu.au/~lloyd/tildeAlgDS/Sort/Quick/</a> (by Lloyd Allison who attributes it to Niklaus Wirth’s 1986 book “Algorithms and Data Structures”) performs some careful accounting to try and avoid allowing duplicate elements into the split in addition to the code to do this and the comments documenting what is being attempted the key they to look for is the use of two separate variables (“right” and “left” in addition to “lo” and “hi”) in the recursive call:</p>
<pre>quicksort(a, lo, right);// divide and conquer
quicksort(a, left,  hi);</pre>
<p>One think that always fundamentally bugged me about Quicksort: why settle for a procedure that “almost always works quickly” when there are other procedures that “always works quickly?”   Why would someone like C. A. R. Hoare promote a needless compromise?</p>
<p>And that is just it- it isn’t needless.  We live in the real world with bounded resources, compromise is inevitable.  If you want to sort a single list and guarantee it will be done quickly you really should be using a provably good sorting procedure (like Mergesort).  If, however, you are asked to solve many sorting problems then a procedure like Quicksort is so often faster than others that even when you add in the time lost on the bad rare-situations (which you will eventually run into) you will still be (on-average) ahead for using Quicksort.  It is a compromise having to pick a “most appropriate method” instead of a “best method” (because “best” varies by situation) but it is not an unnecessary compromise.  Seeing and managing there trade-offs is the essence of design and programming.</p>


<p>Related posts:<ol><li><a href='http://www.win-vector.com/blog/2009/01/map-reduce-a-good-idea/' rel='bookmark' title='Permanent Link: Map Reduce: A Good Idea'>Map Reduce: A Good Idea</a></li>
<li><a href='http://www.win-vector.com/blog/2009/05/programs-reduced-to-statistics/' rel='bookmark' title='Permanent Link: Programs reduced to statistics'>Programs reduced to statistics</a></li>
<li><a href='http://www.win-vector.com/blog/2009/07/thievery-considered-harmful/' rel='bookmark' title='Permanent Link: Thievery considered harmful'>Thievery considered harmful</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2008/04/sorting-in-anger/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Hello World: An Instance Of Rhetoric in Computer Science</title>
		<link>http://www.win-vector.com/blog/2008/02/hello-world-an-instance-rhetoric-in-computer-science/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=hello-world-an-instance-rhetoric-in-computer-science</link>
		<comments>http://www.win-vector.com/blog/2008/02/hello-world-an-instance-rhetoric-in-computer-science/#comments</comments>
		<pubDate>Wed, 20 Feb 2008 02:00:13 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[History]]></category>
		<category><![CDATA[Hello World]]></category>
		<category><![CDATA[Programming]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/2008/02/19/hello-world-an-instance-rhetoric-in-computer-science/</guid>
		<description><![CDATA[Hello World: An Instance Of Rhetoric in Computer Science John Mount: jmount@mzlabs.com February 19, 2008 Computer scientists have usually dodged questions of intent, purpose or meaning. While there are theories that assign deep mathematical meaning to computer programs[13] we computer scientists usually avoid discussion of meaning and talk more about utility and benefit. Discussions of [...]


Related posts:<ol><li><a href='http://www.win-vector.com/blog/2009/05/programs-reduced-to-statistics/' rel='bookmark' title='Permanent Link: Programs reduced to statistics'>Programs reduced to statistics</a></li>
<li><a href='http://www.win-vector.com/blog/2008/04/i-know-i-am-the-one-being-a-jerk/' rel='bookmark' title='Permanent Link: I know, I am the one being a jerk'>I know, I am the one being a jerk</a></li>
<li><a href='http://www.win-vector.com/blog/2008/04/sorting-in-anger/' rel='bookmark' title='Permanent Link: Sorting Used in Anger'>Sorting Used in Anger</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>Hello World: An Instance Of Rhetoric in Computer Science<br />
John Mount: jmount@mzlabs.com</p>
<p />
<p>February 19, 2008</p>
<p />
<p>Computer scientists have usually dodged questions of intent, purpose or meaning. While there are theories that assign deep mathematical meaning to computer programs[13] we computer scientists usually avoid discussion of meaning and talk more about utility and benefit. Discussions of the rhetorical meaning of programs is even less common. However, there is a famous computer program that has a clean an important rhetorical point. This program is called “hello world” and its entire action is to write out the phrase “hello world.” The action is simple but the “hello world” program actually has a fairly significant purpose and meaning.</p>
<p />
<p>I would like to briefly trace the known history of “hello world” and show how the rhetorical message it presents differs from the rhetoric embodied in earlier programs. In this sense we can trace a change in the message computer scientists felt they needed to communicate (most likely due to changes in the outside world).</p>
<p><span id="more-4"></span></p>
<p />
<p>Since the late 1970’s it has been a tradition in computer science that when facing a new system to start by writing the traditional “first program”, which is called “hello world.” The history of this tradition is not well documented, but WikiBooks claims the origin is in Brain Kerninghan’s 1974 tutorial for the computer language “B.”[14, 8]</p>
<p />
<p>The program itself is as follows:</p>
<p />
<pre>
main( ) {
  extrn a, b, c;
  putchar(a); putchar(b); putchar(c); putchar('!*n');
}
a 'hell';
b 'o, w';
c 'orld';
</pre>
<p />
<p>From the original context in [8] we can say this is not a pure “hello world” program. The purpose of this program is to illustrate a few functions of the language (“extrn” variables, use of multiple lines and so on) and not to test if the system is running.<br />
The most famous (and a pure) example of “hello world” is found in Brian Kerninghan and Dennis Ritchie’s famous 1978 C book[9]. The code looks like the following:</p>
<p />
<pre>
main() {
        printf("hello, world\n");
}
</pre>
<p />
<p>Equally famous are the early versions in “BASIC”:</p>
<p />
<pre>
10 print "Hello, World!"
</pre>
<p />
<p>In both of these examples the purpose of “hello world” is clear: it is a trivial computer program that does nothing but print a single line of text. If this can not be made to work then nothing can be made to work. “hello world” is in fact a somewhat confrontational program. The author is saying “it isn’t obvious your computer system will work, so I am not going to invest a lot of time in it until I see it can at least print one line of text.”</p>
<p />
<p>Writing “hello world” as the first program on a new system was certainly a well known tradition by the late 1970’s. We can ask: did this tradition showed up late or early in the history of computer science? </p>
<p />
<p>The first modern computer is commonly credited to John von Neumann in 1945. This attribution of “first” was made before a lot of previously secret information on Bletchley Park’s and Konrad Zuse’s work in Germany were well known, but von Neumann’s work was definitely the known foundation that later work built on. Donald Knuth credits von Neumann with the first modern computer program[10]. In any case we can securely place the invention of the modern computer to the early to mid 1940’s. It seems like “hello world” was not written at that time because it made entirely the wrong rhetorical point.</p>
<p />
<p>In fact von Neumann’s first program (in 1945) was a sorting algorithm (proving the computers could at least replace tabulators)- it was important to prove that not only could the computer turn on but that it could do something.</p>
<p />
<p>The 1957 description of FORTRAN starts with a program that solves for roots of an equation.[4] John Backus himself says the prior to 1954 “almost all programming was done in machine language or assembly language”[3] (which would make “hello world” an unlikely first program as numeric operations are typically much more succinct than string manipulations in these languages).</p>
<p />
<p>The 1958 UNIVAC Math-Matic Programming Manual first examples are (predictably) equations involving trigonometric functions.[5]</p>
<p />
<p>In 1959 the Algo 58 standard [2] had no concrete example programs- the purpose of this document was to show that program syntax could be specified and to discuss the techniques of specification (so a runnable example did not serve their argument).<br />
John McCarthy’s paper on recursive functions (LISP) starts with examples partial functions and conditionals (to get to terminating recursion as fast as possible).[11] The Lisp1 manual [6] concentrates on the translation of information into data-structures (an important point) and the Lisp1.5 manual [12] is written for the famous “bottom of page 13” moment where “eval” (essentially the semantic core of LISP) is defined in a few lines LISP code.</p>
<p />
<p>The computer language “BASIC” seems almost designed to support “hello world.” Instead the 1964 BASIC manual[7] starts with a lively discussion of what a computer program is (as a process) and the first example program is a highly useful program that solves simultaneous equations. The point again being that useful work can be done.</p>
<p />
<p>Even as late as 1972 “hello world” does not seem to be the obvious message. For example the popular BASIC tutorial called “My Computer Likes Me, When I Speak in BASIC”[1] does indeed start with a simple program that only writes a line of text. But the line it writes is “MY HUMAN UNDERSTANDS ME.” This is making a very different point than the point made by “hello world.”</p>
<p />
<p>While I have skipped a number of languages of the period (Autocode, COBOL, APL, SNOBOL, PL/1, Logo, BCPL, Forth, Smalltalk &#8230;) we can see the general trend: fist programs had to make a point. By the dates it seems that once invented the “hello world” tradition took off and spread very quickly. Writing such a program serves a useful purpose as a test so the practice probably spread quickly once the rhetoric of the earlier programs was no longer needed. That is “hello world” became popular once computer scientists no longer felt that society needed to be persuaded of the ultimate utility of computer systems.</p>
<p />
<p>With the increasing complexity of modern systems “hello world” is a more important test than ever. “hello world” is sometimes called the hardest application to deploy. The idea is that once you learn your lessons from deploying it then deploying a second, more sophisticated, application seems relatively easy. I often find that I can lean the true nature of a system by deploying “hello world.” If writing and deploying “hello world” is a sufficiently unpleasant task in a system then it is likely that every other task in that system will be similarly unpleasant.</p>
<p />
<p>In the end “hello world” serves the same purpose that it always has (testing if a system in fact works) and it stands as a rhetorical marker signifying that we are now living long after the skirmishes of the computer revolution.</p>
<p />
<p><strong>References<br />
</strong></p>
<p />
<p>[1]	Albrecht, B. My Computer Likes Me, When I Speak in BASIC. Dymax, 1972.</p>
<p />
<p>[2]	Backus, J. W. The syntax and semantics of the proposed international algebraic language of the zurich acm-gamm conference.</p>
<p />
<p>[3]	Backus, J. W. The history of fortran i, ii and iii. ACM SIGPLAN Notices 13, 8 (Aug 1978).</p>
<p />
<p>[4]	Backus, J. W., Beeber, R. J., Best, S., Goldberg, R., Haibt, L. M., Herrick, H. L., Nelson, R. A., Sayre, D., Sheridan, P. B., Stern, H., Ziller, I., Hughes, R. A., and Nutt, R. The fortran automatic coding system. Proceedings of the Western Joint Computer Conference (Feb 1957).</p>
<p />
<p>[5]	Corp., R. R. U. UNIVAC math-matic programming system. Sperry Rand, 1958.</p>
<p />
<p>[6]	Fox, P. Lisp 1 programmers manual. 165.</p>
<p />
<p>[7]	Kemeny, J., and Kurtz, T. Basic.</p>
<p />
<p>[8]	Kerninghan, B. W. A tutorial introduction to the language b.</p>
<p />
<p>[9]	Kerninghan, B. W., and Ritchie, D. M. The C Programming Language. Prentice Hall, February 1978.</p>
<p />
<p>[10]	Knuth, D. E. von neumann’s first computer program. Comp. Surveys 2, 4 (1970), 247–260.</p>
<p />
<p>[11]	McCarthy, J. Recursive functions of symbolic expressions and their computation by machine, part i.</p>
<p />
<p>[12]	McCarthy, J., Abrahams, P. W., Edwards, D. J., Hart, T. P., and Levin, M. I. Lisp 1.5 programmers manual. 116.</p>
<p />
<p>[13]	Tennent, R. D. The denotational semantics of programming languages. Communications of the ACM 18, 8 (Aug 1976), 437–453.</p>
<p />
<p>[14]	Wikibooks. http://en.wikibooks.org/wiki/Computer_programming/Hello_world#B, 2008.</p>


<p>Related posts:<ol><li><a href='http://www.win-vector.com/blog/2009/05/programs-reduced-to-statistics/' rel='bookmark' title='Permanent Link: Programs reduced to statistics'>Programs reduced to statistics</a></li>
<li><a href='http://www.win-vector.com/blog/2008/04/i-know-i-am-the-one-being-a-jerk/' rel='bookmark' title='Permanent Link: I know, I am the one being a jerk'>I know, I am the one being a jerk</a></li>
<li><a href='http://www.win-vector.com/blog/2008/04/sorting-in-anger/' rel='bookmark' title='Permanent Link: Sorting Used in Anger'>Sorting Used in Anger</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2008/02/hello-world-an-instance-rhetoric-in-computer-science/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
