<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Win-Vector Blog &#187; Exciting Techniques</title>
	<atom:link href="http://www.win-vector.com/blog/category/exciting-techniques/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.win-vector.com/blog</link>
	<description>The Applied Theorist&#039;s Point of View</description>
	<lastBuildDate>Wed, 18 Aug 2010 15:11:02 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
		<item>
		<title>Gradients via Reverse Accumulation</title>
		<link>http://www.win-vector.com/blog/2010/07/gradients-via-reverse-accumulation/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=gradients-via-reverse-accumulation</link>
		<comments>http://www.win-vector.com/blog/2010/07/gradients-via-reverse-accumulation/#comments</comments>
		<pubDate>Thu, 15 Jul 2010 00:00:04 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Applications]]></category>
		<category><![CDATA[Coding]]></category>
		<category><![CDATA[Exciting Techniques]]></category>
		<category><![CDATA[Mathematics]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Tutorials]]></category>
		<category><![CDATA[Automatic Differentiation]]></category>
		<category><![CDATA[Conjugate Gradient]]></category>
		<category><![CDATA[Gradient]]></category>
		<category><![CDATA[Mathematical Bedside Reading]]></category>
		<category><![CDATA[Optimization]]></category>
		<category><![CDATA[Reverse Accumulation]]></category>
		<category><![CDATA[Scala]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1493</guid>
		<description><![CDATA[We extend the ideas of from Automatic Differentiation with Scala to include the reverse accumulation. Reverse accumulation is a non-obvious improvement to automatic differentiation that can in many cases vastly speed up calculations of gradients. As the tables, diagrams and equations do not translate well into HTML, our full article is available here in PDF: [...]


Related posts:<ol><li><a href='http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/' rel='bookmark' title='Permanent Link: Automatic Differentiation with Scala'>Automatic Differentiation with Scala</a></li>
<li><a href='http://www.win-vector.com/blog/2010/01/easy-portfolio-allocation/' rel='bookmark' title='Permanent Link: &#8220;Easy&#8221; Portfolio Allocation'>&#8220;Easy&#8221; Portfolio Allocation</a></li>
<li><a href='http://www.win-vector.com/blog/2008/09/a-quick-appreciation-of-the-sharpe-ratio/' rel='bookmark' title='Permanent Link: A Quick Appreciation of the Sharpe Ratio'>A Quick Appreciation of the Sharpe Ratio</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>We extend the ideas of from <a target="ext" href="http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/">Automatic Differentiation with Scala</a> to include the <em>reverse accumulation</em>.  Reverse accumulation is a non-obvious improvement to automatic differentiation that can in many cases vastly speed up calculations of gradients.<span id="more-1493"></span><br />
As the tables, diagrams and equations do not translate well into HTML, our full article is available here in PDF: <a href="http://www.win-vector.com/dfiles/ReverseAccumulation.pdf">http://www.win-vector.com/dfiles/ReverseAccumulation.pdf</a>.</p>
<p>The purpose of our article is to explain reverse accumulation automatic differentiation clearly (and to release some sample code and timing results).  A side effect of the article is to make sense of the following two diagrams:</p>
<p>If the following is picture of standard or forward differentiation:</p>
<p><img src="http://www.win-vector.com/blog/wp-content/uploads/2010/07/cutFwd.png" alt="cutFwd.png" border="0" width="408" height="677" /></p>
<p>then the following is a picture of reverse accumulation:</p>
<p><img src="http://www.win-vector.com/blog/wp-content/uploads/2010/07/cutRev.png" alt="cutRev.png" border="0" width="487" height="739" /></p>


<p>Related posts:<ol><li><a href='http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/' rel='bookmark' title='Permanent Link: Automatic Differentiation with Scala'>Automatic Differentiation with Scala</a></li>
<li><a href='http://www.win-vector.com/blog/2010/01/easy-portfolio-allocation/' rel='bookmark' title='Permanent Link: &#8220;Easy&#8221; Portfolio Allocation'>&#8220;Easy&#8221; Portfolio Allocation</a></li>
<li><a href='http://www.win-vector.com/blog/2008/09/a-quick-appreciation-of-the-sharpe-ratio/' rel='bookmark' title='Permanent Link: A Quick Appreciation of the Sharpe Ratio'>A Quick Appreciation of the Sharpe Ratio</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2010/07/gradients-via-reverse-accumulation/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Automatic Differentiation with Scala</title>
		<link>http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=automatic-differentiation-with-scala</link>
		<comments>http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/#comments</comments>
		<pubDate>Tue, 15 Jun 2010 04:19:20 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Applications]]></category>
		<category><![CDATA[Coding]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Exciting Techniques]]></category>
		<category><![CDATA[Mathematics]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Tutorials]]></category>
		<category><![CDATA[Automatic Differentiation]]></category>
		<category><![CDATA[Conjugate Gradient]]></category>
		<category><![CDATA[Dual Numbers]]></category>
		<category><![CDATA[Geometric Median]]></category>
		<category><![CDATA[Numeric Methods]]></category>
		<category><![CDATA[Optimization]]></category>
		<category><![CDATA[Scala]]></category>
		<category><![CDATA[Steiner Tree]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1481</guid>
		<description><![CDATA[This article is a worked-out exercise in applying the Scala type system to solve a small scale optimization problem. For this article we supply complete Scala source code (under a GPLv3 license) and some design discussion. Usually we work using a combination of databases, Java, optimization libraries and analysis suites (like R). The reason is [...]


Related posts:<ol><li><a href='http://www.win-vector.com/blog/2010/07/gradients-via-reverse-accumulation/' rel='bookmark' title='Permanent Link: Gradients via Reverse Accumulation'>Gradients via Reverse Accumulation</a></li>
<li><a href='http://www.win-vector.com/blog/2009/11/r-examine-objects-tutorial/' rel='bookmark' title='Permanent Link: R examine objects tutorial'>R examine objects tutorial</a></li>
<li><a href='http://www.win-vector.com/blog/2009/09/survive-r/' rel='bookmark' title='Permanent Link: Survive R'>Survive R</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>This article is a worked-out exercise in applying the <a href="http://www.scala-lang.org/" target="ext">Scala</a> type system to solve a small scale optimization problem.    For this article we supply <a href="http://www.win-vector.com/dfiles/ScalaDiff.jar">complete Scala source code</a> (under a GPLv3 license) and some design discussion.<span id="more-1481"></span><br />
Usually we work using a combination of databases, Java, optimization libraries and analysis suites (like R).  The reason is that, for our typical problems, Java hits a sweet spot of trading off runtime performance against ease of development and maintenance.  In the tens of gigabytes range (data sets larger than the Wikipedia but smaller than the Web) Java outperforms the scripting languages (Ruby, Python &#8230;) and is much easer to develop in and document than C++.  This sweet spot is both subjective and situational- if the tasks were smaller and in a services framework Python is a better choice, if performance is paramount then C or C++ (with the STL) and Hadoop are a better choice, if pre-built statistical libraries are needed then R becomes a better choice.  For the type problem we present here Scala is a very good choice.</p>
<style type="text/css">
td.linenos { background-color: #f0f0f0; padding-right: 10px; }
span.lineno { background-color: #f0f0f0; padding: 0 5px 0 5px; }
pre { line-height: 125%; }
body .hll { background-color: #ffffcc }
body  { background: #f8f8f8; }
body .c { color: #408080; font-style: italic } /* Comment */
body .err { border: 1px solid #FF0000 } /* Error */
body .k { color: #008000; font-weight: bold } /* Keyword */
body .o { color: #666666 } /* Operator */
body .cm { color: #408080; font-style: italic } /* Comment.Multiline */
body .cp { color: #BC7A00 } /* Comment.Preproc */
body .c1 { color: #408080; font-style: italic } /* Comment.Single */
body .cs { color: #408080; font-style: italic } /* Comment.Special */
body .gd { color: #A00000 } /* Generic.Deleted */
body .ge { font-style: italic } /* Generic.Emph */
body .gr { color: #FF0000 } /* Generic.Error */
body .gh { color: #000080; font-weight: bold } /* Generic.Heading */
body .gi { color: #00A000 } /* Generic.Inserted */
body .go { color: #808080 } /* Generic.Output */
body .gp { color: #000080; font-weight: bold } /* Generic.Prompt */
body .gs { font-weight: bold } /* Generic.Strong */
body .gu { color: #800080; font-weight: bold } /* Generic.Subheading */
body .gt { color: #0040D0 } /* Generic.Traceback */
body .kc { color: #008000; font-weight: bold } /* Keyword.Constant */
body .kd { color: #008000; font-weight: bold } /* Keyword.Declaration */
body .kn { color: #008000; font-weight: bold } /* Keyword.Namespace */
body .kp { color: #008000 } /* Keyword.Pseudo */
body .kr { color: #008000; font-weight: bold } /* Keyword.Reserved */
body .kt { color: #B00040 } /* Keyword.Type */
body .m { color: #666666 } /* Literal.Number */
body .s { color: #BA2121 } /* Literal.String */
body .na { color: #7D9029 } /* Name.Attribute */
body .nb { color: #008000 } /* Name.Builtin */
body .nc { color: #0000FF; font-weight: bold } /* Name.Class */
body .no { color: #880000 } /* Name.Constant */
body .nd { color: #AA22FF } /* Name.Decorator */
body .ni { color: #999999; font-weight: bold } /* Name.Entity */
body .ne { color: #D2413A; font-weight: bold } /* Name.Exception */
body .nf { color: #0000FF } /* Name.Function */
body .nl { color: #A0A000 } /* Name.Label */
body .nn { color: #0000FF; font-weight: bold } /* Name.Namespace */
body .nt { color: #008000; font-weight: bold } /* Name.Tag */
body .nv { color: #19177C } /* Name.Variable */
body .ow { color: #AA22FF; font-weight: bold } /* Operator.Word */
body .w { color: #bbbbbb } /* Text.Whitespace */
body .mf { color: #666666 } /* Literal.Number.Float */
body .mh { color: #666666 } /* Literal.Number.Hex */
body .mi { color: #666666 } /* Literal.Number.Integer */
body .mo { color: #666666 } /* Literal.Number.Oct */
body .sb { color: #BA2121 } /* Literal.String.Backtick */
body .sc { color: #BA2121 } /* Literal.String.Char */
body .sd { color: #BA2121; font-style: italic } /* Literal.String.Doc */
body .s2 { color: #BA2121 } /* Literal.String.Double */
body .se { color: #BB6622; font-weight: bold } /* Literal.String.Escape */
body .sh { color: #BA2121 } /* Literal.String.Heredoc */
body .si { color: #BB6688; font-weight: bold } /* Literal.String.Interpol */
body .sx { color: #008000 } /* Literal.String.Other */
body .sr { color: #BB6688 } /* Literal.String.Regex */
body .s1 { color: #BA2121 } /* Literal.String.Single */
body .ss { color: #19177C } /* Literal.String.Symbol */
body .bp { color: #008000 } /* Name.Builtin.Pseudo */
body .vc { color: #19177C } /* Name.Variable.Class */
body .vg { color: #19177C } /* Name.Variable.Global */
body .vi { color: #19177C } /* Name.Variable.Instance */
body .il { color: #666666 } /* Literal.Number.Integer.Long */
 </style>
<h2>Our Example Problem</h2>
<p>Our small scale problem is this:  we have a number of target points on a map and we want to pick a central point to <em>directly</em> connect to all of these points with wire.  Our goal is to minimize the total amount of wire used.  This problem is called the <a href="http://en.wikipedia.org/wiki/Geometric_median" ref="ext">&#8220;Geometric Median&#8221;</a>.  So we are trying to find a point that minimizes the sum of distances from our chosen center to every target point. If we were trying to minimize the sum of squared distances from our chosen center to every target point the answer would be obvious: the average or mean (which by Hooke&#8217;s law is also the point where a set of identical springs would relax to).  The mean is in fact a fairly good guess, but you can do better (which could important if the &#8220;wire&#8221; is expensive, such as cutting irrigation or drainage ditches).  For example given the three target points (20,0), (-1,-1) and (-1,1) the optimal point is (-0.42,0) not the mean (6,0) and the choice of optimal point represents an over 19% savings in total wiring distance (see figure).</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/06/points.png" alt="points.png" border="0" width="525" height="525" /><br />
</center></p>
<p>This is a substantial saving in cost.  </p>
<p>The problem changes as we consider variations.  If indirect connections (such as routing one point through another, which may or may not be possible for reasons of capacity or safety) and multiple new centers are allowed  we then have an instance of the <a href="http://en.wikipedia.org/wiki/Steiner_tree_problem" ref="ext">Steiner Tree Problem</a> which is harder  to solve (since it is known to be NP complete).  If no new centers are allowed (all routing must be between pre-existing target points) then we have a Spanning Tree Problem- which admits very quick solutions.</p>
<p>We bring up the geometric median as a mere example.  We don&#8217;t intend for our code to solve only the geometric median problem and we don&#8217;t intend to touch on the literature of specialized methods for solving the geometric median problem.  Instead we are trying to demonstrate the speed you can develop prototype solutions if you have a few good tools (like various optimizers) available in your toolkit.  Numeric optimizers may sound exotic, but they often are the kind of thing you want to experiment with and link directly into your code.</p>
<h2>Optimization as General Tool</h2>
<p>Now that we have the example problem we can describe a solution strategy.  In this case the solution uses code &#8220;we wished we had lying around&#8221; before we started on the problem.  We will pretend we have the tools we want ready to solve our problem and then we will pay our debt and build the required tools.  The issue is that there is not an obvious closed form for the solution of the geometric median problem.  So we are forced to work a bit harder.  In this case harder means we need to solve an optimization problem.  Consider the contour plot of the total wiring cost as function of where we choose to place our center.  Our optimal point (-0.42,0) had wiring cost of 22.73 and the contour plot given here shows concentric regions of solution positions with higher cost.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/06/contour.png" alt="contour.png" border="0" width="525" height="525" /><br />
</center></p>
<p>In general it is unwise to throw an optimizer at an arbitrary problem and hope to find the globally best solution.  But in this case (and in many similar situations) we can prove that a simple local optimizer will in fact find the unique best solution.  This is a property of the problem not of the optimizer.  The concentric regions shown in the contour plot have a very nice shape: they are <a href="http://en.wikipedia.org/wiki/Convex_set" ref="ext">convex</a>.   That is: they have no intrusions- for any two points drawn from one of these shapes the straight line segment between these points stays inside the given shape.  We don&#8217;t have to depend on observation- we can actually prove this is always the case for this problem.  The wiring cost from a proposed center to any single target point is a <a href="http://en.wikipedia.org/wiki/Convex_function" ref="ext">convex function</a> of where we choose to place our center (a convex function is a function whose graph never reaches above the secant line drawn between any two points on its graph).  The total wiring cost is just the sum of the wiring costs to each target point.  And to finish: the sum of a collection of convex functions is itself a convex function.  Since the contour plot of a convex function has only convex shapes and we have proven the statement.</p>
<p>But how does this help us?  There is a standard technique to find &#8220;local minima&#8221; of a function by inspecting a function for places where the gradient is zero (points where there is no obvious down hill direction on the contour plot).  This technique usually can only be guaranteed to find local minima (places where no small change improves your situation).  But there is no guarantee that the local minimum you find is in fact the global minimum (the best possible solution).  Except when you are dealing with a convex function.  When a function is convex then all of the local minima are always grouped together into a single convex connected shape (if not a line drawn between two remote minima would violate the convexity definition).  And if the function is never flat then this set is a single unique point: the unique best solution.  Our inspection technique will be a gradient driven optimizer- that is an optimizer that when the gradient is non-zero improves its objective by running down hill and halts when the gradient is zero.</p>
<p>The stated function to minimize is to sum the distance from our proposed center to each target point.  We can write this as the sum of the distances:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/06/dist1.png" alt="dist1.png" border="0" width="309" height="81" /><br />
</center></p>
<p>( <img src="http://www.win-vector.com/blog/wp-content/uploads/2010/06/euclid1.png" alt="euclid1.png" border="0" width="119" height="37" /> which is the traditional Euclidean or L2 distance).  This function actually has one one subtle flaw that we will deal with in the appendix (see: Fixing Smoothness).</p>
<h2>Using Scala to Apply the Optimization Solution</h2>
<p>To find our optimal center placement using Scala we first write our cost or objective as a Scala function:</p>
<div class="highlight">
<pre>    <span class="k">val</span> <span class="n">dat</span><span class="k">:</span><span class="kt">Array</span><span class="o">[</span><span class="kt">Array</span><span class="o">[</span><span class="kt">Double</span><span class="o">]]</span> <span class="o">=</span> <span class="nc">Array</span><span class="o">(</span>
      <span class="nc">Array</span><span class="o">(</span> <span class="mi">20</span><span class="o">,</span> <span class="mf">0.0</span><span class="o">),</span>
      <span class="nc">Array</span><span class="o">(</span> <span class="o">-</span><span class="mf">1.0</span><span class="o">,</span> <span class="mf">1.0</span><span class="o">),</span>
      <span class="nc">Array</span><span class="o">(</span> <span class="o">-</span><span class="mf">1.0</span><span class="o">,</span> <span class="o">-</span><span class="mf">1.0</span><span class="o">)</span>
    <span class="o">)</span>

    <span class="k">def</span> <span class="n">fx</span><span class="o">(</span><span class="n">p</span><span class="k">:</span><span class="kt">Array</span><span class="o">[</span><span class="kt">Double</span><span class="o">])</span><span class="k">:</span><span class="kt">Double</span> <span class="o">=</span> <span class="o">{</span>
      <span class="k">val</span> <span class="n">dim</span> <span class="k">=</span> <span class="n">p</span><span class="o">.</span><span class="n">length</span>
      <span class="k">val</span> <span class="n">npoint</span> <span class="k">=</span> <span class="n">dat</span><span class="o">.</span><span class="n">length</span>
      <span class="k">var</span> <span class="n">total</span> <span class="k">=</span> <span class="mf">0.0</span>
      <span class="k">for</span><span class="o">(</span><span class="n">k</span> <span class="k">&lt;-</span> <span class="mi">0</span> <span class="n">to</span> <span class="o">(</span><span class="n">npoint</span><span class="o">-</span><span class="mi">1</span><span class="o">))</span> <span class="o">{</span>
        <span class="k">var</span> <span class="n">term</span> <span class="k">=</span> <span class="mf">0.0</span>
        <span class="k">for</span><span class="o">(</span><span class="n">i</span> <span class="k">&lt;-</span> <span class="mi">0</span> <span class="n">to</span> <span class="o">(</span><span class="n">dim</span><span class="o">-</span><span class="mi">1</span><span class="o">))</span> <span class="o">{</span>
          <span class="k">val</span> <span class="n">diff</span> <span class="k">=</span> <span class="n">p</span><span class="o">(</span><span class="n">i</span><span class="o">)</span> <span class="o">-</span> <span class="n">dat</span><span class="o">(</span><span class="n">k</span><span class="o">)(</span><span class="n">i</span><span class="o">)</span>
          <span class="n">term</span> <span class="k">=</span> <span class="n">term</span> <span class="o">+</span> <span class="n">diff</span><span class="o">*</span><span class="n">diff</span>
        <span class="o">}</span>
        <span class="n">total</span> <span class="k">=</span> <span class="n">total</span> <span class="o">+</span> <span class="n">scala</span><span class="o">.</span><span class="n">math</span><span class="o">.</span><span class="n">sqrt</span><span class="o">(</span><span class="n">term</span><span class="o">)</span>
      <span class="o">}</span>
      <span class="n">total</span>
    <span class="o">}</span>
</pre>
</div>
<p>Scala is succinct and it is a great connivence to have a function definition capture data from its environment.   What we would like to do is generate an initial guess as the solution (we use the mean as our initial guess) and then call an optimizer (in this case a conjugate gradient optimizer) to do all the work:</p>
<div class="highlight">
<pre> <span class="k">val</span> <span class="n">p0</span><span class="k">:</span><span class="kt">Array</span><span class="o">[</span><span class="kt">Double</span><span class="o">]</span> <span class="o">=</span> <span class="n">mean</span><span class="o">(</span><span class="n">dat</span><span class="o">)</span>
 <span class="k">val</span> <span class="o">(</span><span class="n">pF</span><span class="o">,</span><span class="n">fpF</span><span class="o">)</span> <span class="k">=</span> <span class="nc">CG</span><span class="o">.</span><span class="n">minimize</span><span class="o">(</span><span class="n">fx</span><span class="o">,</span><span class="n">p0</span><span class="o">)</span>
</pre>
</div>
<p>At this point we would be done, except the conjugate gradient method (which is superior to gradient descent and many the non-gradient methods) requires a gradient.<br />
We could provide a numeric estimate of the gradient by the following divided difference method:</p>
<div class="highlight">
<pre>  <span class="k">def</span> <span class="n">gradientD</span><span class="o">(</span><span class="n">f</span><span class="k">:</span><span class="kt">Array</span><span class="o">[</span><span class="kt">Double</span><span class="o">]</span><span class="k">=&gt;</span><span class="kt">Double</span><span class="o">,</span><span class="n">p</span><span class="k">:</span><span class="kt">Array</span><span class="o">[</span><span class="kt">Double</span><span class="o">])</span><span class="k">:</span><span class="kt">Array</span><span class="o">[</span><span class="kt">Double</span><span class="o">]</span> <span class="o">=</span> <span class="o">{</span>
    <span class="k">val</span> <span class="n">xdim</span> <span class="k">=</span> <span class="n">p</span><span class="o">.</span><span class="n">length</span>
    <span class="k">val</span> <span class="n">p2</span> <span class="k">=</span> <span class="n">copy</span><span class="o">(</span><span class="n">p</span><span class="o">)</span>
    <span class="k">val</span> <span class="n">base</span> <span class="k">=</span> <span class="n">f</span><span class="o">(</span><span class="n">p2</span><span class="o">)</span>
    <span class="k">val</span> <span class="n">ret</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">Array</span><span class="o">[</span><span class="kt">Double</span><span class="o">](</span><span class="n">xdim</span><span class="o">)</span>
    <span class="k">val</span> <span class="n">delta</span> <span class="k">=</span> <span class="mf">1.0e-6</span>
    <span class="k">for</span><span class="o">(</span><span class="n">i</span> <span class="k">&lt;-</span> <span class="mi">0</span> <span class="n">to</span> <span class="o">(</span><span class="n">xdim</span><span class="o">-</span><span class="mi">1</span><span class="o">))</span> <span class="o">{</span>
      <span class="n">p2</span><span class="o">(</span><span class="n">i</span><span class="o">)</span> <span class="k">=</span> <span class="n">p</span><span class="o">(</span><span class="n">i</span><span class="o">)</span> <span class="o">+</span> <span class="n">delta</span>
      <span class="k">val</span> <span class="n">fplus</span> <span class="k">=</span> <span class="n">f</span><span class="o">(</span><span class="n">p2</span><span class="o">)</span>
      <span class="n">p2</span><span class="o">(</span><span class="n">i</span><span class="o">)</span> <span class="k">=</span> <span class="n">p</span><span class="o">(</span><span class="n">i</span><span class="o">)</span>
      <span class="k">val</span> <span class="n">diff</span> <span class="k">=</span> <span class="o">(</span><span class="n">fplus</span><span class="o">-</span><span class="n">base</span><span class="o">)/</span><span class="n">delta</span>
      <span class="n">ret</span><span class="o">(</span><span class="n">i</span><span class="o">)</span> <span class="k">=</span> <span class="n">diff</span>
    <span class="o">}</span>
    <span class="n">ret</span>
  <span class="o">}</span>
</pre>
</div>
<p>This numeric divided difference method often outperforms non-derivative optimization methods (like Powell&#8217;s Method and the Nelder-Mead Amoeba method).  But the technique can run into numeric difficulties.   We can remedy this if we are willing to write our function in a slightly more general way.   If we re-encode our function in a generic manner we can use <a href="http://en.wikipedia.org/wiki/Automatic_differentiation" target="ext">automatic differentiation</a>  (not to be confused with numeric differentiation or with symbolic differentiation) to produce a reliable gradient for optimization.  What we need to do is re-write our function to work over an abstract field of numbers instead of only the machine supplied doubles.  In fact what we need to do is specify a generic function that will work over any field, with the field to be determined later.  The code to do this in Scala is very similar to the non-generic code:</p>
<div class="highlight">
<pre>   <span class="k">val</span> <span class="n">genericFx</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">VectorFN</span> <span class="o">{</span>
      <span class="k">def</span> <span class="n">apply</span><span class="o">[</span><span class="kt">Y</span> <span class="k">&lt;:</span> <span class="kt">NumberBase</span><span class="o">[</span><span class="kt">Y</span><span class="o">]](</span><span class="n">p</span><span class="k">:</span><span class="kt">Array</span><span class="o">[</span><span class="kt">Y</span><span class="o">])</span><span class="k">:</span><span class="kt">Y</span> <span class="o">=</span> <span class="o">{</span>
        <span class="k">val</span> <span class="n">field</span> <span class="k">=</span> <span class="n">p</span><span class="o">(</span><span class="mi">0</span><span class="o">).</span><span class="n">field</span>
        <span class="k">val</span> <span class="n">dim</span> <span class="k">=</span> <span class="n">p</span><span class="o">.</span><span class="n">length</span>
        <span class="k">val</span> <span class="n">npoint</span> <span class="k">=</span> <span class="n">dat</span><span class="o">.</span><span class="n">length</span>
        <span class="k">var</span> <span class="n">total</span> <span class="k">=</span> <span class="n">field</span><span class="o">.</span><span class="n">zero</span>
        <span class="k">for</span><span class="o">(</span><span class="n">k</span> <span class="k">&lt;-</span> <span class="mi">0</span> <span class="n">to</span> <span class="o">(</span><span class="n">npoint</span><span class="o">-</span><span class="mi">1</span><span class="o">))</span> <span class="o">{</span>
          <span class="k">var</span> <span class="n">term</span> <span class="k">=</span> <span class="n">field</span><span class="o">.</span><span class="n">zero</span>
          <span class="k">for</span><span class="o">(</span><span class="n">i</span> <span class="k">&lt;-</span> <span class="mi">0</span> <span class="n">to</span> <span class="o">(</span><span class="n">dim</span><span class="o">-</span><span class="mi">1</span><span class="o">))</span> <span class="o">{</span>
            <span class="k">val</span> <span class="n">diff</span> <span class="k">=</span> <span class="n">p</span><span class="o">(</span><span class="n">i</span><span class="o">)</span> <span class="o">-</span> <span class="n">field</span><span class="o">.</span><span class="n">inject</span><span class="o">(</span><span class="n">dat</span><span class="o">(</span><span class="n">k</span><span class="o">)(</span><span class="n">i</span><span class="o">))</span>
            <span class="n">term</span> <span class="k">=</span> <span class="n">term</span> <span class="o">+</span> <span class="n">diff</span><span class="o">*</span><span class="n">diff</span>
          <span class="o">}</span>
          <span class="n">total</span> <span class="k">=</span> <span class="n">total</span> <span class="o">+</span> <span class="n">smoothSQRT</span><span class="o">(</span><span class="n">term</span><span class="o">)</span>
        <span class="o">}</span>
        <span class="n">total</span>
      <span class="o">}</span>
    <span class="o">}</span>
</pre>
</div>
<p>Notice that code is very similar to the &#8220;def fx()&#8221; code.  The key differences are that we had to define genericFx as extending a trait (a type of Scala interface) called VectorFN and inside this trait extension we defined a parameterized function name apply().  apply() is a generic function that is willing to work over any type Y where Y is at least of type NumberBase[Y] (we will get more into what that means in a moment).  The difference in notation is that while the Scala function <em>syntax</em> can not specify a generic function with free type parameters (the incompletely specified Y) the Scala <em>semantics</em> are strong enough to implement this.  In fact standard function definitions (such as &#8220;def fx()&#8221;) are just syntactic sugar for extending the Scala built-in <a href="http://www.scala-lang.org/docu/files/api/scala/Function1.html" target="ext">Function1 trait</a>.  With a generic objective function in hand all we need is conjugate gradient code that is expecting a VectorFN (and willing to call apply() instead of just using naked function parenthesis) and some type NumberBase[Y] that can compute gradients for us.  The Scala compiler can specialize our genericFx() into one version for quick calculation and another for gradients.  How this is done is what we will discuss next.  From our point of view our problem is solved with the following one line of code:</p>
<div class="highlight">
<pre><span class="k">val</span> <span class="o">(</span><span class="n">pF</span><span class="o">,</span><span class="n">fpF</span><span class="o">)</span> <span class="k">=</span> <span class="nc">CG</span><span class="o">.</span><span class="n">minimize</span><span class="o">(</span><span class="n">genericFx</span><span class="o">,</span><span class="n">p0</span><span class="o">)</span>
</pre>
</div>
<p>This should always be your goal- build sufficient preparation so your last step is a &#8220;obvious one liner.&#8221;</p>
<h2>What Tools we Wish we Had Lying Around</h2>
<p>We supply in our example some workable conjugate gradient code, but that is standard so we will not discuss it.  What is of interest (and facilitated by Scala&#8217;s parametrized type system) is the implementation of <a href="http://en.wikipedia.org/wiki/Dual_number" target="ext">dual numbers</a> as a framework to supply automatic differentiation.  An implementation of dual numbers as a NumerBase[DualNumber] type is the core of our demonstration.</p>
<p>Dual numbers are an algebraic structure written as pairs of real numbers &#8220;(a,b)&#8221;.  The arithmetic table for dual numbers is given below:</p>
<table>
<tr>
<td>(a,b) + (c,d)</td>
<td>=</td>
<td>((a+c) , (b+d))</td>
</tr>
<tr>
<td>(a,b) &#8211; (c,d)</td>
<td>=</td>
<td>((a-c) , (b-d))</td>
</tr>
<tr>
<td>(a,b) * (c,d)</td>
<td>=</td>
<td>((a*c) , (a*d+b*c))</td>
</tr>
<tr>
<td>(a,b) / (c,d)</td>
<td>=</td>
<td>((a/c) , ((b*c-a*d)/(a*a)))</td>
</tr>
</table>
<p>In a dual number (a,b) &#8220;a&#8221; is the &#8220;large&#8221; or &#8220;standard&#8221; part of the number.  You can check from the arithmetic table that the pair of dual numbers (a,0) and (c,0) behave just as we would expect the real numbers a and c to behave.  In the dual number (a,b) &#8220;b&#8221; is the &#8220;small&#8221; or &#8220;ideal&#8221; portion of the number.  From the multiplication rule above  we can observe two rules: (0,b) * (c,0) = (0,b*c) (something small times anything else is small) and (0,b)*(0,d) = (0,0) (two small things become zero when multiplied).  Essentially the dual numbers are carrying around the first two terms of a Taylor series: we get as a result both the function value and the function derivative.  For a function f() over the real numbers we extend f() to work over the dual number by defining: f((a,b)) = (f(a),b f&#8217;(a)) (which is consistent with the previously defined arithmetic). We can check that the dual numbers numbers obey the usual laws of arithmetic (associative, commutative, distributive, identities and inverses).  The punchline is that over the dual numbers the divided difference estimate of f&#8217;(x) (the derivative of f() evaluated at x)  is in fact exact in the sense that f((x,1)) = (f(x),f&#8217;(x)) (or f((x,0)+(0,1)) &#8211; f((x,0)) = (0, f&#8217;(x))).  Implementing the DualNumber class is little more than transcribing the above arithmetic table into Scala.</p>
<p>We have already seen how to write code that uses NumberBase[Y] types (genericFx() itself is an example).  A more complicated example is the CG.minimize() code which not only accepts a generic function (in the form of VectorFN) but then specializes it to NumberBase[DualNumber] to compute gradients and also specializes to NumberBase[MDouble] for quick calculation during line searches (MDouble is just an adapter for machine Doubles, used for speed).  The ability to re-specialize a function is one of the advantages of a parameterized type system.  The DualNumbers are an example of forward automatic differentiation.  We could also use the same object framework to capture a representation of the computation path and apply more sophisticated methods such as reverse automatic differentiation. </p>
<p>We give a link to a jar containing <a href="http://www.win-vector.com/dfiles/ScalaDiff.jar">complete Scala source code</a> including this example, the DualNumber implementation, a conjugate gradient implementation and some JUnit tests (all under a GPLv3 license) and will go on to describe some of the design decisions.  The code is the bulky part of this work, so we will move on to discuss something more compact: types.</p>
<h2>Types</h2>
<p>If code is ever beautiful it is only when it is succinct.  Among the most succinct forms of code are individual type signatures and interfaces (though the indiscriminate repetition of type signatures is rightly considered ugly bloat, which Scala works to avoid).   Since we are distributing complete source we will describe only types and method signatures.  The entry points to the code are the JUnit tests (organized in the ScalaDiff/test source directory and depending on JUnit which was not included) and the demo program in ScalaDiff/src/demo/Demo.scala).</p>
<p>To be a usable arithmetic type (like DualNumber or MDouble) you must extend the following parameterized abstract class:</p>
<div class="highlight">
<pre><span class="k">abstract</span> <span class="k">class</span> <span class="nc">NumberBase</span><span class="o">[</span><span class="kt">NUMBERTYPE</span> <span class="k">&lt;:</span> <span class="kt">NumberBase</span><span class="o">[</span><span class="kt">NUMBERTYPE</span><span class="o">]]</span> <span class="o">{</span>
  <span class="c">// basic arithmetic</span>
  <span class="k">def</span> <span class="o">+</span> <span class="o">(</span><span class="n">that</span><span class="k">:</span> <span class="kt">NUMBERTYPE</span><span class="o">)</span><span class="k">:</span><span class="kt">NUMBERTYPE</span>
  <span class="k">def</span> <span class="o">-</span> <span class="o">(</span><span class="n">that</span><span class="k">:</span> <span class="kt">NUMBERTYPE</span><span class="o">)</span><span class="k">:</span><span class="kt">NUMBERTYPE</span>
  <span class="k">def</span> <span class="n">unary_-</span><span class="o">()</span><span class="k">:</span><span class="kt">NUMBERTYPE</span>
  <span class="k">def</span> <span class="o">*</span> <span class="o">(</span><span class="n">that</span><span class="k">:</span> <span class="kt">NUMBERTYPE</span><span class="o">)</span><span class="k">:</span><span class="kt">NUMBERTYPE</span>
  <span class="k">def</span> <span class="o">/</span> <span class="o">(</span><span class="n">that</span><span class="k">:</span> <span class="kt">NUMBERTYPE</span><span class="o">)</span><span class="k">:</span><span class="kt">NUMBERTYPE</span>  <span class="kt">//</span> <span class="kt">that</span> <span class="kt">not</span> <span class="kt">equal</span> <span class="kt">to</span> <span class="kt">zero</span>
  <span class="c">// more complicated</span>
  <span class="k">def</span> <span class="n">pow</span><span class="o">(</span><span class="n">that</span><span class="k">:</span><span class="kt">Double</span><span class="o">)</span><span class="k">:</span><span class="kt">NUMBERTYPE</span>
  <span class="k">def</span> <span class="n">exp</span><span class="k">:</span><span class="kt">NUMBERTYPE</span>
  <span class="k">def</span> <span class="n">log</span><span class="k">:</span><span class="kt">NUMBERTYPE</span> <span class="kt">//</span> <span class="kt">this</span> <span class="kt">is</span> <span class="kt">positive</span>
  <span class="c">// comparison functions</span>
  <span class="k">def</span> <span class="o">&gt;</span> <span class="o">(</span><span class="n">that</span><span class="k">:</span> <span class="kt">NUMBERTYPE</span><span class="o">)</span><span class="k">:</span><span class="kt">Boolean</span>
  <span class="k">def</span> <span class="o">&gt;=</span> <span class="o">(</span><span class="n">that</span><span class="k">:</span> <span class="kt">NUMBERTYPE</span><span class="o">)</span><span class="k">:</span><span class="kt">Boolean</span>
  <span class="k">def</span> <span class="o">==</span> <span class="o">(</span><span class="n">that</span><span class="k">:</span> <span class="kt">NUMBERTYPE</span><span class="o">)</span><span class="k">:</span><span class="kt">Boolean</span>
  <span class="k">def</span> <span class="o">!=</span> <span class="o">(</span><span class="n">that</span><span class="k">:</span> <span class="kt">NUMBERTYPE</span><span class="o">)</span><span class="k">:</span><span class="kt">Boolean</span>
  <span class="k">def</span> <span class="o">&lt;</span> <span class="o">(</span><span class="n">that</span><span class="k">:</span> <span class="kt">NUMBERTYPE</span><span class="o">)</span><span class="k">:</span><span class="kt">Boolean</span>
  <span class="k">def</span> <span class="o">&lt;=</span> <span class="o">(</span><span class="n">that</span><span class="k">:</span> <span class="kt">NUMBERTYPE</span><span class="o">)</span><span class="k">:</span><span class="kt">Boolean</span>
  <span class="c">// utility</span>
  <span class="k">def</span> <span class="n">field</span><span class="k">:</span><span class="kt">Field</span><span class="o">[</span><span class="kt">NUMBERTYPE</span><span class="o">]</span>
<span class="o">}</span>
</pre>
</div>
<p>In particular DualNumber extends NumberBase[DualNumber].  This deliberate circular reference has a big purpose: it allows publicly visible contravariant return types (returning nearly the exact type we really are instead of a base type).  This allows us to have strict type arguments so that trying to add a MDouble to DualNumber is a type error (even though they both extend the same base class).  The automatic differentiation technique encapsulated in the DualNumber class only works if all of the calculation is in the DualNumber types and this strict type enforcement allows the compiler to help prevent results sneaking in and out through other types.  All of the methods on NumberBase are obviously related to arithmetic except the field() method.  This method gives us access to a Field object which is responsible for carrying around the runtime type information (this is a common problem in Java and Scala, that some type information known at compile type such choice of template types is not easily accessed at runtime).  The Field class is as follows:</p>
<div class="highlight">
<pre><span class="k">abstract</span> <span class="k">class</span> <span class="nc">Field</span> <span class="o">[</span><span class="kt">NUMBERTYPE</span> <span class="k">&lt;:</span> <span class="kt">NumberBase</span><span class="o">[</span><span class="kt">NUMBERTYPE</span><span class="o">]]</span> <span class="o">{</span>
  <span class="k">def</span> <span class="n">zero</span><span class="k">:</span><span class="kt">NUMBERTYPE</span>            <span class="kt">//</span> <span class="kt">return</span> <span class="kt">canonical</span> <span class="kt">zero</span> <span class="kt">in</span> <span class="kt">field</span>
  <span class="k">def</span> <span class="n">one</span><span class="k">:</span><span class="kt">NUMBERTYPE</span>             <span class="kt">//</span> <span class="kt">return</span> <span class="kt">canonical</span> <span class="kt">one</span> <span class="kt">in</span> <span class="kt">field</span>
  <span class="k">def</span> <span class="n">inject</span><span class="o">(</span><span class="n">v</span><span class="k">:</span><span class="kt">Double</span><span class="o">)</span><span class="k">:</span><span class="kt">NUMBERTYPE</span>  <span class="kt">//</span> <span class="kt">return</span> <span class="kt">canonical</span> <span class="kt">representation</span> <span class="kt">of</span> <span class="kt">number</span> <span class="kt">in</span> <span class="kt">field</span>
  <span class="k">def</span> <span class="n">project</span><span class="o">(</span><span class="n">v</span><span class="k">:</span><span class="kt">NUMBERTYPE</span><span class="o">)</span><span class="k">:</span><span class="kt">Double</span> <span class="kt">//</span> <span class="kt">return</span> <span class="kt">standard-number</span> <span class="kt">represented</span> <span class="kt">in</span> <span class="kt">field</span>
  <span class="k">def</span> <span class="n">array</span><span class="o">(</span><span class="n">n</span><span class="k">:</span><span class="kt">Int</span><span class="o">)</span><span class="k">:</span><span class="kt">Array</span><span class="o">[</span><span class="kt">NUMBERTYPE</span><span class="o">]</span> <span class="kt">//</span> <span class="kt">return</span> <span class="kt">an</span> <span class="kt">array</span> <span class="kt">of</span> <span class="kt">this</span> <span class="k">type</span>
</pre>
</div>
<p>The Field class is where we have factories for numbers (zero, one, arrays, injection from standard Doubles), casting (projection back to standard Doubles).</p>
<p>With these types defined we can actually read intent off some of the method signatures.  </p>
<p>For example our conjugate gradient optimizer is accessed through the following method signature:</p>
<div class="highlight">
<pre> <span class="k">def</span> <span class="n">minimize</span><span class="o">(</span><span class="n">fn</span><span class="k">:</span><span class="kt">VectorFN</span><span class="o">,</span><span class="n">x0</span><span class="k">:</span><span class="kt">Array</span><span class="o">[</span><span class="kt">Double</span><span class="o">])</span><span class="k">:</span><span class="o">(</span><span class="kt">Array</span><span class="o">[</span><span class="kt">Double</span><span class="o">],</span><span class="kt">Double</span><span class="o">)</span> <span class="c">// return x,f(x)</span>
</pre>
</div>
<p>The above can be read as: CG.minimize() requires a VectorFN (our trait representing single argument functions with a free type parameter) and an initial point (in standard Doubles).  The code will the return a pair of the optimum point and the function evaluated at the optimum point.  From the type signature we can see that CG.minimize() expects to re-specialize the function &#8220;fn&#8221; to types of its own choosing (else it could have accepted a parameterized argument instead of our custom trait) and will handle all up-conversion and down-conversion between machine Doubles and NumberBase[Y]&#8216;s itself.  This sort of type information is hard to express (let alone enforce) in a dynamically typed language.</p>
<p>A slightly more complicated example is the lineMinD() method:</p>
<div class="highlight">
<pre><span class="k">def</span> <span class="n">lineMinD</span><span class="o">[</span><span class="kt">Y&lt;:NumberBase</span><span class="o">[</span><span class="kt">Y</span><span class="o">]](</span><span class="n">field</span><span class="k">:</span><span class="kt">Field</span><span class="o">[</span><span class="kt">Y</span><span class="o">],
 </span><span class="n">f</span><span class="k">:</span><span class="kt">Array</span><span class="o">[</span><span class="kt">Y</span><span class="o">]</span><span class="k">=&gt;</span><span class="kt">Y</span><span class="o">,
 </span><span class="n">xm</span><span class="k">:</span><span class="kt">Array</span><span class="o">[</span><span class="kt">Double</span><span class="o">],
 </span><span class="n">di</span><span class="k">:</span><span class="kt">Array</span><span class="o">[</span><span class="kt">Double</span><span class="o">])</span><span class="k">:</span><span class="o">(</span><span class="kt">Array</span><span class="o">[</span><span class="kt">Double</span><span class="o">],</span><span class="kt">Double</span><span class="o">)</span>
</pre>
</div>
<p>Notice it is willing to work with any type parameterized function (which means it is willing to let the caller pick the actual type of NumberBase[Y] and work with that).  Most callers will call with Y=MDouble (the wrapper for machine Doubles) and lineMin() will then work with that (without ever really knowing the actual underlying type).</p>
<p>A lot of fans of dynamic languages consider type systems to be mere hairshirt penance.   But that is not so.  Broken type systems (like Java&#8217;s collections before  erasure parameters were introduced in Java 1.5) are indeed more trouble than they are worth.  Working type systems (like C++ Templates/STL, Java 1.5+ and Scala) allow you to solve problems (and enforce decisions) during the design phase (which is much much cheaper than during the deployment phase).  You can&#8217;t set your types in stone (you are likely going to have them subtly wrong for the first few iteration).  You must be willing to think like a &#8220;language lawyer&#8221; to find out what parts of your work can be specified and enforced in the language type system.  To use an analogy: static types are your blueprint or your underpainting.</p>
<h2>Tests</h2>
<p>One argument against static types is that you can get much of their benefit from unit tests.  My opinion is you never have enough unit tests, so putting more pressure on your test suite is not wise.   Static types plus tests are strictly more powerful than static types alone or tests alone. </p>
<p>Even for this example toy-scale project we have include a JUnit test set to pursue a number of goals:</p>
<ul>
<li>Confirm our number implementations (DualNumber and MDouble) correctly model machine Doubles (perform parallel calculations and compare).</li>
<li>Confirm DualNumber obeys expected laws of algebra composition and cancellation <em>including the portions that can not be modeled in machine Doubles</em>.</li>
<li>Confirm DualNumbers compute gradients.</li>
<li>Confirm operations of optimizers and optimizer components.</li>
</ul>
<p>Many of these tests are related, but they don&#8217;t all imply each other and give different perspective on the errors they catch.  For example no amount of parallel computation between DualNumbers and machine Doubles is going to confirm the infinitesimal portion of the DualNumber is propagating correctly (since this is not a property of machine Doubles).  So we add extra tests that expect DualNumber to obey algebraic relations like: a*(b+c) = a*b + a*c hold.  It is then another step to confirm that whatever the DualNumbers calculate is not only self-consistent, but also models a truncated Taylor Series or differentiation.</p>
<h2>Conclusion</h2>
<p>We hope we have demonstrated how the complexity of a mathematical programming problem can be managed by breaking the problem into an objective function that is separate from the optimizer (allowing the optimizer to be both good and hidden) and a static type system (such as Scala) to help enforce required properties of a calculation (such as all numbers being routed though a required representation).  With these sort of tools available many formerly hard problems (that are often, unfortunately solved by over-specifying direct inefficient iterative improvement techniques) become &#8220;if I can write a reasonable objective function this may already by solved by an optimizer in my library.&#8221;  The more of these tools you have (either in your code or in your reference library) the more of these problems become easy (this is the topic of my earlier paper: <a href="http://www.win-vector.com/blog/2009/11/the-local-to-global-principle/">The Local to Global Principle</a>).</p>
<h2>Appendix: Fixing Smoothness</h2>
<p>Our chosen example objective function is very nice (i.e. convex) but it has a small (but correctable) problem.   The derivative or gradient or gradient has some jump discontinuities that could cause an optimizer to exit prematurely (not at the global optimum).  Consider the simple form of this for wiring a center to a single point at the origin (even in 1 dimension).  The wiring cost function is sqrt(x*x) has a cost graph as shown here.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/06/abs.png" alt="abs.png" border="0" width="525" height="525" /><br />
</center></p>
<p>This is convex- but derivative is not smooth as we see in the included graph of the derivative of sqrt(x*x).</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/06/dabs.png" alt="dabs.png" border="0" width="525" height="525" /><br />
</center></p>
<p>So: in this case if the optimizer stops at one of the target points we can&#8217;t be sure that it stopped at the global optimum (it may have stopped due to the discontinuity in the gradient).  For some simple problems the optimum is necessarily at a target point.  For example on the number line take the target points 0,1 and x.  As long as x&ge;0 and x&le;1 the optimum placement will be x itself.</p>
<p>One way to defend against this is to use some sort of smoothed version of sqrt() that essentially decreases a little faster near the origin.  Our cost function becomes:<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/06/cost2.png" alt="cost2.png" border="0" width="237" height="55" /><br />
</center><br />
where s() is our suitable approximation of the sqrt() function.  Two candidates are s(x) = (x+tau)^(1/2) and s(x) = x^(1/2 + tau); where tau is a small constant.  As long as tau is greater than zero we have no derivative discontinuity in s(x^2) and convexity is preserved (even made a bit stricter).  Other ways to deal with this include adding additional coordinates to the problem and small perturbations on these coordinates.  Finally, a point found by optimizing with respect to s(x) can be &#8220;polished&#8221; by re-starting the optimization at the first found solution and using sqrt(x) as the new objective (if the original point is not near any of the target points).</p>


<p>Related posts:<ol><li><a href='http://www.win-vector.com/blog/2010/07/gradients-via-reverse-accumulation/' rel='bookmark' title='Permanent Link: Gradients via Reverse Accumulation'>Gradients via Reverse Accumulation</a></li>
<li><a href='http://www.win-vector.com/blog/2009/11/r-examine-objects-tutorial/' rel='bookmark' title='Permanent Link: R examine objects tutorial'>R examine objects tutorial</a></li>
<li><a href='http://www.win-vector.com/blog/2009/09/survive-r/' rel='bookmark' title='Permanent Link: Survive R'>Survive R</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>The Local to Global Principle</title>
		<link>http://www.win-vector.com/blog/2009/11/the-local-to-global-principle/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=the-local-to-global-principle</link>
		<comments>http://www.win-vector.com/blog/2009/11/the-local-to-global-principle/#comments</comments>
		<pubDate>Wed, 11 Nov 2009 16:37:53 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Exciting Techniques]]></category>
		<category><![CDATA[Expository Writing]]></category>
		<category><![CDATA[Mathematics]]></category>
		<category><![CDATA[Dynamic Programming]]></category>
		<category><![CDATA[Local to Global]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[PageRank]]></category>
		<category><![CDATA[Problem Solving]]></category>
		<category><![CDATA[Speech Recognition]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1123</guid>
		<description><![CDATA[We describe the &#8220;the local to global principle.&#8221; It is a principle used to break algorithmic problem solving into two distinct phases (local criticism followed by global solution) and is an aid both in the design and in the application of algorithms. Instead of giving a formal definition of the principle we quickly define it [...]


Related posts:<ol><li><a href='http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/' rel='bookmark' title='Permanent Link: Automatic Differentiation with Scala'>Automatic Differentiation with Scala</a></li>
<li><a href='http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/' rel='bookmark' title='Permanent Link: A Demonstration of Data Mining'>A Demonstration of Data Mining</a></li>
<li><a href='http://www.win-vector.com/blog/2009/07/should-your-mom-use-google-search/' rel='bookmark' title='Permanent Link: Should your mom use Google search?'>Should your mom use Google search?</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>We describe the &#8220;the local to global principle.&#8221; It is a principle used to break algorithmic problem solving into two distinct phases (local criticism followed by global solution) and is an aid both in the design and in the application of algorithms. Instead of giving a formal definition of the principle we quickly define it and discuss a few examples and methods.  We have produced both a stand-alone <a href="http://www.win-vector.com/dfiles/LocalToGlobal.pdf">PDF</a> (more legible) and a HTML/blog form (more skimable).<br />
<span id="more-1123"></span></p>
<h1 align="center">The Local to Global Principle</h1>
<p align="center"><strong>John Mount<a name="tex2html3" href="#foot21" id="tex2html3"><sup>1</sup></a></strong></p>
<p></p>
<p align="center"><b>Date:</b> November 11, 2009</p>
<hr />
<h3>Abstract:</h3>
<div>We describe the &#8220;the local to global principle.&#8221; It is a principle used to break algorithmic problem solving into two distinct phases (local criticism followed by global solution) and is an aid both in the design and in the application of algorithms. Instead of giving a formal definition of the principle we quickly define it and discuss a few examples and methods.</div>
<p></p>
<h2><a name="SECTION00010000000000000000" id="SECTION00010000000000000000">Contents</a></h2>
<p><!--Table of Contents--></p>
<ul>
<li><a name="tex2html32" href="#SECTION00020000000000000000" id="tex2html32">Introduction</a></li>
<li><a name="tex2html33" href="#SECTION00030000000000000000" id="tex2html33">The Examples</a>
<ul>
<li><a name="tex2html34" href="#SECTION00031000000000000000" id="tex2html34">Web Page Link Analysis</a></li>
<li><a name="tex2html35" href="#SECTION00032000000000000000" id="tex2html35">Natural Language Processing</a></li>
<li><a name="tex2html36" href="#SECTION00033000000000000000" id="tex2html36">Machine Learning</a></li>
</ul>
<p></li>
<li><a name="tex2html37" href="#SECTION00040000000000000000" id="tex2html37">Some Methods</a>
<ul>
<li><a name="tex2html38" href="#SECTION00041000000000000000" id="tex2html38">Local Methods</a></li>
<li><a name="tex2html39" href="#SECTION00042000000000000000" id="tex2html39">Globalization Methods</a></li>
</ul>
<p></li>
<li><a name="tex2html40" href="#SECTION00050000000000000000" id="tex2html40">Conclusion</a></li>
<li><a name="tex2html41" href="#SECTION00060000000000000000" id="tex2html41">Bibliography</a></li>
<li><a name="tex2html42" href="#SECTION00070000000000000000" id="tex2html42">Acknowledgement</a></li>
</ul>
<p><!--End of Table of Contents--></p>
<h1><a name="SECTION00020000000000000000" id="SECTION00020000000000000000">Introduction</a></h1>
<p><font>A common vain hope of computer scientists and algorithm designers is that a domain expert has already &#8220;boiled down&#8221; a problem to a precise, but unsolved, algorithmic core. On this point the mathematician Gian-Carlo Rota wrote:</font></p>
<blockquote><p><font>One of the rarest mathematical talents is the talent for applied mathematics, for picking out of a maze of experimental data the two or three parameters that are relevant, and to discard all other data. This talent is rare. It is taught only at the shop level.[<a href="#IndiscreteThoughts">Rot97</a>, ``A Mathematician's Gossip'']</font></p></blockquote>
<p><font>We describe a useful tool for designing algorithmic applications and solutions which we call &#8220;the local to global principle.&#8221; The local to global principle is the method of deriving applications and solutions by specifying &#8220;local&#8221; (and deliberately myopic) heuristics, critiques and methods followed by using a powerful general method to &#8220;globalize&#8221; this specification into a complete solution.</font></p>
<p><font>There are many important problem solving prescriptions and methods of thought already systematically described and taught:</font></p>
<ul>
<li>Bacon&#8217;s &#8220;New Organon&#8221; and Mill&#8217;s principles of inductive logic.[<a href="#Mill">Mil02</a>]</li>
<li>Feynman&#8217;s genius method.[<a href="#IndiscreteThoughts">Rot97</a>, ``Ten Lessons I Wish I Had Been Taught'']</li>
<li>Reductionism (top down and bottom up).</li>
<li>Divide and conquer.[<a href="#IntroductionToAlgorithms">CLRS09</a>]</li>
<li>Forward deduction, backwards induction.</li>
<li>Root Cause Analysis.</li>
<li>Polya&#8217;s heuristic and conjecture and prove patterns [<a href="#citeulike:679515">Pol71</a>,<a href="#Polya1">Pol54a</a>,<a href="#Polya2">Pol54b</a>]</li>
<li>Doron Zeilberger&#8217;s &#8220;Method of Undetermined Generalization and Specialization.&#8221; [<a href="#Zeilberger:1995p277">Zei95</a>]</li>
<li>Zbigniew Michalewicz and David B. Fogel&#8217;s presentation of evolutionary algorithms.[<a href="#HTSMH">MF00</a>]</li>
</ul>
<p><font>The local to global principle is more of an organizational pattern than &#8220;computer aided technique&#8221; as no one specific species of software or family of notation is required.</font></p>
<p><font>The local to global principle can be identified in a number of previous important applications, but it is not currently an identified principle.<a name="tex2html4" href="#foot244" id="tex2html4"><sup>2</sup></a> The principle is very general, so any succinct description of it is going to be painfully vague. Instead, we explain the principle by discussing some example applications and methods.  For each of our example applications we deliberately use a different globalization technique. The effective algorithmist or practitioner must in fact come to each problem already familiar with a reasonably large set of already known local and global techniques, so we conclude with some appropriate fields of study and preparation.</font></p>
<p><font>The local to global principle is divided into two parts: local encoding of the problem followed by a globalization step that uses the encoding. The guiding feature of local encodings is that they are usually easy to compute from the data at hand. Any extension that looks like enumeration, search or optimization is best left to the global step. The local step is essentially the translation of your problem into an abstract language that is ready for the globalization step. In contrast globalization methods are often &#8220;off the shelf&#8221; in that once you abstract and encode the particulars of your problem you can look for pre-existing useful methods or software to finish your solution. The idea of globalization is to find a best overall or global compromise between competing local criteria. The local step does not so much have to avoid conflicts but instead &#8220;price them.&#8221; There is also an important trade-off that sophisticated local techniques allow the use of simpler globalization methods and more powerful globalization methods allow the use of simpler local techniques.</font></p>
<h1><a name="SECTION00030000000000000000" id="SECTION00030000000000000000">The Examples</a></h1>
<p><font>To demonstrate the breadth of the local to global principle we choose a diverse collection of example applications: web page link analysis, natural language processing and machine learning. For each example application we will set up the problem, introduce a reasonable set of local criteria and pick an appropriate globalization technique. We will favor finishing each example without describing the globalization technique in detail, as this would distract from our point and is best left to the given references. These examples are previously solved problems, our contribution is demonstrating the shared underlying principle.</font></p>
<h2><a name="SECTION00031000000000000000" id="SECTION00031000000000000000">Web Page Link Analysis</a></h2>
<p><font>For our first example application we demonstrate web page link analysis in the form of the famous PageRank score.[<a href="#Page:1998p2689">PBMW98</a>]</font></p>
<p><font>One of the many good ideas leading up to the early Google search engine was the design of a non-text based measure of importance or interestingness of web pages. A search engine that could fold &#8220;interestingness&#8221; or popularity into its notion of relevance could better sort important pages into the search user&#8217;s view. When the web got so large that there were many pages that were exact matches to any common user query popularity became a critical consideration. A link based notion of popularity exploits what is important about the web (the link structure, for example see [<a href="#Kleinberg:1997p32">Kle97</a>]) and avoids having to depend on a lot of natural language understanding technology. This technique also uses authority outside of the given page, so has some hope at being resistant (though not immune) to web-spam.</font></p>
<p><font>Taken all at once, the task of designing a score of page importance is a daunting task. However, by working in stages (as the local to global principle prescribes) we can quickly derive interesting scores including the famous PageRank score. We start with the idea that popularity (or the amount of web traffic a page receives) is (loosely) correlated with importance. So for our first approximation step we decide to try to estimate popularity (or web traffic) and use this estimate as our importance score. Accurately estimating web traffic is itself a hard problem and a big industry (just a few of the major companies involved in this are: Google/Urchin, Quantcast, Nielsen, comScore, Alexa, Hitwise and LookSmart). For our second approximation step we are going to try and estimate popularity from the link structure<a name="tex2html6" href="#foot43" id="tex2html6"><sup>4</sup></a> of the web (using no other measurements or historic data) and use this as our score. This link based estimate is unlikely to completely reproduce real web surfing patterns, but it is very interesting in its own right and has been proven in the market to be a useful score.</font></p>
<p><font>Now the problem is to try to estimate the popularity of a web page from the link structure of the web. We claim: we can generate a useful (but not necessarily accurate) estimate of web traffic from the web&#8217;s link structure alone. Consider Figure&nbsp;<a href="#fig:Links1">1</a> where we have a universe of three web pages A,B and C that link to each other in the pattern illustrated by what is called a graph<a name="tex2html7" href="#foot45" id="tex2html7"><sup>5</sup></a></font></p>
<div align="center"><a name="fig:Links1" id="fig:Links1"></a><a name="50"></a></p>
<table>
<caption align="bottom"><strong>Figure 1:</strong> A set of Mutually Linked Web Pages</caption>
<tr>
<td>
<div align="center"><img width="300" height="436" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/Links1.png" alt="Image Links1"></div>
</td>
</tr>
</table>
</div>
<p><font>In Figure&nbsp;<a href="#fig:Links1">1</a> we can consider each link to another page as evidence the other page is interesting or popular. One idea is to simulate a very simple web surfer who clicks on the links on a page uniformly at random. This is called &#8220;the random surfer model&#8221; and even a model this simple allows us to read some useful information from the link structure of the web. For instance, we could ask what fraction of their time the random surfer spends on each web page, with an eye to the idea that the pages the random surfer visits more often are the more important ones. Let <img width="35" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg2.png" alt="$ p(A)$"> denote the proportion of time the random web surfer spends on page A (and define <img width="36" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg3.png" alt="$ p(B)$"> and <img width="36" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg4.png" alt="$ p(C)$"> similarly). While we do not know any of <!-- MATH<br />
 $p(A), p(B)$<br />
 --><br />
<img width="76" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg5.png" alt="$ p(A), p(B)$"> or <img width="36" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg4.png" alt="$ p(C)$"> we can derive some relationships between them by inspecting the link graph:</font></p>
<p></p>
<div align="center"><!-- MATH<br />
 \begin{eqnarray*}<br />
p(A) &#038; = &#038; \frac{1}{2} P(B) + P(C) \\<br />
p(B) &#038; = &#038; \frac{1}{2} P(A) \\<br />
p(C) &#038; = &#038; \frac{1}{2} P(A) + \frac{1}{2} P(B) .<br />
\end{eqnarray*}<br />
 --></p>
<table cellpadding="0" align="center" width="100%">
<tr valign="middle">
<td nowrap align="right"><img width="35" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg6.png" alt="$\displaystyle p(A)$"></td>
<td width="10" align="center" nowrap><img width="16" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg7.png" alt="$\displaystyle =$"></td>
<td align="left" nowrap><img width="109" height="49" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg8.png" alt="$\displaystyle \frac{1}{2} P(B) + P(C)$"></td>
<td width="10" align="right">&nbsp;</td>
</tr>
<tr valign="middle">
<td nowrap align="right"><img width="36" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg9.png" alt="$\displaystyle p(B)$"></td>
<td width="10" align="center" nowrap><img width="16" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg7.png" alt="$\displaystyle =$"></td>
<td align="left" nowrap><img width="52" height="49" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg10.png" alt="$\displaystyle \frac{1}{2} P(A)$"></td>
<td width="10" align="right">&nbsp;</td>
</tr>
<tr valign="middle">
<td nowrap align="right"><img width="36" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg11.png" alt="$\displaystyle p(C)$"></td>
<td width="10" align="center" nowrap><img width="16" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg7.png" alt="$\displaystyle =$"></td>
<td align="left" nowrap><img width="125" height="49" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg12.png" alt="$\displaystyle \frac{1}{2} P(A) + \frac{1}{2} P(B) .$"></td>
<td width="10" align="right">&nbsp;</td>
</tr>
</table>
</div>
<p><br clear="all"></p>
<p><font>The first equation is just reading from the graph that: all visits on page-A must come from pages B and C, half of the visitors on page-B continue on to A and all of the visitors on page-C continue on to A. The second and third equations are the appropriate summaries of how traffic is routed to pages B and C. We can insist that <!-- MATH<br />
 $P(A) + P(B)<br />
+ P(C) = 1$<br />
 --><br />
<img width="183" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg13.png" alt="$ P(A) + P(B) + P(C) = 1$"> as we want these numbers to represent the fraction of time the random web surfer spends on each page. A more sophisticated model would add more features<a name="tex2html9" href="#foot245" id="tex2html9"><sup>6</sup></a> to get a more useful result.</font></p>
<p><font>It turns out we have already encoded enough local rules to completely determine <!-- MATH<br />
 $P(A), P(B)$<br />
 --><br />
<img width="85" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg14.png" alt="$ P(A), P(B)$"> and <img width="40" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg15.png" alt="$ P(C)$"> . In this example application an algorithmist already familiar with linear algebra&nbsp;[<a href="#Strang">Str76</a>] would recognize these local conditions as &#8220;a system of linear equations.&#8221; Solving even web-scale systems of linear systems is considered easy with modern techniques and modern computers. For our small example example the solution is: <!-- MATH<br />
 $p(A) = \frac{4}{9}$<br />
 --><br />
<img width="67" height="34" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg16.png" alt="$ p(A) = \frac{4}{9}$"> , <!-- MATH<br />
 $p(B) = \frac{2}{9}$<br />
 --><br />
<img width="68" height="34" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg17.png" alt="$ p(B) = \frac{2}{9}$"> , and <!-- MATH<br />
 $p(C) = \frac{3}{9}$<br />
 --><br />
<img width="67" height="34" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg18.png" alt="$ p(C) = \frac{3}{9}$"> . The role of the local steps was to reduce a new problem (estimating the importance or popularity of web page from the link structure) to something with its <em>already known</em> known techniques (like solving a linear system as illustrated in Figure&nbsp;<a href="#fig:LinAlg">2</a>).</font></p>
<div align="center"><a name="fig:LinAlg" id="fig:LinAlg"></a><a name="79"></a></p>
<table>
<caption align="bottom"><strong>Figure 2:</strong> Linear Algebra Solution: As Taught in School</caption>
<tr>
<td>
<div align="center"><img width="400" height="365" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LinAlg.jpg" alt="Image LinAlg"></div>
</td>
</tr>
</table>
</div>
<p><font>So page-A is the most important page by the PageRank measure.</font></p>
<p><font>In this example application the local step was setting up the system of linear equalities (which are easy to derive from the web link graph) and the global step was solving the entire system for the final scores (which were not obvious). You spend most of your time encoding the problem and then use a known technique (in this case solving a linear system) to finish the solution.</font></p>
<h2><a name="SECTION00032000000000000000" id="SECTION00032000000000000000">Natural Language Processing</a></h2>
<p><font>Our next example application is natural language processing&nbsp;[<a href="#CharniakBook">Cha96</a>,<a href="#Charniak:1997p1484">Cha97</a>]. Speech recognition (the alignment or transcription of recognized intelligible segments of sound to written text) is an important problem in natural language processing. An example problem is the need to find the most likely text matching a sequence of sounds such as is shown in Figure&nbsp;<a href="#fig:SoundSeq1">3</a>.</font></p>
<div align="center"><a name="fig:SoundSeq1" id="fig:SoundSeq1"></a><a name="89"></a></p>
<table>
<caption align="bottom"><strong>Figure 3:</strong> A Sequence of Sounds</caption>
<tr>
<td>
<div align="center"><img width="500" height="69" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/SoundSeq1.png" alt="Image SoundSeq1"></div>
</td>
</tr>
</table>
</div>
<p><font>Consider Figure&nbsp;<a href="#fig:SoundSeq3">4</a> (which shows a bad transcription) and Figure&nbsp;<a href="#fig:SoundSeq2">5</a> (which shows a good transcription).</font></p>
<div align="center"><a name="fig:SoundSeq3" id="fig:SoundSeq3"></a><a name="98"></a></p>
<table>
<caption align="bottom"><strong>Figure 4:</strong> A Bad Transcription</caption>
<tr>
<td>
<div align="center"><img width="500" height="142" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/SoundSeq3.png" alt="Image SoundSeq3"></div>
</td>
</tr>
</table>
</div>
<div align="center"><a name="fig:SoundSeq2" id="fig:SoundSeq2"></a><a name="105"></a></p>
<table>
<caption align="bottom"><strong>Figure 5:</strong> A Good Transcription</caption>
<tr>
<td>
<div align="center"><img width="500" height="142" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/SoundSeq2.png" alt="Image SoundSeq2"></div>
</td>
</tr>
</table>
</div>
<p><font>Our claim: we can (given access to training data, and this is the age of data&nbsp;[<a href="#Halevy:2009p2327">HNP09</a>]) solve this problem with a local step that is a set of simple criticisms of proposed transcriptions. A good starting point is a database of previous sounds to text transcriptions. This database allows the construction of a series of tables that give the historic frequency (or probability) of all of the following:</font></p>
<ul>
<li>Prior probability of each sound</li>
<li>Probability of each sound given the immediately previous sound</li>
<li>Prior probability of each word</li>
<li>Probability of each word given the immediately previous word</li>
<li>Which combinations of word fragments are legitimate words</li>
<li>Probability of each sound being assigned to each word fragment (syllables, phonemes and so on).</li>
</ul>
<p><font>These tables encode a &#8220;speech model&#8221; (the rules involving sounds only), a language model (the rules involving text or words only) and the linkage between the two models. These models are deliberately simple in that they capture only local interactions (like probability of a word given the word before it) but no long range interactions (like subject predicate agreement).</font></p>
<p><font>Each box, nested box and arrow on our diagram represents one possible local critique. For each item in our diagram (again, the boxes and arrows) we can use our tables to assign a goodness or plausibility score. For instance bad word to word transitions (like &#8220;won&#8221; <!-- MATH<br />
 $\rightarrow$<br />
 --><br />
<img width="19" height="13" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg19.png" alt="$ \rightarrow$"> &#8220;won&#8221;) will be rare in our historic tables so, just looking up probabilities from the tables (or, better, using the logarithms of probabilities) gives as a &#8220;plausibility score&#8221; that prefers known patterns of language. Then a score for the overall transcription can be derived by multiplying all of the local scores together. These local scores (though simple) already have encoded enough evidence to prefer the good transcription to the bad transcription <em>without</em> requiring any deep knowledge of speech, text or the meaning of the text. This is because the bad transcription has a series of obvious flaws such as: unlikely sound to word fragment assignments and unlikely word to word transitions.</font></p>
<div align="center"><a name="fig:SoundSeqPartial" id="fig:SoundSeqPartial"></a><a name="116"></a></p>
<table>
<caption align="bottom"><strong>Figure 6:</strong> Naively Extending a Partial Transcription</caption>
<tr>
<td>
<div align="center"><img width="500" height="142" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/SoundSeqPartial.png" alt="Image SoundSeqPartial"></div>
</td>
</tr>
</table>
</div>
<p><font>For example consider Figure&nbsp;<a href="#fig:SoundSeqPartial">6</a> where a naive solver is in the process of considering selecting the word &#8220;one&#8221; as the third word to fill in. The <em>only</em> local critiques they need to consider are:</font></p>
<ul>
<li>how likely the word &#8220;one&#8221; is in general (call this <img width="49" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg20.png" alt="$ P[one]$"> )</li>
<li>how likely the word &#8220;one&#8221; is to follow the word &#8220;nine&#8221; (call this <!-- MATH<br />
 $P[one | nine]$<br />
 --><br />
<img width="86" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg21.png" alt="$ P[one \vert nine]$"> )</li>
<li>how likely the letter sequence &#8220;o&#8221; is given the sound &#8220;w&#8221; (call this <!-- MATH<br />
 $P[o | \text{w\textschwa}]$<br />
 --><br />
<img width="55" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg24.png" alt="$P[o \vert \text{w\textschwa}]$"> )</li>
<li>how likely the letter sequence &#8220;ne&#8221; is given the sound &#8220;n&#8221; (call this <!-- MATH<br />
 $P[ne | \text{n}]$<br />
 --><br />
<img width="41" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg25.png" alt="$ P[ne \vert$">&nbsp; &nbsp;n<img width="7" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg23.png" alt="$ ]$"> ).</li>
</ul>
<p><font>So the local plausibility of the fill-in word &#8220;one&#8221; is: <!-- MATH<br />
 $P[one]<br />
\times P[one | nine] \times P[o | \text{w\textschwa}] \times P[ne |<br />
\text{o}]$<br />
 --><br />
<img width="292" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg28.png" alt="$P[one] \times P[one \vert nine] \times P[o \vert \text{w\textschwa}] \times P[ne \vert \text{o}]$"> . We will call this the critique of &#8220;one&#8221; in position 3 and write as <!-- MATH<br />
 $C_3(w_2,one)$<br />
 --><br />
<img width="84" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg29.png" alt="$ C_3(w_2,one)$"> where <img width="22" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg30.png" alt="$ w_2$"> is the word known to be in position 2. Similarly we can generate all of the possible critiques <img width="53" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg31.png" alt="$ C_1(w_1)$"> , <!-- MATH<br />
 $C_2(w_1,w_2)$<br />
 --><br />
<img width="78" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg32.png" alt="$ C_2(w_1,w_2)$"> , <!-- MATH<br />
 $C_3(w_2,w_3)$<br />
 --><br />
<img width="78" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg33.png" alt="$ C_3(w_2,w_3)$"> , <!-- MATH<br />
 $C_4(w_3,w_4)$<br />
 --><br />
<img width="78" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg34.png" alt="$ C_4(w_3,w_4)$"> and the overall criticize of a sequence <!-- MATH<br />
 $w_1 \; w_2 \; w_3 \; w_4$<br />
 --><br />
<img width="77" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg35.png" alt="$ w_1 \; w_2 \; w_3 \; w_4$"> : <!-- MATH<br />
 $C_1(w_1)<br />
\times C_2(w_1,w_2) \times C_3(w_2,w_3) \times C_4(w_3,w_4)$<br />
 --><br />
<img width="336" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg36.png" alt="$ C_1(w_1) \times C_2(w_1,w_2) \times C_3(w_2,w_3) \times C_4(w_3,w_4)$"> from our pre-computed tables of probabilities. Notice for all of these critiques only the immediately previous word and the nearby sounds were used to determine the plausibility of the word we are attempting to fit in. Instead of using these critiques to directly fill in a possible solution (or using search) we will package up these critiques (in the form of the <img width="32" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg37.png" alt="$ C_i()$"> ) and pass them on to a powerful separate globalization step called Dynamic Programming&nbsp;[<a href="#DynamicProgramming">Bel57</a>].</font></p>
<p><font>The globalization or finding of a best overall transcription is not trivial even though our score is simple. This is because the overall <em>best</em> sequence could depend on clever non-local fill-ins (like deliberately picking a less likely first word to allow a later favored transition to a fantastically good third word). Dynamic Programing does not fill in the transcription from left to right, but instead uses a table of scores derived from the left to right arrows and the <img width="32" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg37.png" alt="$ C_i()$"> . In our example Dynamic Programming consists of building a table of information as shown in Figure&nbsp;<a href="#fig:DynBackFill">7</a>. Let <img width="9" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg38.png" alt="$ i$"> represent the word position we are working looking at (so <img width="9" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg38.png" alt="$ i$"> ranges from 1 to 4) and let <img width="15" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg39.png" alt="$ w$"> be a variable that ranges over every word in the dictionary. Our table is indexed by <img width="9" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg38.png" alt="$ i$"> and <img width="15" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg39.png" alt="$ w$"> and when filled in <img width="51" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg40.png" alt="$ T(i,w)$"> stores what the highest &#8220;plausibility score&#8221; of a partial sequence of words where words 1 through <img width="9" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg38.png" alt="$ i$"> have been filled in and the <img width="9" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg38.png" alt="$ i$"> -th word is <img width="15" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg39.png" alt="$ w$"> .</font></p>
<div align="center"><a name="fig:DynBackFill" id="fig:DynBackFill"></a><a name="134"></a></p>
<table>
<caption align="bottom"><strong>Figure 7:</strong> Dynamic Programming: Back Chaining in <img width="27" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg1.png" alt="$ T()$"> for a Solution</caption>
<tr>
<td>
<div align="center"><img width="300" height="298" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/DynTableBackFill.png" alt="Image DynTableBackFill"></div>
</td>
</tr>
</table>
</div>
<p><font>If we already had this magic table <img width="27" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg1.png" alt="$ T()$"> we could find a best possible sequence by &#8220;back chaining.&#8221; We start by finding a fourth word (<img width="22" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg41.png" alt="$ w_4$"> ) such that <img width="61" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg42.png" alt="$ T(4,w_4)$"> is maximal (in this case &#8220;one&#8221;). We then find a best third word (<img width="22" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg43.png" alt="$ w_3$"> ) by enumerating all words and picking <img width="22" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg43.png" alt="$ w_3$"> such that <!-- MATH<br />
 $T(3,w_3) \times C_4(w_3,w_4) = T(4,w_4)$<br />
 --><br />
<img width="234" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg44.png" alt="$ T(3,w_3) \times C_4(w_3,w_4) = T(4,w_4)$"> . We continue back until we had found words <img width="22" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg30.png" alt="$ w_2$"> and <img width="22" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg45.png" alt="$ w_1$"> to get a complete best sequence. Notice that we work from right to left (backwards) and except for the starting step we pick each word to match the calculation we are trying to un-roll, not to be the maximal entry in the column. For instance we pick <!-- MATH<br />
 $w_1 = dial$<br />
 --><br />
<img width="70" height="29" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg46.png" alt="$ w_1 = dial$"> even though it does not have a the highest score, but because <!-- MATH<br />
 $T(1,dial) C_2(dial,nine)<br />
C_3(nine,one) C_4(one,one) = T(4,one)$<br />
 --><br />
<img width="433" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg47.png" alt="$ T(1,dial) C_2(dial,nine) C_3(nine,one) C_4(one,one) = T(4,one)$"> is the maximal complete chain.</font></p>
<p><font>Of course, we don&#8217;t start with the table <img width="27" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg1.png" alt="$ T()$"> already filled in- so we need a procedure to build it. This procedure is the heart of the Dynamic Programming method (for more examples see: &#8220;Introduction to Algorithms&#8221;&nbsp;[<a href="#IntroductionToAlgorithms">CLRS09</a>]). Notice that <img width="54" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg48.png" alt="$ T(1,w)$"> can be filled in for all <img width="15" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg39.png" alt="$ w$"> just by plugging in words and computing the critiques <img width="46" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg49.png" alt="$ C_1(w)$"> (i.e. <!-- MATH<br />
 $T(1,w) = C_1(w)$<br />
 --><br />
<img width="118" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg50.png" alt="$ T(1,w) = C_1(w)$"> ). Once all the <img width="54" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg48.png" alt="$ T(1,w)$"> are filled in we can fill in the the <img width="54" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg51.png" alt="$ T(2,w)$"> with the general (and slightly trickier) formula:</font></p>
<p><!-- MATH<br />
 \begin{displaymath}<br />
T(i+1,w) = \max_{v} T(i,v) C_{i+1}(v,w)<br />
\end{displaymath}<br />
 --></p>
<div align="center"><img width="249" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg52.png" alt="$\displaystyle T(i+1,w) = \max_{v} T(i,v) C_{i+1}(v,w) $"></div>
<p><font>as we illustrate for <img width="74" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg53.png" alt="$ T(2,nine)$"> in Figure&nbsp;<a href="#fig:DynTable">8</a>.</font></p>
<div align="center"><a name="fig:DynTable" id="fig:DynTable"></a><a name="145"></a></p>
<table>
<caption align="bottom"><strong>Figure 8:</strong> Dynamic Programming: Building the Table <img width="27" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg1.png" alt="$ T()$"></caption>
<tr>
<td>
<div align="center"><img width="400" height="261" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/DynTableCalculate.png" alt="Image DynTableCalculate"></div>
</td>
</tr>
</table>
</div>
<p><font>The magic of the Dynamic Programing technique is: by being careful to not store too much in the table <img width="51" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg40.png" alt="$ T(i,w)$"> we avoid an explosion in record keeping that would render the method inefficient. Dynamic Programming exploits the small dependence structure encoded in <img width="32" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg37.png" alt="$ C_i()$"> (each box in our diagram depending on only a few arrows) and as we have shown can find &#8220;clever&#8221; solutions (such as taking a sub-optimal first word to get better transitions into preferred later words). For those who want more detail on solving this problem we recommend [<a href="#CharniakBook">Cha96</a>] (as our goal is not to fully explain Dynamic Programming, but to demonstrate how it could be applied to the transcription problem as a pre-packaged globalizer).</font></p>
<p><font>In this example the local step was the graph link based critiques and the globalization step was Dynamic Programming. The separation of concerns from the local scoring to the globalizing step is a strength of the local to global principle.</font></p>
<h2><a name="SECTION00033000000000000000" id="SECTION00033000000000000000">Machine Learning</a></h2>
<p><font>Our final example application is machine learning. Machine learning is loosely defined as computer programs that adapt or learn from data. Thomas Mitchell helps distinguish this activity as a specialty of artificial intelligence that concentrates on &#8220;well-posed learning problems.&#8221;&nbsp;[<a href="#MitchellML">Mit97</a>] Trevor Hastie, Robert Tibshirani, Jerome Friedman emphasize the relation to statistics (versus more traditional symbolic AI)&nbsp;[<a href="#TibHat">TH09</a>]. A simple demonstration can be found in [<a href="#MLArt">Mou09b</a>].</font></p>
<p><font>Machine learning is perhaps the strongest example of the local to global principle and is inspired by the work of Kristin P. Bennett and Emilio Parrado-Hernandez&nbsp;[<a href="#Bennett:2006p400">BPH06</a>]. In hindsight many machine learning algorithms (each of which has had a turn at being &#8220;the most exciting breakthrough ever&#8221; for a while) can be seen as the pairing of a performance criterion (which we call a local criterion as it applies to one specific set of parameter values at a time) and an optimization method (what we have been calling the globalization step). The work of Bennett and Parrado-Hernandez calls this distinction out and shows how it is not productive to present machine learning systems as unique named monolithic units, but instead to consider how to break them into an objective function and an optimizer. This allows both choice of better optimizers (such as replacing the inferior method of gradient descent method wherever it occurs) and for explicit control of important concepts such as hypothesis regularization and control of over-fitting (which some algorithms claim to achieve by deliberately using early exit from a an inferior optimizer).</font></p>
<p><font>At a &#8220;30,000 feet level&#8221; we can build a table of common machine learning techniques and name what is commonly used to implement their local and global steps. When a machine learning algorithm is defined by what conditions are meant to be true at the optimum we are no longer bound by details of the original implementation and can examine fix and improve the components.<a name="tex2html17" href="#foot154" id="tex2html17"><sup>7</sup></a> Table&nbsp;<a href="#fig:MachineLearning">1</a> is a crude summary of a wide selection for machine learning algorithms that may be more likely to offend everybody than just offend somebody. But this is also the point: it is the algorithmist&#8217;s job to think fluidly (beyond given names and provenances) and to invent scaffolding to convert partial analogies into practical correspondences.</font></p>
<p></p>
<div align="center"><a name="190"></a></p>
<table>
<caption><strong>Table 1:</strong> Various Machine Learning Techniques</caption>
<tr>
<td>
<div align="center">
<table cellpadding="3" border="1" align="center">
<tr>
<td align="left" valign="top" width="180"><font size="-1">Machine Learning Method</font></td>
<td align="left" valign="top" width="144"><font size="-1">Local Criterion</font></td>
<td align="left" valign="top" width="144"><font size="-1">Globalization Method</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">Linear Regression [<a href="#Breiman:1997p1133">BF97</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">square error</font></td>
<td align="left" valign="top" width="144"><font size="-1">Linear Algebra</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">Linear Discriminant Analysis [<a href="#Fisher:1936p2576">Fis36</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">square error</font></td>
<td align="left" valign="top" width="144"><font size="-1">Linear Algebra</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">Logistic Regression [<a href="#Komarek:2008p1742">Kom08</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">logit penalty</font></td>
<td align="left" valign="top" width="144"><font size="-1">Newton&#8217;s Method</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">Perceptron [<a href="#Beigel:1991p1027">BRS91</a>] [<a href="#Blum:2002p1867">BD02</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">error rate</font></td>
<td align="left" valign="top" width="144"><font size="-1">error based update</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">Naive Bayes [<a href="#Maron:2000p2553">MK00</a>] [<a href="#Maron:1961p2566">Mar61</a>] [<a href="#Lewis:1998p105">Lew98</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">frequency tables</font></td>
<td align="left" valign="top" width="144"><font size="-1">arithmetic</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">Nearest Neighbor [<a href="#Ailon:2006p872">AC06</a>] [<a href="#Indyk:1999p166">IM99</a>] [<a href="#Andoni:2006p52">AI06</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">Kernel Methods</font></td>
<td align="left" valign="top" width="144"><font size="-1">enumeration,<br />
projection</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">Decision Trees [<a href="#bfso:1984">BFSO84</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">information theory</font></td>
<td align="left" valign="top" width="144"><font size="-1">partitioning</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">clustering [<a href="#Cilibrasi:2005p8">CV05</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">square error</font></td>
<td align="left" valign="top" width="144"><font size="-1">partitioning</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">MaxEnt [<a href="#Grunwald:2000p108">Gru00</a>] [<a href="#Grunwald:2004p739">GD04</a>] [<a href="#Skilling:1988p780">Ski88</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">entropy penalty</font></td>
<td align="left" valign="top" width="144"><font size="-1">Newton&#8217;s Method</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">Neural Net with Back Propagation [<a href="#NNCPE">Hus99</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">sigmoid penalty function</font></td>
<td align="left" valign="top" width="144"><font size="-1">Automatic Differentiation,<br />
steepest descent</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">Winnow [<a href="#Kivinen:1995p1836">KWA95</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">error rate</font></td>
<td align="left" valign="top" width="144"><font size="-1">multiplicative error based update</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">Boosting [<a href="#Freund:1999p1015">FS99</a>] [<a href="#Breiman:2000p1134">Bre00</a>] [<a href="#Collins:2002p1008">CSS02</a>] [<a href="#Trevisan:2008p2166">TTV08</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">weighted errors,<br />
data re-weighting</font></td>
<td align="left" valign="top" width="144"><font size="-1">Conjugate Gradient</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">HMM [<a href="#Kristjansson:2004p545">KCVM04</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">probability penalty</font></td>
<td align="left" valign="top" width="144"><font size="-1">Gibbs Sampler</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">Latent Dirichlet Allocation [<a href="#Blei:2003p1063">BNJ03</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">KL divergence</font></td>
<td align="left" valign="top" width="144"><font size="-1">Variational Methods</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">Support Vector Machine [<a href="#Joachims:1998p406">Joa98</a>] [<a href="#SVMBook">STC00</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">L1 Margin,<br />
Kernel Methods</font></td>
<td align="left" valign="top" width="144"><font size="-1">Quadratic Optimization</font></td>
</tr>
</table>
</div>
<p><a name="fig:MachineLearning" id="fig:MachineLearning"></a></td>
</tr>
</table>
</div>
<p></p>
<p><font>This table is a necessarily crude summary. For example: notice that several known techniques can not even be distinguished from each other by the local and global columns of the table.</font></p>
<p><font>There are a few points we would like to make. Back propagation was considered unique to Neural Nets for quite a while because it was so entwined with the technique it was not recognized as the simple application of Automatic Differentiation&nbsp;[<a href="#Rall:1996p2473">RC96</a>] that it is. Support Vector Machines (SVM) are remarkable for their uniform very good choice of component methods (maximum L1 margin objective regularization, Kernel Methods&nbsp;[<a href="#KernBook">STC04</a>] and sophisticated optimization methods&nbsp;[<a href="#Joachims:2006p403">Joa06</a>]). Many of the machine learning methods that SVM outperforms become again competitive when they adopt some of SVM&#8217;s technologies (especially using kernel methods to produce synthetic features).</font></p>
<p><font>Beyond these points we invoke a &#8220;globalizers are pre-packaged&#8221; principle and leave the discussion of machine learning and optimization to our reference: [<a href="#Bennett:2006p400">BPH06</a>]. In this example the local step is a per-example score or penalty and the globalization step is optimization.</font></p>
<h1><a name="SECTION00040000000000000000" id="SECTION00040000000000000000">Some Methods</a></h1>
<p><font>The application of the local to global principle is similar to the Feynman &#8220;genius method.&#8221; Feynman&#8217;s method is to always have in mind a list of problems and a list of solution methods. The genius step is: anytime you see a new problem or a new solution method to immediately try it against every item from the complementary list.&nbsp;[<a href="#IndiscreteThoughts">Rot97</a>, ``Ten Lessons I Wish I Had Been Taught''] This deliberate retention and activity greatly increases your problem solving ability. The power of the local to global principle is itself proportional to the number of local methods times the number of globalization strategies. Of course, to even start: the practitioner must already have available a number of candidate local and globalization methods. We list some methods and some guidance on variation and invention.</font></p>
<h2><a name="SECTION00041000000000000000" id="SECTION00041000000000000000">Local Methods</a></h2>
<p><font><img width="100" height="100" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/nails.jpg" alt="Image nails"> Good sources of ideas and analogies for local methods include:</font></p>
<ul>
<li>Introduce a Graph Structure
<p>A graph structure is a network of nodes connected by edges. Use of graphs was demonstrated both in the natural language processing and web page link analysis examples. We can dress up how we solved these problems and say we used a &#8220;Hidden Markov Model&#8221;, but the real power was we encoded our problem in a simple graph. Some problems (especially those from logic or those involving time) are essentially solved once they are translated out of their original form and into graph notation (for an example see: [<a href="#Mount:2000p360">Mou00</a>]).</p>
</li>
<li>Appeal to Physical Conservation Laws
<p>A good example physical law is Kirchhoff&#8217;s law or conservation of flow. All of the web page link analysis&#8217;s equations were derived by saying that the attention of at node is essentially the sum of attentions from other nodes (more sophisticated versions of the analysis actually do create and destroy flow, but they do it in a principled way).</p>
</li>
<li>Encode the Problem into an Objective Function
<p>This method is essentially your declaration that you intend to use an optimizer for the globalization step. In operations research this specific technique has long been the practice (with no disrespect: a very productive part of operations research has been translating different problems into linear programs so the simplex method can be applied, for an example see [<a href="#TradeArt">Mou09a</a>]).</p>
</li>
<li>Gradient Like Computations
<p>Includes Gradients, Secants, Lagrangians and other ideas from calculus. Gradients can drive optimizer based globalizers and techniques like Lagrangians are often powerful enough use mere inspection as the globalization step.</p>
</li>
<li>Violation Driven Updates
<p>This method is particularly effective when your problem is not amenable to continuous optimization. A good example is the Lin-Kernighan heuristic for solving the traveling salesman problem.[<a href="#Lin:1973p2739">LK73</a>] This heuristic looks at subsets of the problem and suggests improving &#8220;surgeries&#8221; (until no more such improvements are possible).</p>
</li>
<li>Introduction of Symbols
<p>Often, as with the web page link analysis example, you can not specify specific values for the unknowns, but you can specify relationships. You often can then solve for the symbols or introduce additional conditions and use an optimizer to complete the solution (see for example the maximum entropy method as described in [<a href="#Skilling:1988p780">Ski88</a>]).</p>
</li>
<li>Over Specification
<p>If we anticipate using a global step like search, enumeration, summation or integration then over specification is a good local idea.</p>
<p>For example: consider computing the probability that a fair count flipped 10 times comes up with heads exactly 3 times. The easiest way to perform this calculation is to specify exactly which 3 coins come up heads (the local over-specification) and then sum over all choices of 3 out of 10 coins (the global step). In mathematical notation this is:</p>
<p><!-- MATH<br />
 \begin{displaymath}<br />
P[\text{exactly 3 heads out of 10 flips}] = \binom{10}{3} 2^{-10} \approx 0.117<br />
\end{displaymath}<br />
 --></p>
<div align="center"><img width="20" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg54.png" alt="$\displaystyle P[$">exactly 3 heads out of 10 flips<img width="157" height="54" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg55.png" alt="$\displaystyle ] = \binom{10}{3} 2^{-10} \approx 0.117 $"></div>
<p>or just under 12%.</li>
<li>Under Specification
<p>One of the core principles of Dynamic Programming is to forget as much as possible about partial solutions, keeping only partial solution cost and just enough information to extend partial solutions. If you anticipate using something like Dynamic Programming as your globalization step then your goal should be to under specify.</p>
</li>
<li>Tables
<p>A key step of the natural language processing example was the use of tables of past experience to determine which sounds likely corresponded to which words, which words likely followed each other (and so on). Encoding domain knowledge or expertise as probability tables is a very effective problem solving strategy (especially if the globalization strategy is going to be search or Dynamic Programming). In natural language processing examples tables and statistics are <em>much</em> easier to manage than comprehensive rules or grammars.</p>
</li>
<li>Set up as Ranking or Machine Learning Problem
<p>This tactic is especially appropriate if your solution success metric is counts, frequencies or probabilities (instead of having to always be correct or always be optimal).</p>
</li>
</ul>
<h2><a name="SECTION00042000000000000000" id="SECTION00042000000000000000">Globalization Methods</a></h2>
<p><font><img width="100" height="100" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/hammer.jpg" alt="Image hammer"> The universe of possible globalization methods is very diverse (in particular globalization is not always optimization).</font></p>
<ul>
<li>Search / Enumeration
<p>Search can be slow, but it is always an option to consider. If your problem translates naturally into a graph structure or your solutions are naturally seen as being composed of small pieces search should be considered. One of the big advantages using the local phase to formally encode your problem&#8217;s structure and putting search off to the global phase is: you can use advanced search techniques. Once you are freed from your specific problem details it becomes much easier to consider search techniques like branch and bound, A*, game theoretic search and general speed-up techniques like hashing and caching.</p>
</li>
<li>Dynamic Programming
<p>If your problem has a bit more structure (in that partial solutions summarize and compose easily) then you can likely replace search with Dynamic Programming. The advantage is that Dynamic Programming typically offers an incredible speed up when compared to search.</p>
</li>
<li>Optimization
<p>If your problem is continuous (involves numbers instead of discrete or categorical decisions), can be encoded as a reasonable objective function (linear, positive definite quadratic) and has reasonable constraints (linear or convex) then you can immediately apply an optimizer as your globalization step. Typical optimization methods include: conjugate gradient, Newton methods, quasi Newton methods, linear programming and quadratic programming.</p>
</li>
<li>Combinatorial Optimization
<p>If your problem includes a &#8220;discrete variables&#8221; (that is variables that take on one of fixed set of values instead of values from a numeric range) then you may not be able to apply standard optimization techniques. At this point you may want to use more expensive combinatorial optimization techniques like integer linear programing or constraint satisfaction.</p>
</li>
<li>Fixed Point Methods / Iteration
<p>Fixed point methods are based on the idea: &#8220;incrementally improve until there is no incremental improvement possible.&#8221; If the problem is continuous this is similar to steepest descent. If the problem is discrete then this is similar the Lin-Kernighan heuristic.</p>
</li>
<li>Linear Algebra
<p>The web page link analysis and optimization examples were essentially solved once we reduced them to linear algebra. If you can write your problem as a linear relationship between unknowns or as the fixed-point of a linear operator (i.e. an <img width="12" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg56.png" alt="$ x$"> such that <img width="54" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg57.png" alt="$ A x = x$"> ) then you can immediately use linear algebra to solve the problem at very large scale (e.g. web scale).</p>
</li>
<li>Sampling / Problem Kernels
<p>A very successful line of attack on large problems is to reduce to a smaller problem containing most of the essential difficulty. David Karger has produced a number of effective algorithms for graph cuts and flows using a theory of sampling&nbsp;[<a href="#Karger:1998p556">Kar98</a>]. Rod Downey and M. Fellows have demonstrated an effective theory of &#8220;problem kernels&#8221; that finds solution by focusing on smaller sub-problems (on which we can afford to use more expensive procedures).[<a href="#DF98">DF98</a>]</p>
</li>
<li>Amortized Analysis / Economic Mechanism Methods
<p>Daniel Sleator and Robert Tarjan&#8217;s ideas of amortized analysis&nbsp;[<a href="#Sleator:1985p168">ST85</a>] allow approximation schemes similar to problem kernels. The method is to approximately optimize by pairing a bunch of unavoidable large penalties (conditions we can&#8217;t meet) with some accounting credits (say bonuses from other conditions we are meeting very well). We then isolate these paired items and optimize the rest of the problem exactly. The technique often works by showing the approximation can not be too bad because, due to the pairing of large penalties to good credits, there can not be too many large penalties. An informal example is: if it is impossible to pick someplace where all of an office will eat for lunch, perhaps you can solve the problem by paying one person to accept a restaurant they do not like (if the removal of their objection opens up a venue that is acceptable to everybody else).</p>
</li>
<li>Relaxation / Homotopic methods
<p>These methods involve changing hard constraints to soft penalties (so allowing the constraints to be violated, but at a slowly increasing cost). After such a relaxation the homotopic (or continuous deformation) method is to increase the cost of violation and re-solve to try and get a trajectory of better and better nearly acceptable solutions that point to a possible overall solution.</p>
</li>
</ul>
<h1><a name="SECTION00050000000000000000" id="SECTION00050000000000000000">Conclusion</a></h1>
<p><font>The purpose of this article has been to make more visible an idea we call the local to global principle. This principle is an organizing tool useful both in designing and analyzing a wide variety of applications. Essentially the whole point of this writeup is to set up enough framework to quickly write down a table of advice such as Table&nbsp;<a href="#fig:ProblemTable">2</a> (and for such a table to mean something).</font></p>
<p></p>
<div align="center"><a name="227"></a></p>
<table>
<caption><strong>Table 2:</strong> Various Applications, Local Steps and Global Steps</caption>
<tr>
<td>
<div align="center">
<table cellpadding="3" border="1" align="center">
<tr>
<td align="left"><font size="-1">Example</font></td>
<td align="left"><font size="-1">Local Step</font></td>
<td align="left"><font size="-1">Global Step</font></td>
</tr>
<tr>
<td align="left"><font size="-1">speech transcription</font></td>
<td align="left"><font size="-1">tables</font></td>
<td align="left"><font size="-1">Dynamic Programming</font></td>
</tr>
<tr>
<td align="left"><font size="-1">PageRank</font></td>
<td align="left"><font size="-1">graph structure, linear equations</font></td>
<td align="left"><font size="-1">Linear Algebra</font></td>
</tr>
<tr>
<td align="left"><font size="-1">machine learning</font></td>
<td align="left"><font size="-1">objective function</font></td>
<td align="left"><font size="-1">optimization</font></td>
</tr>
</table>
</div>
<p><a name="fig:ProblemTable" id="fig:ProblemTable"></a></td>
</tr>
</table>
</div>
<p></p>
<p><font>The principle is not universal; not everything can be fit into such a table. For example the local to global decoupling is <em>not</em> a feature of the famous EM algorithm&nbsp;[<a href="#Dempster:1977p761">DLR77</a>], which depends on mixing predictions and corrections.</font></p>
<p><font>To conclude: the recipe is as follows. If you come to a problem with a large shopping bag of possible ways to build local criteria and powerful globalization procedures then you stand a very good chance of solving the problem quickly. Also, if you keep the local to global principle in mind you are more likely to identify and retain potential local tricks and globalizers when you see them and thus have a larger more nimble set of tools available to solve problems when the time comes.</font></p>
<h2><a name="SECTION00060000000000000000" id="SECTION00060000000000000000">Bibliography</a></h2>
<dl compact>
<dt><a name="Ailon:2006p872" id="Ailon:2006p872">AC06</a></dt>
<dd>Nir Ailon and Bernard Chazelle, <i>Approximate nearest neighbors and the fast johnson-lindenstrauss transform</i>, STOC (2006).</dd>
<dt><a name="Andoni:2006p52" id="Andoni:2006p52">AI06</a></dt>
<dd>Alexandr Andoni and Piotr Indyk, <i>Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions</i>.</dd>
<dt><a name="Blum:2002p1867" id="Blum:2002p1867">BD02</a></dt>
<dd>Avrim Blum and John Dunagan, <i>Smoothed analysis of the perceptron algorithm for linear programming</i>, SODA (2002), 11.</dd>
<dt><a name="DynamicProgramming" id="DynamicProgramming">Bel57</a></dt>
<dd>Richard Bellman, <i>Dynamic programming</i>, Princeton University Press, 1957.</dd>
<dt><a name="Breiman:1997p1133" id="Breiman:1997p1133">BF97</a></dt>
<dd>Leo Breiman and Jerome&nbsp;H Friedman, <i>Predicting multivariate responses in multiple linear regression</i>, Journal of the Royal Statistical Society, Series B (Methodological) <b>59</b> (1997), no.&nbsp;1, 3-54.</dd>
<dt><a name="bfso:1984" id="bfso:1984">BFSO84</a></dt>
<dd>Leo Breiman, Jerome Friedman, Charles&nbsp;J. Stone, and R.&nbsp;A. Olshen, <i>Classification and regression trees</i>, Chapman &amp; Hall/CRC, January 1984.</dd>
<dt><a name="Blei:2003p1063" id="Blei:2003p1063">BNJ03</a></dt>
<dd>David&nbsp;M Blei, Andrew&nbsp;Y Ng, and Michael&nbsp;I Jordan, <i>Latent dirichlet allocation</i>, Journal of Machine Learning Research <b>3</b> (2003), 993-1022.</dd>
<dt><a name="Bennett:2006p400" id="Bennett:2006p400">BPH06</a></dt>
<dd>Kristin&nbsp;P. Bennett and Emilio Parrado-Hernandez, <i>The interplay of optimization and machine learning research</i>, Journal of Machine Learning Research <b>7</b> (2006), 1265-1281.</dd>
<dt><a name="Breiman:2000p1134" id="Breiman:2000p1134">Bre00</a></dt>
<dd>Leo Breiman, <i>Special invited paper. additive logistic regression: A statistical view of boosting: Discussion</i>, Ann. Statist. <b>28</b> (2000), no.&nbsp;2, 374-377.</dd>
<dt><a name="Beigel:1991p1027" id="Beigel:1991p1027">BRS91</a></dt>
<dd>R&nbsp;Beigel, N&nbsp;Reingold, and D&nbsp;Spielman, <i>The perceptron strikes back</i>, Structure in Complexity Theory Conference <b>6</b> (1991), 286-291.</dd>
<dt><a name="CharniakBook" id="CharniakBook">Cha96</a></dt>
<dd>Eugene Charniak, <i>Statistical language learning</i>, MIT Press, 1996.</dd>
<dt><a name="Charniak:1997p1484" id="Charniak:1997p1484">Cha97</a></dt>
<dd>to3em, <i>Statistial techniques for natural language parsing</i>, AI Magazine <b>18</b> (1997), no.&nbsp;4, 33-44.</dd>
<dt><a name="IntroductionToAlgorithms" id="IntroductionToAlgorithms">CLRS09</a></dt>
<dd>Thomas&nbsp;H. Cormen, Charles&nbsp;E. Leiserson, Ronald&nbsp;L. Rivest, and Clifford Stein, <i>Introduction to algorithms</i>, MIT Press, 2009.</dd>
<dt><a name="Collins:2002p1008" id="Collins:2002p1008">CSS02</a></dt>
<dd>Michael Collins, Robert&nbsp;E Schapire, and Yoram Singer, <i>Logistic regression, adaboost and bregman distances</i>, Machine Learning <b>48</b> (2002), no.&nbsp;1/2/3, 30.</dd>
<dt><a name="Cilibrasi:2005p8" id="Cilibrasi:2005p8">CV05</a></dt>
<dd>Rudi Cilibrasi and Paul&nbsp;M.B Vitanyi, <i>Clustering by compression</i>, IEEE Transactions on Information Theory <b>51</b> (2005), no.&nbsp;4, 1523-1545.</dd>
<dt><a name="DF98" id="DF98">DF98</a></dt>
<dd>Rod&nbsp;G. Downey and M.&nbsp;R. Fellows, <i>Parameterized complexity</i>, Monographs in Computer Science, Springer, November 1998.</dd>
<dt><a name="Dempster:1977p761" id="Dempster:1977p761">DLR77</a></dt>
<dd>A&nbsp;P Dempster, N&nbsp;M Laird, and D&nbsp;B Rubin, <i>Maximum likelihood from incomplete data via the em algorithm</i>, Journal of the Royal Statistical Society, Series B (Methodological) <b>39</b> (1977), no.&nbsp;1, 1-38.</dd>
<dt><a name="Fisher:1936p2576" id="Fisher:1936p2576">Fis36</a></dt>
<dd>Ronald&nbsp;A Fisher, <i>The use of multiple measurements in taxonomic problems</i>, Annals of Eugenics <b>7</b> (1936), 179-188.</dd>
<dt><a name="Freund:1999p1015" id="Freund:1999p1015">FS99</a></dt>
<dd>Yoav Freund and Robert&nbsp;E Schapire, <i>A short introduction to boosting</i>, Journal of Japanese Society for Artificial Intelligence <b>14</b> (1999), no.&nbsp;5, 771-780.</dd>
<dt><a name="Grunwald:2004p739" id="Grunwald:2004p739">GD04</a></dt>
<dd>Peter&nbsp;D Grunwald and A&nbsp;Philip Dawid, <i>Game theory, maximum entropy, minimum discrepancy and robust bayesian decision theory</i>, Ann. Statist. <b>32</b> (2004), no.&nbsp;4, 1367-1433.</dd>
<dt><a name="Grunwald:2000p108" id="Grunwald:2000p108">Gru00</a></dt>
<dd>PD&nbsp;Grunwald, <i>Maximum entropy and the glasses you are looking through</i>, Conference on Uncertainty in Artificial Intelligence (2000), 238-246.</dd>
<dt><a name="Halevy:2009p2327" id="Halevy:2009p2327">HNP09</a></dt>
<dd>Alon Halevy, Peter Norvig, and Fernando Pereira, <i>The unreasonable effectiveness of data</i>, IEEE Intellegent Systems (2009).</dd>
<dt><a name="NNCPE" id="NNCPE">Hus99</a></dt>
<dd>Dirk Husmeier, <i>Neural networks for conditional probability estimation</i>, Springer, 1999.</dd>
<dt><a name="Indyk:1999p166" id="Indyk:1999p166">IM99</a></dt>
<dd>Piotr Indyk and Rajeev Motwani, <i>Approximate nearest neighbors: Towards removing the curse of dimensionality</i>.</dd>
<dt><a name="Joachims:1998p406" id="Joachims:1998p406">Joa98</a></dt>
<dd>Thorsten Joachims, <i>Making large-scale svm learning practical</i>, Advances in Kernel Methods &#8211; Support Vector Learning (1998).</dd>
<dt><a name="Joachims:2006p403" id="Joachims:2006p403">Joa06</a></dt>
<dd>to3em, <i>Training linear svms in linear time</i>, KDD (2006).</dd>
<dt><a name="Karger:1998p556" id="Karger:1998p556">Kar98</a></dt>
<dd>David&nbsp;R Karger, <i>Randomization in graph optimization problems: A survey</i>, Optima: Mathematical Programming Society Newsletter <b>58</b> (1998).</dd>
<dt><a name="Kristjansson:2004p545" id="Kristjansson:2004p545">KCVM04</a></dt>
<dd>Trausti Kristjansson, Aron Culotta, Paul Viola, and Andrew&nbsp;Kachites McCallum, <i>Interactive information extraction with constrained conditional random fields</i>, AAAI (2004).</dd>
<dt><a name="Kleinberg:1997p32" id="Kleinberg:1997p32">Kle97</a></dt>
<dd>Jon&nbsp;M Kleinberg, <i>Authoritative souces in a hyperlinked environment</i>, ACM SIAM Symposium on Discrete Algorithms (1997).</dd>
<dt><a name="Komarek:2008p1742" id="Komarek:2008p1742">Kom08</a></dt>
<dd>Paul Komarek, <i>Logistic regression for data mining and high-dimensional classification</i>, CMU CS Thesis (2008), 138.</dd>
<dt><a name="Kivinen:1995p1836" id="Kivinen:1995p1836">KWA95</a></dt>
<dd>J&nbsp;Kivinen, Manfred&nbsp;K Warmuth, and P&nbsp;Auer, <i>The perceptron algorithm v.s. winnowo: Linear v.s. logarithmic mistake bounds when few input variables are relevant</i>, COLT (1995), 289-296.</dd>
<dt><a name="Lewis:1998p105" id="Lewis:1998p105">Lew98</a></dt>
<dd>David&nbsp;D Lewis, <i>Naive (bayes) at forty: The independence assumption in information retrieval</i>, find journal (1998).</dd>
<dt><a name="Lin:1973p2739" id="Lin:1973p2739">LK73</a></dt>
<dd>S&nbsp;Lin and BW&nbsp;Kernighan, <i>An effective heuristic algorithm for the traveling-salesman problem</i>, Operations Research (1973), 498-516.</dd>
<dt><a name="Maron:1961p2566" id="Maron:1961p2566">Mar61</a></dt>
<dd>M&nbsp;E Maron, <i>Automatic indexing: An experimental inquiry</i>, RAND Technical Report (1961), 404-417.</dd>
<dt><a name="HTSMH" id="HTSMH">MF00</a></dt>
<dd>Zbigniew Michalewicz and David&nbsp;B. Fogel, <i>How to solve it: Modern heuristics</i>, Springer, 2000.</dd>
<dt><a name="Mill" id="Mill">Mil02</a></dt>
<dd>John&nbsp;Stuart Mill, <i>A system of logic</i>, University Press of the Pacific, 2002.</dd>
<dt><a name="MitchellML" id="MitchellML">Mit97</a></dt>
<dd>Thomas Mitchell, <i>Machine learning</i>, McGraw-Hill, 1997.</dd>
<dt><a name="Maron:2000p2553" id="Maron:2000p2553">MK00</a></dt>
<dd>M&nbsp;E Maron and J&nbsp;L Kuhns, <i>On relevance, probabilistic indexing and information retrieval</i>, 1960 (2000), 1-29.</dd>
<dt><a name="Mount:2000p360" id="Mount:2000p360">Mou00</a></dt>
<dd>John&nbsp;A Mount, <i>Automatic detection of potential deadlock</i>, Dr. Dobbs Journal (2000).</dd>
<dt><a name="TradeArt" id="TradeArt">Mou09a</a></dt>
<dd>John Mount, <i>Automatic generation and testing of un-rolls for profitable technical trades</i>, <a href="http://www.win-vector.com/blog/2007/10/paper-on-stock-trading/">http://www.win-vector.com/blog/2007/10/paper-on-stock-trading/</a>, 2009.</dd>
<dt><a name="MLArt" id="MLArt">Mou09b</a></dt>
<dd>to3em, <i>A demonstration of data mining</i>, <a href="http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/">http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/</a>, 2009.</dd>
<dt><a name="Page:1998p2689" id="Page:1998p2689">PBMW98</a></dt>
<dd>Lawrence Page, Sergey Brin, Rajeev Motwani, and Tery Winograd, <i>The pagerank citation ranking: Bringing order to the web</i>, <a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.31.1768">http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.31.1768</a> (1998).</dd>
<dt><a name="Polya1" id="Polya1">Pol54a</a></dt>
<dd>G.&nbsp;Polya, <i>Induction and analogy in mathematics</i>, Princeton University Press, 1954.</dd>
<dt><a name="Polya2" id="Polya2">Pol54b</a></dt>
<dd>to3em, <i>Patterns of plausible inference</i>, Princeton University Press, 1954.</dd>
<dt><a name="citeulike:679515" id="citeulike:679515">Pol71</a></dt>
<dd>to3em, <i>How to solve it</i>, Princeton University Press, November 1971.</dd>
<dt><a name="Rall:1996p2473" id="Rall:1996p2473">RC96</a></dt>
<dd>Louis&nbsp;B Rall and George&nbsp;F Corliss, <i>An introduction to automatic differentiation</i>, SIAM: Computational Differentiation: Techniques, Applications and Tools (1996), 1-18.</dd>
<dt><a name="IndiscreteThoughts" id="IndiscreteThoughts">Rot97</a></dt>
<dd>Gian-Carlo Rota, <i>Indiscrete thoughts</i>, Birkhauser, 1997.</dd>
<dt><a name="Skilling:1988p780" id="Skilling:1988p780">Ski88</a></dt>
<dd>John Skilling, <i>The axioms of maximum entropy</i>, Maximum Entropy and Bayesian Methods in Science and Engineering <b>1</b> (1988), no.&nbsp;173-187.</dd>
<dt><a name="Sleator:1985p168" id="Sleator:1985p168">ST85</a></dt>
<dd>Daniel&nbsp;Dominic Sleator and Robert&nbsp;Endre Tarjan, <i>Amortized efficiency of list update and paging rules</i>, Communications of the ACM <b>28</b> (1985), no.&nbsp;2.</dd>
<dt><a name="SVMBook" id="SVMBook">STC00</a></dt>
<dd>Jown Shawe-Taylor and Nello Cristianini, <i>Support vector machines</i>, Cambridge University Press, 2000.</dd>
<dt><a name="KernBook" id="KernBook">STC04</a></dt>
<dd>to3em, <i>Kernel methods for pattern analysis</i>, Cambridge University Press, 2004.</dd>
<dt><a name="Strang" id="Strang">Str76</a></dt>
<dd>Gilbert Strang, <i>Linear algebra and its applications</i>, Academic Press, Inc., 1976.</dd>
<dt><a name="TibHat" id="TibHat">TH09</a></dt>
<dd>Jerome&nbsp;Friedman Trevor&nbsp;Hastie, Robert&nbsp;Tibshirani, <i>The elements of statistical learning: Data mining, inference and prediction</i>, Springer, 2009.</dd>
<dt><a name="Trevisan:2008p2166" id="Trevisan:2008p2166">TTV08</a></dt>
<dd>Luca Trevisan, Madhur Tulsiani, and Salil Vadhan, <i>Regularity, boosting, and efficiently simulating every high-entropy distribution</i>, Electronic Colloquium on Computational Complexity (2008), 18.</dd>
<dt><a name="Zeilberger:1995p277" id="Zeilberger:1995p277">Zei95</a></dt>
<dd>Doron Zeilberger, <i>The method of undetermined generalization and specialization illustrated with fred galvin&#8217;s amazing proof of the dinitz conjecture</i>, <a href="http://arxiv.org/abs/math/9506215">http://arxiv.org/abs/math/9506215</a>, 1995.</dd>
</dl>
<h1><a name="SECTION00070000000000000000" id="SECTION00070000000000000000">Acknowledgement</a></h1>
<p><font><font>A thank you to readers who supplied help and comments on earlier drafts.</font></font></p>
<p></p>
<hr />
<h4>Footnotes</h4>
<dl>
<dt><a name="foot21" id="foot21">&#8230; Mount</a><a href="#tex2html3"><sup>1</sup></a></dt>
<dd>email: <tt><a name="tex2html1" href="mailto:jmount@win-vector.com" id="tex2html1">mailto:jmount@win-vector.com</a></tt> web: <tt><a name="tex2html2" href="http://www.win-vector.com/" id="tex2html2">http://www.win-vector.com/</a></tt></dd>
<dt><a name="foot244" id="foot244">&#8230; principle.</a><a href="#tex2html4"><sup>2</sup></a></dt>
<dd>The pre-existing practice that comes cloesest to the local o global principle is found in operations research where encoding a problem to be solved by an optimizer is a central technique. We claim the natural statement of the local to global principle is more general than <font><em>always</em> encoding constraints for a particular optimizer (in particular globalization is not always optimization).</font></dd>
<dt><font><a name="foot43" id="foot43">&#8230; structure</a><a href="#tex2html6"><sup>4</sup></a></font></dt>
<dd><font>By &#8220;link structure&#8221; we mean which web pages link to which other web pages.</font></dd>
<dt><font><a name="foot45" id="foot45">&#8230; graph</a><a href="#tex2html7"><sup>5</sup></a></font></dt>
<dd><font>Remember, a graph is diagram consisting of nodes and edges (here depicted as arrows).</font></dd>
<dt><font><a name="foot245" id="foot245">&#8230; features</a><a href="#tex2html9"><sup>6</sup></a></font></dt>
<dd><font>For example the model could account for:</font></p>
<ul>
<li>surfers entering and leaving the model</li>
<li>link odds that vary where they are on a page</li>
<li>surfers staying on a page proportional to how much text is on the page</li>
<li>matching known traffic and click behavior where we have such data.</li>
</ul>
<p><font>For simplicity we will just stick with the example given example.</font></dd>
<dt><font><a name="foot154" id="foot154">&#8230; components.</a><a href="#tex2html17"><sup>7</sup></a></font></dt>
<dd><font>When a system is named and defined as an exact set of procedures the system can, by definition, not be improved. This is because with any change in procedure we have a new system that no longer matches the original definition and therefore requires a new name.</font></dd>
</dl>
<p><font><br /></font></p>
<hr />
<address><font>John Mount 2009-11-11</font></address>


<p>Related posts:<ol><li><a href='http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/' rel='bookmark' title='Permanent Link: Automatic Differentiation with Scala'>Automatic Differentiation with Scala</a></li>
<li><a href='http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/' rel='bookmark' title='Permanent Link: A Demonstration of Data Mining'>A Demonstration of Data Mining</a></li>
<li><a href='http://www.win-vector.com/blog/2009/07/should-your-mom-use-google-search/' rel='bookmark' title='Permanent Link: Should your mom use Google search?'>Should your mom use Google search?</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2009/11/the-local-to-global-principle/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Good Graphs: Graphical Perception and Data Visualization</title>
		<link>http://www.win-vector.com/blog/2009/08/good-graphs-graphical-perception-and-data-visualization/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=good-graphs-graphical-perception-and-data-visualization</link>
		<comments>http://www.win-vector.com/blog/2009/08/good-graphs-graphical-perception-and-data-visualization/#comments</comments>
		<pubDate>Fri, 28 Aug 2009 15:40:41 +0000</pubDate>
		<dc:creator>Nina Zumel</dc:creator>
				<category><![CDATA[Exciting Techniques]]></category>
		<category><![CDATA[Expository Writing]]></category>
		<category><![CDATA[Mathematics]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Cleveland]]></category>
		<category><![CDATA[data exploration]]></category>
		<category><![CDATA[graphical perception]]></category>
		<category><![CDATA[Lattice]]></category>
		<category><![CDATA[Mathematical Bedside Reading]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[visualization]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=296</guid>
		<description><![CDATA[What makes a good graph? When faced with a slew of numeric data, graphical visualization can be a more efficient way of getting a feel for the data than going through the rows of a spreadsheet. But do we know if we are getting an accurate or useful picture? How do we pick an effective [...]


Related posts:<ol><li><a href='http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/' rel='bookmark' title='Permanent Link: A Demonstration of Data Mining'>A Demonstration of Data Mining</a></li>
<li><a href='http://www.win-vector.com/blog/2010/02/living-in-a-lognormal-world/' rel='bookmark' title='Permanent Link: Living in A Lognormal World'>Living in A Lognormal World</a></li>
<li><a href='http://www.win-vector.com/blog/2009/04/the-data-enrichment-method/' rel='bookmark' title='Permanent Link: The Data Enrichment Method'>The Data Enrichment Method</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>What makes a good graph? When faced with a slew of numeric data, graphical visualization can be a more efficient way of getting a feel for the data than going through the rows of a spreadsheet. But do we know if we are getting an accurate or useful picture? How do we pick an effective visualization that neither obscures important details, or drowns us in confusing clutter? In 1968, William Cleveland published a text called <a href="http://www.stat.purdue.edu/~wsc/elements.html"><em>The Elements of Graphing Data,</em></a> inspired by Strunk and White&#8217;s classic writing handbook <a href="http://www.amazon.com/Elements-Style-50th-Anniversary/dp/0205632645"><em>The Elements of Style</em></a> . <em>The Elements of Graphing Data</em> puts forward Cleveland&#8217;s philosophy about how to produce good, clear graphs — not only for presenting one&#8217;s experimental results to peers, but also for the purposes of data analysis and exploration. Cleveland&#8217;s approach is based on a theory of graphical perception: how well the human perceptual system accomplishes certain tasks involved in reading a graph. For a given data analysis task, the goal is to align the information being presented with the perceptual tasks the viewer accomplishes the best. <span id="more-296"></span></p>
<blockquote><p>When a graph is made, quantitative and categorical information is encoded by a display method. Then the information is visually decoded. This visual perception is a vital link. No matter how clever the choice of the information, and no matter how technologically impressive the encoding, a visualization fails if the decoding fails. Some display methods lead to efficient, accurate decoding, and others lead to inefficient, inaccurate decoding. It is only through scientific study of visual perception that informed judgments can be made about display methods. The display methods of <em>Elements</em> rest on a foundation of scientific enquiry.</p></blockquote>
<p>— from the preface of <em>The Elements of Graphing Data</em></p>
<p>A revised edition of <em>The Elements of Graphing Data</em> was published in 1994, along with a companion volume, <a href="http://www.stat.purdue.edu/~wsc/visualizing.html"><em>Visualizing Data,</em></a> which is oriented towards the implementation and technical details of different graphing techniques. I highly recommend <em>The Elements of Graphing Data</em> as a guidebook for creating graphs, as well as for its excellent survey of several useful techniques. Cleveland, along with other colleagues at Bell Labs, developed the <a href="http://stat.bell-labs.com/project/trellis/s.html">Trellis display system,</a> a framework for the visualization of multivariable databases, using the ideas developed in his texts. Trellis, in turn, influenced Deepayan Sarkar&#8217;s Lattice graphics system for R. Lattice implements many of Cleveland&#8217;s ideas, and I also recommend Sarkar&#8217;s <a href="http://lmdvr.r-forge.r-project.org/figures/figures.html">Lattice manual</a> if you do data visualization in R.</p>
<p>It&#8217;s important to note here that Cleveland writes for researchers and decision-makers who use graphs to analyze data, or to convey scientific results to colleagues in an (ideally) objective manner. This distinguishes him from Darrell Huff, whose 1954 <a href="http://www.amazon.com/How-Lie-Statistics-Darrell-Huff/dp/0393310728"><em>How to Lie with Statistics</em></a> considered the use of graphs (and statistics in general) as rhetorical devices for convincing others of one&#8217;s point of view. Hence, some of Cleveland&#8217;s recommendations and guidelines actually contradict Huff&#8217;s. <a id="refHuff" href="#Huff"><sup>1</sup></a></p>
<p>Edward Tufte also explored the idea that the choice of graphical display should be influenced by the viewer&#8217;s cognitive processes, in his 1990 book <a href="http://www.edwardtufte.com/tufte/books_ei"><em>Envisioning Information</em></a>. Tufte tends to be more broadly concerned with the gestalt of a graph, beyond its use as an analysis tool; he is also more concerned than Cleveland is with aesthetic considerations.</p>
<p>Cleveland&#8217;s philosophy might be summarized as: <em>minimize the mental gymnastics that the viewer must go through to understand the graph</em>. This leads to some obvious advice: avoid clutter and occlusion, make graphing symbols or color-coding unambiguous, use scale-lines on all four sides of the graph, and so on. It also leads to advice that perhaps should be as obvious, but isn&#8217;t: <em>make the aspect of the data that you want to analyze as clear as possible</em>. But what does this mean in practice?</p>
<p><strong>Make important differences large enough to perceive</strong></p>
<p>Weber&#8217;s Law is a well known observation from the psychophysics literature, which states that the &#8220;just noticeable&#8221; change in a stimulus is a constant ratio of the original stimulus. Put another way, people are only capable of detecting a change in a stimulus that is greater than a certain percentage <em>k</em> of the original stimulus. Here, &#8220;stimulus&#8221; can refer to any perceivable physical quantity: weight, intensity, length, orientation. The percentage <em>k</em> will vary with stimulus, and with observer.</p>
<table border="0" align="center">
<tbody>
<tr>
<td>
<div style="text-align:center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/weberslaw.jpg" border="0" alt="weberslaw.jpg" width="488" height="233" /></div>
</td>
</tr>
</tbody>
<caption>Figure 1: From Cleveland, <em>The Elements of Graphing Data</em></caption>
<tbody>
<tr>
<td></td>
</tr>
</tbody>
</table>
<p>Figure 1 shows the application of Weber&#8217;s law to lengths. The bars A and B are of different lengths, but the difference is such a small fraction of the &#8220;base&#8221; length (say, A&#8217;s length, to be specific) that is difficult to tell whether or not they are different, or which is longer. On the right, the bars have been embedded in frames of identical length, and now it is easy to see that B is longer. Why? Because the difference in lengths of the <em>white</em> intervals is a much larger percentage of the white &#8220;base&#8221; length (say the white A interval). It is easy to see that the white B interval is shorter than the white A interval, and therefore, the black B interval is longer than the black A interval.</p>
<p>The moral is that you always want the viewer to be estimating changes or differences with respect to a short base length. You can do this with reference grids, as demonstrated below.</p>
<table border="0" align="center">
<caption>From Cleveland, <em>The Elements of Graphing Data</em></caption>
<tbody>
<tr>
<td><!-- original 319 by 601 --><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/noreferencegrids.jpg" border="0" alt="noreferencegrids.jpg" width="200" height="400" align="left" /></td>
<td><!-- original 319 by 601 --><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/referencegrids1.jpg" border="0" alt="referencegrids.jpg" width="200" height="400" align="right" /></td>
</tr>
<tr>
<td align="center">Figure 2</td>
<td align="center">Figure 3</td>
</tr>
</tbody>
</table>
<p>Figure 2 shows eight curves. Which one dips to the lowest minimum? Are the high curves approaching the same value, and which one is rising the fastest? Are the low curves dipping to the same minimum? Are they going to the same steady state? Figure 3 shows the same curves, graphed with identical reference grids. The grids shorten the base lengths that are being compared, and it is now much easier to compare highs, lows, and steady state behavior.</p>
<p>But wouldn&#8217;t it be better to compare the graphs by superposing them? For two or three curves, perhaps. But in this case, eight curves can clutter the graph, and use up the symbol or color space, making it difficult to distinguish the different datasets &#8212; increasing the mental gymnastics.</p>
<p>Reference grids are useful even for a single curve, especially one with slowly varying segments, such as these graphs have. The reference grid makes it easier to answer questions like: does the process return to the initial state, or to a different steady state? Has the process reached steady state, or is it still growing?</p>
<p><strong>Make important shape changes large enough to perceive: Banking to 45 degrees.</strong></p>
<p>The aspect ratio of a graph is important when trying to understand shape. Rate of change information is encoded in the slope of the curve, which the viewer estimates by changes in the orientation of the local tangents at each point of the graph. Weber&#8217;s Law tells us that very small changes in this orientation will be difficult to detect. For a given (physical) curve, the local orientation changes will be dependent on the aspect ratio of its graphical presentation, as shown (to an exaggerated degree) in Figure 4. Here, the same curve (two line segments) is plotted at three different aspect ratios, one that centers the graph at 45 degrees, one that forces the curve to be nearly vertical, and another that forces it to be nearly horizontal. In the last two cases, the change in orientation of the two line segments is so small as to be nearly undetectable.</p>
<table border="0" align="center">
<caption>Figure 4: From Cleveland</caption>
<tbody>
<tr>
<td><!-- original 670 by 630 --></p>
<div style="text-align:center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/angles.jpg" border="0" alt="angles.jpg" width="446" height="420" align="left" /></div>
</td>
</tr>
</tbody>
</table>
<p>For two line segments with positive, unequal slopes, a simple geometric argument shows that their absolute difference in orientation is maximized by the aspect ratio that sets their average orientation to 45 degrees (the first graph in Figure 4). Empirical studies by Cleveland and others have indeed verified that a viewer&#8217;s ability to judge the relative slopes of line segments on a graph is maximized when the absolute values of the orientations of the segments are centered on 45 degrees.</p>
<p>This result leads to a technique called <em>Banking to 45</em>, whereby the aspect ratio of the graph is chosen so that the average slope of the entire graph is 45 degrees. The details are discussed in Cleveland, and many of the plots in R&#8217;s Lattice package also have an option to bank the graph to 45 degrees.</p>
<p>This deliberate exaggeration of slope is something that Darrell Huff deplores. In <em>How to Lie with Statistics</em>, Huff refers to these graphs as &#8220;gee-whiz&#8221; graphs — and in the context of his discussion of statistics as rhetoric, they are:</p>
<table border="0" align="center">
<caption>Figure 5: From Huff, <em>How to Lie With Statistics</em></caption>
<tbody>
<tr>
<td><!-- original 461 by 351 --></p>
<div style="text-align:center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/geewhiz.jpg" border="0" alt="geewhiz.jpg" width="461" height="351" /></div>
</td>
</tr>
</tbody>
</table>
<p>To insist that a graph should always include a zero line and that units be in proportion may be good advice from a rhetorical perspective; but it is poor advice if the purpose of the graph is data analysis. As Figure 6 below demonstrates, we can lose resolution if we always insist on including the zero. Does the trend line in the left graph increase linearly, superlinearly, or sublinearly? The convexity of the curve is more apparent when it is banked to 45, as on the right. Assuming that the scientist reads the axis and is cognizant of the actual magnitude changes involved, the graph on the right conveys more information.</p>
<table border="0" align="center">
<caption>Figure 6: From Cleveland</caption>
<tbody>
<tr>
<td><img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/bank451.jpg" border="0" alt="bank45.jpg" width="500"  /></td>
</tr>
</tbody>
</table>
<p><strong>Make sure all the data is equally well resolved.</strong></p>
<p>It is quite common for positive data —  word frequencies, populations, price distributions, just to name a few examples — to be skewed: most of the data is bunched towards low values, the rest of it is spread out on a very long tail. This long tail squashes the majority of the data into a tiny interval of a very narrow dynamic range, as in Figure 7, making it difficult to evaluate the data.</p>
<table border="0" align="center">
<tbody>
<tr>
<td>
<table border="0">
<tbody>
<tr>
<td><!-- original size: 990 by 860 --><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/skewed1.gif" border="0" alt="skewed.gif" width="250" /></td>
</tr>
</tbody>
<caption>Figure 7: Long-tailed distribution of purchase sizes</caption>
</table>
</td>
<td>
<table border="0">
<tbody>
<tr>
<td><!-- original = 499 by 675 --><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/logskewed1.gif" border="0" alt="logskewed.gif" width="250" /></td>
</tr>
</tbody>
<caption>Figure 8: Distribution of log(purchase size)</caption>
</table>
</td>
</tr>
</tbody>
</table>
<p>Imagine that Figure 7 represents the distribution of average purchase size across an online merchant&#8217;s customers: average purchase size is plotted on the x-axis, and the y-axis represents the fraction of the total customer population whose average purchase size is a given value (the area under the graph integrates to one). According to this graph, most customers make fairly small purchases on average, but there is a long tail of big spenders trailing out into the range of several thousand dollars. Obviously, one would like a little more resolution on the big spike of customers near zero. One could simply &#8220;zoom in&#8221; on this range, by chopping off some long chunk of the tail, but you may potentially lose sight of some global patterns in the data by doing so.</p>
<p>Graphing the distribution of log(purchase size) enables you to increase the resolution near zero, while preserving the global view. Figure 8 shows the distribution of log(purchase size), revealing two spending populations: a population of high spenders who tend to make purchases in the $3000 range (in log space), and another population whose purchases are centered (in log space) around $60. The existence of these two distinct populations is not apparent in the original graph.</p>
<p>Notice that Figure 8 has two x-axis scales: the top axis is marked in log units, while the bottom axis is marked in absolute dollars, spaced on a log scale. This accords with the principle of minimizing mental gymnastics, since the viewer of the graph will typically be concerned about prices in dollars, not log dollars. In fact, it would have been better yet to have plotted the distribution of log<sub>2</sub> or log<sub>10</sub> of the data; the former would allow us to see at a glance the doubling of price ranges, the latter to see price changes in factors of ten.</p>
<table border="0" align="center">
<caption>Figure 9: The 14 most abundant elements in meteorites. From Cleveland</caption>
<tbody>
<tr>
<td><!-- original = 543 by 522 --><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/metals.jpg" border="0" alt="metals.jpg" width="250" /></td>
<td><!-- original = 550 by 600 --><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/logmetals.jpg" border="0" alt="logmetals.jpg" width="250" /></td>
</tr>
</tbody>
</table>
<p>Figure 9 shows another example: the fourteen most abundant elements in meteorites, specifically the average percent of each of the elements. If we graph the percentages directly, as on the left, we cannot easily distinguish the differences in the elements from aluminum on down. Graphing log<sub>2</sub> of the percentages, as on the right, improves the resolution. Again, we have two x-axes on the graph of the log data.</p>
<p><strong>If you want to analyze the difference between two processes, then graph the difference, not the processes (or graph both).</strong></p>
<p>Suppose that we are comparing the two processes f1 and f2 that are shown in Figure 10. As x increases, the two processes appear to be approaching each other  — that is, the difference between the two seems to be decreasing. In reality, the difference between the two is constant: f2 = f1+1.</p>
<table border="0" align="center">
<tbody>
<tr>
<td>
<table border="0">
<tbody>
<tr>
<td><!-- original size: 990 by 860 --><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/difference1.gif" border="0" alt="difference.gif" width="250" /></td>
</tr>
</tbody>
<caption>Figure 10: The illusion of convergence</caption>
</table>
</td>
<td>
<table border="0">
<tbody>
<tr>
<td><!-- original = 499 by 675 --><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/imports.jpg" border="0" alt="imports.jpg" width="250" /></td>
</tr>
</tbody>
<caption>Figure 11: British Imports and Exports. From Cleveland</caption>
</table>
</td>
</tr>
</tbody>
</table>
<p>It turns out that people are good at perceiving the perpendicular difference between two curves, but not the differences in height, which is what we are actually interested in here. When we try to infer the differences from the process graph, we may not only miss key information, we may actually draw incorrect conclusions.</p>
<p>A less toy example is given in Figure 11. Here the imports to and exports from England are graphed over the first 80 years of the 18th century. In the difference graph on the bottom, we can see a local peak in (imports-exports) just after 1760; this is not obvious from simply comparing the two processes (top graph).</p>
<p><strong>If you are interested in rate of change, then graph rate of change.</strong></p>
<p>In Figure 12, we see the population figures for a given community from 1990 to 2009. Obviously, the population is steadily increasing, but how quickly? Is the rate of population growth increasing over time, or is it decreasing? If we are interested in these questions, then simply graphing the population over time is not enough. We need to look at the rate of change directly.</p>
<table border="0" align="center">
<tbody>
<tr>
<td>
<table border="0">
<caption>Figure 12</caption>
<tbody>
<tr>
<td><!-- original 998 by 860 --><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/rateofchange1.gif" border="0" alt="rateofchange.gif" width="250" /></td>
</tr>
</tbody>
</table>
</td>
<td>
<table border="0">
<caption>Figure 13</caption>
<tbody>
<tr>
<td><!-- original 720 by 720 --><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/lograteofchange2.gif" border="0" alt="lograteofchange.gif" width="250" /></td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
<p>The classic way to do this is by graphing the logarithm of the data. In Figure 13, we have graphed log<sub>2</sub> of the population over time, with the log scale printed on the right hand y-axis, and the actual population numbers printed at a log scale on the left hand axis. Now we can see that the population increased at a constant rate from 1990 to 2000, quadrupling approximately every four years, and then slowed down (to a lower constant rate) after 2000.</p>
<p><strong>Graphs as a research tool</strong></p>
<p>Throughout this discussion, we have considered graphs as a tool for data exploration and initial understanding. It is an iterative process &#8212; as questions arise, the data will be reprocessed and re-plotted to highlight the new issues to be examined. A good research graph must display this information directly, with a minimum of mental gymnastics, but &#8212; as with any research tool &#8212; there can be a learning curve. For example, densityplots (such as those shown in Figures 7 and 8) are in my opinion more useful than histograms for understanding how numerical data is distributed &#8212; and I am constantly surprised at the amount of explanation that they require when I show them to people who are unfamiliar with them. A number of very useful graphs that are discussed in Cleveland&#8217;s texts meet with the same reaction from people who encounter that style of graph for the first time. This is a disadvantage, relative to using a more fashionable graph, when attempting to communicate results. But the insight into the data that these graphs provide often make it worth spending the time to educate clients or peers on how to read the graph.</p>
<p>Even so, a good graph still may not be a quick read. As Cleveland writes:</p>
<blockquote><p>While there is a place for rapidly-understood graphs, it is too limiting to make speed a requirement in science and technology, where the use of graphs ranges from detailed in-depth data analysis to quick presentation.<br />
&#8230;</p>
<p>The important criterion for a graph is not simply how fast we can see a result; rather it is whether through the use of the graph we can see something that would have been harder to see otherwise or that could not have been seen at all.</p></blockquote>
<p>- <em>The Elements of Graphing Data</em>, Chapter 2</p>
<hr /><a id="Huff" href="#refHuff">[Back]</a><sup>1</sup><em>How to Lie with Statistics</em> is an entertaining (if a little dated) discussion of how to read statistical and quantitative claims critically, and is definitely worth a read.</p>


<p>Related posts:<ol><li><a href='http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/' rel='bookmark' title='Permanent Link: A Demonstration of Data Mining'>A Demonstration of Data Mining</a></li>
<li><a href='http://www.win-vector.com/blog/2010/02/living-in-a-lognormal-world/' rel='bookmark' title='Permanent Link: Living in A Lognormal World'>Living in A Lognormal World</a></li>
<li><a href='http://www.win-vector.com/blog/2009/04/the-data-enrichment-method/' rel='bookmark' title='Permanent Link: The Data Enrichment Method'>The Data Enrichment Method</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2009/08/good-graphs-graphical-perception-and-data-visualization/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Exciting Technique #1: The &#8220;R&#8221; language.</title>
		<link>http://www.win-vector.com/blog/2009/01/exciting-technique-1-the-r-language/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=exciting-technique-1-the-r-language</link>
		<comments>http://www.win-vector.com/blog/2009/01/exciting-technique-1-the-r-language/#comments</comments>
		<pubDate>Thu, 22 Jan 2009 19:59:01 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Exciting Techniques]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[R]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=26</guid>
		<description><![CDATA[Our first &#8220;exciting technique&#8221; article is about a statistical language called &#8220;R.&#8221; R is a language for statistical analysis available from http://cran.r-project.org/ . The things you can immediately do with it are incredible. You can import a spreadsheet and immediately spot relationships, trend and anomalies. R gives you instant access to top notch visualization methods [...]


Related posts:<ol><li><a href='http://www.win-vector.com/blog/2010/03/r-annoyances/' rel='bookmark' title='Permanent Link: R annoyances'>R annoyances</a></li>
<li><a href='http://www.win-vector.com/blog/2009/09/survive-r/' rel='bookmark' title='Permanent Link: Survive R'>Survive R</a></li>
<li><a href='http://www.win-vector.com/blog/2009/11/r-examine-objects-tutorial/' rel='bookmark' title='Permanent Link: R examine objects tutorial'>R examine objects tutorial</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>Our first &#8220;exciting technique&#8221; article is about a statistical language called &#8220;R.&#8221;</p>
<p>R is a language for statistical analysis available from <a href="http://cran.r-project.org/">http://cran.r-project.org/</a> .  The things you can immediately do with it are incredible.  You can import a spreadsheet and immediately spot relationships, trend and anomalies.  R gives you instant access to top notch visualization methods and sophisticated statistical methods.</p>
<p><span id="more-26"></span></p>
<p>R is so hot (a strange thing to say about a statistics package) that it was the subject of a recent New York Times article: <a href="http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html">http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html</a> .  If you read between the lines some of the interviewees come off as being slightly threatened by R (there is a slight hint of &#8220;R is very good for others&#8221;).  In fact R is simply very good.  A good statistician with R can do things that a great statistician without R can not.  Like all tools R is dangerous, ask for the wrong analysis and you well draw wrong and misleading conclusions.  Ask for the right analysis and R will correctly perform it while tracking critical implementation details that would take you hundreds of hours to master on you own.</p>
<p>Want to produce graphs using the theories of perception and analysis of W. S. Cleveland?  Simple- use Deepayan Sarkar&#8217;s &#8220;Lattice&#8221; model, which even has a wonderful book.</p>
<p>Want to find subtle relationships in your data using logistic regression (one of the more complicated cousins of linear regression)?  That is built into the base R system.</p>
<p>Need to re-run all of your analyses because the data has changed?  R is script based and stores your command history.  A single paste can re-run a 20 step analysis and re-build a 10 slide presentation.</p>
<p>Impressed by a particular type of analysis? Take, for example, Roger Koenker&#8217;s &#8220;Quantile Regression&#8221; (which is a brilliant idea backed by a masterpiece of a book).  Guess what, the original author has supplied a free R-module that implements the ideas.</p>
<p>Want to give a client working software?  Easy, R is open source and comes with very good automated installers for OSX, Linux and Windows.</p>
<p>Want to train somebody to use R?  Easy, R has an extensive library of excellent books and there is even an exciting set of books with a series title &#8220;Use R!&#8221;</p>
<p>Want to learn the internals of R from John M. Chambers (one of the inventors of the &#8220;S&#8221; language that R is an implementation of)?  You are in luck the latest book by Chambers is &#8220;Software for Data Analysis, Programming with R.&#8221;  R is so popular that it has managed to pull one of the creators of S language and the proprietary S+ implementation into its world.</p>
<p>It is almost getting to the point where you need to justify not using R.</p>


<p>Related posts:<ol><li><a href='http://www.win-vector.com/blog/2010/03/r-annoyances/' rel='bookmark' title='Permanent Link: R annoyances'>R annoyances</a></li>
<li><a href='http://www.win-vector.com/blog/2009/09/survive-r/' rel='bookmark' title='Permanent Link: Survive R'>Survive R</a></li>
<li><a href='http://www.win-vector.com/blog/2009/11/r-examine-objects-tutorial/' rel='bookmark' title='Permanent Link: R examine objects tutorial'>R examine objects tutorial</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2009/01/exciting-technique-1-the-r-language/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>
