<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Win-Vector Blog &#187; Mathematics</title>
	<atom:link href="http://www.win-vector.com/blog/category/mathematics/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.win-vector.com/blog</link>
	<description>The Applied Theorist&#039;s Point of View</description>
	<lastBuildDate>Thu, 29 Jul 2010 17:09:49 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
		<item>
		<title>Gradients via Reverse Accumulation</title>
		<link>http://www.win-vector.com/blog/2010/07/gradients-via-reverse-accumulation/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=gradients-via-reverse-accumulation</link>
		<comments>http://www.win-vector.com/blog/2010/07/gradients-via-reverse-accumulation/#comments</comments>
		<pubDate>Thu, 15 Jul 2010 00:00:04 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Applications]]></category>
		<category><![CDATA[Coding]]></category>
		<category><![CDATA[Exciting Techniques]]></category>
		<category><![CDATA[Mathematics]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Tutorials]]></category>
		<category><![CDATA[Automatic Differentiation]]></category>
		<category><![CDATA[Conjugate Gradient]]></category>
		<category><![CDATA[Gradient]]></category>
		<category><![CDATA[Mathematical Bedside Reading]]></category>
		<category><![CDATA[Optimization]]></category>
		<category><![CDATA[Reverse Accumulation]]></category>
		<category><![CDATA[Scala]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1493</guid>
		<description><![CDATA[We extend the ideas of from Automatic Differentiation with Scala to include the reverse accumulation. Reverse accumulation is a non-obvious improvement to automatic differentiation that can in many cases vastly speed up calculations of gradients. As the tables, diagrams and equations do not translate well into HTML, our full article is available here in PDF: [...]


Related posts:<ol><li><a href='http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/' rel='bookmark' title='Permanent Link: Automatic Differentiation with Scala'>Automatic Differentiation with Scala</a></li>
<li><a href='http://www.win-vector.com/blog/2010/01/easy-portfolio-allocation/' rel='bookmark' title='Permanent Link: &#8220;Easy&#8221; Portfolio Allocation'>&#8220;Easy&#8221; Portfolio Allocation</a></li>
<li><a href='http://www.win-vector.com/blog/2008/09/a-quick-appreciation-of-the-sharpe-ratio/' rel='bookmark' title='Permanent Link: A Quick Appreciation of the Sharpe Ratio'>A Quick Appreciation of the Sharpe Ratio</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>We extend the ideas of from <a target="ext" href="http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/">Automatic Differentiation with Scala</a> to include the <em>reverse accumulation</em>.  Reverse accumulation is a non-obvious improvement to automatic differentiation that can in many cases vastly speed up calculations of gradients.<span id="more-1493"></span><br />
As the tables, diagrams and equations do not translate well into HTML, our full article is available here in PDF: <a href="http://www.win-vector.com/dfiles/ReverseAccumulation.pdf">http://www.win-vector.com/dfiles/ReverseAccumulation.pdf</a>.</p>
<p>The purpose of our article is to explain reverse accumulation automatic differentiation clearly (and to release some sample code and timing results).  A side effect of the article is to make sense of the following two diagrams:</p>
<p>If the following is picture of standard or forward differentiation:</p>
<p><img src="http://www.win-vector.com/blog/wp-content/uploads/2010/07/cutFwd.png" alt="cutFwd.png" border="0" width="408" height="677" /></p>
<p>then the following is a picture of reverse accumulation:</p>
<p><img src="http://www.win-vector.com/blog/wp-content/uploads/2010/07/cutRev.png" alt="cutRev.png" border="0" width="487" height="739" /></p>


<p>Related posts:<ol><li><a href='http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/' rel='bookmark' title='Permanent Link: Automatic Differentiation with Scala'>Automatic Differentiation with Scala</a></li>
<li><a href='http://www.win-vector.com/blog/2010/01/easy-portfolio-allocation/' rel='bookmark' title='Permanent Link: &#8220;Easy&#8221; Portfolio Allocation'>&#8220;Easy&#8221; Portfolio Allocation</a></li>
<li><a href='http://www.win-vector.com/blog/2008/09/a-quick-appreciation-of-the-sharpe-ratio/' rel='bookmark' title='Permanent Link: A Quick Appreciation of the Sharpe Ratio'>A Quick Appreciation of the Sharpe Ratio</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2010/07/gradients-via-reverse-accumulation/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Automatic Differentiation with Scala</title>
		<link>http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=automatic-differentiation-with-scala</link>
		<comments>http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/#comments</comments>
		<pubDate>Tue, 15 Jun 2010 04:19:20 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Applications]]></category>
		<category><![CDATA[Coding]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Exciting Techniques]]></category>
		<category><![CDATA[Mathematics]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Tutorials]]></category>
		<category><![CDATA[Automatic Differentiation]]></category>
		<category><![CDATA[Conjugate Gradient]]></category>
		<category><![CDATA[Dual Numbers]]></category>
		<category><![CDATA[Geometric Median]]></category>
		<category><![CDATA[Numeric Methods]]></category>
		<category><![CDATA[Optimization]]></category>
		<category><![CDATA[Scala]]></category>
		<category><![CDATA[Steiner Tree]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1481</guid>
		<description><![CDATA[This article is a worked-out exercise in applying the Scala type system to solve a small scale optimization problem. For this article we supply complete Scala source code (under a GPLv3 license) and some design discussion. Usually we work using a combination of databases, Java, optimization libraries and analysis suites (like R). The reason is [...]


Related posts:<ol><li><a href='http://www.win-vector.com/blog/2010/07/gradients-via-reverse-accumulation/' rel='bookmark' title='Permanent Link: Gradients via Reverse Accumulation'>Gradients via Reverse Accumulation</a></li>
<li><a href='http://www.win-vector.com/blog/2009/11/r-examine-objects-tutorial/' rel='bookmark' title='Permanent Link: R examine objects tutorial'>R examine objects tutorial</a></li>
<li><a href='http://www.win-vector.com/blog/2009/09/survive-r/' rel='bookmark' title='Permanent Link: Survive R'>Survive R</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>This article is a worked-out exercise in applying the <a href="http://www.scala-lang.org/" target="ext">Scala</a> type system to solve a small scale optimization problem.    For this article we supply <a href="http://www.win-vector.com/dfiles/ScalaDiff.jar">complete Scala source code</a> (under a GPLv3 license) and some design discussion.<span id="more-1481"></span><br />
Usually we work using a combination of databases, Java, optimization libraries and analysis suites (like R).  The reason is that, for our typical problems, Java hits a sweet spot of trading off runtime performance against ease of development and maintenance.  In the tens of gigabytes range (data sets larger than the Wikipedia but smaller than the Web) Java outperforms the scripting languages (Ruby, Python &#8230;) and is much easer to develop in and document than C++.  This sweet spot is both subjective and situational- if the tasks were smaller and in a services framework Python is a better choice, if performance is paramount then C or C++ (with the STL) and Hadoop are a better choice, if pre-built statistical libraries are needed then R becomes a better choice.  For the type problem we present here Scala is a very good choice.</p>
<style type="text/css">
td.linenos { background-color: #f0f0f0; padding-right: 10px; }
span.lineno { background-color: #f0f0f0; padding: 0 5px 0 5px; }
pre { line-height: 125%; }
body .hll { background-color: #ffffcc }
body  { background: #f8f8f8; }
body .c { color: #408080; font-style: italic } /* Comment */
body .err { border: 1px solid #FF0000 } /* Error */
body .k { color: #008000; font-weight: bold } /* Keyword */
body .o { color: #666666 } /* Operator */
body .cm { color: #408080; font-style: italic } /* Comment.Multiline */
body .cp { color: #BC7A00 } /* Comment.Preproc */
body .c1 { color: #408080; font-style: italic } /* Comment.Single */
body .cs { color: #408080; font-style: italic } /* Comment.Special */
body .gd { color: #A00000 } /* Generic.Deleted */
body .ge { font-style: italic } /* Generic.Emph */
body .gr { color: #FF0000 } /* Generic.Error */
body .gh { color: #000080; font-weight: bold } /* Generic.Heading */
body .gi { color: #00A000 } /* Generic.Inserted */
body .go { color: #808080 } /* Generic.Output */
body .gp { color: #000080; font-weight: bold } /* Generic.Prompt */
body .gs { font-weight: bold } /* Generic.Strong */
body .gu { color: #800080; font-weight: bold } /* Generic.Subheading */
body .gt { color: #0040D0 } /* Generic.Traceback */
body .kc { color: #008000; font-weight: bold } /* Keyword.Constant */
body .kd { color: #008000; font-weight: bold } /* Keyword.Declaration */
body .kn { color: #008000; font-weight: bold } /* Keyword.Namespace */
body .kp { color: #008000 } /* Keyword.Pseudo */
body .kr { color: #008000; font-weight: bold } /* Keyword.Reserved */
body .kt { color: #B00040 } /* Keyword.Type */
body .m { color: #666666 } /* Literal.Number */
body .s { color: #BA2121 } /* Literal.String */
body .na { color: #7D9029 } /* Name.Attribute */
body .nb { color: #008000 } /* Name.Builtin */
body .nc { color: #0000FF; font-weight: bold } /* Name.Class */
body .no { color: #880000 } /* Name.Constant */
body .nd { color: #AA22FF } /* Name.Decorator */
body .ni { color: #999999; font-weight: bold } /* Name.Entity */
body .ne { color: #D2413A; font-weight: bold } /* Name.Exception */
body .nf { color: #0000FF } /* Name.Function */
body .nl { color: #A0A000 } /* Name.Label */
body .nn { color: #0000FF; font-weight: bold } /* Name.Namespace */
body .nt { color: #008000; font-weight: bold } /* Name.Tag */
body .nv { color: #19177C } /* Name.Variable */
body .ow { color: #AA22FF; font-weight: bold } /* Operator.Word */
body .w { color: #bbbbbb } /* Text.Whitespace */
body .mf { color: #666666 } /* Literal.Number.Float */
body .mh { color: #666666 } /* Literal.Number.Hex */
body .mi { color: #666666 } /* Literal.Number.Integer */
body .mo { color: #666666 } /* Literal.Number.Oct */
body .sb { color: #BA2121 } /* Literal.String.Backtick */
body .sc { color: #BA2121 } /* Literal.String.Char */
body .sd { color: #BA2121; font-style: italic } /* Literal.String.Doc */
body .s2 { color: #BA2121 } /* Literal.String.Double */
body .se { color: #BB6622; font-weight: bold } /* Literal.String.Escape */
body .sh { color: #BA2121 } /* Literal.String.Heredoc */
body .si { color: #BB6688; font-weight: bold } /* Literal.String.Interpol */
body .sx { color: #008000 } /* Literal.String.Other */
body .sr { color: #BB6688 } /* Literal.String.Regex */
body .s1 { color: #BA2121 } /* Literal.String.Single */
body .ss { color: #19177C } /* Literal.String.Symbol */
body .bp { color: #008000 } /* Name.Builtin.Pseudo */
body .vc { color: #19177C } /* Name.Variable.Class */
body .vg { color: #19177C } /* Name.Variable.Global */
body .vi { color: #19177C } /* Name.Variable.Instance */
body .il { color: #666666 } /* Literal.Number.Integer.Long */
 </style>
<h2>Our Example Problem</h2>
<p>Our small scale problem is this:  we have a number of target points on a map and we want to pick a central point to <em>directly</em> connect to all of these points with wire.  Our goal is to minimize the total amount of wire used.  This problem is called the <a href="http://en.wikipedia.org/wiki/Geometric_median" ref="ext">&#8220;Geometric Median&#8221;</a>.  So we are trying to find a point that minimizes the sum of distances from our chosen center to every target point. If we were trying to minimize the sum of squared distances from our chosen center to every target point the answer would be obvious: the average or mean (which by Hooke&#8217;s law is also the point where a set of identical springs would relax to).  The mean is in fact a fairly good guess, but you can do better (which could important if the &#8220;wire&#8221; is expensive, such as cutting irrigation or drainage ditches).  For example given the three target points (20,0), (-1,-1) and (-1,1) the optimal point is (-0.42,0) not the mean (6,0) and the choice of optimal point represents an over 19% savings in total wiring distance (see figure).</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/06/points.png" alt="points.png" border="0" width="525" height="525" /><br />
</center></p>
<p>This is a substantial saving in cost.  </p>
<p>The problem changes as we consider variations.  If indirect connections (such as routing one point through another, which may or may not be possible for reasons of capacity or safety) and multiple new centers are allowed  we then have an instance of the <a href="http://en.wikipedia.org/wiki/Steiner_tree_problem" ref="ext">Steiner Tree Problem</a> which is harder  to solve (since it is known to be NP complete).  If no new centers are allowed (all routing must be between pre-existing target points) then we have a Spanning Tree Problem- which admits very quick solutions.</p>
<p>We bring up the geometric median as a mere example.  We don&#8217;t intend for our code to solve only the geometric median problem and we don&#8217;t intend to touch on the literature of specialized methods for solving the geometric median problem.  Instead we are trying to demonstrate the speed you can develop prototype solutions if you have a few good tools (like various optimizers) available in your toolkit.  Numeric optimizers may sound exotic, but they often are the kind of thing you want to experiment with and link directly into your code.</p>
<h2>Optimization as General Tool</h2>
<p>Now that we have the example problem we can describe a solution strategy.  In this case the solution uses code &#8220;we wished we had lying around&#8221; before we started on the problem.  We will pretend we have the tools we want ready to solve our problem and then we will pay our debt and build the required tools.  The issue is that there is not an obvious closed form for the solution of the geometric median problem.  So we are forced to work a bit harder.  In this case harder means we need to solve an optimization problem.  Consider the contour plot of the total wiring cost as function of where we choose to place our center.  Our optimal point (-0.42,0) had wiring cost of 22.73 and the contour plot given here shows concentric regions of solution positions with higher cost.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/06/contour.png" alt="contour.png" border="0" width="525" height="525" /><br />
</center></p>
<p>In general it is unwise to throw an optimizer at an arbitrary problem and hope to find the globally best solution.  But in this case (and in many similar situations) we can prove that a simple local optimizer will in fact find the unique best solution.  This is a property of the problem not of the optimizer.  The concentric regions shown in the contour plot have a very nice shape: they are <a href="http://en.wikipedia.org/wiki/Convex_set" ref="ext">convex</a>.   That is: they have no intrusions- for any two points drawn from one of these shapes the straight line segment between these points stays inside the given shape.  We don&#8217;t have to depend on observation- we can actually prove this is always the case for this problem.  The wiring cost from a proposed center to any single target point is a <a href="http://en.wikipedia.org/wiki/Convex_function" ref="ext">convex function</a> of where we choose to place our center (a convex function is a function whose graph never reaches above the secant line drawn between any two points on its graph).  The total wiring cost is just the sum of the wiring costs to each target point.  And to finish: the sum of a collection of convex functions is itself a convex function.  Since the contour plot of a convex function has only convex shapes and we have proven the statement.</p>
<p>But how does this help us?  There is a standard technique to find &#8220;local minima&#8221; of a function by inspecting a function for places where the gradient is zero (points where there is no obvious down hill direction on the contour plot).  This technique usually can only be guaranteed to find local minima (places where no small change improves your situation).  But there is no guarantee that the local minimum you find is in fact the global minimum (the best possible solution).  Except when you are dealing with a convex function.  When a function is convex then all of the local minima are always grouped together into a single convex connected shape (if not a line drawn between two remote minima would violate the convexity definition).  And if the function is never flat then this set is a single unique point: the unique best solution.  Our inspection technique will be a gradient driven optimizer- that is an optimizer that when the gradient is non-zero improves its objective by running down hill and halts when the gradient is zero.</p>
<p>The stated function to minimize is to sum the distance from our proposed center to each target point.  We can write this as the sum of the distances:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/06/dist1.png" alt="dist1.png" border="0" width="309" height="81" /><br />
</center></p>
<p>( <img src="http://www.win-vector.com/blog/wp-content/uploads/2010/06/euclid1.png" alt="euclid1.png" border="0" width="119" height="37" /> which is the traditional Euclidean or L2 distance).  This function actually has one one subtle flaw that we will deal with in the appendix (see: Fixing Smoothness).</p>
<h2>Using Scala to Apply the Optimization Solution</h2>
<p>To find our optimal center placement using Scala we first write our cost or objective as a Scala function:</p>
<div class="highlight">
<pre>    <span class="k">val</span> <span class="n">dat</span><span class="k">:</span><span class="kt">Array</span><span class="o">[</span><span class="kt">Array</span><span class="o">[</span><span class="kt">Double</span><span class="o">]]</span> <span class="o">=</span> <span class="nc">Array</span><span class="o">(</span>
      <span class="nc">Array</span><span class="o">(</span> <span class="mi">20</span><span class="o">,</span> <span class="mf">0.0</span><span class="o">),</span>
      <span class="nc">Array</span><span class="o">(</span> <span class="o">-</span><span class="mf">1.0</span><span class="o">,</span> <span class="mf">1.0</span><span class="o">),</span>
      <span class="nc">Array</span><span class="o">(</span> <span class="o">-</span><span class="mf">1.0</span><span class="o">,</span> <span class="o">-</span><span class="mf">1.0</span><span class="o">)</span>
    <span class="o">)</span>

    <span class="k">def</span> <span class="n">fx</span><span class="o">(</span><span class="n">p</span><span class="k">:</span><span class="kt">Array</span><span class="o">[</span><span class="kt">Double</span><span class="o">])</span><span class="k">:</span><span class="kt">Double</span> <span class="o">=</span> <span class="o">{</span>
      <span class="k">val</span> <span class="n">dim</span> <span class="k">=</span> <span class="n">p</span><span class="o">.</span><span class="n">length</span>
      <span class="k">val</span> <span class="n">npoint</span> <span class="k">=</span> <span class="n">dat</span><span class="o">.</span><span class="n">length</span>
      <span class="k">var</span> <span class="n">total</span> <span class="k">=</span> <span class="mf">0.0</span>
      <span class="k">for</span><span class="o">(</span><span class="n">k</span> <span class="k">&lt;-</span> <span class="mi">0</span> <span class="n">to</span> <span class="o">(</span><span class="n">npoint</span><span class="o">-</span><span class="mi">1</span><span class="o">))</span> <span class="o">{</span>
        <span class="k">var</span> <span class="n">term</span> <span class="k">=</span> <span class="mf">0.0</span>
        <span class="k">for</span><span class="o">(</span><span class="n">i</span> <span class="k">&lt;-</span> <span class="mi">0</span> <span class="n">to</span> <span class="o">(</span><span class="n">dim</span><span class="o">-</span><span class="mi">1</span><span class="o">))</span> <span class="o">{</span>
          <span class="k">val</span> <span class="n">diff</span> <span class="k">=</span> <span class="n">p</span><span class="o">(</span><span class="n">i</span><span class="o">)</span> <span class="o">-</span> <span class="n">dat</span><span class="o">(</span><span class="n">k</span><span class="o">)(</span><span class="n">i</span><span class="o">)</span>
          <span class="n">term</span> <span class="k">=</span> <span class="n">term</span> <span class="o">+</span> <span class="n">diff</span><span class="o">*</span><span class="n">diff</span>
        <span class="o">}</span>
        <span class="n">total</span> <span class="k">=</span> <span class="n">total</span> <span class="o">+</span> <span class="n">scala</span><span class="o">.</span><span class="n">math</span><span class="o">.</span><span class="n">sqrt</span><span class="o">(</span><span class="n">term</span><span class="o">)</span>
      <span class="o">}</span>
      <span class="n">total</span>
    <span class="o">}</span>
</pre>
</div>
<p>Scala is succinct and it is a great connivence to have a function definition capture data from its environment.   What we would like to do is generate an initial guess as the solution (we use the mean as our initial guess) and then call an optimizer (in this case a conjugate gradient optimizer) to do all the work:</p>
<div class="highlight">
<pre> <span class="k">val</span> <span class="n">p0</span><span class="k">:</span><span class="kt">Array</span><span class="o">[</span><span class="kt">Double</span><span class="o">]</span> <span class="o">=</span> <span class="n">mean</span><span class="o">(</span><span class="n">dat</span><span class="o">)</span>
 <span class="k">val</span> <span class="o">(</span><span class="n">pF</span><span class="o">,</span><span class="n">fpF</span><span class="o">)</span> <span class="k">=</span> <span class="nc">CG</span><span class="o">.</span><span class="n">minimize</span><span class="o">(</span><span class="n">fx</span><span class="o">,</span><span class="n">p0</span><span class="o">)</span>
</pre>
</div>
<p>At this point we would be done, except the conjugate gradient method (which is superior to gradient descent and many the non-gradient methods) requires a gradient.<br />
We could provide a numeric estimate of the gradient by the following divided difference method:</p>
<div class="highlight">
<pre>  <span class="k">def</span> <span class="n">gradientD</span><span class="o">(</span><span class="n">f</span><span class="k">:</span><span class="kt">Array</span><span class="o">[</span><span class="kt">Double</span><span class="o">]</span><span class="k">=&gt;</span><span class="kt">Double</span><span class="o">,</span><span class="n">p</span><span class="k">:</span><span class="kt">Array</span><span class="o">[</span><span class="kt">Double</span><span class="o">])</span><span class="k">:</span><span class="kt">Array</span><span class="o">[</span><span class="kt">Double</span><span class="o">]</span> <span class="o">=</span> <span class="o">{</span>
    <span class="k">val</span> <span class="n">xdim</span> <span class="k">=</span> <span class="n">p</span><span class="o">.</span><span class="n">length</span>
    <span class="k">val</span> <span class="n">p2</span> <span class="k">=</span> <span class="n">copy</span><span class="o">(</span><span class="n">p</span><span class="o">)</span>
    <span class="k">val</span> <span class="n">base</span> <span class="k">=</span> <span class="n">f</span><span class="o">(</span><span class="n">p2</span><span class="o">)</span>
    <span class="k">val</span> <span class="n">ret</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">Array</span><span class="o">[</span><span class="kt">Double</span><span class="o">](</span><span class="n">xdim</span><span class="o">)</span>
    <span class="k">val</span> <span class="n">delta</span> <span class="k">=</span> <span class="mf">1.0e-6</span>
    <span class="k">for</span><span class="o">(</span><span class="n">i</span> <span class="k">&lt;-</span> <span class="mi">0</span> <span class="n">to</span> <span class="o">(</span><span class="n">xdim</span><span class="o">-</span><span class="mi">1</span><span class="o">))</span> <span class="o">{</span>
      <span class="n">p2</span><span class="o">(</span><span class="n">i</span><span class="o">)</span> <span class="k">=</span> <span class="n">p</span><span class="o">(</span><span class="n">i</span><span class="o">)</span> <span class="o">+</span> <span class="n">delta</span>
      <span class="k">val</span> <span class="n">fplus</span> <span class="k">=</span> <span class="n">f</span><span class="o">(</span><span class="n">p2</span><span class="o">)</span>
      <span class="n">p2</span><span class="o">(</span><span class="n">i</span><span class="o">)</span> <span class="k">=</span> <span class="n">p</span><span class="o">(</span><span class="n">i</span><span class="o">)</span>
      <span class="k">val</span> <span class="n">diff</span> <span class="k">=</span> <span class="o">(</span><span class="n">fplus</span><span class="o">-</span><span class="n">base</span><span class="o">)/</span><span class="n">delta</span>
      <span class="n">ret</span><span class="o">(</span><span class="n">i</span><span class="o">)</span> <span class="k">=</span> <span class="n">diff</span>
    <span class="o">}</span>
    <span class="n">ret</span>
  <span class="o">}</span>
</pre>
</div>
<p>This numeric divided difference method often outperforms non-derivative optimization methods (like Powell&#8217;s Method and the Nelder-Mead Amoeba method).  But the technique can run into numeric difficulties.   We can remedy this if we are willing to write our function in a slightly more general way.   If we re-encode our function in a generic manner we can use <a href="http://en.wikipedia.org/wiki/Automatic_differentiation" target="ext">automatic differentiation</a>  (not to be confused with numeric differentiation or with symbolic differentiation) to produce a reliable gradient for optimization.  What we need to do is re-write our function to work over an abstract field of numbers instead of only the machine supplied doubles.  In fact what we need to do is specify a generic function that will work over any field, with the field to be determined later.  The code to do this in Scala is very similar to the non-generic code:</p>
<div class="highlight">
<pre>   <span class="k">val</span> <span class="n">genericFx</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">VectorFN</span> <span class="o">{</span>
      <span class="k">def</span> <span class="n">apply</span><span class="o">[</span><span class="kt">Y</span> <span class="k">&lt;:</span> <span class="kt">NumberBase</span><span class="o">[</span><span class="kt">Y</span><span class="o">]](</span><span class="n">p</span><span class="k">:</span><span class="kt">Array</span><span class="o">[</span><span class="kt">Y</span><span class="o">])</span><span class="k">:</span><span class="kt">Y</span> <span class="o">=</span> <span class="o">{</span>
        <span class="k">val</span> <span class="n">field</span> <span class="k">=</span> <span class="n">p</span><span class="o">(</span><span class="mi">0</span><span class="o">).</span><span class="n">field</span>
        <span class="k">val</span> <span class="n">dim</span> <span class="k">=</span> <span class="n">p</span><span class="o">.</span><span class="n">length</span>
        <span class="k">val</span> <span class="n">npoint</span> <span class="k">=</span> <span class="n">dat</span><span class="o">.</span><span class="n">length</span>
        <span class="k">var</span> <span class="n">total</span> <span class="k">=</span> <span class="n">field</span><span class="o">.</span><span class="n">zero</span>
        <span class="k">for</span><span class="o">(</span><span class="n">k</span> <span class="k">&lt;-</span> <span class="mi">0</span> <span class="n">to</span> <span class="o">(</span><span class="n">npoint</span><span class="o">-</span><span class="mi">1</span><span class="o">))</span> <span class="o">{</span>
          <span class="k">var</span> <span class="n">term</span> <span class="k">=</span> <span class="n">field</span><span class="o">.</span><span class="n">zero</span>
          <span class="k">for</span><span class="o">(</span><span class="n">i</span> <span class="k">&lt;-</span> <span class="mi">0</span> <span class="n">to</span> <span class="o">(</span><span class="n">dim</span><span class="o">-</span><span class="mi">1</span><span class="o">))</span> <span class="o">{</span>
            <span class="k">val</span> <span class="n">diff</span> <span class="k">=</span> <span class="n">p</span><span class="o">(</span><span class="n">i</span><span class="o">)</span> <span class="o">-</span> <span class="n">field</span><span class="o">.</span><span class="n">inject</span><span class="o">(</span><span class="n">dat</span><span class="o">(</span><span class="n">k</span><span class="o">)(</span><span class="n">i</span><span class="o">))</span>
            <span class="n">term</span> <span class="k">=</span> <span class="n">term</span> <span class="o">+</span> <span class="n">diff</span><span class="o">*</span><span class="n">diff</span>
          <span class="o">}</span>
          <span class="n">total</span> <span class="k">=</span> <span class="n">total</span> <span class="o">+</span> <span class="n">smoothSQRT</span><span class="o">(</span><span class="n">term</span><span class="o">)</span>
        <span class="o">}</span>
        <span class="n">total</span>
      <span class="o">}</span>
    <span class="o">}</span>
</pre>
</div>
<p>Notice that code is very similar to the &#8220;def fx()&#8221; code.  The key differences are that we had to define genericFx as extending a trait (a type of Scala interface) called VectorFN and inside this trait extension we defined a parameterized function name apply().  apply() is a generic function that is willing to work over any type Y where Y is at least of type NumberBase[Y] (we will get more into what that means in a moment).  The difference in notation is that while the Scala function <em>syntax</em> can not specify a generic function with free type parameters (the incompletely specified Y) the Scala <em>semantics</em> are strong enough to implement this.  In fact standard function definitions (such as &#8220;def fx()&#8221;) are just syntactic sugar for extending the Scala built-in <a href="http://www.scala-lang.org/docu/files/api/scala/Function1.html" target="ext">Function1 trait</a>.  With a generic objective function in hand all we need is conjugate gradient code that is expecting a VectorFN (and willing to call apply() instead of just using naked function parenthesis) and some type NumberBase[Y] that can compute gradients for us.  The Scala compiler can specialize our genericFx() into one version for quick calculation and another for gradients.  How this is done is what we will discuss next.  From our point of view our problem is solved with the following one line of code:</p>
<div class="highlight">
<pre><span class="k">val</span> <span class="o">(</span><span class="n">pF</span><span class="o">,</span><span class="n">fpF</span><span class="o">)</span> <span class="k">=</span> <span class="nc">CG</span><span class="o">.</span><span class="n">minimize</span><span class="o">(</span><span class="n">genericFx</span><span class="o">,</span><span class="n">p0</span><span class="o">)</span>
</pre>
</div>
<p>This should always be your goal- build sufficient preparation so your last step is a &#8220;obvious one liner.&#8221;</p>
<h2>What Tools we Wish we Had Lying Around</h2>
<p>We supply in our example some workable conjugate gradient code, but that is standard so we will not discuss it.  What is of interest (and facilitated by Scala&#8217;s parametrized type system) is the implementation of <a href="http://en.wikipedia.org/wiki/Dual_number" target="ext">dual numbers</a> as a framework to supply automatic differentiation.  An implementation of dual numbers as a NumerBase[DualNumber] type is the core of our demonstration.</p>
<p>Dual numbers are an algebraic structure written as pairs of real numbers &#8220;(a,b)&#8221;.  The arithmetic table for dual numbers is given below:</p>
<table>
<tr>
<td>(a,b) + (c,d)</td>
<td>=</td>
<td>((a+c) , (b+d))</td>
</tr>
<tr>
<td>(a,b) &#8211; (c,d)</td>
<td>=</td>
<td>((a-c) , (b-d))</td>
</tr>
<tr>
<td>(a,b) * (c,d)</td>
<td>=</td>
<td>((a*c) , (a*d+b*c))</td>
</tr>
<tr>
<td>(a,b) / (c,d)</td>
<td>=</td>
<td>((a/c) , ((b*c-a*d)/(a*a)))</td>
</tr>
</table>
<p>In a dual number (a,b) &#8220;a&#8221; is the &#8220;large&#8221; or &#8220;standard&#8221; part of the number.  You can check from the arithmetic table that the pair of dual numbers (a,0) and (c,0) behave just as we would expect the real numbers a and c to behave.  In the dual number (a,b) &#8220;b&#8221; is the &#8220;small&#8221; or &#8220;ideal&#8221; portion of the number.  From the multiplication rule above  we can observe two rules: (0,b) * (c,0) = (0,b*c) (something small times anything else is small) and (0,b)*(0,d) = (0,0) (two small things become zero when multiplied).  Essentially the dual numbers are carrying around the first two terms of a Taylor series: we get as a result both the function value and the function derivative.  For a function f() over the real numbers we extend f() to work over the dual number by defining: f((a,b)) = (f(a),b f&#8217;(a)) (which is consistent with the previously defined arithmetic). We can check that the dual numbers numbers obey the usual laws of arithmetic (associative, commutative, distributive, identities and inverses).  The punchline is that over the dual numbers the divided difference estimate of f&#8217;(x) (the derivative of f() evaluated at x)  is in fact exact in the sense that f((x,1)) = (f(x),f&#8217;(x)) (or f((x,0)+(0,1)) &#8211; f((x,0)) = (0, f&#8217;(x))).  Implementing the DualNumber class is little more than transcribing the above arithmetic table into Scala.</p>
<p>We have already seen how to write code that uses NumberBase[Y] types (genericFx() itself is an example).  A more complicated example is the CG.minimize() code which not only accepts a generic function (in the form of VectorFN) but then specializes it to NumberBase[DualNumber] to compute gradients and also specializes to NumberBase[MDouble] for quick calculation during line searches (MDouble is just an adapter for machine Doubles, used for speed).  The ability to re-specialize a function is one of the advantages of a parameterized type system.  The DualNumbers are an example of forward automatic differentiation.  We could also use the same object framework to capture a representation of the computation path and apply more sophisticated methods such as reverse automatic differentiation. </p>
<p>We give a link to a jar containing <a href="http://www.win-vector.com/dfiles/ScalaDiff.jar">complete Scala source code</a> including this example, the DualNumber implementation, a conjugate gradient implementation and some JUnit tests (all under a GPLv3 license) and will go on to describe some of the design decisions.  The code is the bulky part of this work, so we will move on to discuss something more compact: types.</p>
<h2>Types</h2>
<p>If code is ever beautiful it is only when it is succinct.  Among the most succinct forms of code are individual type signatures and interfaces (though the indiscriminate repetition of type signatures is rightly considered ugly bloat, which Scala works to avoid).   Since we are distributing complete source we will describe only types and method signatures.  The entry points to the code are the JUnit tests (organized in the ScalaDiff/test source directory and depending on JUnit which was not included) and the demo program in ScalaDiff/src/demo/Demo.scala).</p>
<p>To be a usable arithmetic type (like DualNumber or MDouble) you must extend the following parameterized abstract class:</p>
<div class="highlight">
<pre><span class="k">abstract</span> <span class="k">class</span> <span class="nc">NumberBase</span><span class="o">[</span><span class="kt">NUMBERTYPE</span> <span class="k">&lt;:</span> <span class="kt">NumberBase</span><span class="o">[</span><span class="kt">NUMBERTYPE</span><span class="o">]]</span> <span class="o">{</span>
  <span class="c">// basic arithmetic</span>
  <span class="k">def</span> <span class="o">+</span> <span class="o">(</span><span class="n">that</span><span class="k">:</span> <span class="kt">NUMBERTYPE</span><span class="o">)</span><span class="k">:</span><span class="kt">NUMBERTYPE</span>
  <span class="k">def</span> <span class="o">-</span> <span class="o">(</span><span class="n">that</span><span class="k">:</span> <span class="kt">NUMBERTYPE</span><span class="o">)</span><span class="k">:</span><span class="kt">NUMBERTYPE</span>
  <span class="k">def</span> <span class="n">unary_-</span><span class="o">()</span><span class="k">:</span><span class="kt">NUMBERTYPE</span>
  <span class="k">def</span> <span class="o">*</span> <span class="o">(</span><span class="n">that</span><span class="k">:</span> <span class="kt">NUMBERTYPE</span><span class="o">)</span><span class="k">:</span><span class="kt">NUMBERTYPE</span>
  <span class="k">def</span> <span class="o">/</span> <span class="o">(</span><span class="n">that</span><span class="k">:</span> <span class="kt">NUMBERTYPE</span><span class="o">)</span><span class="k">:</span><span class="kt">NUMBERTYPE</span>  <span class="kt">//</span> <span class="kt">that</span> <span class="kt">not</span> <span class="kt">equal</span> <span class="kt">to</span> <span class="kt">zero</span>
  <span class="c">// more complicated</span>
  <span class="k">def</span> <span class="n">pow</span><span class="o">(</span><span class="n">that</span><span class="k">:</span><span class="kt">Double</span><span class="o">)</span><span class="k">:</span><span class="kt">NUMBERTYPE</span>
  <span class="k">def</span> <span class="n">exp</span><span class="k">:</span><span class="kt">NUMBERTYPE</span>
  <span class="k">def</span> <span class="n">log</span><span class="k">:</span><span class="kt">NUMBERTYPE</span> <span class="kt">//</span> <span class="kt">this</span> <span class="kt">is</span> <span class="kt">positive</span>
  <span class="c">// comparison functions</span>
  <span class="k">def</span> <span class="o">&gt;</span> <span class="o">(</span><span class="n">that</span><span class="k">:</span> <span class="kt">NUMBERTYPE</span><span class="o">)</span><span class="k">:</span><span class="kt">Boolean</span>
  <span class="k">def</span> <span class="o">&gt;=</span> <span class="o">(</span><span class="n">that</span><span class="k">:</span> <span class="kt">NUMBERTYPE</span><span class="o">)</span><span class="k">:</span><span class="kt">Boolean</span>
  <span class="k">def</span> <span class="o">==</span> <span class="o">(</span><span class="n">that</span><span class="k">:</span> <span class="kt">NUMBERTYPE</span><span class="o">)</span><span class="k">:</span><span class="kt">Boolean</span>
  <span class="k">def</span> <span class="o">!=</span> <span class="o">(</span><span class="n">that</span><span class="k">:</span> <span class="kt">NUMBERTYPE</span><span class="o">)</span><span class="k">:</span><span class="kt">Boolean</span>
  <span class="k">def</span> <span class="o">&lt;</span> <span class="o">(</span><span class="n">that</span><span class="k">:</span> <span class="kt">NUMBERTYPE</span><span class="o">)</span><span class="k">:</span><span class="kt">Boolean</span>
  <span class="k">def</span> <span class="o">&lt;=</span> <span class="o">(</span><span class="n">that</span><span class="k">:</span> <span class="kt">NUMBERTYPE</span><span class="o">)</span><span class="k">:</span><span class="kt">Boolean</span>
  <span class="c">// utility</span>
  <span class="k">def</span> <span class="n">field</span><span class="k">:</span><span class="kt">Field</span><span class="o">[</span><span class="kt">NUMBERTYPE</span><span class="o">]</span>
<span class="o">}</span>
</pre>
</div>
<p>In particular DualNumber extends NumberBase[DualNumber].  This deliberate circular reference has a big purpose: it allows publicly visible contravariant return types (returning nearly the exact type we really are instead of a base type).  This allows us to have strict type arguments so that trying to add a MDouble to DualNumber is a type error (even though they both extend the same base class).  The automatic differentiation technique encapsulated in the DualNumber class only works if all of the calculation is in the DualNumber types and this strict type enforcement allows the compiler to help prevent results sneaking in and out through other types.  All of the methods on NumberBase are obviously related to arithmetic except the field() method.  This method gives us access to a Field object which is responsible for carrying around the runtime type information (this is a common problem in Java and Scala, that some type information known at compile type such choice of template types is not easily accessed at runtime).  The Field class is as follows:</p>
<div class="highlight">
<pre><span class="k">abstract</span> <span class="k">class</span> <span class="nc">Field</span> <span class="o">[</span><span class="kt">NUMBERTYPE</span> <span class="k">&lt;:</span> <span class="kt">NumberBase</span><span class="o">[</span><span class="kt">NUMBERTYPE</span><span class="o">]]</span> <span class="o">{</span>
  <span class="k">def</span> <span class="n">zero</span><span class="k">:</span><span class="kt">NUMBERTYPE</span>            <span class="kt">//</span> <span class="kt">return</span> <span class="kt">canonical</span> <span class="kt">zero</span> <span class="kt">in</span> <span class="kt">field</span>
  <span class="k">def</span> <span class="n">one</span><span class="k">:</span><span class="kt">NUMBERTYPE</span>             <span class="kt">//</span> <span class="kt">return</span> <span class="kt">canonical</span> <span class="kt">one</span> <span class="kt">in</span> <span class="kt">field</span>
  <span class="k">def</span> <span class="n">inject</span><span class="o">(</span><span class="n">v</span><span class="k">:</span><span class="kt">Double</span><span class="o">)</span><span class="k">:</span><span class="kt">NUMBERTYPE</span>  <span class="kt">//</span> <span class="kt">return</span> <span class="kt">canonical</span> <span class="kt">representation</span> <span class="kt">of</span> <span class="kt">number</span> <span class="kt">in</span> <span class="kt">field</span>
  <span class="k">def</span> <span class="n">project</span><span class="o">(</span><span class="n">v</span><span class="k">:</span><span class="kt">NUMBERTYPE</span><span class="o">)</span><span class="k">:</span><span class="kt">Double</span> <span class="kt">//</span> <span class="kt">return</span> <span class="kt">standard-number</span> <span class="kt">represented</span> <span class="kt">in</span> <span class="kt">field</span>
  <span class="k">def</span> <span class="n">array</span><span class="o">(</span><span class="n">n</span><span class="k">:</span><span class="kt">Int</span><span class="o">)</span><span class="k">:</span><span class="kt">Array</span><span class="o">[</span><span class="kt">NUMBERTYPE</span><span class="o">]</span> <span class="kt">//</span> <span class="kt">return</span> <span class="kt">an</span> <span class="kt">array</span> <span class="kt">of</span> <span class="kt">this</span> <span class="k">type</span>
</pre>
</div>
<p>The Field class is where we have factories for numbers (zero, one, arrays, injection from standard Doubles), casting (projection back to standard Doubles).</p>
<p>With these types defined we can actually read intent off some of the method signatures.  </p>
<p>For example our conjugate gradient optimizer is accessed through the following method signature:</p>
<div class="highlight">
<pre> <span class="k">def</span> <span class="n">minimize</span><span class="o">(</span><span class="n">fn</span><span class="k">:</span><span class="kt">VectorFN</span><span class="o">,</span><span class="n">x0</span><span class="k">:</span><span class="kt">Array</span><span class="o">[</span><span class="kt">Double</span><span class="o">])</span><span class="k">:</span><span class="o">(</span><span class="kt">Array</span><span class="o">[</span><span class="kt">Double</span><span class="o">],</span><span class="kt">Double</span><span class="o">)</span> <span class="c">// return x,f(x)</span>
</pre>
</div>
<p>The above can be read as: CG.minimize() requires a VectorFN (our trait representing single argument functions with a free type parameter) and an initial point (in standard Doubles).  The code will the return a pair of the optimum point and the function evaluated at the optimum point.  From the type signature we can see that CG.minimize() expects to re-specialize the function &#8220;fn&#8221; to types of its own choosing (else it could have accepted a parameterized argument instead of our custom trait) and will handle all up-conversion and down-conversion between machine Doubles and NumberBase[Y]&#8216;s itself.  This sort of type information is hard to express (let alone enforce) in a dynamically typed language.</p>
<p>A slightly more complicated example is the lineMinD() method:</p>
<div class="highlight">
<pre><span class="k">def</span> <span class="n">lineMinD</span><span class="o">[</span><span class="kt">Y&lt;:NumberBase</span><span class="o">[</span><span class="kt">Y</span><span class="o">]](</span><span class="n">field</span><span class="k">:</span><span class="kt">Field</span><span class="o">[</span><span class="kt">Y</span><span class="o">],
 </span><span class="n">f</span><span class="k">:</span><span class="kt">Array</span><span class="o">[</span><span class="kt">Y</span><span class="o">]</span><span class="k">=&gt;</span><span class="kt">Y</span><span class="o">,
 </span><span class="n">xm</span><span class="k">:</span><span class="kt">Array</span><span class="o">[</span><span class="kt">Double</span><span class="o">],
 </span><span class="n">di</span><span class="k">:</span><span class="kt">Array</span><span class="o">[</span><span class="kt">Double</span><span class="o">])</span><span class="k">:</span><span class="o">(</span><span class="kt">Array</span><span class="o">[</span><span class="kt">Double</span><span class="o">],</span><span class="kt">Double</span><span class="o">)</span>
</pre>
</div>
<p>Notice it is willing to work with any type parameterized function (which means it is willing to let the caller pick the actual type of NumberBase[Y] and work with that).  Most callers will call with Y=MDouble (the wrapper for machine Doubles) and lineMin() will then work with that (without ever really knowing the actual underlying type).</p>
<p>A lot of fans of dynamic languages consider type systems to be mere hairshirt penance.   But that is not so.  Broken type systems (like Java&#8217;s collections before  erasure parameters were introduced in Java 1.5) are indeed more trouble than they are worth.  Working type systems (like C++ Templates/STL, Java 1.5+ and Scala) allow you to solve problems (and enforce decisions) during the design phase (which is much much cheaper than during the deployment phase).  You can&#8217;t set your types in stone (you are likely going to have them subtly wrong for the first few iteration).  You must be willing to think like a &#8220;language lawyer&#8221; to find out what parts of your work can be specified and enforced in the language type system.  To use an analogy: static types are your blueprint or your underpainting.</p>
<h2>Tests</h2>
<p>One argument against static types is that you can get much of their benefit from unit tests.  My opinion is you never have enough unit tests, so putting more pressure on your test suite is not wise.   Static types plus tests are strictly more powerful than static types alone or tests alone. </p>
<p>Even for this example toy-scale project we have include a JUnit test set to pursue a number of goals:</p>
<ul>
<li>Confirm our number implementations (DualNumber and MDouble) correctly model machine Doubles (perform parallel calculations and compare).</li>
<li>Confirm DualNumber obeys expected laws of algebra composition and cancellation <em>including the portions that can not be modeled in machine Doubles</em>.</li>
<li>Confirm DualNumbers compute gradients.</li>
<li>Confirm operations of optimizers and optimizer components.</li>
</ul>
<p>Many of these tests are related, but they don&#8217;t all imply each other and give different perspective on the errors they catch.  For example no amount of parallel computation between DualNumbers and machine Doubles is going to confirm the infinitesimal portion of the DualNumber is propagating correctly (since this is not a property of machine Doubles).  So we add extra tests that expect DualNumber to obey algebraic relations like: a*(b+c) = a*b + a*c hold.  It is then another step to confirm that whatever the DualNumbers calculate is not only self-consistent, but also models a truncated Taylor Series or differentiation.</p>
<h2>Conclusion</h2>
<p>We hope we have demonstrated how the complexity of a mathematical programming problem can be managed by breaking the problem into an objective function that is separate from the optimizer (allowing the optimizer to be both good and hidden) and a static type system (such as Scala) to help enforce required properties of a calculation (such as all numbers being routed though a required representation).  With these sort of tools available many formerly hard problems (that are often, unfortunately solved by over-specifying direct inefficient iterative improvement techniques) become &#8220;if I can write a reasonable objective function this may already by solved by an optimizer in my library.&#8221;  The more of these tools you have (either in your code or in your reference library) the more of these problems become easy (this is the topic of my earlier paper: <a href="http://www.win-vector.com/blog/2009/11/the-local-to-global-principle/">The Local to Global Principle</a>).</p>
<h2>Appendix: Fixing Smoothness</h2>
<p>Our chosen example objective function is very nice (i.e. convex) but it has a small (but correctable) problem.   The derivative or gradient or gradient has some jump discontinuities that could cause an optimizer to exit prematurely (not at the global optimum).  Consider the simple form of this for wiring a center to a single point at the origin (even in 1 dimension).  The wiring cost function is sqrt(x*x) has a cost graph as shown here.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/06/abs.png" alt="abs.png" border="0" width="525" height="525" /><br />
</center></p>
<p>This is convex- but derivative is not smooth as we see in the included graph of the derivative of sqrt(x*x).</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/06/dabs.png" alt="dabs.png" border="0" width="525" height="525" /><br />
</center></p>
<p>So: in this case if the optimizer stops at one of the target points we can&#8217;t be sure that it stopped at the global optimum (it may have stopped due to the discontinuity in the gradient).  For some simple problems the optimum is necessarily at a target point.  For example on the number line take the target points 0,1 and x.  As long as x&ge;0 and x&le;1 the optimum placement will be x itself.</p>
<p>One way to defend against this is to use some sort of smoothed version of sqrt() that essentially decreases a little faster near the origin.  Our cost function becomes:<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/06/cost2.png" alt="cost2.png" border="0" width="237" height="55" /><br />
</center><br />
where s() is our suitable approximation of the sqrt() function.  Two candidates are s(x) = (x+tau)^(1/2) and s(x) = x^(1/2 + tau); where tau is a small constant.  As long as tau is greater than zero we have no derivative discontinuity in s(x^2) and convexity is preserved (even made a bit stricter).  Other ways to deal with this include adding additional coordinates to the problem and small perturbations on these coordinates.  Finally, a point found by optimizing with respect to s(x) can be &#8220;polished&#8221; by re-starting the optimization at the first found solution and using sqrt(x) as the new objective (if the original point is not near any of the target points).</p>


<p>Related posts:<ol><li><a href='http://www.win-vector.com/blog/2010/07/gradients-via-reverse-accumulation/' rel='bookmark' title='Permanent Link: Gradients via Reverse Accumulation'>Gradients via Reverse Accumulation</a></li>
<li><a href='http://www.win-vector.com/blog/2009/11/r-examine-objects-tutorial/' rel='bookmark' title='Permanent Link: R examine objects tutorial'>R examine objects tutorial</a></li>
<li><a href='http://www.win-vector.com/blog/2009/09/survive-r/' rel='bookmark' title='Permanent Link: Survive R'>Survive R</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Algorithmic Movie (with texture)</title>
		<link>http://www.win-vector.com/blog/2010/04/algorithmic-movie-with-texture/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=algorithmic-movie-with-texture</link>
		<comments>http://www.win-vector.com/blog/2010/04/algorithmic-movie-with-texture/#comments</comments>
		<pubDate>Tue, 27 Apr 2010 16:44:52 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Mathematics]]></category>
		<category><![CDATA[Algorithmic Art]]></category>
		<category><![CDATA[genetic art]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1457</guid>
		<description><![CDATA[We would like to share a new algorithmic movie we have created. Since the mid 90&#8242;s we have been dabbling off and on with a combination of algorithmic and genetic art (see: What is “Genetic Art?” or try running the Java code directly in your browser). Every once in a while we return to the [...]


Related posts:<ol><li><a href='http://www.win-vector.com/blog/2009/06/what-is-genetic-art/' rel='bookmark' title='Permanent Link: What is &#8220;Genetic Art?&#8221;'>What is &#8220;Genetic Art?&#8221;</a></li>
<li><a href='http://www.win-vector.com/blog/2010/07/gradients-via-reverse-accumulation/' rel='bookmark' title='Permanent Link: Gradients via Reverse Accumulation'>Gradients via Reverse Accumulation</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>We would like to share a new algorithmic movie we have created.</p>
<p>Since the mid 90&#8242;s we have been dabbling off and on with a combination of algorithmic and genetic art (see: <a href="http://www.win-vector.com/blog/2009/06/what-is-genetic-art/" target="other">What is “Genetic Art?”</a> or try <a href="http://www.mzlabs.com/MZLabsJM/page4/page22/page22.html" target="other">running the Java code directly in your browser</a>).  Every once in a while we return to the project and generate something we would like to share.</p>
<p><span id="more-1457"></span><br />
For this project we have used formulas over the variables &#8220;x&#8221; and &#8220;y&#8221; to describe how color varies as a function of position on our canvas.</p>
<p>This has allowed formulas like:</p>
<blockquote><p>
( + ( mod ( iexp k ) ( isin ( / j ( / ( x + i y + j x + k y ) k ) ) ) ) ( mod ( iexp k ) ( isin ( / j ( / ( x + i y + j x + k y ) k ) ) ) ) )
</p></blockquote>
<p>To generate pictures like this:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/04/gartPicture2010_04_27_09.20.21.7941.jpg" alt="gartPicture2010_04_27_09.20.21.794.jpg" border="0" width="500" height="333" /><br />
</center></p>
<p>We then add a source-texture from C. Estrade&#8217;s &#8220;Full-Color Japanese Textile Designs CD-ROM and Book&#8221; (<a href="http://store.doverpublications.com/0486996956.html" target="ext">Dover</a>, unrestricted use):<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/04/023.jpg" alt="023.jpg" border="0" width="500" height="325" /><br />
</center></p>
<p>Which (with a slightly modified formula) yields a picture like this:</p>
<blockquote><p>
( + ( subst ( mod ( iexp k ) ( isin ( / j ( / ( x + i y + j x + k y ) k ) ) ) ) Img23 ) ( mod ( iexp k ) ( isin ( / j ( / ( x + i y + j x + k y ) k ) ) ) ) )
</p></blockquote>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/04/gartPicture2010_04_18_09.12.24.2121.jpg" alt="gartPicture2010_04_18_09.12.24.212.jpg" border="0" width="500" height="333" /><br />
</center></p>
<p>We can further modify the formula to depend on time (represented by the new variable &#8220;z&#8221;):</p>
<blockquote><p>
( + ( subst ( mod ( iexp k ) ( isin ( / j ( / ( x + i y + j (x +z) + k (y + z) ) k ) ) ) ) Img23 ) ( mod ( iexp k ) ( isin ( / j ( / ( x + i y + j (x +z) + k (y + z) ) k ) ) ) ) )
</p></blockquote>
<p>And get a <a href="http://www.youtube.com/watch?v=hs_glOeEV7c" target="ext">movie</a> like this:</p>
<p><center><br />
<object width="500" height="405"><param name="movie" value="http://www.youtube.com/v/hs_glOeEV7c&#038;hl=en_US&#038;fs=1&#038;rel=0&#038;border=1"></param><param name="allowFullScreen" value="true"></param><param name="allowscriptaccess" value="always"></param><embed src="http://www.youtube.com/v/hs_glOeEV7c&#038;hl=en_US&#038;fs=1&#038;rel=0&#038;border=1" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="500" height="405"></embed></object><br />
</center></p>
<p>What we have previously called &#8220;genetic art&#8221; was the system of automatically combining and re-combining fragments of formulas using user votes and preferences (so nobody would have to see or understand these ugly formulas to produce art).  What we now present is a larger &#8220;algebra&#8221; of &#8220;simple picture plus pattern = complicated pictures&#8221; and &#8220;picture plus time transformations = movie.&#8221;</p>


<p>Related posts:<ol><li><a href='http://www.win-vector.com/blog/2009/06/what-is-genetic-art/' rel='bookmark' title='Permanent Link: What is &#8220;Genetic Art?&#8221;'>What is &#8220;Genetic Art?&#8221;</a></li>
<li><a href='http://www.win-vector.com/blog/2010/07/gradients-via-reverse-accumulation/' rel='bookmark' title='Permanent Link: Gradients via Reverse Accumulation'>Gradients via Reverse Accumulation</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2010/04/algorithmic-movie-with-texture/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>SIGACT Review of: Combinatorics the Rota Way</title>
		<link>http://www.win-vector.com/blog/2010/04/sigact-review-of-combinatorics-the-rota-way/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=sigact-review-of-combinatorics-the-rota-way</link>
		<comments>http://www.win-vector.com/blog/2010/04/sigact-review-of-combinatorics-the-rota-way/#comments</comments>
		<pubDate>Wed, 21 Apr 2010 03:51:56 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Mathematics]]></category>
		<category><![CDATA[Opinion]]></category>
		<category><![CDATA[Book Reviews]]></category>
		<category><![CDATA[Combinatorics]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1450</guid>
		<description><![CDATA[SIGACT News review of: Combinatorics the Rota Way. Also found on Professor Gasarch&#8217;s page and ACM SIGACT News Volume 41, Issue 2 (paywall) Review of Combinatorics The Rota Way by Joseph P.S. Kung, Gian-Carlo Rota and Catherine H. Yan Cambridge, 2009 396 pages, Trade Paperback Review by John Mount, jmount@win-vector.com April 20, 2010 Introduction Combinatorics, [...]


Related posts:<ol><li><a href='http://www.win-vector.com/blog/2008/08/what-is-mathematics-really/' rel='bookmark' title='Permanent Link: What is Mathematics, Really?'>What is Mathematics, Really?</a></li>
<li><a href='http://www.win-vector.com/blog/2009/05/the-joy-of-calculation/' rel='bookmark' title='Permanent Link: The Joy of Calculation'>The Joy of Calculation</a></li>
<li><a href='http://www.win-vector.com/blog/2008/04/sorting-in-anger/' rel='bookmark' title='Permanent Link: Sorting Used in Anger'>Sorting Used in Anger</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>SIGACT News review of: Combinatorics the Rota Way.  Also found on <a href="http://www.cs.umd.edu/~gasarch/bookrev/41-2.pdf" target="ext">Professor Gasarch&#8217;s page</a> and  <a href="http://portal.acm.org/browse_dl.cfm?idx=J697" target="ext">ACM SIGACT News Volume 41, Issue 2 (paywall)</a></p>
<p><span id="more-1450"></span></p>
<div align="center"><b>Review of<br />
Combinatorics The Rota Way<br />
by Joseph P.S. Kung, Gian-Carlo Rota and Catherine H. Yan<br />
Cambridge, 2009<br />
396 pages, Trade Paperback</b></div>
<div align="center"><b>Review by<br />
John Mount, jmount@win-vector.com<br />
April 20, 2010</b></div>
<h1><a name="SECTION00010000000000000000">Introduction</a></h1>
<p>Combinatorics, as it matures, becomes harder to succinctly describe. The field has progressed from the basic study of finite sets and counting techniques to being the discipline where questions involving counting, graphs, connectivity, mappings and partial orders all naturally reside. But the objects that combinatorics studies turn out not to be the correct foundation to support modern combinatorial methods. Many combinatorial methods were dismissed as mere technique until combinatorics expanded to include the natural domains of these methods: lattices, formal power series, valuation rings, matroids and many diverse algebras. One person who pushed hard for this coherence and unity was Gian-Carlo Rota.</p>
<p>An example of a high-school level combinatorial trick is proving the equation</p>
<div align="center"><img width="101" height="60" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/04/CTRimg1.png" alt="$\displaystyle \sum_{i=0}^{n} \binom{n}{i} = 2^n $"></div>
<p>by applying the binomial theorem to <img width="61" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/04/CTRimg2.png" alt="$ (1+1)^n$"> . This trick is transformed into a method when you recognize that you really should be working in the ring of formal power series and invent the Umbral Calculus. With the Umbral Calculus you can use the equivalence of the following two equations:</p>
<div align="center">
<table cellpadding="0" align="center" width="100%">
<tr valign="middle">
<td nowrap align="right"><img width="20" height="30" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/04/CTRimg3.png" alt="$\displaystyle b^n$"></td>
<td width="10" align="center" nowrap><img width="17" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/04/CTRimg4.png" alt="$\displaystyle =$"></td>
<td align="left" nowrap><img width="155" height="60" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/04/CTRimg5.png" alt="$\displaystyle (a+1)^n = \sum_{i=0}^{n} \binom{n}{i} a^i$"></td>
<td width="10" align="right">&nbsp;</td>
</tr>
<tr valign="middle">
<td nowrap align="right"><img width="21" height="30" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/04/CTRimg6.png" alt="$\displaystyle a^n$"></td>
<td width="10" align="center" nowrap><img width="17" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/04/CTRimg4.png" alt="$\displaystyle =$"></td>
<td align="left" nowrap><img width="205" height="60" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/04/CTRimg7.png" alt="$\displaystyle (b-1)^n = \sum_{i=0}^{n} (-1)^{n-i} \binom{n}{i} b^i$"></td>
<td width="10" align="right">&nbsp;</td>
</tr>
</table>
</div>
<p><br clear="all"><br />
(i.e. <img width="68" height="29" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/04/CTRimg8.png" alt="$ b = a+1$"> is equivalent to <img width="68" height="29" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/04/CTRimg9.png" alt="$ a=b-1$"> ) to prove that for any two arbitrary infinite sequences <img width="37" height="29" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/04/CTRimg10.png" alt="$ a_i,b_i$"> the following two statements are also equivalent:</p>
<p></p>
<div align="center"><a name="eq1"></a><a name="eq2"></a><br />
<table cellpadding="0" align="center" width="100%">
<tr valign="middle">
<td nowrap align="right"><img width="20" height="29" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/04/CTRimg11.png" alt="$\displaystyle b_n$"></td>
<td width="10" align="center" nowrap><img width="17" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/04/CTRimg4.png" alt="$\displaystyle =$"></td>
<td align="left" nowrap><img width="81" height="60" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/04/CTRimg12.png" alt="$\displaystyle \sum_{i=0}^{n} \binom{n}{i} a_i \;$">for all<img width="18" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/04/CTRimg13.png" alt="$\displaystyle \; n$"></td>
<td width="10" align="right">(1)</td>
</tr>
<tr valign="middle">
<td nowrap align="right"><img width="21" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/04/CTRimg14.png" alt="$\displaystyle a_n$"></td>
<td width="10" align="center" nowrap><img width="17" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/04/CTRimg4.png" alt="$\displaystyle =$"></td>
<td align="left" nowrap><img width="133" height="60" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/04/CTRimg15.png" alt="$\displaystyle \sum_{i=0}^{n} (-1)^{n-i} \binom{n}{i} b_i \;$">for all<img width="22" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/04/CTRimg16.png" alt="$\displaystyle \; n.$"></td>
<td width="10" align="right">(2)</td>
</tr>
</table>
</div>
<p><br clear="all"><br />
For example: we could pick <img width="45" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/04/CTRimg17.png" alt="$ a_i = i$"> and substitute it into Equation&nbsp;<a href="#eq1">1</a>. With some work we see this implies <img width="73" height="34" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/04/CTRimg18.png" alt="$ b_i= 2^{i-1} i$"> .<a name="tex2html1" href="#foot43"><sup>1</sup></a>Then by the Umbral result we know Equation&nbsp;<a href="#eq2">2</a> must also be true so we get a new identity: <img width="186" height="34" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/04/CTRimg22.png" alt="$ n = \sum_{i=0}^{n} (-1)^{n-i} \binom{n}{i} 2^{i-1} i$"> . This algebraic production of a new identity is very different than the classical method of &#8220;counting two ways&#8221; (or being lucky enough to come up with a clever bijection to prove the identity).</p>
<h1><a name="SECTION00020000000000000000">Summary</a></h1>
<p>The book &#8220;Combinatorics the Rota Way&#8221; is itself hard to succinctly describe. The first and third authors tell of writing this book using notes from the Massachusetts Institute of Technology&#8217;s course 18.315 collected over a span of more than 30 years. Gian-Carlo Rota himself was added as a posthumous author. The book itself contains more than a single course-year&#8217;s worth of material and is packed very densely.</p>
<p>The book&#8217;s emphasis is abstract and algebraic. The exercises are not to teach, but are instead to identify applications of combinatorics in other mathematical disciplines. The book is the product of a strong push to demonstrate many combinatorial methods in their most powerful, but not most obvious, forms. This work is clearly a labor of love and contains some remarkable material. However, due to the large breadth of the work not much time is spent on motivation or on concrete examples.</p>
<h2><a name="SECTION00021000000000000000">Chapter 1: Sets, Functions and Relations</a></h2>
<p>The first chapter covers the definitional foundations of combinatorics: sets, lattices, partial orders, functions and relations. These are the discrete objects that the book will reason about by later building more complicated algebraic objects. This section is very dense and reads like a compressed Bourbaki treatment of discrete mathematics.</p>
<p>One portion of this chapter that is problematic is the section on entropy that seems to serve no purpose other than to prepare the reader for exercise 1.4.10 which demonstrates an abstraction of entropy. Also, exercises 1.2.5(j,k) are needlessly cruel in asking the reader to recreate the Robertson-Seymour graph minor theorem. There have been books where the reader is successfully guided through a major result by exercises, such as the Weak Perfect Graph Theorem in Lov&aacute;sz&#8217;s &#8220;Combinatorial Problems and Exercises&#8221;, but this book is not structured in that manner.</p>
<h2><a name="SECTION00022000000000000000">Chapter 2: Matching Theory</a></h2>
<p>The second chapter is a welcome change in tone and opens with a quote from Harper and Rota describing matching theory and a clever 1979 Putnam exam problem is worked into the exercises and solutions. Central to the chapter is &#8220;marriage theorem&#8221;, which determines when matchings are possible. Also discussed is Birkhoff&#8217;s Theorem, which states that every doubly stochastic matrix is a convex combination of permutations matrices, which relates matchings to matrices. The text is lively and includes a number of well-researched asides, such as the origin of the name &#8220;The Hungarian Method.&#8221; However, there are some problems with forward reference: for example the reader is asked to work a couple of exercise (2.4.5 and 2.4.6) using the Binet-Cauchy formula, which isn&#8217;t discussed at length until chapter 6.</p>
<h2><a name="SECTION00023000000000000000">Chapter 3: Partially Ordered Sets and Lattices</a></h2>
<p>This chapter begins with a very exciting presentation of the M&ouml;bius Function (the convolutional inverse of what is essentially the indicator function of a partial order). It is a real pleasure to see this material well presented in a general lattice setting, instead of the more common and specialized number theoretic setting. The chapter moves on to chains (ordered sequences in lattices) and anti-chains (sets of incomparable elements) in partial orders. The authors present Dilworth&#8217;s theorem which states that every partial can be covered by a number of chains no larger than the size of the largest anti-chain.<a name="tex2html2" href="#foot57"><sup>2</sup></a> The chapter continues with Sperner Theory, which relates counting anti-chains to binomial coefficients. Chapter 3 concludes with valuation rings and M&ouml;bius Algebras: a transition to the more algebraic style found in Chapter 4.</p>
<h2><a name="SECTION00024000000000000000">Chapter 4: Generating Functions and the Umbral Calculus</a></h2>
<p>This is a key chapter. The book introduces the Umbral Calculus, a transform space automating the manipulation of generating functions. The algebra of delta operators is introduced, which provides an abstraction of differentiation. Finally co-algebras are explored, which abstract the processes of factoring.</p>
<p>A rare (and unfortunate) typo on page-190 mis-defines a basic sequence <img width="42" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/04/CTRimg23.png" alt="$ p_n(x)$"> for the delta operator <img width="17" height="29" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/04/CTRimg24.png" alt="$ Q$"> as obeying <img width="131" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/04/CTRimg25.png" alt="$ Q p_n(x) = p_{n-1}(x)$"> instead of the correct equation: <img width="140" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/04/CTRimg26.png" alt="$ Q p_n(x) = n p_{n-1}(x)$"> . A careful reader can spot the mistake as it is inconsistent with the the subsequent demonstrations and uses.</p>
<h2><a name="SECTION00025000000000000000">Chapter 5: Symmetric Functions and Baxter Algebras</a></h2>
<p>This chapter treats a number of important algebraic topics. Symmetric functions are studied and identified as being the obvious class of functions that contains all of the well know generating functions already studied. P&oacute;lya&#8217;s Enumeration Theory, which is the method of counting the number of equivalence classes of distinct arrangements, is given a very interesting exposition. But the book skips the classic examples and exercises, such as counting the number of ways to construct distinct necklaces from colored beads, that would be needed for the topic to be fully approachable. Baxter Algebras, which abstract both summation and integration by parts, are introduced and via a study the sequence shift operator. By this point the book has abstract versions of both differentiation and integration, providing a combinatorial groundwork to prove theorems on &#8220;the calculus&#8221; that are more general than is possible in any one theory of differentiation or integration.</p>
<h2><a name="SECTION00026000000000000000">Chapter 6: Determinants, Matrices and Polynomials</a></h2>
<p>This chapter is most similar to classical polynomial invariant theory, the study of symmetric functions of the roots of polynomials such as the discriminant. A major theme of this chapter is the study of the relations between properties of polynomial coefficients and the locations of roots of the polynomials. The study of matrices brings us to the remarkable Binet-Cauchy Formula for the determinant of a product of matrices. The results are deep, but it is a shame that more time isn&#8217;t spent on simple concrete applications such as using the Binet-Cauchy formula to count the number of spanning trees in a graph. This chapter reveals the parts of combinatorics that come from analysis and the study of locations of roots of polynomials (via group theory), in contrast to the parts that come from enumerating finite sets, linear algebra and abstract algebra. This is also the chapter where the exterior algebra, a favorite tool of Rota&#8217;s, is most discussed.</p>
<p>A typo on page 275 (a potentially confusing comma in the definition of the <img width="46" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/04/CTRimg27.png" alt="$ eval()$"> operation) can be recovered from because the authors have the nice habit of explicitly calling out the domain and range of functions.</p>
<h1><a name="SECTION00030000000000000000">Opinion</a></h1>
<p>Some important questions about this book are: is Gian-Carlo Rota a coauthor, what is the purpose of the book and who is the best audience?</p>
<p>Gian-Carlo Rota seems appropriately labeled as a co-author, as clearly a lot of his work went into the book. The book is not suitable to be used as an introductory text book or as a reference. It is a book meant to be read. The ideal audience is capable of graduate level mathematics, is comfortable with a high degree of abstraction and algebra and is already familiar with many of the structures and techniques of combinatorics: sets, graphs, matrices, alternating sequences and generating functions. A mathematician or computer scientist wanting to learn more about the science of combinatorics will find a good read here.</p>
<p>The book works best as a second read of the topics covered. If you already know of a combinatorial method, like P&oacute;lya&#8217;s Enumeration Theory, this book is a good place to find the starting point for an alternate and powerful treatment of the topic. The book admits to not being self contained, and has a few forward-reference problems. However, this is forgivable when you realize the goal of this book is not to teach some easy discrete mathematics before you move on to analysis, but to extract the important combinatorial methods and themes from all of mathematics.</p>
<p>The content is well written, very accurate and well edited. The index is good, but not quite up to the job. The bibliography is very good and divided into three useful sections: papers by Gian-Carlo Rota and coworkers, books for further reading and a section of references.</p>
<p>We close with a extract from the book at hand. Many mathematicians have used the phrase &#8220;merely combinatorial proof&#8221; as a phrase of dismissal. However, when properly founded, combinatorial proofs are in fact more general than proofs that depend on additional specific details from the original problem domain. The authors take some justifiable pleasure in including points like: &#8220;Hilbert&#8217;s basis theorem is equivalent to the &#8216;trivial combinatorial fact&#8217; given in Gordan&#8217;s lemma.&#8221; This is certainly a taste of combinatorics the Rota way.</p>
<p></p>
<hr />
<h4>Footnotes</h4>
<dl>
<dt><a name="foot43">&#8230;.</a><a href="#tex2html1"><sup>1</sup></a></dt>
<dd>For this use the binomial theorem to expand <img width="62" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/04/CTRimg19.png" alt="$ (1+x)^n$"> , differentiate with respect to <img width="13" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/04/CTRimg20.png" alt="$ x$"> and then substitute in <img width="42" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/04/CTRimg21.png" alt="$ x=1$"> .</dd>
<dt><a name="foot57">&#8230; anti-chain.</a><a href="#tex2html2"><sup>2</sup></a></dt>
<dd>From this they derive just about the only Ramsey-theoretic style result in the book: any large partial order must have a large chain or large anti-chain.</dd>
</dl>
<p></p>
<hr />


<p>Related posts:<ol><li><a href='http://www.win-vector.com/blog/2008/08/what-is-mathematics-really/' rel='bookmark' title='Permanent Link: What is Mathematics, Really?'>What is Mathematics, Really?</a></li>
<li><a href='http://www.win-vector.com/blog/2009/05/the-joy-of-calculation/' rel='bookmark' title='Permanent Link: The Joy of Calculation'>The Joy of Calculation</a></li>
<li><a href='http://www.win-vector.com/blog/2008/04/sorting-in-anger/' rel='bookmark' title='Permanent Link: Sorting Used in Anger'>Sorting Used in Anger</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2010/04/sigact-review-of-combinatorics-the-rota-way/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>&#8220;Easy&#8221; Portfolio Allocation</title>
		<link>http://www.win-vector.com/blog/2010/01/easy-portfolio-allocation/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=easy-portfolio-allocation</link>
		<comments>http://www.win-vector.com/blog/2010/01/easy-portfolio-allocation/#comments</comments>
		<pubDate>Thu, 14 Jan 2010 20:09:13 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Finance]]></category>
		<category><![CDATA[Mathematics]]></category>
		<category><![CDATA[Tutorials]]></category>
		<category><![CDATA[Lagrange Multipliers]]></category>
		<category><![CDATA[Mathematical Bedside Reading]]></category>
		<category><![CDATA[Portfolio Theory]]></category>
		<category><![CDATA[Sharpe Ratio]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1342</guid>
		<description><![CDATA[This is an elementary mathematical finance article. This means if you know some math (linear algebra, differential calculus) you can find a quick solution to a simple finance question. The topic was inspired by a recent article in The American Mathematical Monthly (Volume 117, Number 1 January 2010, pp. 3-26): &#8220;Find Good Bets in the [...]


Related posts:<ol><li><a href='http://www.win-vector.com/blog/2008/09/a-quick-appreciation-of-the-sharpe-ratio/' rel='bookmark' title='Permanent Link: A Quick Appreciation of the Sharpe Ratio'>A Quick Appreciation of the Sharpe Ratio</a></li>
<li><a href='http://www.win-vector.com/blog/2009/09/a-discrete-model-gauging-market-efficiency/' rel='bookmark' title='Permanent Link: A Discrete Model Gauging Market Efficiency'>A Discrete Model Gauging Market Efficiency</a></li>
<li><a href='http://www.win-vector.com/blog/2009/10/what-is-the-gamblers-equivalent-of-amdahls-law/' rel='bookmark' title='Permanent Link: What is the gambler&#8217;s equivalent of Amdahl&#8217;s Law?'>What is the gambler&#8217;s equivalent of Amdahl&#8217;s Law?</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>This is an elementary mathematical finance article. This means if you know some math (linear algebra, differential calculus) you can find a quick solution to a simple finance question. The topic was inspired by a recent article in The American Mathematical Monthly (Volume 117, Number 1 January 2010, pp. 3-26): &#8220;Find Good Bets in the Lottery, and Why You Shouldn&#8217;t Take Them&#8221; by Aaron Abrams and Skip Garibaldi which said optimal asset allocation is now an undergraduate exercise. That may well be, but there are a lot of people with very deep mathematical backgrounds that have yet to have seen this. We will fill in the details here. The style is terse, but the content should be about what you would expect from one day of lecture in a mathematical finance course.</p>
<p><span id="more-1342"></span></p>
<p>Portfolio allocation is not the &#8220;magic predict the future&#8221; part of finance, it is the scheme for correctly applying magic predictions of the future. The idea is that if you had an prediction of future returns of a number of assets, the naive thing to do would be to invest everything into the asset with highest predicted return. Portfolio theory, while still taking the predictions at face value, picks an investment pattern that will (in risk-adjusted dollars) outperform the naive strategy even if the predictions are correct and is a bit safer when the predictions are wrong.</p>
<p>Suppose you had <img width="14" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg1.png" alt="$ n$"> different assets you could invest in. For the <img width="10" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg2.png" alt="$ i$"> -th asset there is an expected excess relative return of <img width="19" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg3.png" alt="$ \mu_i$"> and an estimated variance of <img width="17" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg4.png" alt="$ s_i$"> (for a definition of relative return see <a href="http://www.win-vector.com/blog/2010/01/relative-returns-a-banker-versus-trader-paradox/">Relative returns: a banker versus trader paradox</a> and for a definition of variance see <a href="http://www.win-vector.com/blog/2008/09/a-quick-appreciation-of-the-sharpe-ratio/">A Quick Appreciation of the Sharpe Ratio</a>). Let the vector <img width="16" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg5.png" alt="$ w$"> be such that <img width="23" height="29" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg6.png" alt="$ X_i$"> represents the number of dollars we invest in the <img width="10" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg2.png" alt="$ i$"> -th asset. If <img width="23" height="29" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg6.png" alt="$ X_i$"> is positive then our plan is &#8220;to go long&#8221; or buy some of the <img width="10" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg2.png" alt="$ i$"> -th asset. If <img width="23" height="29" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg6.png" alt="$ X_i$"> is negative our plan is &#8220;to short&#8221; or sell some of the <img width="10" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg2.png" alt="$ i$"> -th asset to somebody else (It is called going short as we actually sell something we do not have. This is often allowed in finance; as long as we make the same pay-outs to the buyer that the buyer would receive if we really had the item to sell).</p>
<p>When we appeal to the idea of optimizing the portfolio Sharpe Ratio (again, see <a href="http://www.win-vector.com/blog/2008/09/a-quick-appreciation-of-the-sharpe-ratio/">A Quick Appreciation of the Sharpe Ratio</a>) then we say a good portfolio is one that doesn&#8217;t just maximize expected relative returns (which is <img width="39" height="34" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg11.png" alt="$ X^{\top} \mu$"> ) but maximizes the ratio of expected relative return to standard deviation:</p>
</p>
<div align="center"><img width="73" height="56" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg8.png" alt="$\displaystyle \frac{X^{\top} \mu}{\sqrt{X^{\top} C X}} $"></div>
<p>where (for now) <img width="17" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg9.png" alt="$ C$"> is the matrix <img width="30" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg10.png" alt="$ s s^{\top}$"> . This ratio is called a &#8220;risk adjusted return&#8221; (versus the un-adjusted form <img width="39" height="34" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg11.png" alt="$ X^{\top} \mu$"> ). Also notice that the ratio is homogeneous in <img width="19" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg12.png" alt="$ X$"> (doubling <img width="19" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg12.png" alt="$ X$"> does not change the ratio as it simultaneously doubles the numerator and the denominator) so an optimal solution <img width="19" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg12.png" alt="$ X$"> describes not how much to invest, but what pattern to invest in. This allows us to introduce an important practical constraint: we are only going to allow ourselves to risk a total of <img width="16" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg13.png" alt="$ T$"> dollars (both long and short). That is: we insist <img width="105" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg14.png" alt="$ \sum_{i=1}^{n} \vert X_i\vert = T$"> . We will ignore this total investment constraint until the end when we can satisfy the constraint by simply re-scaling an partial solution.</p>
<p>To solve for <img width="19" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg12.png" alt="$ X$"> we introduce an old friend: <a href="http://en.wikipedia.org/wiki/Lagrange_multipliers">Lagrange Multipliers</a> (or equivalently the Karush-Kuhn-Tucker conditions of optimality). Since the fraction we are trying to optimize is homogeneous in <img width="19" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg12.png" alt="$ X$"> we can convert the denominator into a constraint and arbitrarily insist that <img width="99" height="38" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg15.png" alt="$ \sqrt{X^{\top} C X} = 1$"> without changing the nature of the problem. We are now trying to maximize <img width="39" height="34" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg11.png" alt="$ X^{\top} \mu$"> subject to <img width="99" height="38" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg15.png" alt="$ \sqrt{X^{\top} C X} = 1$"> . The Lagrangian conditions of optimality state at the optimum we must have the gradient of the objective is proportional to the gradient of the constraint or:</p>
</p>
<div align="center"><img width="225" height="40" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg16.png" alt="$\displaystyle \nabla_X X^{\top} \mu = \lambda \nabla_X ( \sqrt{X^{\top} C X} - 1 ) $"></div>
<p>for some (to be determined) constant <img width="13" height="15" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg17.png" alt="$ \lambda$"> . Pushing the gradient operator through we get:</p>
<div align="center"><img width="213" height="37" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg18.png" alt="$\displaystyle \mu = \lambda (1/2) ( X^{\top} C X )^{-1/2} 2 C X . $"></div>
<p>A similar equation could be gotten by appealing to a Rayleigh Quotient argument.</p>
<p>We do not yet know <img width="19" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg12.png" alt="$ X$"> (that is what we are trying to solve for), so we do not know what <img width="56" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg19.png" alt="$ X^{\top} C X$"> is. However, this is just a scalar and since we are just trying to solve up to a multiple we can throw it out and introduce a new multiple and see that it is enough to solve:</p>
</p>
<div align="center"><img width="76" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg20.png" alt="$\displaystyle \mu = \lambda' C X $"></div>
<p>where <img width="18" height="16" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg21.png" alt="$ \lambda'$"> is new (still unknown) scalar. This means we have:</p>
<div align="center"><img width="121" height="35" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg22.png" alt="$\displaystyle X = (1/\lambda') C^{-1} \mu $"></div>
<p>so our desired solution is some re-scaling of <img width="43" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg23.png" alt="$ C^{-1} \mu$"> .</p>
<p>As we stated earlier we have a total investment constraint of <img width="105" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg14.png" alt="$ \sum_{i=1}^{n} \vert X_i\vert = T$"> . We can achieve this with the following adjusted solution:</p>
</p>
<div align="center"><img width="189" height="51" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg24.png" alt="$\displaystyle X = \frac{T}{\sum_{i=1}^{n} \vert(C^{-1} \mu)_i\vert} C^{-1} \mu $"></div>
<p>as our desired optimal portfolio allocation. In the end we can solve for the optimal portfolio by merely solving a linear system (we don&#8217;t need anything as expensive as a general purpose optimizer in this case).</p>
<p>These are very old results (going back as long as there has been Sharpe Ratios and portfolio theory). A good example reference is: &#8220;The Valuation of Risk Assets and the Selection of Risky Investments in Stock Portfolios and Capital Budgets,&#8221; John Lintner, The Review of Economics and Statistics (1965) vol. 47 (1) pp. 13-37. These results are the basis for advice like: &#8220;diversify.&#8221; Without modeling risk you would tend to put all of your money in the predicted highest paying asset. When modeling risk you tend to put some of your money in each high paying asset and as long as they do not all fail at the same time you have some safety. Another (very different) route to diversification is the Kelly Criterion (discussed in <a href="http://www.win-vector.com/blog/2009/10/what-is-the-gamblers-equivalent-of-amdahls-law/">What is the gambler&#8217;s equivalent of Amdahl&#8217;s Law?</a>).</p>
<p>A very important risk we have not yet modeled is that our assets may have a tendency to fail at the same time (meaning we may not have really diversified usefully). The notion of assets may fail at the same time brings us to the ideas of correlation and covariance. When we took <img width="64" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg25.png" alt="$ C = s s^{\top}$"> we were implicitly assuming (or modeling), without justification, that each possible asset was independent of all the others (that there was no correlation between asset returns). This is, of course, not going to be anywhere near true in practice. Instead we should take <img width="17" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg9.png" alt="$ C$"> to be the <a href="http://en.wikipedia.org/wiki/Covariance_matrix">Covariance Matrix</a> that represent our estimate of the assent to asset correlations. In this case the solution methods above all work exactly as before. Companies such as MSCI Barra have made complete businesses out of producing and selling estimates of <img width="17" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg9.png" alt="$ C$"> .</p>
<p>Another issue is when we do not allow ourselves to &#8220;short&#8221; (or take a negative allocation of) assets. In this case we have the additional constraints <img width="48" height="29" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg26.png" alt="$ X \ge 0$"> which complicates our solution. For the special case where the asset variances are assumed to be independent (i.e. <img width="64" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg25.png" alt="$ C = s s^{\top}$"> ) it is enough to solve as above and merely replace any negative allocations with zero when inspecting and scaling the final step of the solution. When the covariances are non-trivial (<img width="17" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg9.png" alt="$ C$"> has non-zero off-diagonal entries) this solution may not be optimal. In this case the Karush-Kuhn-Tucker conditions are more complicated and at the point of optimal solution we have the following conditions:</p>
<p></p>
<div align="center">
<table cellpadding="0" align="center">
<tr valign="middle">
<td nowrap align="right"><img width="145" height="60" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg27.png" alt="$\displaystyle \mu + \lambda C X - \sum_{i=1}^{n} \tau_i E^i$"></td>
<td width="10" align="center" nowrap><img width="17" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg28.png" alt="$\displaystyle =$"></td>
<td align="left" nowrap>0</td>
<td width="10" align="right">&nbsp;</td>
</tr>
<tr valign="middle">
<td nowrap align="right"><img width="19" height="29" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg29.png" alt="$\displaystyle X$"></td>
<td width="10" align="center" nowrap><img width="17" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg30.png" alt="$\displaystyle \ge$"></td>
<td align="left" nowrap>0</td>
<td width="10" align="right">&nbsp;</td>
</tr>
<tr valign="middle">
<td nowrap align="right"><img width="48" height="60" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg31.png" alt="$\displaystyle \sum_{i=1}^{n} X_i$"></td>
<td width="10" align="center" nowrap><img width="17" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg28.png" alt="$\displaystyle =$"></td>
<td align="left" nowrap><img width="16" height="29" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg32.png" alt="$\displaystyle T$"></td>
<td width="10" align="right">&nbsp;</td>
</tr>
<tr valign="middle">
<td nowrap align="right"><img width="13" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg33.png" alt="$\displaystyle \tau$"></td>
<td width="10" align="center" nowrap><img width="17" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg30.png" alt="$\displaystyle \ge$"></td>
<td align="left" nowrap>0</td>
<td width="10" align="right">&nbsp;</td>
</tr>
<tr valign="middle">
<td nowrap align="right"><img width="38" height="36" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg34.png" alt="$\displaystyle \tau^{\top} X$"></td>
<td width="10" align="center" nowrap><img width="17" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg28.png" alt="$\displaystyle =$"></td>
<td align="left" nowrap>0</td>
<td width="10" align="right">&nbsp;</td>
</tr>
</table>
</div>
<p><br clear="all"><br />
where <img width="19" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg12.png" alt="$ X$"> is the allocation vector we wish to solve for, <img width="13" height="15" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg17.png" alt="$ \lambda$"> is an unknown scalar, <img width="13" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg35.png" alt="$ \tau$"> is a new unknown vector and <img width="22" height="16" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg36.png" alt="$ E^i$"> is the vector with <img width="69" height="34" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg37.png" alt="$ (E^i)_i = 1$"> and zeroes elsewhere. Using the Karush-Kuhn-Tucker conditions has allowed us to again almost linearize the problem, but we know have sign constraints on <img width="19" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg12.png" alt="$ X$"> and <img width="13" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg35.png" alt="$ \tau$"> and what is called a complementarity constraint: <img width="67" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg38.png" alt="$ \tau^{\top} X = 0$"> . This sort of problem essentially called a &#8220;Linear Complementarity Problem&#8221; and is about as hard as solving a linear program (the typical solution method is a variation of the simplex method called &#8220;Lemke&#8217;s algorithm&#8221;). (Technically the <img width="13" height="15" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg17.png" alt="$ \lambda$"> prevents the problem from being in the right form, but <img width="13" height="15" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg17.png" alt="$ \lambda$"> can be inspected out of the problem.) The problem can still be solved, you just need a bit more software. If we can not short assets (or at least simulate shorting assets) we not only eliminate many possible portfolios from consideration (so we likely end up with a less profitable portfolio than we would like) we also make the mathematics and computation a bit harder.</p>
<p>The goal of this writeup has been to show how to systematically convert investment advice like &#8220;this stock is going to really take off&#8221; into an allocation of assets (which in turn implies a pattern of trades). We take as unexamined premises where to get such advice and whether to use the Sharpe ratio or some other notion of risk and/or utility. The point is that even though it may be complicated, from this point it is just calculation and calculation is easy to automate.</p>


<p>Related posts:<ol><li><a href='http://www.win-vector.com/blog/2008/09/a-quick-appreciation-of-the-sharpe-ratio/' rel='bookmark' title='Permanent Link: A Quick Appreciation of the Sharpe Ratio'>A Quick Appreciation of the Sharpe Ratio</a></li>
<li><a href='http://www.win-vector.com/blog/2009/09/a-discrete-model-gauging-market-efficiency/' rel='bookmark' title='Permanent Link: A Discrete Model Gauging Market Efficiency'>A Discrete Model Gauging Market Efficiency</a></li>
<li><a href='http://www.win-vector.com/blog/2009/10/what-is-the-gamblers-equivalent-of-amdahls-law/' rel='bookmark' title='Permanent Link: What is the gambler&#8217;s equivalent of Amdahl&#8217;s Law?'>What is the gambler&#8217;s equivalent of Amdahl&#8217;s Law?</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2010/01/easy-portfolio-allocation/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The Local to Global Principle</title>
		<link>http://www.win-vector.com/blog/2009/11/the-local-to-global-principle/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=the-local-to-global-principle</link>
		<comments>http://www.win-vector.com/blog/2009/11/the-local-to-global-principle/#comments</comments>
		<pubDate>Wed, 11 Nov 2009 16:37:53 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Exciting Techniques]]></category>
		<category><![CDATA[Expository Writing]]></category>
		<category><![CDATA[Mathematics]]></category>
		<category><![CDATA[Dynamic Programming]]></category>
		<category><![CDATA[Local to Global]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[PageRank]]></category>
		<category><![CDATA[Problem Solving]]></category>
		<category><![CDATA[Speech Recognition]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1123</guid>
		<description><![CDATA[We describe the &#8220;the local to global principle.&#8221; It is a principle used to break algorithmic problem solving into two distinct phases (local criticism followed by global solution) and is an aid both in the design and in the application of algorithms. Instead of giving a formal definition of the principle we quickly define it [...]


Related posts:<ol><li><a href='http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/' rel='bookmark' title='Permanent Link: Automatic Differentiation with Scala'>Automatic Differentiation with Scala</a></li>
<li><a href='http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/' rel='bookmark' title='Permanent Link: A Demonstration of Data Mining'>A Demonstration of Data Mining</a></li>
<li><a href='http://www.win-vector.com/blog/2009/07/should-your-mom-use-google-search/' rel='bookmark' title='Permanent Link: Should your mom use Google search?'>Should your mom use Google search?</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>We describe the &#8220;the local to global principle.&#8221; It is a principle used to break algorithmic problem solving into two distinct phases (local criticism followed by global solution) and is an aid both in the design and in the application of algorithms. Instead of giving a formal definition of the principle we quickly define it and discuss a few examples and methods.  We have produced both a stand-alone <a href="http://www.win-vector.com/dfiles/LocalToGlobal.pdf">PDF</a> (more legible) and a HTML/blog form (more skimable).<br />
<span id="more-1123"></span></p>
<h1 align="center">The Local to Global Principle</h1>
<p align="center"><strong>John Mount<a name="tex2html3" href="#foot21" id="tex2html3"><sup>1</sup></a></strong></p>
<p></p>
<p align="center"><b>Date:</b> November 11, 2009</p>
<hr />
<h3>Abstract:</h3>
<div>We describe the &#8220;the local to global principle.&#8221; It is a principle used to break algorithmic problem solving into two distinct phases (local criticism followed by global solution) and is an aid both in the design and in the application of algorithms. Instead of giving a formal definition of the principle we quickly define it and discuss a few examples and methods.</div>
<p></p>
<h2><a name="SECTION00010000000000000000" id="SECTION00010000000000000000">Contents</a></h2>
<p><!--Table of Contents--></p>
<ul>
<li><a name="tex2html32" href="#SECTION00020000000000000000" id="tex2html32">Introduction</a></li>
<li><a name="tex2html33" href="#SECTION00030000000000000000" id="tex2html33">The Examples</a>
<ul>
<li><a name="tex2html34" href="#SECTION00031000000000000000" id="tex2html34">Web Page Link Analysis</a></li>
<li><a name="tex2html35" href="#SECTION00032000000000000000" id="tex2html35">Natural Language Processing</a></li>
<li><a name="tex2html36" href="#SECTION00033000000000000000" id="tex2html36">Machine Learning</a></li>
</ul>
<p></li>
<li><a name="tex2html37" href="#SECTION00040000000000000000" id="tex2html37">Some Methods</a>
<ul>
<li><a name="tex2html38" href="#SECTION00041000000000000000" id="tex2html38">Local Methods</a></li>
<li><a name="tex2html39" href="#SECTION00042000000000000000" id="tex2html39">Globalization Methods</a></li>
</ul>
<p></li>
<li><a name="tex2html40" href="#SECTION00050000000000000000" id="tex2html40">Conclusion</a></li>
<li><a name="tex2html41" href="#SECTION00060000000000000000" id="tex2html41">Bibliography</a></li>
<li><a name="tex2html42" href="#SECTION00070000000000000000" id="tex2html42">Acknowledgement</a></li>
</ul>
<p><!--End of Table of Contents--></p>
<h1><a name="SECTION00020000000000000000" id="SECTION00020000000000000000">Introduction</a></h1>
<p><font>A common vain hope of computer scientists and algorithm designers is that a domain expert has already &#8220;boiled down&#8221; a problem to a precise, but unsolved, algorithmic core. On this point the mathematician Gian-Carlo Rota wrote:</font></p>
<blockquote><p><font>One of the rarest mathematical talents is the talent for applied mathematics, for picking out of a maze of experimental data the two or three parameters that are relevant, and to discard all other data. This talent is rare. It is taught only at the shop level.[<a href="#IndiscreteThoughts">Rot97</a>, ``A Mathematician's Gossip'']</font></p></blockquote>
<p><font>We describe a useful tool for designing algorithmic applications and solutions which we call &#8220;the local to global principle.&#8221; The local to global principle is the method of deriving applications and solutions by specifying &#8220;local&#8221; (and deliberately myopic) heuristics, critiques and methods followed by using a powerful general method to &#8220;globalize&#8221; this specification into a complete solution.</font></p>
<p><font>There are many important problem solving prescriptions and methods of thought already systematically described and taught:</font></p>
<ul>
<li>Bacon&#8217;s &#8220;New Organon&#8221; and Mill&#8217;s principles of inductive logic.[<a href="#Mill">Mil02</a>]</li>
<li>Feynman&#8217;s genius method.[<a href="#IndiscreteThoughts">Rot97</a>, ``Ten Lessons I Wish I Had Been Taught'']</li>
<li>Reductionism (top down and bottom up).</li>
<li>Divide and conquer.[<a href="#IntroductionToAlgorithms">CLRS09</a>]</li>
<li>Forward deduction, backwards induction.</li>
<li>Root Cause Analysis.</li>
<li>Polya&#8217;s heuristic and conjecture and prove patterns [<a href="#citeulike:679515">Pol71</a>,<a href="#Polya1">Pol54a</a>,<a href="#Polya2">Pol54b</a>]</li>
<li>Doron Zeilberger&#8217;s &#8220;Method of Undetermined Generalization and Specialization.&#8221; [<a href="#Zeilberger:1995p277">Zei95</a>]</li>
<li>Zbigniew Michalewicz and David B. Fogel&#8217;s presentation of evolutionary algorithms.[<a href="#HTSMH">MF00</a>]</li>
</ul>
<p><font>The local to global principle is more of an organizational pattern than &#8220;computer aided technique&#8221; as no one specific species of software or family of notation is required.</font></p>
<p><font>The local to global principle can be identified in a number of previous important applications, but it is not currently an identified principle.<a name="tex2html4" href="#foot244" id="tex2html4"><sup>2</sup></a> The principle is very general, so any succinct description of it is going to be painfully vague. Instead, we explain the principle by discussing some example applications and methods.  For each of our example applications we deliberately use a different globalization technique. The effective algorithmist or practitioner must in fact come to each problem already familiar with a reasonably large set of already known local and global techniques, so we conclude with some appropriate fields of study and preparation.</font></p>
<p><font>The local to global principle is divided into two parts: local encoding of the problem followed by a globalization step that uses the encoding. The guiding feature of local encodings is that they are usually easy to compute from the data at hand. Any extension that looks like enumeration, search or optimization is best left to the global step. The local step is essentially the translation of your problem into an abstract language that is ready for the globalization step. In contrast globalization methods are often &#8220;off the shelf&#8221; in that once you abstract and encode the particulars of your problem you can look for pre-existing useful methods or software to finish your solution. The idea of globalization is to find a best overall or global compromise between competing local criteria. The local step does not so much have to avoid conflicts but instead &#8220;price them.&#8221; There is also an important trade-off that sophisticated local techniques allow the use of simpler globalization methods and more powerful globalization methods allow the use of simpler local techniques.</font></p>
<h1><a name="SECTION00030000000000000000" id="SECTION00030000000000000000">The Examples</a></h1>
<p><font>To demonstrate the breadth of the local to global principle we choose a diverse collection of example applications: web page link analysis, natural language processing and machine learning. For each example application we will set up the problem, introduce a reasonable set of local criteria and pick an appropriate globalization technique. We will favor finishing each example without describing the globalization technique in detail, as this would distract from our point and is best left to the given references. These examples are previously solved problems, our contribution is demonstrating the shared underlying principle.</font></p>
<h2><a name="SECTION00031000000000000000" id="SECTION00031000000000000000">Web Page Link Analysis</a></h2>
<p><font>For our first example application we demonstrate web page link analysis in the form of the famous PageRank score.[<a href="#Page:1998p2689">PBMW98</a>]</font></p>
<p><font>One of the many good ideas leading up to the early Google search engine was the design of a non-text based measure of importance or interestingness of web pages. A search engine that could fold &#8220;interestingness&#8221; or popularity into its notion of relevance could better sort important pages into the search user&#8217;s view. When the web got so large that there were many pages that were exact matches to any common user query popularity became a critical consideration. A link based notion of popularity exploits what is important about the web (the link structure, for example see [<a href="#Kleinberg:1997p32">Kle97</a>]) and avoids having to depend on a lot of natural language understanding technology. This technique also uses authority outside of the given page, so has some hope at being resistant (though not immune) to web-spam.</font></p>
<p><font>Taken all at once, the task of designing a score of page importance is a daunting task. However, by working in stages (as the local to global principle prescribes) we can quickly derive interesting scores including the famous PageRank score. We start with the idea that popularity (or the amount of web traffic a page receives) is (loosely) correlated with importance. So for our first approximation step we decide to try to estimate popularity (or web traffic) and use this estimate as our importance score. Accurately estimating web traffic is itself a hard problem and a big industry (just a few of the major companies involved in this are: Google/Urchin, Quantcast, Nielsen, comScore, Alexa, Hitwise and LookSmart). For our second approximation step we are going to try and estimate popularity from the link structure<a name="tex2html6" href="#foot43" id="tex2html6"><sup>4</sup></a> of the web (using no other measurements or historic data) and use this as our score. This link based estimate is unlikely to completely reproduce real web surfing patterns, but it is very interesting in its own right and has been proven in the market to be a useful score.</font></p>
<p><font>Now the problem is to try to estimate the popularity of a web page from the link structure of the web. We claim: we can generate a useful (but not necessarily accurate) estimate of web traffic from the web&#8217;s link structure alone. Consider Figure&nbsp;<a href="#fig:Links1">1</a> where we have a universe of three web pages A,B and C that link to each other in the pattern illustrated by what is called a graph<a name="tex2html7" href="#foot45" id="tex2html7"><sup>5</sup></a></font></p>
<div align="center"><a name="fig:Links1" id="fig:Links1"></a><a name="50"></a></p>
<table>
<caption align="bottom"><strong>Figure 1:</strong> A set of Mutually Linked Web Pages</caption>
<tr>
<td>
<div align="center"><img width="300" height="436" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/Links1.png" alt="Image Links1"></div>
</td>
</tr>
</table>
</div>
<p><font>In Figure&nbsp;<a href="#fig:Links1">1</a> we can consider each link to another page as evidence the other page is interesting or popular. One idea is to simulate a very simple web surfer who clicks on the links on a page uniformly at random. This is called &#8220;the random surfer model&#8221; and even a model this simple allows us to read some useful information from the link structure of the web. For instance, we could ask what fraction of their time the random surfer spends on each web page, with an eye to the idea that the pages the random surfer visits more often are the more important ones. Let <img width="35" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg2.png" alt="$ p(A)$"> denote the proportion of time the random web surfer spends on page A (and define <img width="36" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg3.png" alt="$ p(B)$"> and <img width="36" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg4.png" alt="$ p(C)$"> similarly). While we do not know any of <!-- MATH<br />
 $p(A), p(B)$<br />
 --><br />
<img width="76" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg5.png" alt="$ p(A), p(B)$"> or <img width="36" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg4.png" alt="$ p(C)$"> we can derive some relationships between them by inspecting the link graph:</font></p>
<p></p>
<div align="center"><!-- MATH<br />
 \begin{eqnarray*}<br />
p(A) &#038; = &#038; \frac{1}{2} P(B) + P(C) \\<br />
p(B) &#038; = &#038; \frac{1}{2} P(A) \\<br />
p(C) &#038; = &#038; \frac{1}{2} P(A) + \frac{1}{2} P(B) .<br />
\end{eqnarray*}<br />
 --></p>
<table cellpadding="0" align="center" width="100%">
<tr valign="middle">
<td nowrap align="right"><img width="35" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg6.png" alt="$\displaystyle p(A)$"></td>
<td width="10" align="center" nowrap><img width="16" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg7.png" alt="$\displaystyle =$"></td>
<td align="left" nowrap><img width="109" height="49" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg8.png" alt="$\displaystyle \frac{1}{2} P(B) + P(C)$"></td>
<td width="10" align="right">&nbsp;</td>
</tr>
<tr valign="middle">
<td nowrap align="right"><img width="36" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg9.png" alt="$\displaystyle p(B)$"></td>
<td width="10" align="center" nowrap><img width="16" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg7.png" alt="$\displaystyle =$"></td>
<td align="left" nowrap><img width="52" height="49" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg10.png" alt="$\displaystyle \frac{1}{2} P(A)$"></td>
<td width="10" align="right">&nbsp;</td>
</tr>
<tr valign="middle">
<td nowrap align="right"><img width="36" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg11.png" alt="$\displaystyle p(C)$"></td>
<td width="10" align="center" nowrap><img width="16" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg7.png" alt="$\displaystyle =$"></td>
<td align="left" nowrap><img width="125" height="49" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg12.png" alt="$\displaystyle \frac{1}{2} P(A) + \frac{1}{2} P(B) .$"></td>
<td width="10" align="right">&nbsp;</td>
</tr>
</table>
</div>
<p><br clear="all"></p>
<p><font>The first equation is just reading from the graph that: all visits on page-A must come from pages B and C, half of the visitors on page-B continue on to A and all of the visitors on page-C continue on to A. The second and third equations are the appropriate summaries of how traffic is routed to pages B and C. We can insist that <!-- MATH<br />
 $P(A) + P(B)<br />
+ P(C) = 1$<br />
 --><br />
<img width="183" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg13.png" alt="$ P(A) + P(B) + P(C) = 1$"> as we want these numbers to represent the fraction of time the random web surfer spends on each page. A more sophisticated model would add more features<a name="tex2html9" href="#foot245" id="tex2html9"><sup>6</sup></a> to get a more useful result.</font></p>
<p><font>It turns out we have already encoded enough local rules to completely determine <!-- MATH<br />
 $P(A), P(B)$<br />
 --><br />
<img width="85" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg14.png" alt="$ P(A), P(B)$"> and <img width="40" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg15.png" alt="$ P(C)$"> . In this example application an algorithmist already familiar with linear algebra&nbsp;[<a href="#Strang">Str76</a>] would recognize these local conditions as &#8220;a system of linear equations.&#8221; Solving even web-scale systems of linear systems is considered easy with modern techniques and modern computers. For our small example example the solution is: <!-- MATH<br />
 $p(A) = \frac{4}{9}$<br />
 --><br />
<img width="67" height="34" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg16.png" alt="$ p(A) = \frac{4}{9}$"> , <!-- MATH<br />
 $p(B) = \frac{2}{9}$<br />
 --><br />
<img width="68" height="34" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg17.png" alt="$ p(B) = \frac{2}{9}$"> , and <!-- MATH<br />
 $p(C) = \frac{3}{9}$<br />
 --><br />
<img width="67" height="34" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg18.png" alt="$ p(C) = \frac{3}{9}$"> . The role of the local steps was to reduce a new problem (estimating the importance or popularity of web page from the link structure) to something with its <em>already known</em> known techniques (like solving a linear system as illustrated in Figure&nbsp;<a href="#fig:LinAlg">2</a>).</font></p>
<div align="center"><a name="fig:LinAlg" id="fig:LinAlg"></a><a name="79"></a></p>
<table>
<caption align="bottom"><strong>Figure 2:</strong> Linear Algebra Solution: As Taught in School</caption>
<tr>
<td>
<div align="center"><img width="400" height="365" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LinAlg.jpg" alt="Image LinAlg"></div>
</td>
</tr>
</table>
</div>
<p><font>So page-A is the most important page by the PageRank measure.</font></p>
<p><font>In this example application the local step was setting up the system of linear equalities (which are easy to derive from the web link graph) and the global step was solving the entire system for the final scores (which were not obvious). You spend most of your time encoding the problem and then use a known technique (in this case solving a linear system) to finish the solution.</font></p>
<h2><a name="SECTION00032000000000000000" id="SECTION00032000000000000000">Natural Language Processing</a></h2>
<p><font>Our next example application is natural language processing&nbsp;[<a href="#CharniakBook">Cha96</a>,<a href="#Charniak:1997p1484">Cha97</a>]. Speech recognition (the alignment or transcription of recognized intelligible segments of sound to written text) is an important problem in natural language processing. An example problem is the need to find the most likely text matching a sequence of sounds such as is shown in Figure&nbsp;<a href="#fig:SoundSeq1">3</a>.</font></p>
<div align="center"><a name="fig:SoundSeq1" id="fig:SoundSeq1"></a><a name="89"></a></p>
<table>
<caption align="bottom"><strong>Figure 3:</strong> A Sequence of Sounds</caption>
<tr>
<td>
<div align="center"><img width="500" height="69" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/SoundSeq1.png" alt="Image SoundSeq1"></div>
</td>
</tr>
</table>
</div>
<p><font>Consider Figure&nbsp;<a href="#fig:SoundSeq3">4</a> (which shows a bad transcription) and Figure&nbsp;<a href="#fig:SoundSeq2">5</a> (which shows a good transcription).</font></p>
<div align="center"><a name="fig:SoundSeq3" id="fig:SoundSeq3"></a><a name="98"></a></p>
<table>
<caption align="bottom"><strong>Figure 4:</strong> A Bad Transcription</caption>
<tr>
<td>
<div align="center"><img width="500" height="142" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/SoundSeq3.png" alt="Image SoundSeq3"></div>
</td>
</tr>
</table>
</div>
<div align="center"><a name="fig:SoundSeq2" id="fig:SoundSeq2"></a><a name="105"></a></p>
<table>
<caption align="bottom"><strong>Figure 5:</strong> A Good Transcription</caption>
<tr>
<td>
<div align="center"><img width="500" height="142" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/SoundSeq2.png" alt="Image SoundSeq2"></div>
</td>
</tr>
</table>
</div>
<p><font>Our claim: we can (given access to training data, and this is the age of data&nbsp;[<a href="#Halevy:2009p2327">HNP09</a>]) solve this problem with a local step that is a set of simple criticisms of proposed transcriptions. A good starting point is a database of previous sounds to text transcriptions. This database allows the construction of a series of tables that give the historic frequency (or probability) of all of the following:</font></p>
<ul>
<li>Prior probability of each sound</li>
<li>Probability of each sound given the immediately previous sound</li>
<li>Prior probability of each word</li>
<li>Probability of each word given the immediately previous word</li>
<li>Which combinations of word fragments are legitimate words</li>
<li>Probability of each sound being assigned to each word fragment (syllables, phonemes and so on).</li>
</ul>
<p><font>These tables encode a &#8220;speech model&#8221; (the rules involving sounds only), a language model (the rules involving text or words only) and the linkage between the two models. These models are deliberately simple in that they capture only local interactions (like probability of a word given the word before it) but no long range interactions (like subject predicate agreement).</font></p>
<p><font>Each box, nested box and arrow on our diagram represents one possible local critique. For each item in our diagram (again, the boxes and arrows) we can use our tables to assign a goodness or plausibility score. For instance bad word to word transitions (like &#8220;won&#8221; <!-- MATH<br />
 $\rightarrow$<br />
 --><br />
<img width="19" height="13" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg19.png" alt="$ \rightarrow$"> &#8220;won&#8221;) will be rare in our historic tables so, just looking up probabilities from the tables (or, better, using the logarithms of probabilities) gives as a &#8220;plausibility score&#8221; that prefers known patterns of language. Then a score for the overall transcription can be derived by multiplying all of the local scores together. These local scores (though simple) already have encoded enough evidence to prefer the good transcription to the bad transcription <em>without</em> requiring any deep knowledge of speech, text or the meaning of the text. This is because the bad transcription has a series of obvious flaws such as: unlikely sound to word fragment assignments and unlikely word to word transitions.</font></p>
<div align="center"><a name="fig:SoundSeqPartial" id="fig:SoundSeqPartial"></a><a name="116"></a></p>
<table>
<caption align="bottom"><strong>Figure 6:</strong> Naively Extending a Partial Transcription</caption>
<tr>
<td>
<div align="center"><img width="500" height="142" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/SoundSeqPartial.png" alt="Image SoundSeqPartial"></div>
</td>
</tr>
</table>
</div>
<p><font>For example consider Figure&nbsp;<a href="#fig:SoundSeqPartial">6</a> where a naive solver is in the process of considering selecting the word &#8220;one&#8221; as the third word to fill in. The <em>only</em> local critiques they need to consider are:</font></p>
<ul>
<li>how likely the word &#8220;one&#8221; is in general (call this <img width="49" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg20.png" alt="$ P[one]$"> )</li>
<li>how likely the word &#8220;one&#8221; is to follow the word &#8220;nine&#8221; (call this <!-- MATH<br />
 $P[one | nine]$<br />
 --><br />
<img width="86" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg21.png" alt="$ P[one \vert nine]$"> )</li>
<li>how likely the letter sequence &#8220;o&#8221; is given the sound &#8220;w&#8221; (call this <!-- MATH<br />
 $P[o | \text{w\textschwa}]$<br />
 --><br />
<img width="55" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg24.png" alt="$P[o \vert \text{w\textschwa}]$"> )</li>
<li>how likely the letter sequence &#8220;ne&#8221; is given the sound &#8220;n&#8221; (call this <!-- MATH<br />
 $P[ne | \text{n}]$<br />
 --><br />
<img width="41" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg25.png" alt="$ P[ne \vert$">&nbsp; &nbsp;n<img width="7" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg23.png" alt="$ ]$"> ).</li>
</ul>
<p><font>So the local plausibility of the fill-in word &#8220;one&#8221; is: <!-- MATH<br />
 $P[one]<br />
\times P[one | nine] \times P[o | \text{w\textschwa}] \times P[ne |<br />
\text{o}]$<br />
 --><br />
<img width="292" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg28.png" alt="$P[one] \times P[one \vert nine] \times P[o \vert \text{w\textschwa}] \times P[ne \vert \text{o}]$"> . We will call this the critique of &#8220;one&#8221; in position 3 and write as <!-- MATH<br />
 $C_3(w_2,one)$<br />
 --><br />
<img width="84" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg29.png" alt="$ C_3(w_2,one)$"> where <img width="22" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg30.png" alt="$ w_2$"> is the word known to be in position 2. Similarly we can generate all of the possible critiques <img width="53" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg31.png" alt="$ C_1(w_1)$"> , <!-- MATH<br />
 $C_2(w_1,w_2)$<br />
 --><br />
<img width="78" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg32.png" alt="$ C_2(w_1,w_2)$"> , <!-- MATH<br />
 $C_3(w_2,w_3)$<br />
 --><br />
<img width="78" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg33.png" alt="$ C_3(w_2,w_3)$"> , <!-- MATH<br />
 $C_4(w_3,w_4)$<br />
 --><br />
<img width="78" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg34.png" alt="$ C_4(w_3,w_4)$"> and the overall criticize of a sequence <!-- MATH<br />
 $w_1 \; w_2 \; w_3 \; w_4$<br />
 --><br />
<img width="77" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg35.png" alt="$ w_1 \; w_2 \; w_3 \; w_4$"> : <!-- MATH<br />
 $C_1(w_1)<br />
\times C_2(w_1,w_2) \times C_3(w_2,w_3) \times C_4(w_3,w_4)$<br />
 --><br />
<img width="336" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg36.png" alt="$ C_1(w_1) \times C_2(w_1,w_2) \times C_3(w_2,w_3) \times C_4(w_3,w_4)$"> from our pre-computed tables of probabilities. Notice for all of these critiques only the immediately previous word and the nearby sounds were used to determine the plausibility of the word we are attempting to fit in. Instead of using these critiques to directly fill in a possible solution (or using search) we will package up these critiques (in the form of the <img width="32" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg37.png" alt="$ C_i()$"> ) and pass them on to a powerful separate globalization step called Dynamic Programming&nbsp;[<a href="#DynamicProgramming">Bel57</a>].</font></p>
<p><font>The globalization or finding of a best overall transcription is not trivial even though our score is simple. This is because the overall <em>best</em> sequence could depend on clever non-local fill-ins (like deliberately picking a less likely first word to allow a later favored transition to a fantastically good third word). Dynamic Programing does not fill in the transcription from left to right, but instead uses a table of scores derived from the left to right arrows and the <img width="32" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg37.png" alt="$ C_i()$"> . In our example Dynamic Programming consists of building a table of information as shown in Figure&nbsp;<a href="#fig:DynBackFill">7</a>. Let <img width="9" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg38.png" alt="$ i$"> represent the word position we are working looking at (so <img width="9" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg38.png" alt="$ i$"> ranges from 1 to 4) and let <img width="15" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg39.png" alt="$ w$"> be a variable that ranges over every word in the dictionary. Our table is indexed by <img width="9" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg38.png" alt="$ i$"> and <img width="15" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg39.png" alt="$ w$"> and when filled in <img width="51" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg40.png" alt="$ T(i,w)$"> stores what the highest &#8220;plausibility score&#8221; of a partial sequence of words where words 1 through <img width="9" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg38.png" alt="$ i$"> have been filled in and the <img width="9" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg38.png" alt="$ i$"> -th word is <img width="15" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg39.png" alt="$ w$"> .</font></p>
<div align="center"><a name="fig:DynBackFill" id="fig:DynBackFill"></a><a name="134"></a></p>
<table>
<caption align="bottom"><strong>Figure 7:</strong> Dynamic Programming: Back Chaining in <img width="27" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg1.png" alt="$ T()$"> for a Solution</caption>
<tr>
<td>
<div align="center"><img width="300" height="298" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/DynTableBackFill.png" alt="Image DynTableBackFill"></div>
</td>
</tr>
</table>
</div>
<p><font>If we already had this magic table <img width="27" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg1.png" alt="$ T()$"> we could find a best possible sequence by &#8220;back chaining.&#8221; We start by finding a fourth word (<img width="22" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg41.png" alt="$ w_4$"> ) such that <img width="61" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg42.png" alt="$ T(4,w_4)$"> is maximal (in this case &#8220;one&#8221;). We then find a best third word (<img width="22" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg43.png" alt="$ w_3$"> ) by enumerating all words and picking <img width="22" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg43.png" alt="$ w_3$"> such that <!-- MATH<br />
 $T(3,w_3) \times C_4(w_3,w_4) = T(4,w_4)$<br />
 --><br />
<img width="234" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg44.png" alt="$ T(3,w_3) \times C_4(w_3,w_4) = T(4,w_4)$"> . We continue back until we had found words <img width="22" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg30.png" alt="$ w_2$"> and <img width="22" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg45.png" alt="$ w_1$"> to get a complete best sequence. Notice that we work from right to left (backwards) and except for the starting step we pick each word to match the calculation we are trying to un-roll, not to be the maximal entry in the column. For instance we pick <!-- MATH<br />
 $w_1 = dial$<br />
 --><br />
<img width="70" height="29" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg46.png" alt="$ w_1 = dial$"> even though it does not have a the highest score, but because <!-- MATH<br />
 $T(1,dial) C_2(dial,nine)<br />
C_3(nine,one) C_4(one,one) = T(4,one)$<br />
 --><br />
<img width="433" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg47.png" alt="$ T(1,dial) C_2(dial,nine) C_3(nine,one) C_4(one,one) = T(4,one)$"> is the maximal complete chain.</font></p>
<p><font>Of course, we don&#8217;t start with the table <img width="27" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg1.png" alt="$ T()$"> already filled in- so we need a procedure to build it. This procedure is the heart of the Dynamic Programming method (for more examples see: &#8220;Introduction to Algorithms&#8221;&nbsp;[<a href="#IntroductionToAlgorithms">CLRS09</a>]). Notice that <img width="54" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg48.png" alt="$ T(1,w)$"> can be filled in for all <img width="15" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg39.png" alt="$ w$"> just by plugging in words and computing the critiques <img width="46" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg49.png" alt="$ C_1(w)$"> (i.e. <!-- MATH<br />
 $T(1,w) = C_1(w)$<br />
 --><br />
<img width="118" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg50.png" alt="$ T(1,w) = C_1(w)$"> ). Once all the <img width="54" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg48.png" alt="$ T(1,w)$"> are filled in we can fill in the the <img width="54" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg51.png" alt="$ T(2,w)$"> with the general (and slightly trickier) formula:</font></p>
<p><!-- MATH<br />
 \begin{displaymath}<br />
T(i+1,w) = \max_{v} T(i,v) C_{i+1}(v,w)<br />
\end{displaymath}<br />
 --></p>
<div align="center"><img width="249" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg52.png" alt="$\displaystyle T(i+1,w) = \max_{v} T(i,v) C_{i+1}(v,w) $"></div>
<p><font>as we illustrate for <img width="74" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg53.png" alt="$ T(2,nine)$"> in Figure&nbsp;<a href="#fig:DynTable">8</a>.</font></p>
<div align="center"><a name="fig:DynTable" id="fig:DynTable"></a><a name="145"></a></p>
<table>
<caption align="bottom"><strong>Figure 8:</strong> Dynamic Programming: Building the Table <img width="27" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg1.png" alt="$ T()$"></caption>
<tr>
<td>
<div align="center"><img width="400" height="261" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/DynTableCalculate.png" alt="Image DynTableCalculate"></div>
</td>
</tr>
</table>
</div>
<p><font>The magic of the Dynamic Programing technique is: by being careful to not store too much in the table <img width="51" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg40.png" alt="$ T(i,w)$"> we avoid an explosion in record keeping that would render the method inefficient. Dynamic Programming exploits the small dependence structure encoded in <img width="32" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg37.png" alt="$ C_i()$"> (each box in our diagram depending on only a few arrows) and as we have shown can find &#8220;clever&#8221; solutions (such as taking a sub-optimal first word to get better transitions into preferred later words). For those who want more detail on solving this problem we recommend [<a href="#CharniakBook">Cha96</a>] (as our goal is not to fully explain Dynamic Programming, but to demonstrate how it could be applied to the transcription problem as a pre-packaged globalizer).</font></p>
<p><font>In this example the local step was the graph link based critiques and the globalization step was Dynamic Programming. The separation of concerns from the local scoring to the globalizing step is a strength of the local to global principle.</font></p>
<h2><a name="SECTION00033000000000000000" id="SECTION00033000000000000000">Machine Learning</a></h2>
<p><font>Our final example application is machine learning. Machine learning is loosely defined as computer programs that adapt or learn from data. Thomas Mitchell helps distinguish this activity as a specialty of artificial intelligence that concentrates on &#8220;well-posed learning problems.&#8221;&nbsp;[<a href="#MitchellML">Mit97</a>] Trevor Hastie, Robert Tibshirani, Jerome Friedman emphasize the relation to statistics (versus more traditional symbolic AI)&nbsp;[<a href="#TibHat">TH09</a>]. A simple demonstration can be found in [<a href="#MLArt">Mou09b</a>].</font></p>
<p><font>Machine learning is perhaps the strongest example of the local to global principle and is inspired by the work of Kristin P. Bennett and Emilio Parrado-Hernandez&nbsp;[<a href="#Bennett:2006p400">BPH06</a>]. In hindsight many machine learning algorithms (each of which has had a turn at being &#8220;the most exciting breakthrough ever&#8221; for a while) can be seen as the pairing of a performance criterion (which we call a local criterion as it applies to one specific set of parameter values at a time) and an optimization method (what we have been calling the globalization step). The work of Bennett and Parrado-Hernandez calls this distinction out and shows how it is not productive to present machine learning systems as unique named monolithic units, but instead to consider how to break them into an objective function and an optimizer. This allows both choice of better optimizers (such as replacing the inferior method of gradient descent method wherever it occurs) and for explicit control of important concepts such as hypothesis regularization and control of over-fitting (which some algorithms claim to achieve by deliberately using early exit from a an inferior optimizer).</font></p>
<p><font>At a &#8220;30,000 feet level&#8221; we can build a table of common machine learning techniques and name what is commonly used to implement their local and global steps. When a machine learning algorithm is defined by what conditions are meant to be true at the optimum we are no longer bound by details of the original implementation and can examine fix and improve the components.<a name="tex2html17" href="#foot154" id="tex2html17"><sup>7</sup></a> Table&nbsp;<a href="#fig:MachineLearning">1</a> is a crude summary of a wide selection for machine learning algorithms that may be more likely to offend everybody than just offend somebody. But this is also the point: it is the algorithmist&#8217;s job to think fluidly (beyond given names and provenances) and to invent scaffolding to convert partial analogies into practical correspondences.</font></p>
<p></p>
<div align="center"><a name="190"></a></p>
<table>
<caption><strong>Table 1:</strong> Various Machine Learning Techniques</caption>
<tr>
<td>
<div align="center">
<table cellpadding="3" border="1" align="center">
<tr>
<td align="left" valign="top" width="180"><font size="-1">Machine Learning Method</font></td>
<td align="left" valign="top" width="144"><font size="-1">Local Criterion</font></td>
<td align="left" valign="top" width="144"><font size="-1">Globalization Method</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">Linear Regression [<a href="#Breiman:1997p1133">BF97</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">square error</font></td>
<td align="left" valign="top" width="144"><font size="-1">Linear Algebra</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">Linear Discriminant Analysis [<a href="#Fisher:1936p2576">Fis36</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">square error</font></td>
<td align="left" valign="top" width="144"><font size="-1">Linear Algebra</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">Logistic Regression [<a href="#Komarek:2008p1742">Kom08</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">logit penalty</font></td>
<td align="left" valign="top" width="144"><font size="-1">Newton&#8217;s Method</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">Perceptron [<a href="#Beigel:1991p1027">BRS91</a>] [<a href="#Blum:2002p1867">BD02</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">error rate</font></td>
<td align="left" valign="top" width="144"><font size="-1">error based update</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">Naive Bayes [<a href="#Maron:2000p2553">MK00</a>] [<a href="#Maron:1961p2566">Mar61</a>] [<a href="#Lewis:1998p105">Lew98</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">frequency tables</font></td>
<td align="left" valign="top" width="144"><font size="-1">arithmetic</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">Nearest Neighbor [<a href="#Ailon:2006p872">AC06</a>] [<a href="#Indyk:1999p166">IM99</a>] [<a href="#Andoni:2006p52">AI06</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">Kernel Methods</font></td>
<td align="left" valign="top" width="144"><font size="-1">enumeration,<br />
projection</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">Decision Trees [<a href="#bfso:1984">BFSO84</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">information theory</font></td>
<td align="left" valign="top" width="144"><font size="-1">partitioning</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">clustering [<a href="#Cilibrasi:2005p8">CV05</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">square error</font></td>
<td align="left" valign="top" width="144"><font size="-1">partitioning</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">MaxEnt [<a href="#Grunwald:2000p108">Gru00</a>] [<a href="#Grunwald:2004p739">GD04</a>] [<a href="#Skilling:1988p780">Ski88</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">entropy penalty</font></td>
<td align="left" valign="top" width="144"><font size="-1">Newton&#8217;s Method</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">Neural Net with Back Propagation [<a href="#NNCPE">Hus99</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">sigmoid penalty function</font></td>
<td align="left" valign="top" width="144"><font size="-1">Automatic Differentiation,<br />
steepest descent</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">Winnow [<a href="#Kivinen:1995p1836">KWA95</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">error rate</font></td>
<td align="left" valign="top" width="144"><font size="-1">multiplicative error based update</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">Boosting [<a href="#Freund:1999p1015">FS99</a>] [<a href="#Breiman:2000p1134">Bre00</a>] [<a href="#Collins:2002p1008">CSS02</a>] [<a href="#Trevisan:2008p2166">TTV08</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">weighted errors,<br />
data re-weighting</font></td>
<td align="left" valign="top" width="144"><font size="-1">Conjugate Gradient</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">HMM [<a href="#Kristjansson:2004p545">KCVM04</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">probability penalty</font></td>
<td align="left" valign="top" width="144"><font size="-1">Gibbs Sampler</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">Latent Dirichlet Allocation [<a href="#Blei:2003p1063">BNJ03</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">KL divergence</font></td>
<td align="left" valign="top" width="144"><font size="-1">Variational Methods</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">Support Vector Machine [<a href="#Joachims:1998p406">Joa98</a>] [<a href="#SVMBook">STC00</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">L1 Margin,<br />
Kernel Methods</font></td>
<td align="left" valign="top" width="144"><font size="-1">Quadratic Optimization</font></td>
</tr>
</table>
</div>
<p><a name="fig:MachineLearning" id="fig:MachineLearning"></a></td>
</tr>
</table>
</div>
<p></p>
<p><font>This table is a necessarily crude summary. For example: notice that several known techniques can not even be distinguished from each other by the local and global columns of the table.</font></p>
<p><font>There are a few points we would like to make. Back propagation was considered unique to Neural Nets for quite a while because it was so entwined with the technique it was not recognized as the simple application of Automatic Differentiation&nbsp;[<a href="#Rall:1996p2473">RC96</a>] that it is. Support Vector Machines (SVM) are remarkable for their uniform very good choice of component methods (maximum L1 margin objective regularization, Kernel Methods&nbsp;[<a href="#KernBook">STC04</a>] and sophisticated optimization methods&nbsp;[<a href="#Joachims:2006p403">Joa06</a>]). Many of the machine learning methods that SVM outperforms become again competitive when they adopt some of SVM&#8217;s technologies (especially using kernel methods to produce synthetic features).</font></p>
<p><font>Beyond these points we invoke a &#8220;globalizers are pre-packaged&#8221; principle and leave the discussion of machine learning and optimization to our reference: [<a href="#Bennett:2006p400">BPH06</a>]. In this example the local step is a per-example score or penalty and the globalization step is optimization.</font></p>
<h1><a name="SECTION00040000000000000000" id="SECTION00040000000000000000">Some Methods</a></h1>
<p><font>The application of the local to global principle is similar to the Feynman &#8220;genius method.&#8221; Feynman&#8217;s method is to always have in mind a list of problems and a list of solution methods. The genius step is: anytime you see a new problem or a new solution method to immediately try it against every item from the complementary list.&nbsp;[<a href="#IndiscreteThoughts">Rot97</a>, ``Ten Lessons I Wish I Had Been Taught''] This deliberate retention and activity greatly increases your problem solving ability. The power of the local to global principle is itself proportional to the number of local methods times the number of globalization strategies. Of course, to even start: the practitioner must already have available a number of candidate local and globalization methods. We list some methods and some guidance on variation and invention.</font></p>
<h2><a name="SECTION00041000000000000000" id="SECTION00041000000000000000">Local Methods</a></h2>
<p><font><img width="100" height="100" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/nails.jpg" alt="Image nails"> Good sources of ideas and analogies for local methods include:</font></p>
<ul>
<li>Introduce a Graph Structure
<p>A graph structure is a network of nodes connected by edges. Use of graphs was demonstrated both in the natural language processing and web page link analysis examples. We can dress up how we solved these problems and say we used a &#8220;Hidden Markov Model&#8221;, but the real power was we encoded our problem in a simple graph. Some problems (especially those from logic or those involving time) are essentially solved once they are translated out of their original form and into graph notation (for an example see: [<a href="#Mount:2000p360">Mou00</a>]).</p>
</li>
<li>Appeal to Physical Conservation Laws
<p>A good example physical law is Kirchhoff&#8217;s law or conservation of flow. All of the web page link analysis&#8217;s equations were derived by saying that the attention of at node is essentially the sum of attentions from other nodes (more sophisticated versions of the analysis actually do create and destroy flow, but they do it in a principled way).</p>
</li>
<li>Encode the Problem into an Objective Function
<p>This method is essentially your declaration that you intend to use an optimizer for the globalization step. In operations research this specific technique has long been the practice (with no disrespect: a very productive part of operations research has been translating different problems into linear programs so the simplex method can be applied, for an example see [<a href="#TradeArt">Mou09a</a>]).</p>
</li>
<li>Gradient Like Computations
<p>Includes Gradients, Secants, Lagrangians and other ideas from calculus. Gradients can drive optimizer based globalizers and techniques like Lagrangians are often powerful enough use mere inspection as the globalization step.</p>
</li>
<li>Violation Driven Updates
<p>This method is particularly effective when your problem is not amenable to continuous optimization. A good example is the Lin-Kernighan heuristic for solving the traveling salesman problem.[<a href="#Lin:1973p2739">LK73</a>] This heuristic looks at subsets of the problem and suggests improving &#8220;surgeries&#8221; (until no more such improvements are possible).</p>
</li>
<li>Introduction of Symbols
<p>Often, as with the web page link analysis example, you can not specify specific values for the unknowns, but you can specify relationships. You often can then solve for the symbols or introduce additional conditions and use an optimizer to complete the solution (see for example the maximum entropy method as described in [<a href="#Skilling:1988p780">Ski88</a>]).</p>
</li>
<li>Over Specification
<p>If we anticipate using a global step like search, enumeration, summation or integration then over specification is a good local idea.</p>
<p>For example: consider computing the probability that a fair count flipped 10 times comes up with heads exactly 3 times. The easiest way to perform this calculation is to specify exactly which 3 coins come up heads (the local over-specification) and then sum over all choices of 3 out of 10 coins (the global step). In mathematical notation this is:</p>
<p><!-- MATH<br />
 \begin{displaymath}<br />
P[\text{exactly 3 heads out of 10 flips}] = \binom{10}{3} 2^{-10} \approx 0.117<br />
\end{displaymath}<br />
 --></p>
<div align="center"><img width="20" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg54.png" alt="$\displaystyle P[$">exactly 3 heads out of 10 flips<img width="157" height="54" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg55.png" alt="$\displaystyle ] = \binom{10}{3} 2^{-10} \approx 0.117 $"></div>
<p>or just under 12%.</li>
<li>Under Specification
<p>One of the core principles of Dynamic Programming is to forget as much as possible about partial solutions, keeping only partial solution cost and just enough information to extend partial solutions. If you anticipate using something like Dynamic Programming as your globalization step then your goal should be to under specify.</p>
</li>
<li>Tables
<p>A key step of the natural language processing example was the use of tables of past experience to determine which sounds likely corresponded to which words, which words likely followed each other (and so on). Encoding domain knowledge or expertise as probability tables is a very effective problem solving strategy (especially if the globalization strategy is going to be search or Dynamic Programming). In natural language processing examples tables and statistics are <em>much</em> easier to manage than comprehensive rules or grammars.</p>
</li>
<li>Set up as Ranking or Machine Learning Problem
<p>This tactic is especially appropriate if your solution success metric is counts, frequencies or probabilities (instead of having to always be correct or always be optimal).</p>
</li>
</ul>
<h2><a name="SECTION00042000000000000000" id="SECTION00042000000000000000">Globalization Methods</a></h2>
<p><font><img width="100" height="100" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/hammer.jpg" alt="Image hammer"> The universe of possible globalization methods is very diverse (in particular globalization is not always optimization).</font></p>
<ul>
<li>Search / Enumeration
<p>Search can be slow, but it is always an option to consider. If your problem translates naturally into a graph structure or your solutions are naturally seen as being composed of small pieces search should be considered. One of the big advantages using the local phase to formally encode your problem&#8217;s structure and putting search off to the global phase is: you can use advanced search techniques. Once you are freed from your specific problem details it becomes much easier to consider search techniques like branch and bound, A*, game theoretic search and general speed-up techniques like hashing and caching.</p>
</li>
<li>Dynamic Programming
<p>If your problem has a bit more structure (in that partial solutions summarize and compose easily) then you can likely replace search with Dynamic Programming. The advantage is that Dynamic Programming typically offers an incredible speed up when compared to search.</p>
</li>
<li>Optimization
<p>If your problem is continuous (involves numbers instead of discrete or categorical decisions), can be encoded as a reasonable objective function (linear, positive definite quadratic) and has reasonable constraints (linear or convex) then you can immediately apply an optimizer as your globalization step. Typical optimization methods include: conjugate gradient, Newton methods, quasi Newton methods, linear programming and quadratic programming.</p>
</li>
<li>Combinatorial Optimization
<p>If your problem includes a &#8220;discrete variables&#8221; (that is variables that take on one of fixed set of values instead of values from a numeric range) then you may not be able to apply standard optimization techniques. At this point you may want to use more expensive combinatorial optimization techniques like integer linear programing or constraint satisfaction.</p>
</li>
<li>Fixed Point Methods / Iteration
<p>Fixed point methods are based on the idea: &#8220;incrementally improve until there is no incremental improvement possible.&#8221; If the problem is continuous this is similar to steepest descent. If the problem is discrete then this is similar the Lin-Kernighan heuristic.</p>
</li>
<li>Linear Algebra
<p>The web page link analysis and optimization examples were essentially solved once we reduced them to linear algebra. If you can write your problem as a linear relationship between unknowns or as the fixed-point of a linear operator (i.e. an <img width="12" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg56.png" alt="$ x$"> such that <img width="54" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg57.png" alt="$ A x = x$"> ) then you can immediately use linear algebra to solve the problem at very large scale (e.g. web scale).</p>
</li>
<li>Sampling / Problem Kernels
<p>A very successful line of attack on large problems is to reduce to a smaller problem containing most of the essential difficulty. David Karger has produced a number of effective algorithms for graph cuts and flows using a theory of sampling&nbsp;[<a href="#Karger:1998p556">Kar98</a>]. Rod Downey and M. Fellows have demonstrated an effective theory of &#8220;problem kernels&#8221; that finds solution by focusing on smaller sub-problems (on which we can afford to use more expensive procedures).[<a href="#DF98">DF98</a>]</p>
</li>
<li>Amortized Analysis / Economic Mechanism Methods
<p>Daniel Sleator and Robert Tarjan&#8217;s ideas of amortized analysis&nbsp;[<a href="#Sleator:1985p168">ST85</a>] allow approximation schemes similar to problem kernels. The method is to approximately optimize by pairing a bunch of unavoidable large penalties (conditions we can&#8217;t meet) with some accounting credits (say bonuses from other conditions we are meeting very well). We then isolate these paired items and optimize the rest of the problem exactly. The technique often works by showing the approximation can not be too bad because, due to the pairing of large penalties to good credits, there can not be too many large penalties. An informal example is: if it is impossible to pick someplace where all of an office will eat for lunch, perhaps you can solve the problem by paying one person to accept a restaurant they do not like (if the removal of their objection opens up a venue that is acceptable to everybody else).</p>
</li>
<li>Relaxation / Homotopic methods
<p>These methods involve changing hard constraints to soft penalties (so allowing the constraints to be violated, but at a slowly increasing cost). After such a relaxation the homotopic (or continuous deformation) method is to increase the cost of violation and re-solve to try and get a trajectory of better and better nearly acceptable solutions that point to a possible overall solution.</p>
</li>
</ul>
<h1><a name="SECTION00050000000000000000" id="SECTION00050000000000000000">Conclusion</a></h1>
<p><font>The purpose of this article has been to make more visible an idea we call the local to global principle. This principle is an organizing tool useful both in designing and analyzing a wide variety of applications. Essentially the whole point of this writeup is to set up enough framework to quickly write down a table of advice such as Table&nbsp;<a href="#fig:ProblemTable">2</a> (and for such a table to mean something).</font></p>
<p></p>
<div align="center"><a name="227"></a></p>
<table>
<caption><strong>Table 2:</strong> Various Applications, Local Steps and Global Steps</caption>
<tr>
<td>
<div align="center">
<table cellpadding="3" border="1" align="center">
<tr>
<td align="left"><font size="-1">Example</font></td>
<td align="left"><font size="-1">Local Step</font></td>
<td align="left"><font size="-1">Global Step</font></td>
</tr>
<tr>
<td align="left"><font size="-1">speech transcription</font></td>
<td align="left"><font size="-1">tables</font></td>
<td align="left"><font size="-1">Dynamic Programming</font></td>
</tr>
<tr>
<td align="left"><font size="-1">PageRank</font></td>
<td align="left"><font size="-1">graph structure, linear equations</font></td>
<td align="left"><font size="-1">Linear Algebra</font></td>
</tr>
<tr>
<td align="left"><font size="-1">machine learning</font></td>
<td align="left"><font size="-1">objective function</font></td>
<td align="left"><font size="-1">optimization</font></td>
</tr>
</table>
</div>
<p><a name="fig:ProblemTable" id="fig:ProblemTable"></a></td>
</tr>
</table>
</div>
<p></p>
<p><font>The principle is not universal; not everything can be fit into such a table. For example the local to global decoupling is <em>not</em> a feature of the famous EM algorithm&nbsp;[<a href="#Dempster:1977p761">DLR77</a>], which depends on mixing predictions and corrections.</font></p>
<p><font>To conclude: the recipe is as follows. If you come to a problem with a large shopping bag of possible ways to build local criteria and powerful globalization procedures then you stand a very good chance of solving the problem quickly. Also, if you keep the local to global principle in mind you are more likely to identify and retain potential local tricks and globalizers when you see them and thus have a larger more nimble set of tools available to solve problems when the time comes.</font></p>
<h2><a name="SECTION00060000000000000000" id="SECTION00060000000000000000">Bibliography</a></h2>
<dl compact>
<dt><a name="Ailon:2006p872" id="Ailon:2006p872">AC06</a></dt>
<dd>Nir Ailon and Bernard Chazelle, <i>Approximate nearest neighbors and the fast johnson-lindenstrauss transform</i>, STOC (2006).</dd>
<dt><a name="Andoni:2006p52" id="Andoni:2006p52">AI06</a></dt>
<dd>Alexandr Andoni and Piotr Indyk, <i>Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions</i>.</dd>
<dt><a name="Blum:2002p1867" id="Blum:2002p1867">BD02</a></dt>
<dd>Avrim Blum and John Dunagan, <i>Smoothed analysis of the perceptron algorithm for linear programming</i>, SODA (2002), 11.</dd>
<dt><a name="DynamicProgramming" id="DynamicProgramming">Bel57</a></dt>
<dd>Richard Bellman, <i>Dynamic programming</i>, Princeton University Press, 1957.</dd>
<dt><a name="Breiman:1997p1133" id="Breiman:1997p1133">BF97</a></dt>
<dd>Leo Breiman and Jerome&nbsp;H Friedman, <i>Predicting multivariate responses in multiple linear regression</i>, Journal of the Royal Statistical Society, Series B (Methodological) <b>59</b> (1997), no.&nbsp;1, 3-54.</dd>
<dt><a name="bfso:1984" id="bfso:1984">BFSO84</a></dt>
<dd>Leo Breiman, Jerome Friedman, Charles&nbsp;J. Stone, and R.&nbsp;A. Olshen, <i>Classification and regression trees</i>, Chapman &amp; Hall/CRC, January 1984.</dd>
<dt><a name="Blei:2003p1063" id="Blei:2003p1063">BNJ03</a></dt>
<dd>David&nbsp;M Blei, Andrew&nbsp;Y Ng, and Michael&nbsp;I Jordan, <i>Latent dirichlet allocation</i>, Journal of Machine Learning Research <b>3</b> (2003), 993-1022.</dd>
<dt><a name="Bennett:2006p400" id="Bennett:2006p400">BPH06</a></dt>
<dd>Kristin&nbsp;P. Bennett and Emilio Parrado-Hernandez, <i>The interplay of optimization and machine learning research</i>, Journal of Machine Learning Research <b>7</b> (2006), 1265-1281.</dd>
<dt><a name="Breiman:2000p1134" id="Breiman:2000p1134">Bre00</a></dt>
<dd>Leo Breiman, <i>Special invited paper. additive logistic regression: A statistical view of boosting: Discussion</i>, Ann. Statist. <b>28</b> (2000), no.&nbsp;2, 374-377.</dd>
<dt><a name="Beigel:1991p1027" id="Beigel:1991p1027">BRS91</a></dt>
<dd>R&nbsp;Beigel, N&nbsp;Reingold, and D&nbsp;Spielman, <i>The perceptron strikes back</i>, Structure in Complexity Theory Conference <b>6</b> (1991), 286-291.</dd>
<dt><a name="CharniakBook" id="CharniakBook">Cha96</a></dt>
<dd>Eugene Charniak, <i>Statistical language learning</i>, MIT Press, 1996.</dd>
<dt><a name="Charniak:1997p1484" id="Charniak:1997p1484">Cha97</a></dt>
<dd>to3em, <i>Statistial techniques for natural language parsing</i>, AI Magazine <b>18</b> (1997), no.&nbsp;4, 33-44.</dd>
<dt><a name="IntroductionToAlgorithms" id="IntroductionToAlgorithms">CLRS09</a></dt>
<dd>Thomas&nbsp;H. Cormen, Charles&nbsp;E. Leiserson, Ronald&nbsp;L. Rivest, and Clifford Stein, <i>Introduction to algorithms</i>, MIT Press, 2009.</dd>
<dt><a name="Collins:2002p1008" id="Collins:2002p1008">CSS02</a></dt>
<dd>Michael Collins, Robert&nbsp;E Schapire, and Yoram Singer, <i>Logistic regression, adaboost and bregman distances</i>, Machine Learning <b>48</b> (2002), no.&nbsp;1/2/3, 30.</dd>
<dt><a name="Cilibrasi:2005p8" id="Cilibrasi:2005p8">CV05</a></dt>
<dd>Rudi Cilibrasi and Paul&nbsp;M.B Vitanyi, <i>Clustering by compression</i>, IEEE Transactions on Information Theory <b>51</b> (2005), no.&nbsp;4, 1523-1545.</dd>
<dt><a name="DF98" id="DF98">DF98</a></dt>
<dd>Rod&nbsp;G. Downey and M.&nbsp;R. Fellows, <i>Parameterized complexity</i>, Monographs in Computer Science, Springer, November 1998.</dd>
<dt><a name="Dempster:1977p761" id="Dempster:1977p761">DLR77</a></dt>
<dd>A&nbsp;P Dempster, N&nbsp;M Laird, and D&nbsp;B Rubin, <i>Maximum likelihood from incomplete data via the em algorithm</i>, Journal of the Royal Statistical Society, Series B (Methodological) <b>39</b> (1977), no.&nbsp;1, 1-38.</dd>
<dt><a name="Fisher:1936p2576" id="Fisher:1936p2576">Fis36</a></dt>
<dd>Ronald&nbsp;A Fisher, <i>The use of multiple measurements in taxonomic problems</i>, Annals of Eugenics <b>7</b> (1936), 179-188.</dd>
<dt><a name="Freund:1999p1015" id="Freund:1999p1015">FS99</a></dt>
<dd>Yoav Freund and Robert&nbsp;E Schapire, <i>A short introduction to boosting</i>, Journal of Japanese Society for Artificial Intelligence <b>14</b> (1999), no.&nbsp;5, 771-780.</dd>
<dt><a name="Grunwald:2004p739" id="Grunwald:2004p739">GD04</a></dt>
<dd>Peter&nbsp;D Grunwald and A&nbsp;Philip Dawid, <i>Game theory, maximum entropy, minimum discrepancy and robust bayesian decision theory</i>, Ann. Statist. <b>32</b> (2004), no.&nbsp;4, 1367-1433.</dd>
<dt><a name="Grunwald:2000p108" id="Grunwald:2000p108">Gru00</a></dt>
<dd>PD&nbsp;Grunwald, <i>Maximum entropy and the glasses you are looking through</i>, Conference on Uncertainty in Artificial Intelligence (2000), 238-246.</dd>
<dt><a name="Halevy:2009p2327" id="Halevy:2009p2327">HNP09</a></dt>
<dd>Alon Halevy, Peter Norvig, and Fernando Pereira, <i>The unreasonable effectiveness of data</i>, IEEE Intellegent Systems (2009).</dd>
<dt><a name="NNCPE" id="NNCPE">Hus99</a></dt>
<dd>Dirk Husmeier, <i>Neural networks for conditional probability estimation</i>, Springer, 1999.</dd>
<dt><a name="Indyk:1999p166" id="Indyk:1999p166">IM99</a></dt>
<dd>Piotr Indyk and Rajeev Motwani, <i>Approximate nearest neighbors: Towards removing the curse of dimensionality</i>.</dd>
<dt><a name="Joachims:1998p406" id="Joachims:1998p406">Joa98</a></dt>
<dd>Thorsten Joachims, <i>Making large-scale svm learning practical</i>, Advances in Kernel Methods &#8211; Support Vector Learning (1998).</dd>
<dt><a name="Joachims:2006p403" id="Joachims:2006p403">Joa06</a></dt>
<dd>to3em, <i>Training linear svms in linear time</i>, KDD (2006).</dd>
<dt><a name="Karger:1998p556" id="Karger:1998p556">Kar98</a></dt>
<dd>David&nbsp;R Karger, <i>Randomization in graph optimization problems: A survey</i>, Optima: Mathematical Programming Society Newsletter <b>58</b> (1998).</dd>
<dt><a name="Kristjansson:2004p545" id="Kristjansson:2004p545">KCVM04</a></dt>
<dd>Trausti Kristjansson, Aron Culotta, Paul Viola, and Andrew&nbsp;Kachites McCallum, <i>Interactive information extraction with constrained conditional random fields</i>, AAAI (2004).</dd>
<dt><a name="Kleinberg:1997p32" id="Kleinberg:1997p32">Kle97</a></dt>
<dd>Jon&nbsp;M Kleinberg, <i>Authoritative souces in a hyperlinked environment</i>, ACM SIAM Symposium on Discrete Algorithms (1997).</dd>
<dt><a name="Komarek:2008p1742" id="Komarek:2008p1742">Kom08</a></dt>
<dd>Paul Komarek, <i>Logistic regression for data mining and high-dimensional classification</i>, CMU CS Thesis (2008), 138.</dd>
<dt><a name="Kivinen:1995p1836" id="Kivinen:1995p1836">KWA95</a></dt>
<dd>J&nbsp;Kivinen, Manfred&nbsp;K Warmuth, and P&nbsp;Auer, <i>The perceptron algorithm v.s. winnowo: Linear v.s. logarithmic mistake bounds when few input variables are relevant</i>, COLT (1995), 289-296.</dd>
<dt><a name="Lewis:1998p105" id="Lewis:1998p105">Lew98</a></dt>
<dd>David&nbsp;D Lewis, <i>Naive (bayes) at forty: The independence assumption in information retrieval</i>, find journal (1998).</dd>
<dt><a name="Lin:1973p2739" id="Lin:1973p2739">LK73</a></dt>
<dd>S&nbsp;Lin and BW&nbsp;Kernighan, <i>An effective heuristic algorithm for the traveling-salesman problem</i>, Operations Research (1973), 498-516.</dd>
<dt><a name="Maron:1961p2566" id="Maron:1961p2566">Mar61</a></dt>
<dd>M&nbsp;E Maron, <i>Automatic indexing: An experimental inquiry</i>, RAND Technical Report (1961), 404-417.</dd>
<dt><a name="HTSMH" id="HTSMH">MF00</a></dt>
<dd>Zbigniew Michalewicz and David&nbsp;B. Fogel, <i>How to solve it: Modern heuristics</i>, Springer, 2000.</dd>
<dt><a name="Mill" id="Mill">Mil02</a></dt>
<dd>John&nbsp;Stuart Mill, <i>A system of logic</i>, University Press of the Pacific, 2002.</dd>
<dt><a name="MitchellML" id="MitchellML">Mit97</a></dt>
<dd>Thomas Mitchell, <i>Machine learning</i>, McGraw-Hill, 1997.</dd>
<dt><a name="Maron:2000p2553" id="Maron:2000p2553">MK00</a></dt>
<dd>M&nbsp;E Maron and J&nbsp;L Kuhns, <i>On relevance, probabilistic indexing and information retrieval</i>, 1960 (2000), 1-29.</dd>
<dt><a name="Mount:2000p360" id="Mount:2000p360">Mou00</a></dt>
<dd>John&nbsp;A Mount, <i>Automatic detection of potential deadlock</i>, Dr. Dobbs Journal (2000).</dd>
<dt><a name="TradeArt" id="TradeArt">Mou09a</a></dt>
<dd>John Mount, <i>Automatic generation and testing of un-rolls for profitable technical trades</i>, <a href="http://www.win-vector.com/blog/2007/10/paper-on-stock-trading/">http://www.win-vector.com/blog/2007/10/paper-on-stock-trading/</a>, 2009.</dd>
<dt><a name="MLArt" id="MLArt">Mou09b</a></dt>
<dd>to3em, <i>A demonstration of data mining</i>, <a href="http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/">http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/</a>, 2009.</dd>
<dt><a name="Page:1998p2689" id="Page:1998p2689">PBMW98</a></dt>
<dd>Lawrence Page, Sergey Brin, Rajeev Motwani, and Tery Winograd, <i>The pagerank citation ranking: Bringing order to the web</i>, <a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.31.1768">http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.31.1768</a> (1998).</dd>
<dt><a name="Polya1" id="Polya1">Pol54a</a></dt>
<dd>G.&nbsp;Polya, <i>Induction and analogy in mathematics</i>, Princeton University Press, 1954.</dd>
<dt><a name="Polya2" id="Polya2">Pol54b</a></dt>
<dd>to3em, <i>Patterns of plausible inference</i>, Princeton University Press, 1954.</dd>
<dt><a name="citeulike:679515" id="citeulike:679515">Pol71</a></dt>
<dd>to3em, <i>How to solve it</i>, Princeton University Press, November 1971.</dd>
<dt><a name="Rall:1996p2473" id="Rall:1996p2473">RC96</a></dt>
<dd>Louis&nbsp;B Rall and George&nbsp;F Corliss, <i>An introduction to automatic differentiation</i>, SIAM: Computational Differentiation: Techniques, Applications and Tools (1996), 1-18.</dd>
<dt><a name="IndiscreteThoughts" id="IndiscreteThoughts">Rot97</a></dt>
<dd>Gian-Carlo Rota, <i>Indiscrete thoughts</i>, Birkhauser, 1997.</dd>
<dt><a name="Skilling:1988p780" id="Skilling:1988p780">Ski88</a></dt>
<dd>John Skilling, <i>The axioms of maximum entropy</i>, Maximum Entropy and Bayesian Methods in Science and Engineering <b>1</b> (1988), no.&nbsp;173-187.</dd>
<dt><a name="Sleator:1985p168" id="Sleator:1985p168">ST85</a></dt>
<dd>Daniel&nbsp;Dominic Sleator and Robert&nbsp;Endre Tarjan, <i>Amortized efficiency of list update and paging rules</i>, Communications of the ACM <b>28</b> (1985), no.&nbsp;2.</dd>
<dt><a name="SVMBook" id="SVMBook">STC00</a></dt>
<dd>Jown Shawe-Taylor and Nello Cristianini, <i>Support vector machines</i>, Cambridge University Press, 2000.</dd>
<dt><a name="KernBook" id="KernBook">STC04</a></dt>
<dd>to3em, <i>Kernel methods for pattern analysis</i>, Cambridge University Press, 2004.</dd>
<dt><a name="Strang" id="Strang">Str76</a></dt>
<dd>Gilbert Strang, <i>Linear algebra and its applications</i>, Academic Press, Inc., 1976.</dd>
<dt><a name="TibHat" id="TibHat">TH09</a></dt>
<dd>Jerome&nbsp;Friedman Trevor&nbsp;Hastie, Robert&nbsp;Tibshirani, <i>The elements of statistical learning: Data mining, inference and prediction</i>, Springer, 2009.</dd>
<dt><a name="Trevisan:2008p2166" id="Trevisan:2008p2166">TTV08</a></dt>
<dd>Luca Trevisan, Madhur Tulsiani, and Salil Vadhan, <i>Regularity, boosting, and efficiently simulating every high-entropy distribution</i>, Electronic Colloquium on Computational Complexity (2008), 18.</dd>
<dt><a name="Zeilberger:1995p277" id="Zeilberger:1995p277">Zei95</a></dt>
<dd>Doron Zeilberger, <i>The method of undetermined generalization and specialization illustrated with fred galvin&#8217;s amazing proof of the dinitz conjecture</i>, <a href="http://arxiv.org/abs/math/9506215">http://arxiv.org/abs/math/9506215</a>, 1995.</dd>
</dl>
<h1><a name="SECTION00070000000000000000" id="SECTION00070000000000000000">Acknowledgement</a></h1>
<p><font><font>A thank you to readers who supplied help and comments on earlier drafts.</font></font></p>
<p></p>
<hr />
<h4>Footnotes</h4>
<dl>
<dt><a name="foot21" id="foot21">&#8230; Mount</a><a href="#tex2html3"><sup>1</sup></a></dt>
<dd>email: <tt><a name="tex2html1" href="mailto:jmount@win-vector.com" id="tex2html1">mailto:jmount@win-vector.com</a></tt> web: <tt><a name="tex2html2" href="http://www.win-vector.com/" id="tex2html2">http://www.win-vector.com/</a></tt></dd>
<dt><a name="foot244" id="foot244">&#8230; principle.</a><a href="#tex2html4"><sup>2</sup></a></dt>
<dd>The pre-existing practice that comes cloesest to the local o global principle is found in operations research where encoding a problem to be solved by an optimizer is a central technique. We claim the natural statement of the local to global principle is more general than <font><em>always</em> encoding constraints for a particular optimizer (in particular globalization is not always optimization).</font></dd>
<dt><font><a name="foot43" id="foot43">&#8230; structure</a><a href="#tex2html6"><sup>4</sup></a></font></dt>
<dd><font>By &#8220;link structure&#8221; we mean which web pages link to which other web pages.</font></dd>
<dt><font><a name="foot45" id="foot45">&#8230; graph</a><a href="#tex2html7"><sup>5</sup></a></font></dt>
<dd><font>Remember, a graph is diagram consisting of nodes and edges (here depicted as arrows).</font></dd>
<dt><font><a name="foot245" id="foot245">&#8230; features</a><a href="#tex2html9"><sup>6</sup></a></font></dt>
<dd><font>For example the model could account for:</font></p>
<ul>
<li>surfers entering and leaving the model</li>
<li>link odds that vary where they are on a page</li>
<li>surfers staying on a page proportional to how much text is on the page</li>
<li>matching known traffic and click behavior where we have such data.</li>
</ul>
<p><font>For simplicity we will just stick with the example given example.</font></dd>
<dt><font><a name="foot154" id="foot154">&#8230; components.</a><a href="#tex2html17"><sup>7</sup></a></font></dt>
<dd><font>When a system is named and defined as an exact set of procedures the system can, by definition, not be improved. This is because with any change in procedure we have a new system that no longer matches the original definition and therefore requires a new name.</font></dd>
</dl>
<p><font><br /></font></p>
<hr />
<address><font>John Mount 2009-11-11</font></address>


<p>Related posts:<ol><li><a href='http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/' rel='bookmark' title='Permanent Link: Automatic Differentiation with Scala'>Automatic Differentiation with Scala</a></li>
<li><a href='http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/' rel='bookmark' title='Permanent Link: A Demonstration of Data Mining'>A Demonstration of Data Mining</a></li>
<li><a href='http://www.win-vector.com/blog/2009/07/should-your-mom-use-google-search/' rel='bookmark' title='Permanent Link: Should your mom use Google search?'>Should your mom use Google search?</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2009/11/the-local-to-global-principle/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Google AdSense Channels IDs and the Cramer Rao Inequality</title>
		<link>http://www.win-vector.com/blog/2009/10/google-adsense-channels-ids-and-the-cramer-rao-inequality/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=google-adsense-channels-ids-and-the-cramer-rao-inequality</link>
		<comments>http://www.win-vector.com/blog/2009/10/google-adsense-channels-ids-and-the-cramer-rao-inequality/#comments</comments>
		<pubDate>Mon, 19 Oct 2009 22:07:07 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Administrativia]]></category>
		<category><![CDATA[Expository Writing]]></category>
		<category><![CDATA[Mathematics]]></category>
		<category><![CDATA[AdSense]]></category>
		<category><![CDATA[AdSense Channel]]></category>
		<category><![CDATA[Channel ID]]></category>
		<category><![CDATA[Cramer-Rao]]></category>
		<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=898</guid>
		<description><![CDATA[&#8220;Comparing Apples and Oranges: Two Examples of the Limits of Statistical Inference, With an Application to Google Advertising Markets&#8221; is our analysis of Google AdSense Channel IDs and our use of the Cramer Rao bound to show that these IDs fundamentally limit what participants in the Google online advertising market can measure (and therefore in [...]


Related posts:<ol><li><a href='http://www.win-vector.com/blog/2008/06/yaygda-yet-another-yahoo-google-deal-article/' rel='bookmark' title='Permanent Link: YAYGDA (Yet Another Yahoo Google Deal Article)'>YAYGDA (Yet Another Yahoo Google Deal Article)</a></li>
<li><a href='http://www.win-vector.com/blog/2007/06/new-paper/' rel='bookmark' title='Permanent Link: New Paper'>New Paper</a></li>
<li><a href='http://www.win-vector.com/blog/2009/07/should-your-mom-use-google-search/' rel='bookmark' title='Permanent Link: Should your mom use Google search?'>Should your mom use Google search?</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.win-vector.com/SelectedPapers/files/ComparingApplesAndOrangesProblemsWithAdsense.pdf">&#8220;Comparing Apples and Oranges: Two Examples of the Limits of Statistical Inference, With an Application to Google Advertising Markets&#8221;</a> is our analysis of Google AdSense Channel IDs and our use of the Cramer Rao bound to show that these IDs fundamentally limit what participants in the Google online advertising market can measure (and therefore in turn limit what these players can do).<br />
<span id="more-898"></span><br />
We also include a entry level exposition and examples of what the Cramer Rao Inequality is and how it works.</p>
<p>This is a repost of an older paper- but a few people have pointed out they were put off by the incredibly uninformative title of the original post &#8220;<a href="http://www.win-vector.com/blog/2007/06/new-paper/">New Paper</a>.&#8221;</p>


<p>Related posts:<ol><li><a href='http://www.win-vector.com/blog/2008/06/yaygda-yet-another-yahoo-google-deal-article/' rel='bookmark' title='Permanent Link: YAYGDA (Yet Another Yahoo Google Deal Article)'>YAYGDA (Yet Another Yahoo Google Deal Article)</a></li>
<li><a href='http://www.win-vector.com/blog/2007/06/new-paper/' rel='bookmark' title='Permanent Link: New Paper'>New Paper</a></li>
<li><a href='http://www.win-vector.com/blog/2009/07/should-your-mom-use-google-search/' rel='bookmark' title='Permanent Link: Should your mom use Google search?'>Should your mom use Google search?</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2009/10/google-adsense-channels-ids-and-the-cramer-rao-inequality/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>A Discrete Model Gauging Market Efficiency</title>
		<link>http://www.win-vector.com/blog/2009/09/a-discrete-model-gauging-market-efficiency/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=a-discrete-model-gauging-market-efficiency</link>
		<comments>http://www.win-vector.com/blog/2009/09/a-discrete-model-gauging-market-efficiency/#comments</comments>
		<pubDate>Wed, 09 Sep 2009 05:34:23 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Finance]]></category>
		<category><![CDATA[Mathematics]]></category>
		<category><![CDATA[Quantitative Finance]]></category>
		<category><![CDATA[Combinatorial Markets]]></category>
		<category><![CDATA[Discrete Markets]]></category>
		<category><![CDATA[Efficient Markets]]></category>
		<category><![CDATA[Information Taker]]></category>
		<category><![CDATA[Preditory Traders]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=809</guid>
		<description><![CDATA[New paper: A Discrete Model Gauging Market Efficiency PDF We highly recommend reading the PDF version, but please find below a HTML translation of the paper. We follow up on some interesting work from the literature and explore some conditions that allow large predatory traders to dominate markets. A Discrete Model Gauging Market Efficiency John [...]


Related posts:<ol><li><a href='http://www.win-vector.com/blog/2009/03/what-does-the-market-think/' rel='bookmark' title='Permanent Link: What does the Market Think?'>What does the Market Think?</a></li>
<li><a href='http://www.win-vector.com/blog/2009/03/it-is-not-all-the-quants-fault/' rel='bookmark' title='Permanent Link: It is not all the quants&#8217; fault.'>It is not all the quants&#8217; fault.</a></li>
<li><a href='http://www.win-vector.com/blog/2007/10/paper-on-stock-trading/' rel='bookmark' title='Permanent Link: Paper on stock trading'>Paper on stock trading</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>New paper: A Discrete Model Gauging Market Efficiency <a href="http://www.win-vector.com/dfiles/DiscreteModel.pdf">PDF</a> </p>
<p>We <em>highly</em> recommend reading the PDF version, but please find below a HTML translation of the paper.</p>
<p>We follow up on some interesting work from the literature and explore some conditions that allow large predatory traders to dominate markets.</p>
<p><span id="more-809"></span></p>
<h1 align="center">A Discrete Model Gauging Market Efficiency</h1>
<p align="center"><strong>John Mount<a name="tex2html3" href="#foot12" id="tex2html3"><sup>1</sup></a></strong></p>
<p></p>
<p align="center"><b>Date:</b> September 8, 2009</p>
<hr />
<h3>Abstract:</h3>
<div>We describe a discrete market model appropriate for quantifying certain desirable and un-desirable features of financial markets. This model allows direct exploration of the impact of different market structures on efficiency and fairness. We conclude by demonstrating that a single trader with a large budget can generate profit while making the market not profitable for smaller traders.</div>
<h1><a name="SECTION00010000000000000000" id="SECTION00010000000000000000">Introduction</a></h1>
<p>Stochastic calculus techniques[<a href="#citeulike:2080469">KS01</a>] (such as Brownian Motion, Levy Processes[<a href="#Applebaum:2004p1042">App04</a>], Wiener Processes or the Ito Calculus[<a href="#citeulike:2635904">Ste03b</a>,<a href="#Steele:2003p2288">Ste03a</a>]) are not the only abstraction useful in thinking about financial markets. Real markets do not meet the typical assumptions of the above systems (infinitely divisible time, no trade costs, no long-term memory and no large actors) and routinely fail goodness of fit tests against such models[<a href="#Lo:2001p1619">LM01</a>,<a href="#Lo:2005p2193">Lo05</a>]. In fact there is a simple arbitrage argument that markets would have summary statistics identical to Ito processes even if they are not such processes.[<a href="#Shafer:2004p1497">Sha04</a>] When studying which features make a market fair or efficient we can not rely on mathematical tools that assume and depend on fair and efficient markets.</p>
<p>To build the tools for our study we follow up on some of the ideas of Hasanhodzic, Lo and Viola [<a href="#Hasanhodzic:2009p2605">HLV09</a>] and propose a specific discrete market model (as distinguished from more traditional continuous mathematics as in [<a href="#MertonCTF">Mer99</a>]) that allows us to effectively apply ideas from game theory[<a href="#AlgGT">NNV07</a>] and theoretical computer science. We show how to solve for optimal trading strategies in this market model and conclude with an illustration of how a single trader can dominate a market by merely exercising a larger budget.</p>
<h1><a name="SECTION00020000000000000000" id="SECTION00020000000000000000">Outline</a></h1>
<p>We will proceed as follows:</p>
<ul>
<li>Define our market model</li>
<li>Solve for optimal trading strategies in our market model</li>
<li>Perform the experiment of adding a single large trader to our model</li>
<li>Draw conclusions</li>
<li>Suggest further research.</li>
</ul>
<h1><a name="SECTION00030000000000000000" id="SECTION00030000000000000000">The Market Model</a></h1>
<p>Our goal is to investigate if even perfect traders are vulnerable to an additional trader that has a larger budget. To do this we must have a market model where at least:</p>
<ul>
<li>We can solve for the optimal trading strategy</li>
<li>There is a reason to trade (profits are available).</li>
</ul>
<p>We propose such a market model below.</p>
<h2><a name="SECTION00031000000000000000" id="SECTION00031000000000000000">The Market</a></h2>
<p>To simplify the description of traders (and to minimize the amount of state we have to carry) we propose a market model that abstracts out price and many other features.</p>
<p>Our market model is represented as an ordered sequence of the symbols &#8220;<img width="18" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg3.png" alt="$ +$"/> &#8221;, &#8220;0 &#8221; and &#8220;<img width="18" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg4.png" alt="$ -$"/> &#8221;. A &#8220;<img width="18" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg3.png" alt="$ +$"/> &#8221; represents a recent price increase, a &#8220;0 &#8221; represents no change and a &#8220;<img width="18" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg4.png" alt="$ -$"/> &#8221; represents a recent price decrease. We are deliberately avoiding direct representation of real market quantities such as absolute price, volume, inventory, bid/ask books, margin and elasticity. Time is represented by regular &#8220;ticks&#8221; or the simple advance to the next symbol in the market sequence. We will describe how the next symbol in the market sequence is determined after we have described trades.</p>
<h3><a name="SECTION00031100000000000000" id="SECTION00031100000000000000">Type 1 Trades</a></h3>
<p>The first type of trade we allow in this market is a &#8220;round trip.&#8221; A round trip is one of the two following trades:</p>
<ul>
<li>&#8220;a long round trip&#8221;
<p>An immediate buy in the current time tick followed by an automatic (forced) sell on the next time tick. This trade is considered profitable if the next market symbol is a <img width="18" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg3.png" alt="$ +$"/> as the sell then happens at a higher price than the initial buy, yielding a profit.</p>
</li>
<li>&#8220;a short round trip&#8221;
<p>An immediate sell in the current time tick followed by an automatic (forced) buy on the next time tick. This trade is considered profitable if the next market symbol is a <img width="18" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg4.png" alt="$ -$"/> as the buy then happens at a lower price than the initial sell, yielding a profit.</p>
</li>
</ul>
<p>The forced nature of these round trip trades allow us to avoid modeling inventory and margin. Round trip trades are meant to abstract some of the aspects of high-frequency trading strategies.</p>
<h3><a name="SECTION00031200000000000000" id="SECTION00031200000000000000">Type 2 Trades</a></h3>
<p>The second type of trade we allow is a &#8220;simple buy&#8221; or &#8220;simple sell&#8221; on the next time tick. This type of trade is meant to abstract some of the properties of a trader who is not so close to the market and has market-external interests (like inventory, customers, margin, fundamental knowledge <img width="28" height="18" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg5.png" alt="$ \cdots$"/> ).</p>
<h3><a name="SECTION00031300000000000000" id="SECTION00031300000000000000">Market Evolution</a></h3>
<p>The market model evolves forward as follows. The second half of each type 1 trade (the sell in the long round trip and buy in the short round trip) is entered as a net impact on the upcoming time tick. So: a long round trip actually generates a sell or downward price impact on the next market tick (and a short round trip generates a buy or upward price impact on the next market tick). This &#8220;reverse impact&#8221; is in our model because we are not allowing these traders to hold inventory and in a &#8220;buy followed by a sell&#8221; pattern the initial buy impact is further in the past then the sell (so should have a lesser future impact). This is also similar to how in real markets a large net short position represents an upward influence on price as the market participants know the short position must eventually be covered.</p>
<p>Also each type 2 (or simple) trade is also entered directly as market impact. So: as expected simple buy trades generate upward price impact and simple sell generate downward price impact.</p>
<p>To determine the next market-symbol we sum the net impact entered against the next tick, if the net impact is positive the symbol is a <img width="18" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg3.png" alt="$ +$"/> , if it is zero the symbol is 0 and if it is negative the symbol is a <img width="18" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg4.png" alt="$ -$"/> . This differs both from the market model in [<a href="#Hasanhodzic:2009p2605">HLV09</a>] (where price is additive) and from real markets (where elasticity of price with respect to trades is very complicated).</p>
<p>For example: if three traders choose &#8220;long round trip&#8221; (betting the market will go up in the short term) and one trader chooses &#8220;simple buy&#8221; (betting the market will go up long term) then the net impact on the next tick is <!-- MATH<br />
 $(-1) + (-1) + (-1) + (+1) = -2$<br />
 --><br />
<img width="257" height="36" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg6.png" alt="$ (-1) + (-1) + (-1) + (+1) = -2$"/> and the next symbol is <img width="18" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg4.png" alt="$ -$"/> . The long round trip traders lose money and the simple buy trader is has an unrealized loss.<a name="tex2html4" href="#foot37" id="tex2html4"><sup>2</sup></a>Just as we settled on a standard unit for trade size we will use a standard unit for profit and arbitrarily say all traders with realized loss lost one unit per share.</p>
<p>This market model is deliberately simple, but just as symbolic dynamics offers insights to continuous dynamical systems [<a href="#symbdyn">TBS91</a>] this market model serves as a platform for analyzing aspects of real markets.</p>
<h2><a name="SECTION00032000000000000000" id="SECTION00032000000000000000">Type 1 Traders</a></h2>
<p>We have described a very simple and very limited market. We will now describe some of the traders. Our first set of traders we call &#8220;Type 1 Traders&#8221; and they are meant to represent high-frequency quantitative or technical traders. Type 1 traders perform only type 1 trades (long round trip or short round trip) or abstain from trading. For now we are restricting each type 1 trader to trade a single unit either in a long round trip, a short round trip, or to not trade.</p>
<p>We will model these traders as having no internal state and a limited window of memory of the market. We allow the traders to use probabilistic strategies (so they do not get caught always performing the exact same trade in a repeating situation). Under these limits we can write each trader as a simple table representing a map from <!-- MATH<br />
 $\{+,0,-\}^{k}$<br />
 --><br />
<img width="82" height="39" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg7.png" alt="$ \{+,0,-\}^{k}$"/> (the sequences of symbols length <img width="14" height="15" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg8.png" alt="$ k$"/> , i.e. what the trader is modeled as remembering) to pairs <!-- MATH<br />
 $(p_{\text{long}},p_{\text{short}})$<br />
 --><br />
<img width="100" height="36" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg9.png" alt="$ (p_{\text{long}},p_{\text{short}})$"/> where <!-- MATH<br />
 $p_{\text{long}}$<br />
 --><br />
<img width="39" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg10.png" alt="$ p_{\text{long}}$"/> is the trader&#8217;s chosen probability of making a long round trip in this situation and <!-- MATH<br />
 $p_{\text{short}}$<br />
 --><br />
<img width="44" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg11.png" alt="$ p_{\text{short}}$"/> is the trader&#8217;s chosen probability of making a short round in this situation.<a name="tex2html5" href="#foot162" id="tex2html5"><sup>3</sup></a></p>
<p>We place no limit on how much effort the Type 1 Traders make in pre-computing their strategy tables. One important point is: since the traders are allowed to use probabilistic tables we can assume (in the limit) that the optimal trading strategy is the same for all type 1 traders. This is because if a type 1 trader is losing money to other type 1 traders who are themselves making a profit then the original type 1 trader can &#8220;cannibalize their own business&#8221; by copying a bit of the strategy they are vulnerable to into their own strategy. For example if a trader is losing money to profitable short round trippers they can fix this by trading short round trips a bit more often.<a name="tex2html6" href="#foot153" id="tex2html6"><sup>4</sup></a> When we can use the assumption that all the type 1 traders have identical strategy tables we can then solve for this table and immediately demonstrate the characteristic of the market formed by these optimal traders.</p>
<p>The market model evolves as follows: if there are <img width="20" height="18" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg15.png" alt="$ m$"/> type 1 traders with a common memory window size of <img width="14" height="15" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg8.png" alt="$ k$"/> then the market symbol at time-<img width="11" height="18" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg16.png" alt="$ t$"/> is:</p>
<p><!-- MATH<br />
 \begin{displaymath}<br />
\text{market}(t) =<br />
\text{sign}\left(<br />
\sum_{i=1}^{m} \chi_i(\text{market}(t-1),\cdots,\text{market}(t-k))<br />
\right)<br />
\end{displaymath}<br />
 --></p>
<div align="center">&nbsp; &nbsp;market<img width="43" height="36" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg17.png" alt="$\displaystyle (t) =$"/>&nbsp; &nbsp;sign<img width="339" height="71" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg18.png" alt="$\displaystyle \left( \sum_{i=1}^{m} \chi_i(\text{market}(t-1),\cdots,\text{market}(t-k)) \right) $"/></div>
<p>where <img width="35" height="36" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg19.png" alt="$ \chi_i()$"/> is the random variable associated with the <img width="11" height="18" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg20.png" alt="$ i$"/> -th type 1 trader.</p>
<p>Already we can show: if the market is only populated by type 1 traders then the optimal trading strategy is to set <!-- MATH<br />
 $p_{\text{long}} =<br />
p_{\text{short}} = 0$<br />
 --><br />
<img width="134" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg21.png" alt="$ p_{\text{long}} = p_{\text{short}} = 0$"/> (to not trade) and there is in fact no market (no trades happen). This follows because for every time no more than half of the active type 1 traders can be on the profitable side, so at best the type 1 traders break even as a group and not trading is a dominant strategy.</p>
<p>To model another important aspect of markets (and to give the type 1 traders a reason to trade) we introduce type 2 traders.</p>
<h2><a name="SECTION00033000000000000000" id="SECTION00033000000000000000">Type 2 Traders</a></h2>
<p>Type 2 traders are completely oblivious to the market. Type 2 traders trade only type 2 trades (simple buy and simple sell). Type 2 traders trade, but do not look at or remember the market sequence. Oddly enough the type 2 traders abstract both the idea of completely informed traders (traders that know something about the future, so do not need to use the market past) and completely uniformed traders (traders trading due to some external to the market pressure like a need to recover liquid assets). For now we are restricting each type 2 trader to trade a single unit either in a simple buy or a simple sell.</p>
<p>We assume one family of type 2 traders that operate as follows: assume a simple sequence of &#8220;<img width="18" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg3.png" alt="$ +$"/> &#8221; and &#8220;<img width="18" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg4.png" alt="$ -$"/> &#8221; generated by the Markov Chain in Figure&nbsp;<a href="#fig:SimpleMarkovChain">1</a>. This Markov Chain emits a sequence of symbols where the same symbol follows the last with probability <img width="14" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg22.png" alt="$ p$"/> (and the symbol changes with probability <img width="44" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg23.png" alt="$ 1-p$"/> ).</p>
<div align="center"><a name="fig:SimpleMarkovChain" id="fig:SimpleMarkovChain"></a><a name="63"></a></p>
<table>
<caption align="bottom"><strong>Figure 1:</strong> Simple Markov Chain</caption>
<tr>
<td>
<div align="center"><img width="250" height="78" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/./SimpleChain.png" alt="Image SimpleChain"/></div>
</td>
</tr>
</table>
</div>
<p>We will call this sequence &#8220;the hidden symbol&#8221; as only our type 2 traders can see it (the type 1 traders can not). Each of our type 2 traders looks at the current hidden symbol and independently does the following: with probability <img width="13" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg24.png" alt="$ q$"/> they enter a simple buy or simple sell for the next time tick betting in the direction of the hidden symbol and with probability <img width="43" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg25.png" alt="$ 1-q$"/> they enter a simple buy or simple sell for the next time tick betting in the direction opposite to the hidden symbol. For now we will assume all type 2 traders share the same <img width="13" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg24.png" alt="$ q$"/> . The type 2 traders do not perform round trip trades, but instead hold inventory. Thus a type 2 trader&#8217;s long bet is modeled as adding a net upward impact to the next time period.</p>
<p>The market model now evolves as follows. If there are <img width="15" height="18" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg26.png" alt="$ n$"/> type 2 traders then the market symbol at time-<img width="11" height="18" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg16.png" alt="$ t$"/> is:</p>
<p><!-- MATH<br />
 \begin{displaymath}<br />
\text{market}(t) =<br />
\text{sign}\left(<br />
\sum_{i=1}^{m} \chi_i(\text{market}(t-1),\cdots,\text{market}(t-k))<br />
+ \sum_{i=1}^{n} \Upsilon_i(\text{hidden}(t-1))<br />
\right)<br />
\end{displaymath}<br />
 --></p>
<div align="center">&nbsp; &nbsp;market<img width="43" height="36" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg17.png" alt="$\displaystyle (t) =$"/>&nbsp; &nbsp;sign<img width="522" height="71" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg27.png" alt="$\displaystyle \left( \sum_{i=1}^{m} \chi_i(\text{market}(t-1),\cdots,\text{market}(t-k)) + \sum_{i=1}^{n} \Upsilon_i(\text{hidden}(t-1)) \right) $"/></div>
<p>where <!-- MATH<br />
 $\Upsilon_i()$<br />
 --><br />
<img width="37" height="36" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg28.png" alt="$ \Upsilon_i()$"/> is the random variable associated with the <img width="11" height="18" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg20.png" alt="$ i$"/> -th type 2 trader.</p>
<p>As is often the case in mathematics what the abstract model means can change if we add different interpretations. If the hidden sequence that all of the type 2 traders simultaneously observe is thought to represent some important hidden value like the true value of the company underlying the equity being traded, then we consider the type 2 traders to be informed and consider their knowledge to be an advantage. If we consider the shared sequence to be irrelevant noise then we see these traders as some loose coalition whose value comes only from the fact their trades correlate with each other. If <img width="59" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg29.png" alt="$ q=0.5$"/> then we have truly uninformed (and uncorrelated) traders who are indeed doing nothing. Many real market properties that are attributed as being consequences of non-arbitrage are in fact consequences of conventions no more meaningful than the one given here (for example: closed end funds).</p>
<p>The interesting point is if <img width="14" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg22.png" alt="$ p$"/> is not too near <img width="27" height="18" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg30.png" alt="$ 0.5$"/> and <img width="13" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg24.png" alt="$ q$"/> is not too near <img width="27" height="18" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg30.png" alt="$ 0.5$"/> then the type 2 traders have a serial correlation (a correlation over time) that the type 1 traders can learn and exploit for profit. Or, from another point of view, the type 1 traders can profit by supplying liquidity to the type 2 traders.</p>
<h1><a name="SECTION00040000000000000000" id="SECTION00040000000000000000">Solving for the Optimal Strategy</a></h1>
<p>Our market was designed to allow a very succinct description. With only type 1 traders and one uniform family of type 2 traders our market is completely specified if we know:</p>
<ul>
<li><img width="20" height="18" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg15.png" alt="$ m$"/> : The number of type 1 traders in the market</li>
<li><img width="14" height="15" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg8.png" alt="$ k$"/> : The memory length of type 1 traders</li>
<li><img width="15" height="18" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg26.png" alt="$ n$"/> : The number of type 2 traders in the market</li>
<li><img width="14" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg22.png" alt="$ p$"/> : The symbol stability odds on the hidden sequence watched by type 2 traders</li>
<li><img width="13" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg24.png" alt="$ q$"/> : the faithfulness of type 2 traders in trading the hidden symbol.</li>
</ul>
<p>Given these parameters there is a unique shared optimal strategy for the type 1 traders, and we can efficiently solve for this strategy (without resorting to approximate or simulation results).</p>
<p>The entire state of the market at a given time can be written as a tuple <img width="77" height="36" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg31.png" alt="$ s = (x,y)$"/> where <img width="15" height="18" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg32.png" alt="$ x$"/> is the sequence of the <img width="14" height="15" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg8.png" alt="$ k$"/> most recent result symbols from the market sequence (<img width="56" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg33.png" alt="$ +,0,-$"/> ) and <img width="14" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg34.png" alt="$ y$"/> is the most recent symbol from the hidden sequence (<img width="40" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg35.png" alt="$ +,-$"/> ). So there are only <img width="47" height="18" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg36.png" alt="$ 2 * 3^k$"/> possible states for the market. Any posited type 1 strategy (along with the above parameters) completely determines the transition odds between each of these detailed market states. Figure&nbsp;<a href="#fig:DetailedMarketMarkovChain">2</a> illustrates the states that make up a <img width="46" height="15" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg1.png" alt="$ k=1$"/> market.</p>
<div align="center"><a name="fig:DetailedMarketMarkovChain" id="fig:DetailedMarketMarkovChain"></a><a name="83"></a></p>
<table>
<caption align="bottom"><strong>Figure 2:</strong> Detailed Market Markov Chain for <img width="46" height="15" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg1.png" alt="$ k=1$"/></caption>
<tr>
<td>
<div align="center"><img width="500" height="274" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/./Market1.png" alt="Image Market1"/></div>
</td>
</tr>
</table>
</div>
<p>Once the transition odds are known between all states it is a simple matter of linear algebra to solve exactly for the stationary distribution and expected value of the market (for type 1 traders).[<a href="#finiteMC">KS76</a>] Global optimization techniques can be used to identify the optimal strategies and we can then characterize how these market models behave when populated with optimal traders.<a name="tex2html9" href="#foot156" id="tex2html9"><sup>5</sup></a></p>
<p>For concreteness we show a piece of the computation for the <!-- MATH<br />
 $m=1, k=1,<br />
n=2, p=0.8, q=0.9$<br />
 --><br />
<img width="275" height="34" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg37.png" alt="$ m=1, k=1, n=2, p=0.8, q=0.9$"/> market model. If the market&#8217;s last symbol was <img width="18" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg3.png" alt="$ +$"/> and the last hidden state was <img width="18" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg3.png" alt="$ +$"/> then the odds of moving from this state to this same detailed state (both a new hidden symbol of <img width="18" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg3.png" alt="$ +$"/> and a new market symbol of <img width="18" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg3.png" alt="$ +$"/> ) for the next time is given by:</p>
<div align="center"><img width="25" height="36" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg38.png" alt="$\displaystyle P($"/>hidden<img width="412" height="36" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg39.png" alt="$\displaystyle _{\text{next}} = + \vert \text{hidden} = +) P( \chi_1(+) + \Upsilon_1(+) + \Upsilon_2(+) &gt; 0 ) $"/></div>
<p>(where <img width="37" height="36" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg40.png" alt="$ \chi_1()$"/> is random variable representing the trade of the type 1 trader and <!-- MATH<br />
 $\Upsilon_1()$<br />
 --><br />
<img width="39" height="36" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg41.png" alt="$ \Upsilon_1()$"/> , <!-- MATH<br />
 $\Upsilon_2()$<br />
 --><br />
<img width="39" height="36" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg42.png" alt="$ \Upsilon_2()$"/> are the random variables representing the trades of the type 2 traders).</p>
<p>Using nothing more complicated than knowledge of the binomial distribution we can compute the complete transition matrix for the detailed Market Markov Chain. For example: assume our type 1 traders trade the most recent market symbol (except 0) with <img width="27" height="18" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg43.png" alt="$ 0.7$"/> probability (and makes no trade otherwise). Now label our states as:</p>
<div align="center">
<table cellpadding="3" border="1">
<tr>
<td align="center">Last Market Symbol</td>
<td align="center">Hidden Symbol</td>
<td align="center">State ID Number</td>
</tr>
<tr>
<td align="center">+</td>
<td align="center">+</td>
<td align="center">1</td>
</tr>
<tr>
<td align="center">+</td>
<td align="center">-</td>
<td align="center">2</td>
</tr>
<tr>
<td align="center">0</td>
<td align="center">+</td>
<td align="center">3</td>
</tr>
<tr>
<td align="center">0</td>
<td align="center">-</td>
<td align="center">4</td>
</tr>
<tr>
<td align="center">-</td>
<td align="center">+</td>
<td align="center">5</td>
</tr>
<tr>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">6</td>
</tr>
</table>
</div>
<p>then it is merely a matter of detailed arithmetic to derive the state to state transition probability matrix<a name="tex2html10" href="#foot97" id="tex2html10"><sup>6</sup></a>:</p>
<p><!-- MATH<br />
 \begin{displaymath}<br />
P =<br />
\left(<br />
\begin{array}{llllll}<br />
0.648 &#038; 0.162 &#038; 0.648 &#038; 0.162 &#038; 0.7488 &#038; 0.1872 \\<br />
0.002 &#038; 0.008 &#038; 0.002 &#038; 0.008 &#038; 0.0272 &#038; 0.1088 \\<br />
0.0432 &#038; 0.0108 &#038; 0.144 &#038; 0.036 &#038; 0.0432 &#038; 0.0108 \\<br />
0.0108 &#038; 0.0432 &#038; 0.036 &#038; 0.144 &#038; 0.0108 &#038; 0.0432 \\<br />
0.1088 &#038; 0.0272 &#038; 0.008 &#038; 0.002 &#038; 0.008 &#038; 0.002 \\<br />
0.1872 &#038; 0.7488 &#038; 0.162 &#038; 0.648 &#038; 0.162 &#038; 0.648<br />
\end{array}<br />
\right)<br />
.<br />
\end{displaymath}<br />
 --></p>
<div align="center"><img width="441" height="147" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg46.png" alt="\begin{displaymath} P = \left( \begin{array}{llllll} 0.648 &amp; 0.162 &amp; 0.648 &amp; 0.... ... &amp; 0.7488 &amp; 0.162 &amp; 0.648 &amp; 0.162 &amp; 0.648 \end{array}\right) . \end{displaymath}"/></div>
<p>Solving for the stationary distribution is, as promised, quite easy. We want to find a vector <img width="15" height="18" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg32.png" alt="$ x$"/> such that <!-- MATH<br />
 $(P-I) x = 0$<br />
 --><br />
<img width="104" height="36" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg47.png" alt="$ (P-I) x = 0$"/> and <!-- MATH<br />
 $1\cdot x = 1$<br />
 --><br />
<img width="68" height="18" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg48.png" alt="$ 1\cdot x = 1$"/> (<img width="14" height="15" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg49.png" alt="$ I$"/> denoting the identity matrix). Under very general conditions this will be a set of <img width="43" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg50.png" alt="$ s+1$"/> equations over <img width="13" height="18" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg51.png" alt="$ s$"/> variables with rank <img width="13" height="18" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg51.png" alt="$ s$"/> (so will have a unique solution and we don&#8217;t need to add any sign constraints).</p>
<p>This solution gives us the stationary odds of the market (how likely we are to see the market in any state at a random observation time):</p>
<p><!-- MATH<br />
 \begin{displaymath}<br />
x =<br />
\left(<br />
\begin{array}{l}<br />
0.420497 \\<br />
0.048611 \\<br />
0.030892 \\<br />
0.030892 \\<br />
0.048611 \\<br />
0.420497<br />
\end{array}<br />
\right)<br />
.<br />
\end{displaymath}<br />
 --></p>
<div align="center"><img width="151" height="147" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg52.png" alt="\begin{displaymath} x = \left( \begin{array}{l} 0.420497 \ 0.048611 \ 0.030892 \ 0.030892 \ 0.048611 \ 0.420497 \end{array}\right) . \end{displaymath}"/></div>
<p>Once we know this it is a matter of arithmetic to determine the expected value of the market for the type 1 trader.<a name="tex2html11" href="#foot104" id="tex2html11"><sup>7</sup></a> The trading strategy we imposed was not optimal but does have the positive value of <img width="36" height="18" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg53.png" alt="$ 0.13$"/> units expected profit per time tick. We can completely characterize these markets for moderate values of <img width="14" height="15" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg8.png" alt="$ k$"/> and arbitrary values of <img width="20" height="18" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg15.png" alt="$ m$"/> and <img width="15" height="18" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg26.png" alt="$ n$"/> .</p>
<p>Already we can confirm some features we would expect to see in this model. For example the type 1 traders have a &#8220;tragedy of the commons&#8221; situation in that they are using up the correlations that the type 2 traders introduce. If there are too many technical traders trying to follow the type 2 traders then the market becomes anti-correlated and oscillates in a way that is not profitable for these traders (until they adjust their strategies). For example raising <img width="20" height="18" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg15.png" alt="$ m$"/> to <img width="14" height="18" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg54.png" alt="$ 2$"/> in our example makes the &#8220;follow the market 70%&#8221; of the time an unprofitable strategy that loses money at a rate of <img width="36" height="18" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg55.png" alt="$ 0.12$"/> units per time tick. However, with <img width="102" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg56.png" alt="$ m=2, n=3$"/> this same strategy is profitable at a rate of <img width="36" height="18" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg57.png" alt="$ 0.08$"/> units per time tick. The <img width="102" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg58.png" alt="$ m=2, n=2$"/> market can be made to be profitable if both of the technical traders act &#8220;superrationally&#8221;<a name="tex2html12" href="#foot158" id="tex2html12"><sup>8</sup></a> and lower their trade rate from following the market 70% of the time to something lower like 20% of the time. Figure&nbsp;<a href="#fig:stratValueK1N2M2">3</a> shows the expected value of the market <!-- MATH<br />
 $m=2, k=1, n=2, p=0.8, q=0.9$<br />
 --><br />
<img width="275" height="34" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg2.png" alt="$ m=2, k=1, n=2, p=0.8, q=0.9$"/> for the type 1 traders as the type 1 traders odds of &#8220;following the last symbol&#8221; are moved from 0 to <img width="14" height="18" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg59.png" alt="$ 1$"/> (and, as earlier, refrain from trading in all other cases).</p>
<div align="center"><a name="fig:stratValueK1N2M2" id="fig:stratValueK1N2M2"></a><a name="110"></a></p>
<table>
<caption align="bottom"><strong>Figure 3:</strong> Strategy values for <!-- MATH<br />
 $m=2, k=1, n=2, p=0.8, q=0.9$<br />
 --><br />
<img width="275" height="34" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg2.png" alt="$ m=2, k=1, n=2, p=0.8, q=0.9$"/></caption>
<tr>
<td>
<div align="center"><img width="400" height="400" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/./stratValueK1N2M2.png" alt="Image stratValueK1N2M2"/></div>
</td>
</tr>
</table>
</div>
<h1><a name="SECTION00050000000000000000" id="SECTION00050000000000000000">The Experiment</a></h1>
<p>Now that we have set up a market and described how to evaluate and solve for the optimal trading strategies we are ready to run an experiment. The experiment is the introduction of a large trader that trades at a much larger size than other type 1 traders. This large trader will act like a type 1 trader but it is allowed larger trade sizes and a small informational advantage over the other type 1 traders. This informational advantage is the ability to remember if their own last trade was one of three possible strategies (so it is not really extending the windows size, and this extension would not help the smaller type 1 traders against this strategy).</p>
<p>To illustrate we assume a market where <img width="52" height="18" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg60.png" alt="$ m=0$"/> , <img width="46" height="15" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg1.png" alt="$ k=1$"/> , <img width="15" height="18" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg26.png" alt="$ n$"/> is large, <img width="63" height="36" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg61.png" alt="$ q&gt;1/2$"/> , <img width="63" height="36" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg62.png" alt="$ p &gt; 3/4$"/> and <!-- MATH<br />
 $n*(q-1/2)*(p-3/4)$<br />
 --><br />
<img width="187" height="36" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg63.png" alt="$ n*(q-1/2)*(p-3/4)$"/> is large (and all known to the large trader).</p>
<p>The large trader trades as follows: define three states to remember the large trader&#8217;s last state &#8220;odd time tick&#8221;, &#8220;even time tick following bluff&#8221; and &#8220;even time tick following non bluff.&#8221; We illustrate the large strategy in Figure&nbsp;<a href="#fig:Strat1">4</a>. On odd time ticks the large trader either bluffs (trades to flip the market symbol and takes a forced loss) or trades normally (allows the market to evolve under the influence of the type 2 traders and takes an expected profit). On even time ticks the large trader&#8217;s behavior depends if the last odd tick was a bluff (and the type 2 traders&#8217; influence on the market is masked) or the last odd tick was not a bluff (and the type 2 traders&#8217; influence on the market is visible). These two different states are marked in Figure&nbsp;<a href="#fig:Strat1">4</a> and the large trader abstains from trading after a bluff or trades for expected profit after a non-bluff.</p>
<div align="center"><a name="fig:Strat1" id="fig:Strat1"></a><a name="120"></a></p>
<table>
<caption align="bottom"><strong>Figure 4:</strong> Large Type 1 Trader States</caption>
<tr>
<td>
<div align="center"><img width="300" height="192" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/./Strat1.png" alt="Image Strat1"/></div>
</td>
</tr>
</table>
</div>
<p>The large trader&#8217;s strategy yields an augmented Markov chain that reflects the large trader&#8217;s state, the last symbol seen in the market and the last symbol of the hidden sequence. This Markov chain is shown in Figure&nbsp;<a href="#fig:BigStrat1">5</a> (with links from even time states to odd time states and links to and from unlikely states suppressed for clarity). We will describe the large trader&#8217;s strategy in detail below, but there are some simplifying points to keep in mind. Since <img width="85" height="36" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg64.png" alt="$ n (q-1/2)$"/> is large we are assuming that on odd time ticks and for even time ticks following non-bluffs the states where the market symbol and the hidden symbol disagree are very rare (and we will omit them from the analysis).</p>
<p>Stepping through the large trader strategy (see Figure&nbsp;<a href="#fig:BigStrat1">5</a>): on the odd time periods the large trader assumes that the market symbol equals the hidden symbol (i.e. the <img width="15" height="18" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg26.png" alt="$ n$"/> type 2 traders successfully copied the hidden symbol to the market without interference). The large trader then flips a fair coin and with 50% chance &#8220;bluffs&#8221; (forcing the market to the symbol opposite the hidden symbol by trading a little more than <img width="15" height="18" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg26.png" alt="$ n$"/> units in the appropriate direction) or on the other 50% of the time trades slightly less than <img width="15" height="18" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg26.png" alt="$ n$"/> units to try and profit off the obvious tick to tick correlation in the market. On the even time ticks the large trader trades to profit if the previous trade was not a bluff or otherwise abstains from trading. The expected value of the sum of contributions of the type 2 traders is <!-- MATH<br />
 $\text{hidden\_symbol}*(q*n - (1-q)*n)$<br />
 --><br />
<img width="282" height="36" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg66.png" alt="$\text{hidden\_symbol}*(q*n - (1-q)*n)$"/> which has an absolute value of <!-- MATH<br />
 $(2 q - 1) n$<br />
 --><br />
<img width="76" height="36" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg67.png" alt="$ (2 q - 1) n$"/> . Let <!-- MATH<br />
 $Q = (2 q - 1) n$<br />
 --><br />
<img width="113" height="36" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg68.png" alt="$ Q = (2 q - 1) n$"/> . A bluff costs the large trader <img width="72" height="36" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg69.png" alt="$ Q + o(n)$"/> <a name="tex2html15" href="#foot159" id="tex2html15"><sup>9</sup></a> units as they enter a trade in large enough to overwhelm the type 2 traders with high probability. A trade for profit (either on a non-bluff odd time tick or a even time tick following a non-bluff) has a maximum expected value of <!-- MATH<br />
 $(Q -<br />
o(n)) * (p*(1) + (1-p)*(-1)) = Q ( 2 p - 1) - o(n)$<br />
 --><br />
<img width="441" height="36" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg72.png" alt="$ (Q - o(n)) * (p*(1) + (1-p)*(-1)) = Q ( 2 p - 1) - o(n)$"/> as the large trader must not overwhelm the expected effect of the type 2 traders. Every two time ticks the large trader either bluffs then abstains (with probability 1/2) or makes two profitable trade attempts in a row (with probability 1/2). So every 2 time ticks the expected return is:</p>
<p><!-- MATH<br />
 \begin{displaymath}<br />
0.5 * (- Q - o(n) ) + 0.5 * 2 * (Q (2 p -1) - o(n))<br />
= (Q / 2) (p - 3/4) - o(n)<br />
.<br />
\end{displaymath}<br />
 --></p>
<div align="center"><img width="559" height="35" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg73.png" alt="$\displaystyle 0.5 * (- Q - o(n) ) + 0.5 * 2 * (Q (2 p -1) - o(n)) = (Q / 2) (p - 3/4) - o(n) . $"/></div>
<p>Or (q &#8211; 1/2)(p-3/4)n/2 &#8211; o(n) expected units return per time tick.</p>
<div align="center"><a name="fig:BigStrat1" id="fig:BigStrat1"></a><a name="131"></a></p>
<table>
<caption align="bottom"><strong>Figure 5:</strong> Example Strategy for Large Type 1 Trader</caption>
<tr>
<td>
<div align="center"><img width="500" height="540" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/./BigStrat1.png" alt="Image BigStrat1"/> <font size="-1">
<p />(for clarity transitions between unlikely states and from even time ticks to odd time ticks are not shown)</font></div>
</td>
</tr>
</table>
</div>
<p>This large trader strategy is for illustration, and is in no sense optimal<a name="tex2html17" href="#foot136" id="tex2html17"><sup>10</sup></a>. The important result is that when looking at the sequence of market symbols with a window of length 2 (the length of window that would be useful in defining a trading strategy for a <img width="46" height="15" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg1.png" alt="$ k=1$"/> type 1 opposing trader) all the zero free market symbol sequences of length 2 come up with the same probability: <img width="31" height="36" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg75.png" alt="$ 1/4$"/> . To a limited memory type 1 opponent (or one who has to encode their strategy with limited memory) the market looks like a fair coin with no serial correlation. Thus, if we start with <img width="52" height="18" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg60.png" alt="$ m=0$"/> (i.e. no other type 1 traders) a single large trader can take over the market and when we later increase <img width="20" height="18" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg15.png" alt="$ m$"/> the new type 1 traders will compute an optimal strategy of not trading (i.e. they will see no method to profitably enter the market).</p>
<p>The large trader has rendered the market untradable for other type 1 traders in the strongest possible sense. Because this market model is symmetric, has no trading costs and no margin requirements, no strategy can exist that forces other adapting strategies to lose money. This is because a strategy that is forced to lose money can be adapted into a profitable strategy by reversing the long and short actions. The large trader is using slightly more memory but this is just an accounting gimmick so they know on which ticks the market has information from the type 2 traders and on which ticks are noise from their own &#8220;bluff&#8221; or &#8220;Pyrrhic&#8221; trades. The other type 1 traders have no advantage when given the equivalent gimmick.<a name="tex2html18" href="#foot137" id="tex2html18"><sup>11</sup></a> Also, the large trader strategy is self financing: the large trader can hold the market (make the market look purely random to outsiders) while extracting a profit.</p>
<h1><a name="SECTION00060000000000000000" id="SECTION00060000000000000000">Conclusion</a></h1>
<p>We have described a combinatorial market model that is designed for simplicity. As is well known from mathematics and theoretical computer science even very simple systems can exhibit arbitrarily complex behavior when feedback, recursion or iteration are involved.</p>
<p>We have shown how to explicitly derive optimal trading behavior for small traders in this market model. We then demonstrated how a large trader (allowed to move more volume than the small traders) can &#8220;hold the market&#8221; in the sense they can make the market appear to be uncorrelated to outsiders while extracting a profit on their own. The ability to completely characterize our market model allows us to show that a self financing large trader is a stable solution in this market model even in the presence of optimal opponents with similar computational power.</p>
<p>It is beyond the scope of current techniques to show under which conditions a self-financing large trader could exist in a &#8220;fully realistic&#8221; market model. But by demonstration we have shown that we can not assume there are no self financing large traders.</p>
<h1><a name="SECTION00070000000000000000" id="SECTION00070000000000000000">Further Research</a></h1>
<p>Interesting follow up studies, which are well within the scope of the methods demonstrated here, include:</p>
<ul>
<li>Larger <img width="14" height="15" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg8.png" alt="$ k$"/> and heterogeneous <img width="14" height="15" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg8.png" alt="$ k$"/></li>
<li>A cross-market arbitrage interpretation for the type 2 traders</li>
<li>More detailed price and hidden symbol trajectories</li>
<li>Non-finite strategies (strategies indexed by integers instead of a small set of symbols)</li>
<li>Inventory and margin</li>
<li>Trade volume controlling price change (i.e. a model of price&#8217;s elasticity with respect to trade volume).</li>
</ul>
<h2><a name="SECTION00080000000000000000" id="SECTION00080000000000000000">Bibliography</a></h2>
<dl compact>
<dt><a name="Applebaum:2004p1042" id="Applebaum:2004p1042">App04</a></dt>
<dd>David Applebaum, <i>Levy processes- from probability to finance and quantum groups</i>, Notices of the AMS <b>51</b> (2004), no.&nbsp;1336-1347, 12.</dd>
<dt><a name="probmeth" id="probmeth">AS92</a></dt>
<dd>Nogal Alon and Joel&nbsp;H. Spencer, <i>The probabilistic method</i>, Wiley, 1992.</dd>
<dt><a name="Hasanhodzic:2009p2605" id="Hasanhodzic:2009p2605">HLV09</a></dt>
<dd>Jasmina Hasanhodzic, Andrew&nbsp;W Lo, and Emanuele Viola, <i>A computational view of market efficiency</i>, 1-14.</dd>
<dt><a name="metamag" id="metamag">Hof85</a></dt>
<dd>Douglas&nbsp;R. Hofstadter, <i>Metamagical themas: Questiong for the essence of mind and pattern</i>, Basic Books Inc., 1985.</dd>
<dt><a name="finiteMC" id="finiteMC">KS76</a></dt>
<dd>John&nbsp;G. Kemeny and J.&nbsp;Lauri Snell, <i>Finite markov chains</i>, Springer, 1976.</dd>
<dt><a name="citeulike:2080469" id="citeulike:2080469">KS01</a></dt>
<dd>Ioannis Karatzas and Steven&nbsp;E. Shreve, <i>Methods of mathematical finance</i>, Springer, September 2001.</dd>
<dt><a name="Lo:2001p1619" id="Lo:2001p1619">LM01</a></dt>
<dd>Andrew&nbsp;W Lo and A&nbsp;Craig MacKinlay, <i>A non-random walk down wall street</i>, Princeton University Press, 2001.</dd>
<dt><a name="Lo:2005p2193" id="Lo:2005p2193">Lo05</a></dt>
<dd>Andrew&nbsp;W Lo, <i>Reconciling efficient markets with behavioral finance: The adaptive markets hypothesis</i>, 44.</dd>
<dt><a name="MertonCTF" id="MertonCTF">Mer99</a></dt>
<dd>Robert&nbsp;C. Merton, <i>Continuous-time finance</i>, Blackwell, 1999.</dd>
<dt><a name="AlgGT" id="AlgGT">NNV07</a></dt>
<dd>Eva&nbsp;Tardos Noam&nbsp;Nisan, Tim&nbsp;Roughgarden and Vijay&nbsp;V. Vazirani, <i>Algorithmic game theory</i>, Cambridge, 2007.</dd>
<dt><a name="Rall:1996p2473" id="Rall:1996p2473">RC96</a></dt>
<dd>Louis&nbsp;B Rall and George&nbsp;F Corliss, <i>An introduction to automatic differentiation</i>, SIAM: Computational Differentiation: Techniques, Applications and Tools (1996), 1-18.</dd>
<dt><a name="Shafer:2004p1497" id="Shafer:2004p1497">Sha04</a></dt>
<dd>Glenn Shafer, <i>Why do price series look like ito processes?</i>, Rutgers (2004), 43.</dd>
<dt><a name="Steele:2003p2288" id="Steele:2003p2288">Ste03a</a></dt>
<dd>J&nbsp;Michael Steele, <i>Ito calculus</i>, Encyclopedia of Actuarial Sciences (2003), 1-12.</dd>
<dt><a name="citeulike:2635904" id="citeulike:2635904">Ste03b</a></dt>
<dd>J.&nbsp;Michael Steele, <i>Stochastic calculus and financial applications</i>, Springer, June 2003.</dd>
<dt><a name="symbdyn" id="symbdyn">TBS91</a></dt>
<dd>Michael&nbsp;Keane Tim&nbsp;Bedford and Caroline Series, <i>Egrodic theory, symbolic dynamics and hyperbolic spaces</i>, Oxford University Press, 1991.</dd>
</dl>
<p></p>
<hr />
<h4>Footnotes</h4>
<dl>
<dt><a name="foot12" id="foot12">&#8230; Mount</a><a href="#tex2html3"><sup>1</sup></a></dt>
<dd>email: <tt><a name="tex2html1" href="mailto:jmount@win-vector.com" id="tex2html1">mailto:jmount@win-vector.com</a></tt> company: <tt><a name="tex2html2" href="http://www.win-vector.com/" id="tex2html2">http://www.win-vector.com/</a></tt></dd>
<dt><a name="foot37" id="foot37">&#8230; loss.</a><a href="#tex2html4"><sup>2</sup></a></dt>
<dd>We do not enforce any sort of &#8220;conservation of money&#8221; (that the amount of profit earned by the short trader should equal the amount of money lost by the long traders). In the real market there is an aspect of conservation of money in trades, but there is not a conservation of money in a single time period if the traders have net holdings.</dd>
<dt><a name="foot162" id="foot162">&#8230; situation.</a><a href="#tex2html5"><sup>3</sup></a></dt>
<dd>So <!-- MATH<br />
 $p_{\text{long}}<br />
\ge 0$<br />
 --><br />
<img width="71" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg12.png" alt="$ p_{\text{long}} \ge 0$"/> , <!-- MATH<br />
 $p_{\text{short}} \ge 0$<br />
 --><br />
<img width="76" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg13.png" alt="$ p_{\text{short}} \ge 0$"/> and <!-- MATH<br />
 $p_{\text{long}} +<br />
p_{\text{short}} \le 1$<br />
 --><br />
<img width="132" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg14.png" alt="$ p_{\text{long}} + p_{\text{short}} \le 1$"/> .</dd>
<dt><a name="foot153" id="foot153">&#8230; often.</a><a href="#tex2html6"><sup>4</sup></a></dt>
<dd>This &#8220;traders can imitate each other&#8221; is a &#8220;linearity of expectation argument&#8221;[<a href="#probmeth">AS92</a>] and is a common argument technique in game theory.</dd>
<dt><a name="foot156" id="foot156">&#8230; traders.</a><a href="#tex2html9"><sup>5</sup></a></dt>
<dd>The optimization problem has some easy aspects. At the optimum we can assume all the type 1 traders are identical (so we solve for one trader of magnitude <img width="15" height="18" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg26.png" alt="$ n$"/> instead of solving for a population of <img width="15" height="18" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg26.png" alt="$ n$"/> traders) and we can use automatic differentiation techniques[<a href="#Rall:1996p2473">RC96</a>] to get gradients as we work.</dd>
<dt><a name="foot97" id="foot97">&#8230; matrix</a><a href="#tex2html10"><sup>6</sup></a></dt>
<dd>We are being a little non-standard here in that we are writing <img width="18" height="15" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg44.png" alt="$ P$"/> as an operator on the left, so if <img width="15" height="18" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg32.png" alt="$ x$"/> is the state-vector of probabilities at a given time tick then <img width="28" height="15" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg45.png" alt="$ P x$"/> is the state-vector of probabilities at the next time tick. This is not the convention in the Markov Chain literature, but more compatible with other topics in linear algebra.</dd>
<dt><a name="foot104" id="foot104">&#8230; trader.</a><a href="#tex2html11"><sup>7</sup></a></dt>
<dd>Some care has to be taken that in computing the value of a strategy as we need access to some several additional transition matrices (each conditioned on knowing the proposed trade of the type 1 trader we are studying).</dd>
<dt><a name="foot158" id="foot158">&#8230; &#8220;superrationally&#8221;</a><a href="#tex2html12"><sup>8</sup></a></dt>
<dd>That is each type 2 trader must dial down their trading activity to account for the number of other type 2 traders present. Douglas Hofstadter called such behavior &#8220;superrational&#8221;[<a href="#metamag">Hof85</a>]. Traders with small budgets who can not collaborate are actually likely to do this- because while they are trading at too high a rate they lose money. However, a trader that can work at higher volume or tolerate larger losses can outwait the others and have the market for theirselves.</dd>
<dt><a name="foot159" id="foot159">&#8230;</a><a href="#tex2html15"><sup>9</sup></a></dt>
<dd>The <img width="37" height="36" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg70.png" alt="$ o(n)$"/> is an &#8220;order-of&#8221; notation meant to denote a quantity that increases more slowly than <img width="15" height="18" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg26.png" alt="$ n$"/> as <img width="15" height="18" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg26.png" alt="$ n$"/> gets large. An example <img width="37" height="36" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg70.png" alt="$ o(n)$"/> quantity would be <img width="30" height="38" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/09/dmmimg71.png" alt="$ \sqrt{n}$"/> . This notation (when used properly) greatly speeds up calculation by suppressing irrelevant details.</dd>
<dt><a name="foot136" id="foot136">&#8230; optimal</a><a href="#tex2html17"><sup>10</sup></a></dt>
<dd>At the very least we could tune the bluff frequency and also trade (albeit with less certainty) in the after-bluff periods</dd>
<dt><a name="foot137" id="foot137">&#8230; gimmick.</a><a href="#tex2html18"><sup>11</sup></a></dt>
<dd>Unless they use the gimmick to collude to overcome the organized size of the large trader, but then the other type 2 traders are essentially also one large trader</dd>
</dl>


<p>Related posts:<ol><li><a href='http://www.win-vector.com/blog/2009/03/what-does-the-market-think/' rel='bookmark' title='Permanent Link: What does the Market Think?'>What does the Market Think?</a></li>
<li><a href='http://www.win-vector.com/blog/2009/03/it-is-not-all-the-quants-fault/' rel='bookmark' title='Permanent Link: It is not all the quants&#8217; fault.'>It is not all the quants&#8217; fault.</a></li>
<li><a href='http://www.win-vector.com/blog/2007/10/paper-on-stock-trading/' rel='bookmark' title='Permanent Link: Paper on stock trading'>Paper on stock trading</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2009/09/a-discrete-model-gauging-market-efficiency/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Good Graphs: Graphical Perception and Data Visualization</title>
		<link>http://www.win-vector.com/blog/2009/08/good-graphs-graphical-perception-and-data-visualization/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=good-graphs-graphical-perception-and-data-visualization</link>
		<comments>http://www.win-vector.com/blog/2009/08/good-graphs-graphical-perception-and-data-visualization/#comments</comments>
		<pubDate>Fri, 28 Aug 2009 15:40:41 +0000</pubDate>
		<dc:creator>Nina Zumel</dc:creator>
				<category><![CDATA[Exciting Techniques]]></category>
		<category><![CDATA[Expository Writing]]></category>
		<category><![CDATA[Mathematics]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Cleveland]]></category>
		<category><![CDATA[data exploration]]></category>
		<category><![CDATA[graphical perception]]></category>
		<category><![CDATA[Lattice]]></category>
		<category><![CDATA[Mathematical Bedside Reading]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[visualization]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=296</guid>
		<description><![CDATA[What makes a good graph? When faced with a slew of numeric data, graphical visualization can be a more efficient way of getting a feel for the data than going through the rows of a spreadsheet. But do we know if we are getting an accurate or useful picture? How do we pick an effective [...]


Related posts:<ol><li><a href='http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/' rel='bookmark' title='Permanent Link: A Demonstration of Data Mining'>A Demonstration of Data Mining</a></li>
<li><a href='http://www.win-vector.com/blog/2010/02/living-in-a-lognormal-world/' rel='bookmark' title='Permanent Link: Living in A Lognormal World'>Living in A Lognormal World</a></li>
<li><a href='http://www.win-vector.com/blog/2009/04/the-data-enrichment-method/' rel='bookmark' title='Permanent Link: The Data Enrichment Method'>The Data Enrichment Method</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>What makes a good graph? When faced with a slew of numeric data, graphical visualization can be a more efficient way of getting a feel for the data than going through the rows of a spreadsheet. But do we know if we are getting an accurate or useful picture? How do we pick an effective visualization that neither obscures important details, or drowns us in confusing clutter? In 1968, William Cleveland published a text called <a href="http://www.stat.purdue.edu/~wsc/elements.html"><em>The Elements of Graphing Data,</em></a> inspired by Strunk and White&#8217;s classic writing handbook <a href="http://www.amazon.com/Elements-Style-50th-Anniversary/dp/0205632645"><em>The Elements of Style</em></a> . <em>The Elements of Graphing Data</em> puts forward Cleveland&#8217;s philosophy about how to produce good, clear graphs — not only for presenting one&#8217;s experimental results to peers, but also for the purposes of data analysis and exploration. Cleveland&#8217;s approach is based on a theory of graphical perception: how well the human perceptual system accomplishes certain tasks involved in reading a graph. For a given data analysis task, the goal is to align the information being presented with the perceptual tasks the viewer accomplishes the best. <span id="more-296"></span></p>
<blockquote><p>When a graph is made, quantitative and categorical information is encoded by a display method. Then the information is visually decoded. This visual perception is a vital link. No matter how clever the choice of the information, and no matter how technologically impressive the encoding, a visualization fails if the decoding fails. Some display methods lead to efficient, accurate decoding, and others lead to inefficient, inaccurate decoding. It is only through scientific study of visual perception that informed judgments can be made about display methods. The display methods of <em>Elements</em> rest on a foundation of scientific enquiry.</p></blockquote>
<p>— from the preface of <em>The Elements of Graphing Data</em></p>
<p>A revised edition of <em>The Elements of Graphing Data</em> was published in 1994, along with a companion volume, <a href="http://www.stat.purdue.edu/~wsc/visualizing.html"><em>Visualizing Data,</em></a> which is oriented towards the implementation and technical details of different graphing techniques. I highly recommend <em>The Elements of Graphing Data</em> as a guidebook for creating graphs, as well as for its excellent survey of several useful techniques. Cleveland, along with other colleagues at Bell Labs, developed the <a href="http://stat.bell-labs.com/project/trellis/s.html">Trellis display system,</a> a framework for the visualization of multivariable databases, using the ideas developed in his texts. Trellis, in turn, influenced Deepayan Sarkar&#8217;s Lattice graphics system for R. Lattice implements many of Cleveland&#8217;s ideas, and I also recommend Sarkar&#8217;s <a href="http://lmdvr.r-forge.r-project.org/figures/figures.html">Lattice manual</a> if you do data visualization in R.</p>
<p>It&#8217;s important to note here that Cleveland writes for researchers and decision-makers who use graphs to analyze data, or to convey scientific results to colleagues in an (ideally) objective manner. This distinguishes him from Darrell Huff, whose 1954 <a href="http://www.amazon.com/How-Lie-Statistics-Darrell-Huff/dp/0393310728"><em>How to Lie with Statistics</em></a> considered the use of graphs (and statistics in general) as rhetorical devices for convincing others of one&#8217;s point of view. Hence, some of Cleveland&#8217;s recommendations and guidelines actually contradict Huff&#8217;s. <a id="refHuff" href="#Huff"><sup>1</sup></a></p>
<p>Edward Tufte also explored the idea that the choice of graphical display should be influenced by the viewer&#8217;s cognitive processes, in his 1990 book <a href="http://www.edwardtufte.com/tufte/books_ei"><em>Envisioning Information</em></a>. Tufte tends to be more broadly concerned with the gestalt of a graph, beyond its use as an analysis tool; he is also more concerned than Cleveland is with aesthetic considerations.</p>
<p>Cleveland&#8217;s philosophy might be summarized as: <em>minimize the mental gymnastics that the viewer must go through to understand the graph</em>. This leads to some obvious advice: avoid clutter and occlusion, make graphing symbols or color-coding unambiguous, use scale-lines on all four sides of the graph, and so on. It also leads to advice that perhaps should be as obvious, but isn&#8217;t: <em>make the aspect of the data that you want to analyze as clear as possible</em>. But what does this mean in practice?</p>
<p><strong>Make important differences large enough to perceive</strong></p>
<p>Weber&#8217;s Law is a well known observation from the psychophysics literature, which states that the &#8220;just noticeable&#8221; change in a stimulus is a constant ratio of the original stimulus. Put another way, people are only capable of detecting a change in a stimulus that is greater than a certain percentage <em>k</em> of the original stimulus. Here, &#8220;stimulus&#8221; can refer to any perceivable physical quantity: weight, intensity, length, orientation. The percentage <em>k</em> will vary with stimulus, and with observer.</p>
<table border="0" align="center">
<tbody>
<tr>
<td>
<div style="text-align:center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/weberslaw.jpg" border="0" alt="weberslaw.jpg" width="488" height="233" /></div>
</td>
</tr>
</tbody>
<caption>Figure 1: From Cleveland, <em>The Elements of Graphing Data</em></caption>
<tbody>
<tr>
<td></td>
</tr>
</tbody>
</table>
<p>Figure 1 shows the application of Weber&#8217;s law to lengths. The bars A and B are of different lengths, but the difference is such a small fraction of the &#8220;base&#8221; length (say, A&#8217;s length, to be specific) that is difficult to tell whether or not they are different, or which is longer. On the right, the bars have been embedded in frames of identical length, and now it is easy to see that B is longer. Why? Because the difference in lengths of the <em>white</em> intervals is a much larger percentage of the white &#8220;base&#8221; length (say the white A interval). It is easy to see that the white B interval is shorter than the white A interval, and therefore, the black B interval is longer than the black A interval.</p>
<p>The moral is that you always want the viewer to be estimating changes or differences with respect to a short base length. You can do this with reference grids, as demonstrated below.</p>
<table border="0" align="center">
<caption>From Cleveland, <em>The Elements of Graphing Data</em></caption>
<tbody>
<tr>
<td><!-- original 319 by 601 --><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/noreferencegrids.jpg" border="0" alt="noreferencegrids.jpg" width="200" height="400" align="left" /></td>
<td><!-- original 319 by 601 --><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/referencegrids1.jpg" border="0" alt="referencegrids.jpg" width="200" height="400" align="right" /></td>
</tr>
<tr>
<td align="center">Figure 2</td>
<td align="center">Figure 3</td>
</tr>
</tbody>
</table>
<p>Figure 2 shows eight curves. Which one dips to the lowest minimum? Are the high curves approaching the same value, and which one is rising the fastest? Are the low curves dipping to the same minimum? Are they going to the same steady state? Figure 3 shows the same curves, graphed with identical reference grids. The grids shorten the base lengths that are being compared, and it is now much easier to compare highs, lows, and steady state behavior.</p>
<p>But wouldn&#8217;t it be better to compare the graphs by superposing them? For two or three curves, perhaps. But in this case, eight curves can clutter the graph, and use up the symbol or color space, making it difficult to distinguish the different datasets &#8212; increasing the mental gymnastics.</p>
<p>Reference grids are useful even for a single curve, especially one with slowly varying segments, such as these graphs have. The reference grid makes it easier to answer questions like: does the process return to the initial state, or to a different steady state? Has the process reached steady state, or is it still growing?</p>
<p><strong>Make important shape changes large enough to perceive: Banking to 45 degrees.</strong></p>
<p>The aspect ratio of a graph is important when trying to understand shape. Rate of change information is encoded in the slope of the curve, which the viewer estimates by changes in the orientation of the local tangents at each point of the graph. Weber&#8217;s Law tells us that very small changes in this orientation will be difficult to detect. For a given (physical) curve, the local orientation changes will be dependent on the aspect ratio of its graphical presentation, as shown (to an exaggerated degree) in Figure 4. Here, the same curve (two line segments) is plotted at three different aspect ratios, one that centers the graph at 45 degrees, one that forces the curve to be nearly vertical, and another that forces it to be nearly horizontal. In the last two cases, the change in orientation of the two line segments is so small as to be nearly undetectable.</p>
<table border="0" align="center">
<caption>Figure 4: From Cleveland</caption>
<tbody>
<tr>
<td><!-- original 670 by 630 --></p>
<div style="text-align:center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/angles.jpg" border="0" alt="angles.jpg" width="446" height="420" align="left" /></div>
</td>
</tr>
</tbody>
</table>
<p>For two line segments with positive, unequal slopes, a simple geometric argument shows that their absolute difference in orientation is maximized by the aspect ratio that sets their average orientation to 45 degrees (the first graph in Figure 4). Empirical studies by Cleveland and others have indeed verified that a viewer&#8217;s ability to judge the relative slopes of line segments on a graph is maximized when the absolute values of the orientations of the segments are centered on 45 degrees.</p>
<p>This result leads to a technique called <em>Banking to 45</em>, whereby the aspect ratio of the graph is chosen so that the average slope of the entire graph is 45 degrees. The details are discussed in Cleveland, and many of the plots in R&#8217;s Lattice package also have an option to bank the graph to 45 degrees.</p>
<p>This deliberate exaggeration of slope is something that Darrell Huff deplores. In <em>How to Lie with Statistics</em>, Huff refers to these graphs as &#8220;gee-whiz&#8221; graphs — and in the context of his discussion of statistics as rhetoric, they are:</p>
<table border="0" align="center">
<caption>Figure 5: From Huff, <em>How to Lie With Statistics</em></caption>
<tbody>
<tr>
<td><!-- original 461 by 351 --></p>
<div style="text-align:center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/geewhiz.jpg" border="0" alt="geewhiz.jpg" width="461" height="351" /></div>
</td>
</tr>
</tbody>
</table>
<p>To insist that a graph should always include a zero line and that units be in proportion may be good advice from a rhetorical perspective; but it is poor advice if the purpose of the graph is data analysis. As Figure 6 below demonstrates, we can lose resolution if we always insist on including the zero. Does the trend line in the left graph increase linearly, superlinearly, or sublinearly? The convexity of the curve is more apparent when it is banked to 45, as on the right. Assuming that the scientist reads the axis and is cognizant of the actual magnitude changes involved, the graph on the right conveys more information.</p>
<table border="0" align="center">
<caption>Figure 6: From Cleveland</caption>
<tbody>
<tr>
<td><img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/bank451.jpg" border="0" alt="bank45.jpg" width="500"  /></td>
</tr>
</tbody>
</table>
<p><strong>Make sure all the data is equally well resolved.</strong></p>
<p>It is quite common for positive data —  word frequencies, populations, price distributions, just to name a few examples — to be skewed: most of the data is bunched towards low values, the rest of it is spread out on a very long tail. This long tail squashes the majority of the data into a tiny interval of a very narrow dynamic range, as in Figure 7, making it difficult to evaluate the data.</p>
<table border="0" align="center">
<tbody>
<tr>
<td>
<table border="0">
<tbody>
<tr>
<td><!-- original size: 990 by 860 --><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/skewed1.gif" border="0" alt="skewed.gif" width="250" /></td>
</tr>
</tbody>
<caption>Figure 7: Long-tailed distribution of purchase sizes</caption>
</table>
</td>
<td>
<table border="0">
<tbody>
<tr>
<td><!-- original = 499 by 675 --><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/logskewed1.gif" border="0" alt="logskewed.gif" width="250" /></td>
</tr>
</tbody>
<caption>Figure 8: Distribution of log(purchase size)</caption>
</table>
</td>
</tr>
</tbody>
</table>
<p>Imagine that Figure 7 represents the distribution of average purchase size across an online merchant&#8217;s customers: average purchase size is plotted on the x-axis, and the y-axis represents the fraction of the total customer population whose average purchase size is a given value (the area under the graph integrates to one). According to this graph, most customers make fairly small purchases on average, but there is a long tail of big spenders trailing out into the range of several thousand dollars. Obviously, one would like a little more resolution on the big spike of customers near zero. One could simply &#8220;zoom in&#8221; on this range, by chopping off some long chunk of the tail, but you may potentially lose sight of some global patterns in the data by doing so.</p>
<p>Graphing the distribution of log(purchase size) enables you to increase the resolution near zero, while preserving the global view. Figure 8 shows the distribution of log(purchase size), revealing two spending populations: a population of high spenders who tend to make purchases in the $3000 range (in log space), and another population whose purchases are centered (in log space) around $60. The existence of these two distinct populations is not apparent in the original graph.</p>
<p>Notice that Figure 8 has two x-axis scales: the top axis is marked in log units, while the bottom axis is marked in absolute dollars, spaced on a log scale. This accords with the principle of minimizing mental gymnastics, since the viewer of the graph will typically be concerned about prices in dollars, not log dollars. In fact, it would have been better yet to have plotted the distribution of log<sub>2</sub> or log<sub>10</sub> of the data; the former would allow us to see at a glance the doubling of price ranges, the latter to see price changes in factors of ten.</p>
<table border="0" align="center">
<caption>Figure 9: The 14 most abundant elements in meteorites. From Cleveland</caption>
<tbody>
<tr>
<td><!-- original = 543 by 522 --><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/metals.jpg" border="0" alt="metals.jpg" width="250" /></td>
<td><!-- original = 550 by 600 --><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/logmetals.jpg" border="0" alt="logmetals.jpg" width="250" /></td>
</tr>
</tbody>
</table>
<p>Figure 9 shows another example: the fourteen most abundant elements in meteorites, specifically the average percent of each of the elements. If we graph the percentages directly, as on the left, we cannot easily distinguish the differences in the elements from aluminum on down. Graphing log<sub>2</sub> of the percentages, as on the right, improves the resolution. Again, we have two x-axes on the graph of the log data.</p>
<p><strong>If you want to analyze the difference between two processes, then graph the difference, not the processes (or graph both).</strong></p>
<p>Suppose that we are comparing the two processes f1 and f2 that are shown in Figure 10. As x increases, the two processes appear to be approaching each other  — that is, the difference between the two seems to be decreasing. In reality, the difference between the two is constant: f2 = f1+1.</p>
<table border="0" align="center">
<tbody>
<tr>
<td>
<table border="0">
<tbody>
<tr>
<td><!-- original size: 990 by 860 --><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/difference1.gif" border="0" alt="difference.gif" width="250" /></td>
</tr>
</tbody>
<caption>Figure 10: The illusion of convergence</caption>
</table>
</td>
<td>
<table border="0">
<tbody>
<tr>
<td><!-- original = 499 by 675 --><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/imports.jpg" border="0" alt="imports.jpg" width="250" /></td>
</tr>
</tbody>
<caption>Figure 11: British Imports and Exports. From Cleveland</caption>
</table>
</td>
</tr>
</tbody>
</table>
<p>It turns out that people are good at perceiving the perpendicular difference between two curves, but not the differences in height, which is what we are actually interested in here. When we try to infer the differences from the process graph, we may not only miss key information, we may actually draw incorrect conclusions.</p>
<p>A less toy example is given in Figure 11. Here the imports to and exports from England are graphed over the first 80 years of the 18th century. In the difference graph on the bottom, we can see a local peak in (imports-exports) just after 1760; this is not obvious from simply comparing the two processes (top graph).</p>
<p><strong>If you are interested in rate of change, then graph rate of change.</strong></p>
<p>In Figure 12, we see the population figures for a given community from 1990 to 2009. Obviously, the population is steadily increasing, but how quickly? Is the rate of population growth increasing over time, or is it decreasing? If we are interested in these questions, then simply graphing the population over time is not enough. We need to look at the rate of change directly.</p>
<table border="0" align="center">
<tbody>
<tr>
<td>
<table border="0">
<caption>Figure 12</caption>
<tbody>
<tr>
<td><!-- original 998 by 860 --><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/rateofchange1.gif" border="0" alt="rateofchange.gif" width="250" /></td>
</tr>
</tbody>
</table>
</td>
<td>
<table border="0">
<caption>Figure 13</caption>
<tbody>
<tr>
<td><!-- original 720 by 720 --><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/lograteofchange2.gif" border="0" alt="lograteofchange.gif" width="250" /></td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
<p>The classic way to do this is by graphing the logarithm of the data. In Figure 13, we have graphed log<sub>2</sub> of the population over time, with the log scale printed on the right hand y-axis, and the actual population numbers printed at a log scale on the left hand axis. Now we can see that the population increased at a constant rate from 1990 to 2000, quadrupling approximately every four years, and then slowed down (to a lower constant rate) after 2000.</p>
<p><strong>Graphs as a research tool</strong></p>
<p>Throughout this discussion, we have considered graphs as a tool for data exploration and initial understanding. It is an iterative process &#8212; as questions arise, the data will be reprocessed and re-plotted to highlight the new issues to be examined. A good research graph must display this information directly, with a minimum of mental gymnastics, but &#8212; as with any research tool &#8212; there can be a learning curve. For example, densityplots (such as those shown in Figures 7 and 8) are in my opinion more useful than histograms for understanding how numerical data is distributed &#8212; and I am constantly surprised at the amount of explanation that they require when I show them to people who are unfamiliar with them. A number of very useful graphs that are discussed in Cleveland&#8217;s texts meet with the same reaction from people who encounter that style of graph for the first time. This is a disadvantage, relative to using a more fashionable graph, when attempting to communicate results. But the insight into the data that these graphs provide often make it worth spending the time to educate clients or peers on how to read the graph.</p>
<p>Even so, a good graph still may not be a quick read. As Cleveland writes:</p>
<blockquote><p>While there is a place for rapidly-understood graphs, it is too limiting to make speed a requirement in science and technology, where the use of graphs ranges from detailed in-depth data analysis to quick presentation.<br />
&#8230;</p>
<p>The important criterion for a graph is not simply how fast we can see a result; rather it is whether through the use of the graph we can see something that would have been harder to see otherwise or that could not have been seen at all.</p></blockquote>
<p>- <em>The Elements of Graphing Data</em>, Chapter 2</p>
<hr /><a id="Huff" href="#refHuff">[Back]</a><sup>1</sup><em>How to Lie with Statistics</em> is an entertaining (if a little dated) discussion of how to read statistical and quantitative claims critically, and is definitely worth a read.</p>


<p>Related posts:<ol><li><a href='http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/' rel='bookmark' title='Permanent Link: A Demonstration of Data Mining'>A Demonstration of Data Mining</a></li>
<li><a href='http://www.win-vector.com/blog/2010/02/living-in-a-lognormal-world/' rel='bookmark' title='Permanent Link: Living in A Lognormal World'>Living in A Lognormal World</a></li>
<li><a href='http://www.win-vector.com/blog/2009/04/the-data-enrichment-method/' rel='bookmark' title='Permanent Link: The Data Enrichment Method'>The Data Enrichment Method</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2009/08/good-graphs-graphical-perception-and-data-visualization/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>A Demonstration of Data Mining</title>
		<link>http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=a-demonstration-of-data-mining</link>
		<comments>http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/#comments</comments>
		<pubDate>Thu, 20 Aug 2009 01:16:27 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Applications]]></category>
		<category><![CDATA[Expository Writing]]></category>
		<category><![CDATA[Mathematics]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[ML]]></category>
		<category><![CDATA[Regression]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=252</guid>
		<description><![CDATA[REPOST (now in HTML in addition to the original PDF). This paper demonstrates and explains some of the basic techniques used in data mining. It also serves as an example of some of the kinds of analyses and projects Win Vector LLC engages in. August 19, 2009 John Mount1 A Demonstration of Data Mining 1&#160;&#160;Introduction [...]


Related posts:<ol><li><a href='http://www.win-vector.com/blog/2009/04/the-data-enrichment-method/' rel='bookmark' title='Permanent Link: The Data Enrichment Method'>The Data Enrichment Method</a></li>
<li><a href='http://www.win-vector.com/blog/2009/01/exciting-technique-1-the-r-language/' rel='bookmark' title='Permanent Link: Exciting Technique #1: The &#8220;R&#8221; language.'>Exciting Technique #1: The &#8220;R&#8221; language.</a></li>
<li><a href='http://www.win-vector.com/blog/2009/08/good-graphs-graphical-perception-and-data-visualization/' rel='bookmark' title='Permanent Link: Good Graphs: Graphical Perception and Data Visualization'>Good Graphs: Graphical Perception and Data Visualization</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>REPOST (now in HTML in addition to the original  <a href="http://www.win-vector.com/dfiles/ADemonstrationOfDataMining.pdf"> PDF</a>).</p>
<p>This paper  demonstrates and explains some of the basic techniques used in data mining.  It also serves as an example of some of the kinds of analyses and projects Win Vector LLC engages in.<span id="more-252"></span>
<div class="p"><!----></div>
<h3 align="center">August 19, 2009 </h3>
<h3 align="center">John Mount<a href="#tthFtNtAAB" name="tthFrefAAB"><sup>1</sup></a> </h3>
<h1 align="center">A Demonstration of Data Mining </h1>
<div class="p"><!----></div>
<h2><a name="tth_sEc1"><br />
1</a>&nbsp;&nbsp;Introduction</h2>
<div class="p"><!----></div>
<p> A major industry in our time is the collection of large data sets in preparation for the magic of data mining [<a href="#NYTStat" name="CITENYTStat">Loh09</a>,<a href="#Halevy:2009p2327" name="CITEHalevy:2009p2327">HNP09</a>].  There is extreme excitement about both the possible applications (identifying important customers, identifying medical risks, targeting advertising, designing auctions and so on) and the various methods for data mining and machine learning.  To some extent these methods are classic statistics presented in a new bottle.  Unfortunately, the concerns, background and language of the modern data-mining practitioner are different than that of the classic statistician- so some demonstration and translation is required.  In this writeup we will show how much of the magic of current data mining and machine learning can be explained in terms of statistical regression techniques and show how the statistician&#8217;s view is useful in choosing techniques.</p>
<div class="p"><!----></div>
<p> Too often data mining is used as a black-box. It is quite possible to clearly use statistics to understand the meaning and mechanisms of data mining.</p>
<div class="p"><!----></div>
<h2><a name="tth_sEc2"><br />
2</a>&nbsp;&nbsp;The Example Problem</h2>
<div class="p"><!----></div>
<p> Throughout this writeup we will work on a single idealized example problem.  For our problem we will assume we are working with a company that sells items and that this company has recorded its past sales visits.  We assume they recorded how well the prospect matched the product offering (we will call this &#8220;match factor&#8221;), how much of a discount was offered to the prospect (we will call this &#8220;discount factor&#8221;) and if the prospect became a customer or not (this is our determination of positive or negative outcome).  The goal is to use this past record as &#8220;training data&#8221; and build a model to predict the odds of making a new sale as a function of the match factor and the discount factor.  In a perfect world the historic data would look a lot like Figure&nbsp;<a href="#fig:IdealFitting">1</a>.  In Figure&nbsp;<a href="#fig:IdealFitting">1</a> each icon represents a past sales-visit, the red diamonds are non-sales and the green disks are successful sales.  Each icon is positioned horizontally to correspond to the discount factor used and vertically to correspond to the degree of product match estimated during the prospective customer visit.  This data is literally too good to be true in at least three qualities: the past data covers a large range of possibilities, every possible combination has already been tried in an orderly fashion and the good and bad events &#8220;are linearly separable.&#8221;  The job of the modeler would then be to draw the separating line (shown in Figure&nbsp;<a href="#fig:IdealFitting">1</a>) and label every situation above and to the right of the separating line as good (or positive) and every situation below and to the left as bad (or negative).</p>
<div class="p"><!----></div>
<div class="p"><!----></div>
<p><a name="tth_fIg1"><br />
</a><br />
<center><img width="400" src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/IdealFitting.png" alt="IdealFitting.png" /></p>
<p></center><center>Figure 1: Ideal Fitting Situation</center><br />
<a name="fig:IdealFitting"><br />
</a></p>
<div class="p"><!----></div>
<p> In reality past data is subject to what prospects were available (so you are unlikely to have good range and an orderly layout of past sales calls) and also heavily affected by past policy.  An example policy might be that potential customers with good product match factor may never have been offered a significant discount in the past; so we would have no data from that situation.  Finally each outcome is a unique event that depends on a lot more than the two quantities we are recording- so it is too much to hope that the good prospects are simply separable from the bad ones.</p>
<div class="p"><!----></div>
<p> Figure&nbsp;<a href="#fig:IdealFitting">1</a> is a mere cartoon or caricature of the modeling process, but it represents the initial intuition behind data mining.  Again: the flaws in Figure&nbsp;<a href="#fig:IdealFitting">1</a> represent the implicit hopes of the data miner.  The data miner wishes that the past experiments are laid out in an orderly manner, data covers most of the combinations of possibilities and there is a perfect and simple concept ready to be learned.</p>
<div class="p"><!----></div>
<p> Frankly, an experienced data miner would feel incredibly fortunate if the past data looked anything like what is shown in Figure&nbsp;<a href="#fig:EmpiricalData">2</a>.</p>
<div class="p"><!----></div>
<div class="p"><!----></div>
<p><a name="tth_fIg2"><br />
</a><br />
<center><img width="400" src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/empirical1.png" alt="empirical1.png" /></p>
<p></center><center>Figure 2: Empirical Data</center><br />
<a name="fig:EmpiricalData"><br />
</a></p>
<div class="p"><!----></div>
<p> The green disks (representing good past prospects) and the red diamonds (representing bad past prospects) are intermingled (which is bad).  There is some evidence that past policy was to lower the discount offered as the match factor increased (as seen in the diagonal spread of the green disks).  Finally we see the red diamonds are also distributed differently than the green disks. This is both good and bad.  The good is that the center of mass of the red diamonds differs from the center of mass of the green disks.  The bad is that the density of red diamonds does not fall any faster as it passes into the green disks than it falls in any other direction.  This indicates there is something important and different (and not measured in our two variables) about at least some of the bad prospects.  It is the data miner&#8217;s job be aware and to press on.</p>
<div class="p"><!----></div>
<h3><a name="tth_sEc2.1"><br />
2.1</a>&nbsp;&nbsp;The Trendy Now</h3>
<div class="p"><!----></div>
<p> In truth data miners often rush where classical statisticians fear to tread.  Right now the temptation is to immediately select from any number of &#8220;red hot&#8221; techniques, methods or software packages.  My short list of super-star method buzzwords includes:</p>
<div class="p"><!----></div>
<ul>
<li> Boosting[<a href="#Schapire:2001p1019" name="CITESchapire:2001p1019">Sch01</a>,<a href="#Breiman:2000p1134" name="CITEBreiman:2000p1134">Bre00</a>,<a href="#Freund:2003p1009" name="CITEFreund:2003p1009">FISS03</a>]
<div class="p"><!----></div>
</li>
<li> Latent Dirichlet Allocation[<a href="#Blei:2003p1063" name="CITEBlei:2003p1063">BNJ03</a>]
<div class="p"><!----></div>
</li>
<li> Linear Regression[<a href="#statistics" name="CITEstatistics">FPP07</a>,<a href="#Agresti" name="CITEAgresti">Agr02</a>]
<div class="p"><!----></div>
</li>
<li> Linear Discriminant Analysis[<a href="#Fisher:1936p2576" name="CITEFisher:1936p2576">Fis36</a>]
<div class="p"><!----></div>
</li>
<li> Logistic Regression[<a href="#Agresti" name="CITEAgresti">Agr02</a>,<a href="#Klein:2003p261" name="CITEKlein:2003p261">KM03</a>]
<div class="p"><!----></div>
</li>
<li> Kernel Methods[<a href="#kernel1" name="CITEkernel1">CST00</a>,<a href="#kernel2" name="CITEkernel2">STC04</a>]
<div class="p"><!----></div>
</li>
<li> Maximum Entropy[<a href="#Klein:2003p261" name="CITEKlein:2003p261">KM03</a>,<a href="#Grunwald:2005p108" name="CITEGrunwald:2005p108">Gru05</a>,<a href="#Stern:1989p1480" name="CITEStern:1989p1480">SC89</a>,<a href="#Dudik:2006p954" name="CITEDudik:2006p954">DS06</a>]
<div class="p"><!----></div>
</li>
<li> Naive Bayes[<a href="#Lewis:1998p105" name="CITELewis:1998p105">Lew98</a>]
<div class="p"><!----></div>
</li>
<li> Perceptrons[<a href="#Beigel:2008p1027" name="CITEBeigel:2008p1027">BRS08</a>,<a href="#Dasgupta:2005p2013" name="CITEDasgupta:2005p2013">DKM05</a>]
<div class="p"><!----></div>
</li>
<li> Quantile Regression[<a href="#quantile" name="CITEquantile">Koe05</a>]
<div class="p"><!----></div>
</li>
<li> Ridge Regression[<a href="#Breiman:1997p1133" name="CITEBreiman:1997p1133">BF97</a>]
<div class="p"><!----></div>
</li>
<li> Support Vector Machines[<a href="#kernel1" name="CITEkernel1">CST00</a>]
<div class="p"><!----></div>
</li>
</ul>
<div class="p"><!----></div>
<p> Based on some of the above referenced writing and analysis I would first pick &#8220;logistic regression&#8221; as I am confident that, when used properly, it is just about as powerful as any of the modern data mining techniques (despite its somewhat less than trendy status).  Using logistic regression I immediately get just about as close to a separating line as this data set will support: Figure&nbsp;<a href="#fig:LinearSepartor">3</a>.</p>
<div class="p"><!----></div>
<div class="p"><!----></div>
<p><a name="tth_fIg3"><br />
</a><br />
<center><img width="400" src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/lin1.png" alt="lin1.png" /></p>
<p></center><center>Figure 3: Linear Separator</center><br />
<a name="fig:LinearSepartor"><br />
</a></p>
<div class="p"><!----></div>
<p> The separating line actually encodes a simple rule of the form: &#8220;if 2.2*DiscountFactor + 3.1*MatchFactor &#8805; 1 then we have a good chance of a sale.&#8221;  This is classic black-box data mining magic.  The purpose of this writeup is to look deeper how to actually derive and understand something like this.</p>
<div class="p"><!----></div>
<h2><a name="tth_sEc3"><br />
3</a>&nbsp;&nbsp;Explanation</h2>
<div class="p"><!----></div>
<p> What is really going on?  Why is our magic formula at all sensible advice, why did this work at all and what motivates the analysis?  It turns out regression (be it linear regression or logistic regression) works in this case because it somewhat imitates the methodology of linear discriminant analysis (described in: [<a href="#Fisher:1936p2576" name="CITEFisher:1936p2576">Fis36</a>]).  In fact in many cases it would be a better idea to perform a linear discriminant analysis or perform an analysis of variance than to immediately appeal to a complicated method.  I will first step through the process of linear discriminant analysis and then relate it to our logistic regression.  Stepping through understandable stages lets us see where we were lucky in modeling and what limits and opportunities for improvement we have.</p>
<div class="p"><!----></div>
<div class="p"><!----></div>
<p><a name="tth_fIg4"><br />
</a><br />
<center></p>
<table>
<tr>
<td><img width="250" src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/posDat.png" alt="posDat.png" /></td>
<td><img width="250" src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/negDat.png" alt="negDat.png" />
</td>
</tr>
</table>
<p></center><center>Figure 4: Separate Plots</center><br />
<a name="fig:SeparatePlots"><br />
</a></p>
<div class="p"><!----></div>
<p> Our data initially looks very messy (the good and bad group are fairly mixed together).  But if we examine out data in separate groups we can see we are actually incredibly lucky in that the data is easy to describe.  As we can see in Figure&nbsp;<a href="#fig:SeparatePlots">4</a>: the data, when separated by outcome (plotting only all of the good green disks or only all of the bad red diamonds), is grouped in simple blobs without bends, intrusions or other odd (and more work to model) configurations.</p>
<div class="p"><!----></div>
<p> We can plot the idealizations of these data distributions (or densities) as &#8220;contour maps&#8221; (as if we are looking down on the elevations of a mountain on a map) which gives us Figure&nbsp;<a href="#fig:SeparateDistributions">5</a>.</p>
<div class="p"><!----></div>
<div class="p"><!----></div>
<p><a name="tth_fIg5"><br />
</a><br />
<center></p>
<table>
<tr>
<td><img width="250" src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/posDist.png" alt="posDist.png" /></td>
<td> <img width="250" src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/negDist.png" alt="negDist.png" />
</td>
</tr>
</table>
<p></center><center>Figure 5: Separate Distributions</center><br />
<a name="fig:SeparateDistributions"><br />
</a></p>
<div class="p"><!----></div>
<h3><a name="tth_sEc3.1"><br />
3.1</a>&nbsp;&nbsp;Full Bayes Model</h3>
<div class="p"><!----></div>
<p> From Figure&nbsp;<a href="#fig:SeparateDistributions">5</a> we can see while our data is not separable there are significant differences between the groups.  The difference in the groups is more obvious if we plot the difference of the densities on the same graph as in Figure&nbsp;<a href="#fig:DifferenceInDensity">6</a>.  Here we are visualizing the distribution of positive examples as a connected pair of peaks (colored green) and the distribution of negative examples a deep valley (colored red) located just below and to the left of the peaks.</p>
<div class="p"><!----></div>
<div class="p"><!----></div>
<p><a name="tth_fIg6"><br />
</a><br />
<center><img width="400" src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/diff1.png" alt="diff1.png" /></p>
<p></center><center>Figure 6: Difference in Density</center><br />
<a name="fig:DifferenceInDensity"><br />
</a></p>
<div class="p"><!----></div>
<p> This difference graph is demonstrating how both of the densities or distributions (positive and negative) reach into different regions of the plane.  The white areas are where the difference in densities is very small which includes the areas in the corners (where there is little of either distribution) and the area between the blobs (where there is a lot of mass from both distributions competing).  This view is a bit closer to what a statistician wants to see- how the distributions of successes and failures different (this is a step to take before even guessing at or looking for causes and explanations).</p>
<div class="p"><!----></div>
<p> Figure&nbsp;<a href="#fig:DifferenceInDensity">6</a> is already an actionable model- we can predict the odds a new prospect will buy or not at a given discount by looking where they fall on Figure&nbsp;<a href="#fig:DifferenceInDensity">6</a> and checking if they fall in a region on strong red or strong green color.  We can also recommend a discount for a given potential customer by drawing a line at the height determined by their degree of match and tracing from left to right until we first hit a strong green region.  We could hand out a simplified Figure&nbsp;<a href="#fig:FullBayesModel">7</a> as a sales rulebook.</p>
<div class="p"><!----></div>
<div class="p"><!----></div>
<p><a name="tth_fIg7"><br />
</a><br />
<center><img width="400" src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/bayesModel1.png" alt="bayesModel1.png" /></p>
<p></center><center>Figure 7: Full Bayes Model</center><br />
<a name="fig:FullBayesModel"><br />
</a></p>
<div class="p"><!----></div>
<p> This model is a full Bayes model (but not a Naive Bayes model, which is oddly more famous and which we will cover later).  The steps we took were: first we summarized or idealized our known data into two Gaussian blobs (as depicted in Figure&nbsp;<a href="#fig:SeparateDistributions">5</a>).  Once we had estimated the centers, widths and orientations of these blobs we could then: for any new point say how likely the point is under the modeled distribution of sales and how likely the point is under the modeled distribution of non-sales.  Mathematically we claim we can estimate P(x,y &#124;sale)<a href="#tthFtNtAAC" name="tthFrefAAC"><sup>2</sup></a> and P(x,y &#124; non-sale) (where x is our discount factor and y is our matching factor).<a href="#tthFtNtAAD" name="tthFrefAAD"><sup>3</sup></a> Neither of these are what we are actually interested in (we want: P(sale &#124; x,y)<a href="#tthFtNtAAE" name="tthFrefAAE"><sup>4</sup></a>).  We can, however, use these values to calculate what we want to know.  Bayes&#8217; law is a law of probability that says if we know P(sale &#124; x,y), P(non-sale &#124; x,y), P(sale) and P(non-sale)<a href="#tthFtNtAAF" name="tthFrefAAF"><sup>5</sup></a> then:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/eqn1.png"/><br />
</center></p>
<p>Figure&nbsp;<a href="#fig:FullBayesModel">7</a> depicts a central hourglass shaped region (colored green) that represents the region of x, y values where P(sale &#124;x,y) is estimated to be at least 0.5 and the remaining (darker red region) are the situations predicted to be less favorable.  Here we are using priors of P(sale) = P(non-sale) = 0.5, for different priors and thresholds we would get different graphs.</p>
<div class="p"><!----></div>
<p> Even at this early stage in the analysis we have already accidentally introduced what we call &#8220;an inductive bias.&#8221;  By modeling both distributions as Gaussians we have guaranteed that our acceptance region will be an hourglass figure (as we saw in Figure&nbsp;<a href="#fig:FullBayesModel">7</a>).  One undesirable consequence of the modeling technique is the prediction sales become unlikely when both match factor and discount factor are very large.  This is somewhat a consequence of our modeling technique (though the fact that the negative data does not fall quickly as it passes into the green region also added to this).  This un-realistic (or &#8220;not physically plausible&#8221;) prediction is called an artifact (of the technique and of the data) and it is the statistician&#8217;s job to see this, confirm they don&#8217;t want it and eliminate it (by deliberately introducing a &#8220;useful modeling bias&#8221;).</p>
<div class="p"><!----></div>
<h3><a name="tth_sEc3.2"><br />
3.2</a>&nbsp;&nbsp;Linear Discriminant</h3>
<div class="p"><!----></div>
<p> To get around the bad predictions of our model in the upper-right quadrant we &#8220;apply domain knowledge&#8221; and introduce a useful modeling bias as follows.  Let us insist that our model be monotone: that if moving some direction is good than moving further in the same direction is better.  In fact let&#8217;s insist that our model be a half-plane (instead of two parabolas).  We want a nice straight separating cut, which brings us to linear discriminant analysis.  We have enough information to apply Fisher linear discriminant technique and find a separator that maximizes the variance of data across categories while minimizing the variance of data within one category and within the other category.  This is called the linear discriminant and it is shown in Figure&nbsp;<a href="#fig:LinearDiscriminant">8</a>.</p>
<div class="p"><!----></div>
<div class="p"><!----></div>
<p><a name="tth_fIg8"><br />
</a><br />
<center><img width="400" src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/lda1.png" alt="lda1.png" /></p>
<p></center><center>Figure 8: Linear Discriminant</center><br />
<a name="fig:LinearDiscriminant"><br />
</a></p>
<div class="p"><!----></div>
<p> The blue line is the linear discriminant (similar to the logistic regression line depicted earlier on the data-slide).  Everything above or to the right of the blue line is considered good and everything below or to the left of the blue line is considered bad.  Notice that this advice while not quite as accurate as the Bayes Model near the boundary between the two distributions is much more sensible about the upper right corner of the graph.</p>
<div class="p"><!----></div>
<p> To evaluate a separator we collapse all variation parallel to the separating cut (as shown in Figure&nbsp;<a href="#fig:collapse">9</a>).  We then see that each distribution becomes a small interval or streak.  A separator is good if these resulting streaks are both short (the collapse packs the blobs) and the two centers of the streaks are far apart (and on opposite size of the separator).  In Figure&nbsp;<a href="#fig:collapse">9</a> the streaks are fairly short and despite some overlap we do have some usable separation between the two centers.</p>
<div class="p"><!----></div>
<div class="p"><!----></div>
<p><a name="tth_fIg9"><br />
</a><br />
<center><img width="400" src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/collapse2.png" alt="collapse2.png" /></p>
<p></center><center>Figure 9: Evaluating Quality of Separating Cut</center><br />
<a name="fig:collapse"><br />
</a></p>
<div class="p"><!----></div>
<p> To make the above precise we switch to mathematical notation.  For the i-th positive training example form the vector v<sub>+,i</sub> and the matrix S<sub>+,i</sub> where</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/eqn2.png"/><br />
</center></p>
<div class="p"><!----></div>
<p> where x<sub>i</sub> and y<sub>i</sub> are the known x and y coordinates for this particular past experience.  Define v<sub>&#8722;,i</sub>, S<sub>&#8722;,i</sub> similarly for all negative examples.  In this notation we have for a direction &#947;: the distance along the &#947; direction between the center of positive examples and center of negative examples is: &#947;<sup>T</sup> ( &#8721;<sub>i</sub> v<sub>+,i</sub> / n<sub>+</sub> &#8722; &#8721;<sub>i</sub> v<sub>&#8722;,i</sub> / n<sub>&#8722;</sub>) (where n<sub>+</sub> is the number of positive examples and n<sub>&#8722;</sub> is the number of negative examples).  We would like this quantity to be large.  The degree of spread or variance of the positive examples along the &#947; direction is &#947;<sup>T</sup> (&#8721;<sub>i</sub> S<sub>+,i</sub> / n<sub>+</sub>) &#947;.  The degree of spread or variance of the negative examples along the &#947; direction is &#947;<sup>T</sup> (&#8721;<sub>i</sub> S<sub>&#8722;,i</sub> / n<sub>&#8722;</sub>) &#947;.  We would like the last two quantities to be small.  The linear discriminant is picked to maximize:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/eqn3.png"/><br />
</center></p>
<p>It is a fairly standard observation (involving the Rayleigh quotient) that this form is maximized when:<br />
<center><br />
<a name="eq:lda"><br />
</a><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/eqn4.png"/><br />
</center></p>
<div class="p"><!----></div>
<p> As we have said, the linear discriminant is very similar to what is returned by a regression or logistic regression.  In fact in our diagrams the regression lines are almost identical to the linear discriminant.  A large part of why regression can be usefully applied in classification comes from its close relationship to the linear discriminant.</p>
<div class="p"><!----></div>
<h3><a name="tth_sEc3.3"><br />
3.3</a>&nbsp;&nbsp;Linear Regression</h3>
<div class="p"><!----></div>
<p> Linear regression is designed to model continuous functions subject to independent normal errors in observation.  Linear regression is incredibly powerful at characterizing and elimination correlations between the input variables of a model.  While function fitting is different than classification (our example problem) linear regression is so useful whenever there is any suspected correlation (which is almost always the case) that it is an appropriate tool.  In our example in the positive examples (those that led to sales) there is clearly a historical dependence between the degree of estimated match and amount of discount offered.  Likely this dependence is from past prospects being subject to a (rational) policy of &#8220;the worse the match the higher the offered discount&#8221; (instead of being arranged in a perfect grid-like experiment as in our first diagram: Figure&nbsp;<a href="#fig:IdealFitting">1</a>).  If this dependence is not dealt with we would under-estimate the value of discount because we would think that discounted customers are not signing up at a higher rate (when these prospects are in fact clearly motivated by discount, once you control for the fact that many of the deeply discounted prospects had a much worse degree of match than average).</p>
<div class="p"><!----></div>
<p> For analysis of categorical data linear regression is closely linked to ANOVA (analysis of variance).[<a href="#Agresti" name="CITEAgresti">Agr02</a>] Recall that variance was a major consideration with the linear discriminant analysis, so we should by now be on familiar ground.</p>
<div class="p"><!----></div>
<p>In our notation the standard least-squares regression solution is:<br />
<center><br />
<a name="eq:leastsquares"><br />
</a><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/eqn5.png"/><br />
</center></p>
<p>where y<sub>+,i</sub> = 1 for all i and y<sub>&#8722;,i</sub> = &#8722;1 for all i.</p>
<div class="p"><!----></div>
<p> If we have the same number of positive and negative examples (i.e.  n<sub>+</sub> = n<sub>&#8722;</sub>) then Equation&nbsp;<a href="#eq:lda">1</a> and Equation&nbsp;<a href="#eq:leastsquares">2</a> are identical and we have &#946; = &#947;.  So in this special case the linear discriminant equals the least square linear regression solution.  We can even ask how the solutions change if the relative proportions of positive and negative training data changes.  The linear discriminant is carefully designed not to move, but the regression solution will tilt to be an angle that is more compatible with the larger of the example classes and shift to cut less into that class.  The linear regression solution can be fixed (by re-weighting the data) to also be insensitive to the relative proportions of positive and negative examples but does not behave that way &#8220;fresh out of the box.&#8221;</p>
<div class="p"><!----></div>
<h3><a name="tth_sEc3.4"><br />
3.4</a>&nbsp;&nbsp;Logistic Regression</h3>
<div class="p"><!----></div>
<p> While linear regression is designed to pick a function that minimizes the sum of square errors logistic regression is designed to pick a separator that maximizes something called <em>the plausibility of the data</em>.  In our case since the data is so well behaved the logistic regression line is essentially the same as the linear regression line.  It is in fact an important property of logistic regression that there is always a re-weighting (or choice of re-emphasis) of the data that causes some linear regression to pick the same separator as the logistic regression.  Because linear and logistic regression are only identical in specific circumstances it is the job of the statistician to know which of the two is more appropriate for a given data set and given intended use of the resulting model.</p>
<div class="p"><!----></div>
<h2><a name="tth_sEc4"><br />
4</a>&nbsp;&nbsp;Other Methods and Techniques</h2>
<div class="p"><!----></div>
<h3><a name="tth_sEc4.1"><br />
4.1</a>&nbsp;&nbsp;Kernelized Regression</h3>
<div class="p"><!----></div>
<p> One way to greatly expand the power of modeling methods is a trick called kernel methods.  Roughly kernel methods are those methods that increase the power of machine learning by moving from a simple problem space (like ours in variables x and y) to a richer problem space that may be easier to work in.  A lot of ink is spilled about how efficient the kernel methods are (they work in time proportional to the size of the simple space, not the complex one) but this is not their essential feature.  The essential feature is the expanded explanation power and this is so important that even the trivial kernel methods (such as directly adjoining additional combinations of variables) pick up most of the power of the method.  Kernel methods are also overly associated with Support Vector Machines- but are just as useful when added to Naive Bayes, linear regression or logistic regression.</p>
<div class="p"><!----></div>
<p> For instance: Figure&nbsp;<a href="#fig:KernelizedRegression">10</a> shows a bow-tie like acceptance region found by using linear regression over the variables x, y, x<sup>2</sup>, y<sup>2</sup> and x y (instead of just x and y).  Note how this result is similar to the full Bayes model (but comes from a different feature set and fitting technique).</p>
<div class="p"><!----></div>
<div class="p"><!----></div>
<p><a name="tth_fIg10"><br />
</a><br />
<center><img width="400" src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/kRegression.png" alt="kRegression.png" /></p>
<p></center><center>Figure 10: Kernelized Regression</center><br />
<a name="fig:KernelizedRegression"><br />
</a></p>
<div class="p"><!----></div>
<h3><a name="tth_sEc4.2"><br />
4.2</a>&nbsp;&nbsp;Naive Bayes Model</h3>
<div class="p"><!----></div>
<p> We briefly return to the Bayes model to discuss a more common alternative called &#8220;Naive Bayes.&#8221;  A Naive Bayes model is like a full Bayes model except an additional modeling simplification is introduced in assuming that P(x,y&#124;sale) = P(x&#124;sale)P(y&#124;sale) and P(x,y&#124;non-sale) = P(x&#124;non-sale)P(y&#124;non-sale).  That is we are assuming that the distributions of the x and y measurements are essentially independent (once we know which outcome happened).  This assumption is the opposite of what we do with regression in that we ignore dependencies in the data (instead of modeling and eliminating the dependencies).  However, Naive Bayes methods are quite powerful and very appropriate in sparse-data situations (such as text classification).  The &#8220;naive&#8221; assumption that the input variables are independent greatly reduces the amount of data that needs to be tracked (it is much less work to track values of variables instead of simultaneous values of pairs of variables).  The curved separator from this Naive Bayes model is illustrated in Figure&nbsp;<a href="#fig:NaiveBayesModel">11</a>.</p>
<div class="p"><!----></div>
<div class="p"><!----></div>
<p><a name="tth_fIg11"><br />
</a><br />
<center><img width="400" src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/naiveBayesModel1.png" alt="naiveBayesModel1.png" /></p>
<p></center><center>Figure 11: Naive Bayes Model</center><br />
<a name="fig:NaiveBayesModel"><br />
</a></p>
<div class="p"><!----></div>
<p> The Naive Bayes version of the advice or policy chart is always going to be an axis-aligned parabola as in Figure&nbsp;<a href="#fig:NaiveBayesDecision">12</a>.  Notice how both the linear discriminant and the Naive Bayes model make mistakes (places some colors on the wrong side of the curve)- but they are simple, reliable models that have the desirable property of having connected prediction regions.</p>
<div class="p"><!----></div>
<div class="p"><!----></div>
<p><a name="tth_fIg12"><br />
</a><br />
<center><img width="400" src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/naiveBayesModel2.png" alt="naiveBayesModel2.png" /></p>
<p></center><center>Figure 12: Naive Bayes Decision</center><br />
<a name="fig:NaiveBayesDecision"><br />
</a></p>
<div class="p"><!----></div>
<h3><a name="tth_sEc4.3"><br />
4.3</a>&nbsp;&nbsp;More Exotic Methods</h3>
<div class="p"><!----></div>
<p> Many of the hot buzzword machine learning and data mining methods we listed earlier are essentially different techniques of fitting a linear separator over data.  These methods seem very different but they all form a family once you realize many of the details of the methods are determined by:</p>
<div class="p"><!----></div>
<ul>
<li> Choice of Loss Function
<div class="p"><!----></div>
<p> This is what notion of &#8220;goodness of fit&#8221; is being used.  It can be normalized mean-variance (linear discriminants), un-normalized variance (linear regression), plausibility (logistic regression), L1 distance (support vector machines, quantile regression), entropy (maximum entropy), probability mass and so on.</p>
<div class="p"><!----></div>
</li>
<li> Choice of Optimization Technique
<div class="p"><!----></div>
<p> For a given loss function we can optimize in many ways (though most authors make the mistake of binding their current favorite optimization method deep into their specification of technique): EM, steepest descent, conjugate gradient, quasi-Newton, linear programming and quadratic programming to name a few.</p>
<div class="p"><!----></div>
</li>
<li> Choice of Regularization Method
<div class="p"><!----></div>
<p> Regularization is the idea of forcing the model to not pick extreme values of parameters to over-fit irrelevant artifacts in training data.  Methods include MDL, controlling energy/entropy, Lagrange smoothing, shrinkage, bagging and early termination of optimization.  Non-explicit treatment of regularization is one reason many methods completely specify their optimization procedure (to get some accidental regularization).</p>
<div class="p"><!----></div>
</li>
<li> Choice of Features/Kernelization
<div class="p"><!----></div>
<p> The richness of the feature set the method is applied to is the single largest determinant of model quality.</p>
<div class="p"><!----></div>
</li>
<li> Pre-transformation Tricks
<div class="p"><!----></div>
<p> Some statistical methods are improved by pre-transforming the outcome data to look more normal or be more homoscedastic.<a href="#tthFtNtAAG" name="tthFrefAAG"><sup>6</sup></a></p>
<div class="p"><!----></div>
</li>
</ul>
<div class="p"><!----></div>
<p> If you think along a few axes like these (instead of evaluating them by their name and lineage) you tend to see different data mining methods more as embodying different trade-offs than as being unique incompatible disciplines.</p>
<div class="p"><!----></div>
<div class="p"><!----></div>
<h2><a name="tth_sEc5"><br />
5</a>&nbsp;&nbsp;Conclusion</h2>
<div class="p"><!----></div>
<p> Our goal for this writeup was to fully demonstrate a data mining method and then survey some important data mining and machine learning techniques.  Many of the important considerations are &#8220;too obvious&#8221; to be discussed by statisticians and &#8220;too statistical&#8221; to be comfortably expressed in terms popular with data miners.  The theory and considerations from statistics when combined with the experience and optimism of data-mining/machine-learning truly make possible achieving the important goal of &#8220;learning from data.&#8221;</p>
<div class="p"><!----></div>
<p>This expository writeup is also meant to serve as an example of the<br />
types of research, analysis, software and training supplied by<br />
Win-Vector LLC <a href="http://www.win-vector.com"><tt>http://www.win-vector.com</tt></a> .  Win-Vector LLC<br />
prides itself in depth of research and specializes in identifying,<br />
documenting and implementing the &#8220;simplest technique that can<br />
possibly work&#8221; (which is often the most understandable, maintainable,<br />
robust and reliable).  Win-Vector LLC specializes in research but<br />
has significant experience in delivering full solutions (including<br />
software solutions and integration with existing databases).</p>
<div class="p"><!----></div>
<p><font size="-1"></p>
<h2>References</h2>
<dl compact="compact">
<dt><a href="#CITEAgresti" name="Agresti">[Agr02]</a></dt>
<dd>
Alan Agresti, <em>Categorical data analysis (wiley series in probability and<br />
  statistics)</em>, Wiley-Interscience, July 2002.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITEBreiman:1997p1133" name="Breiman:1997p1133">[BF97]</a></dt>
<dd>
Leo Breiman and Jerome&nbsp;H Friedman, <em>Predicting multivariate responses in<br />
  multiple linear regression</em>, Journal of the Royal Statistical Society, Series<br />
  B (Methodological) <b>59</b> (1997), no.&nbsp;1, 3-54.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITEBlei:2003p1063" name="Blei:2003p1063">[BNJ03]</a></dt>
<dd>
David&nbsp;M Blei, Andrew&nbsp;Y Ng, and Michael&nbsp;I Jordan, <em>Latent dirichlet<br />
  allocation</em>, Journal of Machine Learning Research <b>3</b> (2003),<br />
  993-1022.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITEBreiman:2000p1134" name="Breiman:2000p1134">[Bre00]</a></dt>
<dd>
Leo Breiman, <em>Special invited paper. additive logistic regression: A<br />
  statistical view of boosting: Discussion</em>, Ann. Statist. <b>28</b> (2000),<br />
  no.&nbsp;2, 374-377.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITEBeigel:2008p1027" name="Beigel:2008p1027">[BRS08]</a></dt>
<dd>
Richard Beigel, Nick Reingold, and Daniel&nbsp;A Spielman, <em>The perceptron<br />
  strikes back</em>, 6.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITEkernel1" name="kernel1">[CST00]</a></dt>
<dd>
Nello Cristianini and John Shawe-Taylor, <em>An introduction to support<br />
  vector machines and other kernel-based learning methods</em>, 1 ed., Cambridge<br />
  University Press, March 2000.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITEDasgupta:2005p2013" name="Dasgupta:2005p2013">[DKM05]</a></dt>
<dd>
Sanjoy Dasgupta, Adam&nbsp;Tauman Kalai, and Claire Monteleoni, <em>Analysis of<br />
  perceptron-based active learning</em>, CSAIL Tech. Report (2005), 16.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITEDudik:2006p954" name="Dudik:2006p954">[DS06]</a></dt>
<dd>
Miroslav Dudik and Robert&nbsp;E Schapire, <em>Maximum entropy distribution<br />
  estimation with generalized regularization</em>, COLT (2006), 15.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITEFisher:1936p2576" name="Fisher:1936p2576">[Fis36]</a></dt>
<dd>
Ronald&nbsp;A Fisher, <em>The use of multiple measurements in taxonomic problems</em>,<br />
  Annals of Eugenics <b>7</b> (1936), 179-188.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITEFreund:2003p1009" name="Freund:2003p1009">[FISS03]</a></dt>
<dd>
Yoav Freund, Raj Iyer, Robert&nbsp;E Schapire, and Yoram Singer, <em>An efficient<br />
  boosting algorithm for combining preferences</em>, Journal of Machine Learning<br />
  Research <b>4</b> (2003), 933-969.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITEstatistics" name="statistics">[FPP07]</a></dt>
<dd>
David Freedman, Robert Pisani, and Roger Purves, <em>Statistics 4th edition</em>,<br />
  W. W. Norton and Company, 2007.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITEGrunwald:2005p108" name="Grunwald:2005p108">[Gru05]</a></dt>
<dd>
Peter&nbsp;D Grunwald, <em>Maximum entropy and the glasses you are looking<br />
  through</em>.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITEHalevy:2009p2327" name="Halevy:2009p2327">[HNP09]</a></dt>
<dd>
Alon Halevy, Peter Norvig, and Fernando Pereira, <em>The unreasonable<br />
  effectiveness of data</em>, IEEE Intellegent Systems (2009).</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITEKlein:2003p261" name="Klein:2003p261">[KM03]</a></dt>
<dd>
Dan Klein and Christopher&nbsp;D Manning, <em>Maxent models, conditional<br />
  estimation, and optimization</em>.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITEquantile" name="quantile">[Koe05]</a></dt>
<dd>
Roger Koenker, <em>Quantile regression</em>, Cambridge University Press, May<br />
  2005.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITELewis:1998p105" name="Lewis:1998p105">[Lew98]</a></dt>
<dd>
David&nbsp;D Lewis, <em>Naive (bayes) at forty: The independence assumption in<br />
  information retrieval</em>.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITENYTStat" name="NYTStat">[Loh09]</a></dt>
<dd>
Steve Lohr, <em>For today’s graduate, just one word: Statistics</em>,<br />
  <a href="http://www.nytimes.com/2009/08/06/technology/06stats.html"><tt>http://www.nytimes.com/2009/08/06/technology/06stats.html</tt></a>, August 2009.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITER:Sarkar:2008" name="R:Sarkar:2008">[Sar08]</a></dt>
<dd>
Deepayan Sarkar, <em>Lattice: Multivariate data visualization with R</em>,<br />
  Springer, New York, 2008, ISBN 978-0-387-75968-5.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITEStern:1989p1480" name="Stern:1989p1480">[SC89]</a></dt>
<dd>
Hal Stern and Thomas&nbsp;M Cover, <em>Maximum entropy and the lottery</em>, Journal<br />
  of the American Statistical Association <b>84</b> (1989), no.&nbsp;408,<br />
  980-985.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITESchapire:2001p1019" name="Schapire:2001p1019">[Sch01]</a></dt>
<dd>
Robert&nbsp;E Schapire, <em>The boosting approach to machine learning an<br />
  overview</em>, 23.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITEkernel2" name="kernel2">[STC04]</a></dt>
<dd>
John Shawe-Taylor and Nello Cristianini, <em>Kernel methods for pattern<br />
  analysis</em>, Cambridge University Press, June 2004.</dd>
</dl>
<p></font></p>
<div class="p"><!----></div>
<p><center><b>APPENDIX</b><br />
</center></p>
<div class="p"><!----></div>
<h2><a name="tth_sEcA"><br />
A</a>&nbsp;&nbsp;Graphs</h2>
<div class="p"><!----></div>
<p>The majority of the graphs in this writeup were produced using &#8220;R&#8221;<br />
<a href="http://www.r-project.org/"><tt>http://www.r-project.org/</tt></a> and Deepayan Sarkar&#8217;s Lattice<br />
package[<a href="#R:Sarkar:2008" name="CITER:Sarkar:2008">Sar08</a>].</p>
<div class="p"><!----></div>
<hr />
<h3>Footnotes:</h3>
<div class="p"><!----></div>
<p><a name="tthFtNtAAB"></a><a href="#tthFrefAAB"><sup>1</sup></a><br />
<a href="mailto:jmount@win-vector.com"><tt>mailto:jmount@win-vector.com</tt></a><br />
<a href="http://www.win-vector.com/"><tt>http://www.win-vector.com/</tt></a><br />
<a href="http://www.win-vector.com/blog/"><tt>http://www.win-vector.com/blog/</tt></a></p>
<div class="p"><!----></div>
<p><a name="tthFtNtAAC"></a><a href="#tthFrefAAC"><sup>2</sup></a>Read P(A &#124; B) as: &#8220;the probability of A will<br />
  happen given we know B is true.&#8221;</p>
<div class="p"><!----></div>
<p><a name="tthFtNtAAD"></a><a href="#tthFrefAAD"><sup>3</sup></a>Technically we are working with densities, not<br />
  probabilities, but we will use probability notation for its<br />
  intuition.</p>
<div class="p"><!----></div>
<p><a name="tthFtNtAAE"></a><a href="#tthFrefAAE"><sup>4</sup></a>P(sale &#124; x,y) is the probability of<br />
making a sale as a function of what we know about the prospective<br />
customer and our offer.  Whereas P(x,y&#124;sale) was just how likely it is<br />
to see a prospect with the given x and y values, conditioned on knowing we made<br />
a sale to this prospect.</p>
<div class="p"><!----></div>
<p><a name="tthFtNtAAF"></a><a href="#tthFrefAAF"><sup>5</sup></a> P(sale) and<br />
  P(non-sale) are just the &#8220;prior odds&#8221; of sales or what<br />
  our estimate of our chances of success are before we look at any<br />
  facts about a particular customer.  We can use our historical<br />
  overall success and failure rates as estimates of these quantities.</p>
<div class="p"><!----></div>
<p><a name="tthFtNtAAG"></a><a href="#tthFrefAAG"><sup>6</sup></a>A situation is homoscedastic if the errors are independent of where we are in the parameter space (our x,y or match factor and discount factor).  This property is very important for meaningful fitting/modeling and interpreting significance of fits.</p>
<hr /><small>File translated from<br />
T<sub><font size="-1">E</font></sub>X<br />
by <a href="http://hutchinson.belmont.ma.us/tth/"><br />
T<sub><font size="-1">T</font></sub>H</a>,<br />
version 3.85.<br />On 29 Aug 2009, 11:43.</small></p>


<p>Related posts:<ol><li><a href='http://www.win-vector.com/blog/2009/04/the-data-enrichment-method/' rel='bookmark' title='Permanent Link: The Data Enrichment Method'>The Data Enrichment Method</a></li>
<li><a href='http://www.win-vector.com/blog/2009/01/exciting-technique-1-the-r-language/' rel='bookmark' title='Permanent Link: Exciting Technique #1: The &#8220;R&#8221; language.'>Exciting Technique #1: The &#8220;R&#8221; language.</a></li>
<li><a href='http://www.win-vector.com/blog/2009/08/good-graphs-graphical-perception-and-data-visualization/' rel='bookmark' title='Permanent Link: Good Graphs: Graphical Perception and Data Visualization'>Good Graphs: Graphical Perception and Data Visualization</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>
