<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Win-Vector Blog &#187; Machine Learning</title>
	<atom:link href="http://www.win-vector.com/blog/tag/machine-learning/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.win-vector.com/blog</link>
	<description>The Applied Theorist&#039;s Point of View</description>
	<lastBuildDate>Sat, 04 Feb 2012 17:42:12 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Kernel Methods and Support Vector Machines de-Mystified</title>
		<link>http://www.win-vector.com/blog/2011/10/kernel-methods-and-support-vector-machines-de-mystified/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=kernel-methods-and-support-vector-machines-de-mystified</link>
		<comments>http://www.win-vector.com/blog/2011/10/kernel-methods-and-support-vector-machines-de-mystified/#comments</comments>
		<pubDate>Sat, 08 Oct 2011 00:17:46 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Expository Writing]]></category>
		<category><![CDATA[Mathematics]]></category>
		<category><![CDATA[Opinion]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Tutorials]]></category>
		<category><![CDATA[Kernel Methods]]></category>
		<category><![CDATA[Logistic Regression]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Mathematical Bedside Reading]]></category>
		<category><![CDATA[Naive Bayes]]></category>
		<category><![CDATA[Support Vector Machines]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1804</guid>
		<description><![CDATA[We give a simple explanation of the interrelated machine learning techniques called kernel methods and support vector machines. We hope to characterize and de-mystify some of the properties of these methods. To do this we work some examples and draw a few analogies. The familiar no matter how wonderful is not perceived as mystical. Goals [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2010/10/a-personal-perspective-on-machine-learning/' rel='bookmark' title='A Personal Perspective on Machine Learning'>A Personal Perspective on Machine Learning</a></li>
<li><a href='http://www.win-vector.com/blog/2011/04/do-your-tools-support-production-or-complexity/' rel='bookmark' title='Do your tools support production or complexity?'>Do your tools support production or complexity?</a></li>
<li><a href='http://www.win-vector.com/blog/2011/07/book-review-ensemble-methods-in-data-mining-seni-elder/' rel='bookmark' title='Book Review: Ensemble Methods in Data Mining (Seni &amp; Elder)'>Book Review: Ensemble Methods in Data Mining (Seni &#038; Elder)</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>We give a simple explanation of the interrelated machine learning techniques called <a target="_blank" href="http://en.wikipedia.org/wiki/Kernel_methods">kernel methods</a> and <a target="_blank" href="http://en.wikipedia.org/wiki/Support_vector_machine">support vector machines</a>.  We hope to characterize and de-mystify some of the properties of these methods.  To do this we work some examples and draw a few analogies.  The familiar no matter how wonderful is not perceived as mystical.<span id="more-1804"></span><br />
<h2>Goals of this writeup</h2>
<ol>
<li>De-mystify  kernel methods and support vector machines<br />
<blockquote><p>
Kernel methods and support vector machines have taken mythological proportions in the machine learning imagination. Partly this is because a number of good ideas are overly associated with them: support/non-support training datums, weighting training data, discounting data, regularization, margin and the bounding of generalization error.  My issue is that these are all important enough ideas to stand on their own and are often seen in simpler settings.  The observations that inform my view are as follows:</p>
<ul>
<li>Kernel methods and support vector machines are in fact two good ideas.  Each is important even without the other: kernels are useful all over and support vector machines would be  useful even if we restricted to the trivial identity kernel.</li>
<li>Small scale &#8220;kernel tricks&#8221; are not that different than the classic technique of adding &#8220;interaction variables.&#8221;  Kernels let you escape from the limits of &#8220;linear hypotheses&#8221; (really by moving to a bigger space where things are again linear but look curved from the point of view of your smaller original space).  We demonstrate the linear methods and &#8220;primal kernel tricks&#8221; in <a target="_blank" href="http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/">A Demonstration of Data Mining</a>.</li>
<li>Support/non-support is the central issue of nearest neighbor classifiers.</li>
<li>Weighing training data is most famously shared by logistic regression and boosting.</li>
<li>Re-weighting of data by <i>smoothing kernels</i> (different but related use of the work &#8220;kernel&#8221;) is central to non-parametric statistics (kernel smoothers and splines).</li>
<li>Regularization is of supreme importance in modeling in general.</li>
<li>Most practitioners get tired of &#8220;kernel shopping&#8221; and fall back to the identity, cosine or radial/Gaussian kernels.</li>
</ul>
</blockquote>
</li>
<li>Show concrete examples of what are and what are not kernels.<br />
Few sources give enough theorems to think about kernels abstractly and fewer still work concrete examples.
</li>
<li>
Use as little math as possible (which, unfortunately, turns out to be quite a bit).  We will discuss encodings and stopping conditions (important to understand what is going on) but avoid explaining the optimizers (the most beautiful part of support vector machines, but also the part that is available pre-packaged in libraries).
</li>
<li>
Call out common magical thinking and unreasonable expectations associate with kernel methods and support vector machines.  This should help the reader be in a better position to &#8220;defend their doubts&#8221; regarding machine learning promises.
</li>
<li>
Try to place all techniques in a wider context (if it is usable only one place it is a trick, if it is usable multiple places it is a technique).
</li>
<li>Discuss margin and its impact on generalization error.<br />
<blockquote><p>
Generalization error is an effect of &#8220;over fitting&#8221; where a model has learned things that are true about the training examples that do not hold for the overall truth or concept we are trying to learn (i.e. don&#8217;t generalize).   Generalization error is the excess error rate we observe when scoring new examples versus the error-rate we saw in learning the training data.  Margin is in fact a <em>posterior</em> observation.  That is: margin is observed after the training data is seen, not known before data is seen (like, for example, <a target="_blank" href="http://en.wikipedia.org/wiki/VC_dimension">VC dimension</a>).   Margin is useful as it bounds generalization error but it is not the <em>prior</em> bound it is often portrayed as.  So we assert margin estimates are not much more special than simple cross-validation estimates which can also be performed once we have data available.
</p></blockquote>
</li>
</ol>
<p>The goals are not to try to indict or try to cut down kernel methods or support vector machines, but just to dump some of the associated baggage so they can be used fluidly and without anxiety.  </p>
<h2>Example Problem</h2>
<p>Consider the following simple (caricature) machine learning problem:  we are given a number of points labeled as circles and square and we want to, given a new point, predict the label.  In our example the only input will be two numbers: x and y.  This example is simple enough that we can depict the entire situation in Figure 1 below:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/TruthAndData.png" alt="TruthAndData.png" border="0" width="480" height="480" /></p>
<p>Figure 1: Truth and Training Data</p>
<p></center></p>
<p>We are assuming (for simplicity of exposition) that we have the incredible luck that the label is indeed a deterministic function of x and y.  We portray this &#8220;ground truth&#8221; as the blue parabola- every point found in this region will be considered to be labeled as &#8220;circle&#8221; and every other point (i.e. the red region) will be labeled as &#8220;triangle.&#8221;  We have overlaid 20 example points (with appropriate shape labels).  These points will be our training data.  We will try to learn from the observed training data a good approximation of the unseen blue and red regions.  This process is called learning or training and the ability to correctly predict the label of new points is called generalization.  To make sure our concept is learnable we are going to further guarantee a moat (or margin) the distribution we are picking training and test examples from will never pick a point from the shaded region between the blue and black.  That is we will never be asked about an example where x*x is very near y.</p>
<p>To promote visual thinking we are going to avoid formulas until the section titled &#8220;precise functions used&#8221; (where we will list the formulas used for each graph).</p>
<h2>Nearest Neighbor Solution</h2>
<p>An interesting model in these days of big data and fast machines is the nearest neighbor model.  Such a model colors each point in space blue or red as it <em>thinks</em> the point should be classified.  In this case each point is colored the color given by the closest known training point.  This induces the type of model seen in Figure 2 whose boundary is piecewise line segments mid-way between training points.  The data points that determine segments of the classification boundary in this way are thought to support the boundary and are called &#8220;support examples&#8221; or &#8220;support vectors.&#8221;  In fact any training point that is not &#8220;supporting&#8221; part of the boundary is &#8220;irrelevant&#8221; (could be removed and we would get the exact same model).  This split between important and non-important training points is considered one of the important observations from  the theory of support vector machines, but we see it even here in the nearest neighbor models.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/1NN.png" alt="1NN.png" border="0" width="480" height="480" /></p>
<p>Figure 2: 1 nearest neighbor model</p>
<p></center></p>
<p>Notice that the nearest neighbor model in Figure 2 is &#8220;not half bad.&#8221;  The blue region is much too wide, but near the training data the blue mass is roughly in the correct position.  If we had infinite data and infinite computational resources this sort of model would be hard to beat (as any point we are likely to be tested on would have a lot of very near examples to work from).  We can try to clean up the shape of the model a bit by using a vote of the 2 nearest points to color each point in the plane (see Figure 3) or even the majority of the 3 nearest points (see Figure 4).  </p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/2NN.png" alt="2NN.png" border="0" width="480" height="480" /></p>
<p>Figure 3: 2 nearest neighbor model</p>
<p></center></p>
<p>The purple region in Figure 3 is a &#8220;region of uncertainty&#8221; where net-vote over the 2 nearest points is zero (they gave inconsistent advice).  We can use this as a hint that predictions take from this region are less reliable.  As you may notice, we are not getting real improvements.  We should always spend more time getting more features (useful coordinates in addition to x and y) and more data; and spend less time tweaking models.  But if your data and features are fixed all you work on is your modeling technique (alternately: your modeling technique is all you can prepare before receiving features and data).  So we will continue to discuss features and technique.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/3NN.png" alt="3NN.png" border="0" width="480" height="480" /></p>
<p>Figure 4: 3 nearest neighbor model</p>
<p></center></p>
<p>Nearest neighbor classifiers are optimal in the sense that with an infinite amount of data the 1-nearest neighbor classifier has an error rate that approaches twice the Bayes error rate (the Bayes error rate being the ideal error observed on identical repetitions, or the theoretical best error rate) and for large k the k-nearest neighbor method approaches the Bayes error rate itself (see for example <a target="_blank" href="http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm">k-nearest neighbor algorithm</a> ).</p>
<p>However the effectiveness of the nearest neighbor classifiers is coming from the very low dimension (2) of our input variables x and y.  If we were attempting to predict from n variables we might need an amount of data exponential in n to get a useful nearest neighbor classifier.  This is an <a target="_blank" href="http://en.wikipedia.org/wiki/Efficiency_(statistics)">inefficient</a> use of data of the data; what we want is to use all of the data we have effectively.  To do this we further assume there is some relational reason that the position of the point in the plane determines the shape-label (i.e. that we are learning, generalizing, interpolating and extrapolating- not just memorizing).  We set our ambitions above memorizing and move towards functional modeling.  We start thinking in terms of change: how the label density of examples changes as we move is a good clue as to how it will continue to change.  We can do this either in a parametric or primal formulation (where we attempt to directly infer parameters of an assumed functional form as in regression or logistic regression) or in a non-parametric or dual form (where we attempt to learn relations between points and the training data instead of parameters).  We have discussed this before ( <a target="_blan" href="http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/">A Demonstration of Data Mining</a> ) and expanded on primal methods ( <a target="_blank" href="http://www.win-vector.com/blog/2010/11/learn-a-powerful-machine-learning-tool-logistic-regression-and-beyond/">Learn Logistic Regression (and beyond</a>, <a target="_blank" href="http://www.win-vector.com/blog/2011/09/the-simpler-derivation-of-logistic-regression/">The Simpler Derivation of Logistic Regression</a> and <a target="_blank" href="http://www.win-vector.com/blog/2011/09/the-equivalence-of-logistic-regression-and-maximum-entropy-models/">The equivalence of logistic regression and maximum entropy models</a>).</p>
<h2>Functional Solution</h2>
<p>Our next idea is: can we use our training data to tease out a functional form for the relation between x,y and shape label?  The first function we will try is chosen to be similar to the nearest neighbor model we just demonstrated.  For this form we say each training point is the center of a <a target="_blank" href="http://en.wikipedia.org/wiki/Gaussian_function">Gaussian</a> hump (say up for blue/circle and down for red/triangle) which we will call a &#8220;discount function&#8221; (or informally a hump).  A &#8220;Gaussian&#8221; is just a function that falls off exponentially in the square of Euclidian distance from a central point (we can see the simple form of such functions in the &#8220;precise functions used&#8221; section).  The contour lines (or levels as seen from looking down upon) of one such hump or discount function are shown in figure 5 (all graphs from here on in will be looking down at a height represented by color and contour lines, much like topographic maps).</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/GaussianKernel3ExampleConcept.png" alt="GaussianKernel3ExampleConcept.png" border="0" width="480" height="480" /></p>
<p>Figure 5: Narrow Gaussian example concept</p>
<p></center></p>
<p>This particular discount function has its maximum (or peak) centered on a data point and then rapidly falls as we move away from the training point.  The only property we will use in this section is that we can evaluate the discount easily (so we don&#8217;t at this time need or make use of any restrictive properties like integrability, positive semi-definiteness or even anything like bounded level sets).<br />
We could make our functional model just the sum of all of these up and down humps over all of the training data.  Each point gets a hump of the same height and same radius, pointing up for blue/circle and down for red/triangle.  The net coloring of this sum function is shown in figure 6.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/GaussianKernel3KernelSum1.png" alt="GaussianKernel3KernelSum.png" border="0" width="480" height="480" /></p>
<p>Figure 6: Narrow Gaussian sum model</p>
<p></center></p>
<p>The idea is that this would imitate the nearest neighbor model we have already seen.  This is because even though we are summing over all data the closest training examples should usually dominate (since the Gaussian humps we picked fall to zero as we get further from their centers).   The correspondence to nearest neighbor increases if we tighten the radius of our functions (searching for the right radius is a very important statistical problem called &#8220;bandwidth estimation&#8221;).  Figure 7, for example shows the sum over much narrower Gaussians.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/GaussianKernel20KernelSum1.png" alt="GaussianKernel20KernelSum.png" border="0" width="480" height="480" /></p>
<p>Figure 7: Very narrow Gaussian sum model</p>
<p></center></p>
<p>Figure 7 looks even more like the nearest neighbor solution.  In fact your could think of the discount-weighted sum over all of the data as the natural model (as it is easer to reason about) and the nearest neighbors a more efficient approximation (summing only over near points).  We also have seen that we can alter the smoothness of our model (which has consequences on model generalization) by changing the steepness of our discount functions.  We can also build intermediate models (such as building a nearest neighbor but using discount weighted sums instead of uniform voting in the neighborhoods).   We haven&#8217;t yet greatly improved on nearest neighbor, but we have identified a couple of obvious avenues for further improvement: mess with the bandwidths (the obvious idea to try and deal with near/far scale) or re-weight the data (as one of the lessons of the nearest neighbor model is that points that are not near a boundary are less important).   </p>
<h2>Bandwidth Solution</h2>
<p>As a digression lets play with bandwidth a bit first.  Suppose simultaneously for each training datum we picked an ideal bandwidth or steepness of the Gaussian.  </p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/BandwidthModel.png" alt="BandwidthModel.png" border="0" width="480" height="480" /></p>
<p>Figure 8: Simultaneous bandwidth model</p>
<p></center> </p>
<p>This simultaneous bandwidth model depicted in figure 8 was determined by picking a simultaneous assignment of Gaussian widths (each training datum gets its own width) that maximizes the log-sum of the category signed model function on the training data (much like the logistic regression maximum likelihood decoding).  This differs from typical bandwidth selection problems where all points are given a single common &#8220;best bandwidth.&#8221;  We instead picked a different bandwidth for each point (bandwidths picked to maximize how much of the correct category portion of the sum was correct on each training example).  The blue region is reasonable, but does not match the shape of the true concept.   Partly this is because there has not been enough training data to have learned a lot about the true shape (for example there is no empirical evidence for circle/blue in the top-left quadrant).  The other reason is a bias inherent to this type of model.  If one of the bandwidth functions is wider than all of the others (which is almost certain to occur) then it falls slower than all of the others and for points very far away from the center of the training data this one function is most of what remains.  Or equivalently: due to the nature of this modeling technique one of the concepts learned is going to be the union of bounded islands and the other concept will grab all points sufficiently far away from the center of the training data.  This is strong (and undesirable) bias, but it is a bias also shared by support vector machines with Gaussian kernels (though support vector machines get it from their so-called &#8220;dc-term&#8221; b not being zero; this will be discussed later). Our bandwidth correction was successful, but frankly the optimization problem we solved to estimate the optimal bandwidths was nasty and would not be something we would advise for a large amount of data.</p>
<h2>Data weighting solution (support vector machine)</h2>
<p>So instead of messing with the bandwidths let us consider re-weighting the Gaussians in our sum.  We will leave the shape of each Gaussian the same but allow each individual training datum to have a unique weight or importance.  Perhaps by picking the right weights we can get a better model.  By &#8220;best&#8221; we will mean &#8220;best margin&#8221; (to be defined later) because with &#8220;best margin&#8221; as our objective the optimization problem of solving for the best data weights has a particularly beautiful form that can be reliably solved at great scale.  This is called a &#8220;support vector model&#8221; (or support vector machine) and we will describe these ideas in detail later.  But first lets see the result.  Figure 9 shows the original narrow Gaussians re-summed according to the weights picked by the support vector method (instead of all being 1 as in figure 6).</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/GaussianKernel3SVM1.png" alt="GaussianKernel3SVM.png" border="0" width="480" height="480" /></p>
<p>Figure 9: Narrow Gaussian support vector model</p>
<p></center></p>
<p>Figure 9 is again an improvement.  The figure is formed by taking a signed sum of our discount functions centered at our training points and weighted by the &#8220;support vector machine weights.&#8221;  The signs are picked with one class encoded as +1 and the other as -1.  The colors red and blue are picked if the sum is above or below a constant &#8220;b&#8221; called &#8220;the dc-term&#8221; (part of the support vector solution).  The solution again looks okay: all the training points are in the correct color region (and you can&#8217;t hope to interpolate let alone extrapolate if you can&#8217;t even reproduce your training concept).  Also, as with the bandwidth model, contours are set sensibly (steep/dense where categories are near each other, shallow/sparse where things are safe).   We see the same inductive bias: one of the concepts is the union of bounded islands (this will always going to be the case unless the model picks b=0).  This bias is not obvious from the kernel choice, as it is hidden in the dc-term of the support vector fitting equations (it is not part of the kernel, but can be defeated by choosing unbounded kernels).  An interesting empirical fact about support vector models is they tend to work well with a larger bandwidth.    For example if we use a wider Gaussian we get an even more convincing model (see Figure 10).</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/GaussianKernel025SVM.png" alt="GaussianKernel025SVM.png" border="0" width="480" height="480" /></p>
<p>Figure 10: Wide Gaussian support vector model</p>
<p></center></p>
<p>In figure 10 we have also called out one of the important features of support vector machines.   In the literature all datums are called &#8220;vectors&#8221; (as they are represented in coordinates) and a subset of these are called the support vectors.  In figure 10 we have drawn the 4 training examples that turn out to be support vectors as large shapes and all other training examples as small shapes.  The support vectors are the datums with non negligible weights.  The learned model is a function of the support vectors only.  So after fitting we can discard the other 16 training points. This is similar to the fact that nearest neighbor also only needs the training examples that are supporting its model boundary.</p>
<p>Notice that at this point the support vector models are not &#8220;magically&#8221; better, they in fact look less like the truth we are trying to learn than either the nearest neighbor models or the un-weighted sum models.  To fix this we need to do some &#8220;kernel shopping&#8221; or find functions that better respect what we are trying to model.  With the right functions support vector machines can do a very good job at learning the concept (but knowing &#8220;the right functions&#8221; is a huge hint as to the what the concept is.  There is nothing wrong with using a hint but we do have to a method to produce the hint or have a reasonable number of hints to try from. </p>
<p>The &#8220;sum over everything with the same weight&#8221; solutions from the earlier section are very similar to Naive Bayes which also sums over all matching features with no weight adjustment.  The support vector &#8220;use the same model but pick better weights&#8221; stands over these sum models in very much the same way Logistic Regression stands over Naive Bayes.  In fact the optimization problems are very related.  If we consider weights over features or coordinates as &#8220;primal&#8221; and weights over data items as &#8220;dual&#8221; we can informally say something like: support vector machines optimize by working over dual variables and inspecting for primal weights, and logistic regression works over primal variables and inspects for data weights.   But this is a bit vague.</p>
<p>At this point we need to get a bit more explicit and precise.</p>
<h2>Precise functions used</h2>
<p>In general we will write our training data as a sequence of n-vectors.  Our points will be named u(1) &#8230; u(m).   For each i = 1..m we also know which category the training point is labeled with.  We will encode this as y(1) &#8230; y(m) where each y() is either +1 or -1 depending on the training label.  So the example we have been working has n=2 and m=20.  Let z represent a n-vector we wish to classify.</p>
<p>Figure 6 was a picture of the sign of the function:<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/E4BCA580-314E-4BD8-AD83-3CF55C7AE24E.jpg" alt="E4BCA580-314E-4BD8-AD83-3CF55C7AE24E.jpg" border="0" height="50" /><br />
</center></p>
<p>Figure 7 was a picture of the sign of the function:<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/9F34681D-45FC-4E0B-8466-943CF2881AAC.jpg" alt="9F34681D-45FC-4E0B-8466-943CF2881AAC.jpg" border="0" height="50" /><br />
</center></p>
<p>Figure 8 was a picture of the sign of the function:<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/C70C6F0B-7EB9-44C5-9505-013604F46D9C.jpg" alt="C70C6F0B-7EB9-44C5-9505-013604F46D9C.jpg" border="0"  height="50" /><br />
</center><br />
where the w(i) are non-negative numbers picked so that<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/54765538-DADD-4353-A59A-EF3B6C42743E.jpg" alt="54765538-DADD-4353-A59A-EF3B6C42743E.jpg" border="0"  height="50" /><br />
</center><br />
is large.</p>
<p>Figure 9 was a picture of the sign of -b plus the function:<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/CD6E8576-D6D3-4D61-B285-46DB169BC885.jpg" alt="CD6E8576-D6D3-4D61-B285-46DB169BC885.jpg" border="0" height="50" /><br />
</center><br />
where the a(i) and b are picked so &#8220;the margin&#8221; (discussed later) is large.</p>
<p>Figure 10 was a picture of the sign of -b plus of the function:<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/1932DC0A-E051-41E7-8D90-F0868247A69C.jpg" alt="1932DC0A-E051-41E7-8D90-F0868247A69C.jpg" border="0" height="50" /><br />
</center><br />
where the a(i) and b are picked so the margin is large.</p>
<p>All of this repetition is to emphasize the commonality of the models.  They are all of the form:<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/1FC41F1F-16D8-40A3-BA4E-0EF38E455F00.jpg" alt="1FC41F1F-16D8-40A3-BA4E-0EF38E455F00.jpg" border="0" height="50" /><br />
</center><br />
where the a(i) are non-negative numbers called the &#8220;support weights&#8221; and k(,) is a function mapping pairs of vectors to numbers and is called a &#8220;a kernel.&#8221;  Unfortunately there are many incompatible uses of the word kernel in mathematics and statistics.  Here &#8220;kernel&#8221; is being used in the sense of <a target="_blank" href="http://en.wikipedia.org/wiki/Positive-definite_kernel">positive semi-definiteness</a> or that k(u,u) &ge; 0 for all u.   Note that the support vector machine instead of using the sign of f(z) as its decision instead uses which side f(z) is of a constant b as its category decision.  A consequence is: for support vector machines if b is non-zero (as it almost surely will be) and the kernels all go to zero as we approach infinity fast enough (as they are designed to do) then exactly one of the learned classes is infinite and the other is a union of islands (regardless if this was true for the training data).</p>
<h2>About Kernels</h2>
<p>A lot of awe and mysticism is associated with kernels, but we think they are not that big a deal.  Kernels are a combination of two good ideas, they have one important property and are subject to one major limitation.  It is also unfortunate that support vector machines and kernels are tied so tightly together.   Kernels are the idea of summing functions that imitate similarity (induce a positive-definite encoding of nearness) and support vector machines are the idea of solving a clever dual problem to maximize a quantity called margin.  Each is a useful tool even without the other.</p>
<p>The two good ideas are related and unfortunately treated as if they are the same.  The two good ideas we promised are:</p>
<ol>
<li>The &#8220;kernel trick.&#8221;  Adding new features/variables that are functions of your other input variables can change linearly inseparable problems into linearly separable problems.  For example if our points were encoded not as u(i) = (x(i),y(i)) but as u(i) = (x(i),y(i),x(i)*x(i),y(i)*y(i),x(i)*y(i))</li>
<p>  we could easily find the exact concept ( y(i) > x(i)*x(i) which is now a linear concept encoded as the vector (0,1,-1,0,0).</p>
<li>Often you don&#8217;t need the coordinates of u(i).  You are only interested in functions of distances ||u(i)-u(j)|^2 and in many cases you can get at these by inner products and relations like ||u(i)-u(j)||^2 = &lt;u(i),u(i)&gt; + &lt;u(j),u(j)&gt; &#8211; 2&lt;u(i),u(j)&gt; .</li>
</ol>
<p>We will expand on these issues later.</p>
<p>The important property is that kernels look like inner products in a transformed space.  The definition of a kernel is: there exists a magic function phi() such that for all u,v:<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/F7B3D1C2-2702-48E0-8259-C48BBDACFE50.jpg" alt="F7B3D1C2-2702-48E0-8259-C48BBDACFE50.jpg" border="0"  height="20" /> .<br />
</center><br />
This means that k(.,.) is behaving like an inner product in some (possibly unknown) space.  The important consequence is the positive semi-definiteness, which implies k(u,u)&ge;0 for all u (and this just follows from the fact about inner products over the real numbers that &lt;z,z&gt;&ge;0 for all z).  This is why optimization problems that use the kernel as their encoding are well formed (such as the optimization problem of maximizing margin which is how support vector machines work).  You can <a target="_blank" href="http://en.wikipedia.org/wiki/Regularization_(mathematics)">&#8220;regularize&#8221;</a> optimization problems with a kernel penalty because it behaves a lot like a norm.  Without the positive semidefinite property all of these optimization problems would be able to &#8220;run to negative infinity&#8221; or use negative terms (which are not possible from a kernel) to hide high error rates.  The limits of the kernel functions (not being able to turn distance penalties into bonuses) help ensure that the result of optimization is actually useful (and not just a flaw in our problem encoding).</p>
<p>And this brings us to the major limitations of kernels.  The phi() transform can be arbitrarily magic except when transforming one vector it doesn&#8217;t know what the other vector is and phi() doesn&#8217;t even know which side of the inner product it is encoding.   That is kernels are not as powerful as any of the following forms:</p>
<ul>
<li>snooping (knowing the other):<br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/1BA6AACD-96DE-4B2A-A3FA-1236AE9D2957.jpg" alt="1BA6AACD-96DE-4B2A-A3FA-1236AE9D2957.jpg" border="0"  height="20" />
</li>
<li>positional (knowing which part of inner product mapping to):<br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/D51DD0F0-DF01-4E37-B5C4-E5ECF3179165.jpg" alt="D51DD0F0-DF01-4E37-B5C4-E5ECF3179165.jpg" border="0" height="20" />
</li>
<li>fully general:<br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/845BD9A6-97DC-4F2C-AD4E-633FC9AFADA7.jpg" alt="845BD9A6-97DC-4F2C-AD4E-633FC9AFADA7.jpg" border="0" height="20" />
</li>
</ul>
<p>Not everything is a kernel.</p>
<p>Some non-kernels are:</p>
<ol>
<li> k(u,v) = -c for any c>0</li>
<blockquote><p>
A consequence of the fact that no negative number can be written as a sum of squares or as the limit of a sum of squares in the reals.
</p></blockquote>
<li> k(u,v) = ||u-v||<br />
<blockquote><p>
This can be shown as follows.  Suppose k(u,v) is a kernel with k(u,u) = 0 for all u (which is necessary to try and match ||u-v|| as ||u-u|| = 0 for all u).  By the definition<br />
of kernels k(u,u) = &lt; phi(u), phi(u) &gt; for some real vector valued function phi(.).  But, by the properties of the inner real inner product &lt;.,.&gt;, this means phi(u) is<br />
the zero vector for all u.  So k(u,v) = 0 for all u,v and does not match ||u-v|| for any u,v such that u &ne; v.
</p></blockquote>
</li>
</ol>
<p>There are some obvious kernels:</p>
<ol>
<li> k(u,v) = c for any c &ge; 0 (non-negative constant kernels) </li>
<li> k(u,v) = &lt; u , v &gt;  (the identity kernel) </li>
<li> k(u,v) = f(u) f(v) for any real valued function f(.) </li>
<li> k(u,v) = &lt; f(u) , f(v) &gt; for any real vector valued function f(.) (again, the definition of a kernel)</li>
<li> k(u,v) = transpose(u) B v where B is any symmetric positive semi-definite matrix</li>
</ol>
<p>And there are several subtle ways to build new kernels from old.  If q(.,.) and r(.,.) are kernels then so are:</p>
<ol>
<li> k(u,v) = q(u,v) + r(u,v)</li>
<li> k(u,v) = c q(u,v) for any c &ge; 0 </li>
<li> k(u,v) = q(u,v) r(u,v) </li>
<li> k(u,v) = q(f(u),f(v)) for any real vector valued function f(.)</li>
<li> k(u,v) = lim_{k-> infinity} q_k(u,v) where q_k(u,v) is sequence of kernels and the limit exists.</li>
<li> k(u,v) = p(q(u,v)) where p(.) is any polynomial with all non-negative terms.</li>
<li> k(u,v) = f(q(u,v)) where f(.) is any function with an absolutely convergent Taylor series with all non-negative terms.</li>
</ol>
<p>Most of these facts are taken from the excellent book: John Shawe-Taylor and Nello Cristianini&#8217;s <a target="_blank" href="http://www.kernel-methods.net/">&#8220;Kernel Methods for Pattern Analysis&#8221;</a>, Cambridge 2004.  We are allowing<br />
kernels of the form k(u,v) = &lt; phi(u), phi(v) &gt; where phi(.) is mapping into an infinite dimensional vector space (like a series).  Most of these facts can be checked by<br />
imagining how to alter the phi(.) function.  For example to add two kernels just build a larger vector with enough slots for all of the coordinates for the vectors encoding<br />
the phi(.)&#8217;s of the two kernels you are trying to add.  To scale a kernel by c multiply all coordinates of phi() by sqrt(c).  </p>
<p>Multiplying two kernels is the trickiest.  Without loss of generality assume q(u,v) = &lt; f(u) ,f(v) &gt; and r(u,v) = &lt; g(u), g(v) &gt; and f(.) and g(.) are both mapping into the same finite dimensional vector space R^m (any other situation can be simulated or approximated by padding with zeros and/or taking limits).  Imagine a new vector function p(.) that maps into R^{m*m} such that p(z)_{m*(i-1) + j} = f(z)_i g(z)_j .  It is easy to check that k(u,v) = &lt; p(u) , p(v) &gt; is a kernel and k(u,v) = q(u,v) r(u,v). (The reference proof uses tensor notation and the Schur product, but these are big hammers mathematicians use when they don&#8217;t want to mess around with re-encoding indices).</p>
<p>Note that one of the attractions of kernel methods is that you never have to actually implement any of the above constructions.  What you do is think in terms of sub-routines (easy for computer scientists, unpleasant for mathematicians).  For instance: if you had access to two functions q(.,.) and r(.,.) that claim to be kernels and you wanted the product kernel you would just, when asked to evaluate the kernel, just get the results for the two sub-kernels and multiply (so you never need to see the space phi(.) is implicitly working in).</p>
<p>This implicitness can be important (though far too much is made of it).  For example the Gaussians we have been using throughout are kernels, but kernels of infinite dimension, so we can not directly represent the space they live in.  To see the Gaussian is a kernel notice the following:<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/926E0E3C-883C-4879-A49C-4B91DEB3E449.jpg" alt="926E0E3C-883C-4879-A49C-4B91DEB3E449.jpg" border="0"  height="20" /> .<br />
</center> </p>
<p>And for all c &ge; 0 each of the three terms on the right is a kernel (the first because the Taylor series of exp() is absolutely convergent and non-negative and the second two are the f(u) f(v) form we listed as obvious kernels).  The trick is magic, but the idea is to use the fact that Euclidian squared distance breaks nicely into dot-products ( ||u-v||^2 = &lt; u, u &#038;gt + &lt; v, v &#038;gt &#8211; 2 &lt; u , v &gt;) and exp() converts addition to multiplication.  It is rather remarkable that kernels (which are a generalization of inner products that induce half open spaces) can encode bounded concepts (more on this later when we discuss the VC dimension of the Gaussian).</p>
<h2>Back to Support Vector Machines</h2>
<p>We can now restate what a support vector machine is: it is a method for picking data weights so that the modeling function for a given kernel has a maximum margin.  The margin is the minimal distance between a training example and the model&#8217;s boundary between blue and red.  Notice the 4 support vectors in figure 10 are all not &#8220;tight up against the boundary,&#8221; but are all the same (minimal distance) from the boundary.  This distance is called the margin and the area near the boundary is obviously a place the model has uncertainty (so it makes sense to keep the boundary as far as possible from the training data).  The great benefit of the support vector machine is that with access only to the data labels and the kernel function (in fact only the kernel function evaluated at pairs of training datums) the support vector machine can quickly solve for the optimal margin and data weights achieving this margin.</p>
<p>For fun lets plug in new kernel called the cosine kernel:<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/328D7D80-4133-4E42-A63F-F7E8F3ADC90C.jpg" alt="328D7D80-4133-4E42-A63F-F7E8F3ADC90C.jpg" border="0"  height="43" /><br />
</center><br />
(c &ge; 0).</p>
<p>This kernel can be thought of as having a phi(.) function that takes a vector z and adds an extra coordinate of sqrt(c) and then projects the resulting vector onto the unit sphere.  This kernel induces concepts that look like parabolas, as seen in figure 11.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/CosineKernelExampleConcept.png" alt="CosineKernelExampleConcept.png" border="0" width="480" height="480" /></p>
<p>Figure 11: Cosine example concept</p>
<p></center></p>
<p>Figure 12 shows the sum-concept (add all discount functions with same weight) model. In this case it is far too broad (averaging of the cosine concepts which are not bounded concepts like the Gaussians) creates an overly wide model.  The model not only generalizes poorly (fails to match the shape of the truth diagram) it also gets some of its own training data wrong.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/CosineKernelKernelSum1.png" alt="CosineKernelKernelSum.png" border="0" width="480" height="480" /></p>
<p>Figure 12: Cosine kernel sum model</p>
<p></center></p>
<p>And figure 13 shows the excellent fit returned by the support vector machine.  Actually an excellent fit determined by 4 support vectors (indicated by larger labels).  Also notice the support vector machine using unbounded concepts generated a very good unbounded model (unbounded in the sense that both the blue and red regions are infinite).  By changing the kernel or discount functions/concepts we changed the inductive bias.  So any knowledge of what sort of model we want (one class bounded or not) should greatly influence our choice of kernel functions (since the support vector machine can only pick weights for the kernel functions, not tune their shapes or bandwidths).</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/CosineKernelSVM.png" alt="CosineKernelSVM.png" border="0" width="480" height="480" /></p>
<p>Figure 13: Cosine Support Vector model</p>
<p></center></p>
<p>Another family of kernels (which I consider afflictions) are the &#8220;something for nothing&#8221; kernels.  These are kernels of the form:<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/C2307338-CB70-4EE0-B83D-0B4AE988F951.jpg" alt="C2307338-CB70-4EE0-B83D-0B4AE988F951.jpg" border="0" height="20" /><br />
</center><br />
or higher powers or other finishing functions.  The idea is that if you were to look at the expansion of these kernels you would see lots of higher order terms (powers of u_i and v_j)<br />
and if these terms were available to the support vector machine it could use them.  It is further claimed that in addition to getting new features for free you don&#8217;t use up degrees<br />
of freedom exploiting them (so you cross-validate as well as for simpler models).   Both these claims are fallacious- you can&#8217;t fully use the higher order terms because they are entangled with other terms that are not orthogonal the outcome and the complexity of a kernel is not quite as simple as degrees of freedom (proofs about support vector machines are stated in terms of margin, not in terms of degrees of freedom or even in terms of VC dimension).  The optimizer in the SVM does try hard to make a good hypothesis using the higher order terms- but due to the forced relations among the terms you get best fits like figure 14.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/SquareKernelSVM.png" alt="SquareKernelSVM.png" border="0" width="480" height="480" /></p>
<p>Figure 14: Squared magic kernel support vector model</p>
<p></center></p>
<p>If you want higher order terms I feel you are much better off performing a primal transform so the terms are available in their most general form.   That is re-encode the vector u = (x,y) as the larger vector (x,y,x*y,x*x,y*y).  You have to be honest: if you are trying to fit more complicated functions you are searching a larger hypothesis space so you need more data to falsify the bad hypotheses.   You can be efficient (don&#8217;t add terms you don&#8217;t think you can use as they make generalization harder) but you can&#8217;t get something for nothing (even with kernel methods).</p>
<h2>Back to margin</h2>
<p>Large margin (having a large distance between the decision boundary and all training examples) is a good idea.  At the simplest level it means anything close to a training data point gets classified correctly (so the model is immune to an amount of fuzz proportional to the margin width).  </p>
<p>We get the following remarkable generalization theorem (this one is the hard SVM version, we weaken the result a bit to make it easier to state).  Suppose we train a model using kernel k(.,.) on m training examples u(1),&#8230;,u(m) with training labels y(1)&#8230;y(1).  Further assume the hidden true concept is separable with respect to our kernel and data distribution with a margin of at least w (that is for any sample we draw we can find a SVM model that classifies all of the training data correctly and sees a margin of at least w).  Or more simply: assume w is behaving like a constant with respect to m (i.e. it is not going down as we increase sample size). Then with probability at least 1-d the generalization error (that is error seen on new test points not seen during training, but drawn from the same distribution as the training examples) is no more than:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/1F1B1C26-C676-46EA-9522-5DBA181CC8C6.jpg" alt="1F1B1C26-C676-46EA-9522-5DBA181CC8C6.jpg" border="0" height="65" /><br />
.<br />
</center><br />
(we take this from Theorem 7.22 of &#8220;Kernel Methods for Pattern Analysis, page 215).</p>
<p>If we further assume the expected value k(u,u) under the training distribution exists (i.e. we don&#8217;t have too many examples of very high kernel norm according to the distribution we are training with respect to) and is stable (not taking a lot of value from rare instances) then we can read off the meaning of this upper bound.  Assuming we can always build a model from the training data of margin at least w then the generalization error (excess error we expect to see on classifying similar points) is falling at a rate proportional to the square root of the amount of training data we have.  This is in one sense amazing- we have a bound on how well we are going to do on data we have not seen.  And this bound is also in some sense &#8220;best possible&#8221; because we need around this much data to even confirm such an accuracy (see <a target="_blank" href="http://www.win-vector.com/blog/2011/06/what-is-a-large-enough-random-sample/">What is a large enough random sample?</a> ).  Also notice how the dimension of the implicit features space (which can in fact be infinite) does not enter into the bound.</p>
<p>However, the bound is also some sense to be expected.  It depends on the following strong assumptions (usually not remembered or stated):</p>
<ol>
<li>Test and training data must be generated in the same manner (i.e. come from the same distribution, this is usually stated but not obeyed).</li>
<li>We must not have collapsing margin as the number of data points goes up (so there must actually be a moat between the positive and negative examples such as portrayed in figure1 where the blurry boundary was a region we never generated training or test examples from).</li>
<li>The expected value of k(u,u) must be bounded for examples drawn from our training distribution.  For example this is not true even for the identity kernel if x is just numbers drawn from a <a target="_blan" href="http://en.wikipedia.org/wiki/Cauchy_distribution">Cauchy distribution</a>.</li>
</ol>
<p>Also it is a <em>posterior</em> estimate of expected generalization error.  That is, we only get a bound after we plug in some facts found from fitting a model to the training data (the margin and the expected value of k(u,u)).  It is not like the famous <em>prior</em> bounds based on VC dimension that depend only on the structure of the kernel k(,) and not the distribution of training data or labels (beyond the concept being expressible by the kernel, which is itself a <em>posterior</em> fact).  The thing to remember for all posterior estimates is you can get such estimates at a small expense in statistical efficiency (i.e. you waste some data) and a moderate computational blow-up by cross validation (and with essentially no math and many fewer assumptions).  In fact the cross validation and hold-out estimates may be much tighter than calculated bounds (and are certainly easier to explain).</p>
<p>When we say we weakened the result we mean we have added some stated some conditions that are not strictly needed for the bound to be true.  One of these assumptions is that the margin is not collapsing (the bound remains true even for a collapsing margin; it is just not useful).  We added this assumption because most readings of the bound (or presumed applications) implicitly assume the bound remains useful (which means something near our additional assumptions must also hold).  The reader is really lured to think of the margin w as a constant depending only on the kernel when in fact it depends on the kernel, data distribution and number of training examples.  For example: consider a very simple one dimensional classification problem:  learning the concept x &ge; 0 from labeled training data drawn uniformly from the interval [-0.5,0.5] using a Gaussian kernel.  The support vector models are going to be essentially picking two neighboring points in the observed training sample that have opposite labels and saying the concept boundary is between them.  In this simple case the margin is collapsing: as m (the number of training examples increases) the expected margin width is O(1/m).  This renders the bound calculated above useless (drives it above 1 as the number of training examples increases).  The model is good but the bound is not useful.  This is because the training distribution is discovering points closer and closer to the concept boundary (a necessary discontinuity) as the number of training examples grows.  This is why we had to add the additional (weakening) assumption that not only do we need to assume the concept to be learned is separable with respect to the kernel we are using, but we must further assume the data-distribution (determining where test and training data come from) stays some minimal distance away from the decision surface (i.e. together the data distribution and underlying concept have a moat, as in our figure 1 example).  We will phrase this as &#8220;the concept is wide margin separable with respect to the training distribution&#8221; (implicitly we need to know the kernel, but it is the training distribution that is critical). This is a very strong assumption (not true for all classification problems) but is often true when the classification task is to &#8220;un-mix a mixture&#8221; (the observed data is coming from two different sources and the training labels are which source they come from, in this case you often do have a margin).  Obviously a soft-margin SVM deals better with this (some of the issue is due to our use of hard-margin)- but it is an interesting exercise to think why you would expect your learned  model to have a large margin if your underlying concept and training data do not.</p>
<p>The result remains important: support vector machines are in some sense an optimal learning procedure (up to some constants they achieve as tight a generalization error as can even be confirmed for a given training set size).  Another strong point of the support vector result is the result applies even for kernels with unbounded VC dimension (such as the Gaussian kernel).</p>
<h2>The Gaussian (or radial) kernel again</h2>
<p>We said the Gaussian kernel was of infinite VC dimension.  But we have not shown why it is true.   The fact that the only encoding we could think of off the top of our head was a power-series or an limit does not guarantee there isn&#8217;t  some more clever encoding that easily demonstrates that the Gaussian kernel is of bounded VC dimension (like the identity kernel and cosine kernels are). </p>
<p>However we can see the Gaussian kernel can not have finite VC dimension because it can place an arbitrary number of bounded patches in the plane.  Thus we can build arbitrarily large sets (by placing all points far apart) that can be shattered (thus falsifying any bounded VC dimension).</p>
<p>One of the wonders of the support vector method is that the theory works even in this situation.</p>
<h2>Back to SVM</h2>
<p>Support vector machines seem nearly magical in their power to pick &#8220;best weights.&#8221;  However, as we have seen, these best weights may not always be much better than typical good weights (for instance: using all uniform weights).  Also it is very important to remember the support vector machine is picking the best weighting of a sum of already picked kernels.  It is not picking the best kernel shape or adjusting things like bandwidth (you must pick the best kernel ahead of time or try several kernels to find a good one).  You must remember support vector machines can only directly adjust dual weights (or relative training data weighting) this can affect some changes in hypothesis shape but the support vector machine can not directly implement arbitrary changes in hypothesis shape like a nearest neighbor or a parameterized primal method can (like logistic regression with the kernel trick of adding interaction variables). </p>
<p>Support vector machines are to be much preferred to other combiners like <a target="_blan" href="http://en.wikipedia.org/wiki/Boosting">Boosting</a>.  This is because support vector machines have explicit stopping criteria (conditions known to be true at the optimal solution that characterize the optimal solution) and explicit regularization (control of over fitting).  Boosting relies a path argument (&#8220;do the computation in these stages&#8221;) so it is much harder to reason about and frankly the folklore that &#8220;early stop prevents over fitting in boosting&#8221; is just false (you are much better off explicitly regularizing).</p>
<p>It is also lore that random kernel transformations (like power series over the cosine kernel) are magic.  They can make data separable but they are unlikely to meet the (unstated but critical) conditions of the support vector theorem (margin being wide, margin not filling in and expectation of k(.,.) being small).</p>
<h3>A comment on optimization</h3>
<p>In this writeup we have skipped the most beautiful part of support vector machines- the form of the optimization problem used to solve them.  We strongly suggest consulting  John Shawe-Taylor and Nello Cristianini&#8217;s &#8220;Kernel Methods for Pattern Analysis&#8221;, Cambridge 2004 for this.  The writing is dense but accurate.  You can literally paste their formulas into a general solver and they work.</p>
<h2>Conclusion</h2>
<p>We have shown how kernel methods and support vector machines fit in a sequence of machine learning methods:</p>
<ol>
<li>nearest neighbor models</li>
<li>uniform sum models</li>
<li>support vector machines.</li>
</ol>
<p>We also showed a number of examples of kernels and non-kernels.  It is good to be able to quickly remember the constant &#8220;-1&#8243; is not a kernel and the constant &#8220;1&#8243; is a kernel (though not a very useful one).</p>
<p>It is our argument that support vector machines are &#8220;dual methods&#8221; (as they work in weights over the data instead of weights over coordinates in the features space) and the move from the primal algorithm Naive Bayes to Logistic regression is similar to the move from uniform kernel sums to support vector machines.</p>
<p>Also, if you want to directly manipulate shape (like picking bandwidth) you must in some sense use a primal method (this isn&#8217;t what support vector machines are for).</p>
<p>We end with: If you have some feeling of both what a method can and can not do then you have some understanding of the method.</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2010/10/a-personal-perspective-on-machine-learning/' rel='bookmark' title='A Personal Perspective on Machine Learning'>A Personal Perspective on Machine Learning</a></li>
<li><a href='http://www.win-vector.com/blog/2011/04/do-your-tools-support-production-or-complexity/' rel='bookmark' title='Do your tools support production or complexity?'>Do your tools support production or complexity?</a></li>
<li><a href='http://www.win-vector.com/blog/2011/07/book-review-ensemble-methods-in-data-mining-seni-elder/' rel='bookmark' title='Book Review: Ensemble Methods in Data Mining (Seni &amp; Elder)'>Book Review: Ensemble Methods in Data Mining (Seni &#038; Elder)</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2011/10/kernel-methods-and-support-vector-machines-de-mystified/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Large Data Logistic Regression (with example Hadoop code)</title>
		<link>http://www.win-vector.com/blog/2010/12/large-data-logistic-regression-with-example-hadoop-code/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=large-data-logistic-regression-with-example-hadoop-code</link>
		<comments>http://www.win-vector.com/blog/2010/12/large-data-logistic-regression-with-example-hadoop-code/#comments</comments>
		<pubDate>Mon, 27 Dec 2010 00:00:59 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Programming]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Amazon Elastic MapReduce]]></category>
		<category><![CDATA[Big Data]]></category>
		<category><![CDATA[EC2]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Logistic Regression]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Mahout]]></category>
		<category><![CDATA[Map Reduce]]></category>
		<category><![CDATA[S3]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1607</guid>
		<description><![CDATA[Living in the age of big data we ask what to do when we have the good fortune to be presented with a huge amount of supervised training data? Most often at large scale we are presented with the un-supervised problems of characterization and information extraction; but some problem domains offer an almost limitless supply [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2011/09/the-equivalence-of-logistic-regression-and-maximum-entropy-models/' rel='bookmark' title='The equivalence of logistic regression and maximum entropy models'>The equivalence of logistic regression and maximum entropy models</a></li>
<li><a href='http://www.win-vector.com/blog/2011/09/the-simpler-derivation-of-logistic-regression/' rel='bookmark' title='The Simpler Derivation of Logistic Regression'>The Simpler Derivation of Logistic Regression</a></li>
<li><a href='http://www.win-vector.com/blog/2010/11/learn-a-powerful-machine-learning-tool-logistic-regression-and-beyond/' rel='bookmark' title='Learn Logistic Regression (and beyond)'>Learn Logistic Regression (and beyond)</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>Living in the <a target="_blank" href="http://www.win-vector.com/blog/2010/08/what-did-theorists-do-before-the-age-of-big-data/">age of big data</a> we ask what to do when we have the good fortune to be presented with a huge amount of supervised training data?  Most often at large scale we are presented with the un-supervised problems of <a target="_blank" href="http://www.win-vector.com/blog/2010/10/a-personal-perspective-on-machine-learning/">characterization and information extraction</a>; but some problem domains offer an almost limitless supply of supervised training data (such as using older data to build models that predict the near future).  Having too much training data is a good problem to have and there are ways to use traditional methods (like logistic regression) at this scale.  We present an &#8220;out of core&#8221; logistic regression implementation and a quick example in <a target="_blank" href="http://hadoop.apache.org/">Apache Hadoop</a> running on <a target="_blank" href="http://aws.amazon.com/elasticmapreduce/">Amazon Elastic MapReduce</a>.  This presentation assumes familiarity with Unix style command lines, Java and Hadoop.<span id="more-1607"></span>Apache Hadoop already has a machine learning infrastructure named <a target="_blank" href="http://mahout.apache.org/">Mahout</a>.   While Mahout seems to concentrate more on unsupervised methods (like clustering, nearest neighbor and recommender systems) it does already include a <a target="_blank" href="https://cwiki.apache.org/MAHOUT/logistic-regression.html">logistic regression package</a>.   This package uses a learning method called &#8220;Stochastic Gradient Descent&#8221;, which is in a sense the perceptron update algorithm updated for the new millennium.  This method is fast in most cases but differs from the traditional method of solving a logistic regression which are based on Fisher Scoring or the Newton/Raphston Method (see &#8220;Categorical Data Analysis,&#8221; Alan Agresti, 1990 and  <a target="_blank" href="http://www.autonlab.org/autonweb/14709/version/4/part/5/data/komarek:lr_thesis.pdf?branch=main&#038;language=en">Paul Komarek&#8217;s thesis &#8220;Logistic Regression for Data Mining and High-Dimensional Classification&#8221;</a>).  Fisher Scoring remains interesting in that it parallelizes in exactly the manner described in &#8220;Map-Reduce for Machine Learning on Multicore,&#8221; Cheng-Tao Chu, Sang Kyun Kim, Yi-An Lin, Yuan Yuan Yu, Gary Bradski, Andrew Y Ng, Kunle Olukotoun NIPS 2006.</p>
<p>Stochastic gradient descent is in fact an appropriate method for big data.  For example: if our model complexity is held constant and our data set size is allowed to grow; then stochastic gradient descent will achieve its convergence condition before it even completes a single random order traversal of the data.  However, stochastic gradient descent has a control called the learning rate and one can easily imagine a series of problems that require the learning rate to be set arbitrarily slow.  For example a data set formed as the union of very many &#8220;typical&#8221; examples where a given variable is independent of the outcome and small minority of &#8220;special&#8221; examples where the same variable helps influence the outcome presents a problem.  Training on the &#8220;typical&#8221; examples causes the stochastic gradient descent method to perform a random walk on the given variable coefficient.  So the learning rate must be slow enough that the expected drift does not swamp out the rare contributions from the &#8220;special&#8221; examples (meaning the learning rate must slow roughly proportionally to the square root of the ratio of the typical to special examples).</p>
<p>Not too much must be made of artificial problems designed to slow stochastic gradient descent.  The traditional Fisher scoring (or the Newton/Raphston method) can simply be killed by specifying a problem with a great number of levels for categorical variables.  In this case traditional methods have to solve a linear system that can in fact be much larger than the entire data set (causing representation, work and numeric stability problems).  So it takes little imagination to design problems that kill the traditional methods.  Other intermediate complexity methods (like conjugate gradient) avoid the storage size problem; but can require a many more passes through the training data.</p>
<p>There is a common situation where Fisher scoring makes good sense: you are trying to fit a relatively simple model to an enormous amount of data (often to predict a rare event).  One could sub-sample the training data to shrink the scale of the problem- but this is a case of the analyst being forced to accede to poor tools.  What one would naturally want is a training method that can fit reasonable sized models (that is models with a reasonable number of variables and levels) onto enormous data sets.  The software package <a target="_blank" href="http://cran.r-project.org/">R</a> can work with fairly large data sets (in the gigabytes range) and has some parallel flavors, but R is mostly an in-memory system.  It is appropriate to want a direct method that both &#8220;works out of core&#8221; (i.e. in the terabytes and petabytes ranges), parallelizes to hundreds of machines (using current typical infrastructure- like a Hadoop cluster) and is exact (without additional parameters like learning rate).  </p>
<p>We demonstrate here an example implementation in Java for both single machine &#8220;out of core&#8221; training (allowing filesystem sized datasets) and MapReduce style parallelism (allowing even larger scale).  The method also includes the problem regularization steps discussed in our recent <a target="_blank" href="http://www.win-vector.com/blog/2010/11/learn-a-powerful-machine-learning-tool-logistic-regression-and-beyond/">logistic regression article</a>.  The code (packaged in: <a target="_blank" href="http://www.win-vector.com/dfiles/WinVectorLogisticRegression/WinVectorLogistic.Hadoop0.20.2.jar" title="WinVectorLogistic.Hadoop0.20.2.jar">WinVectorLogistic.Hadoop0.20.2.jar</a> ) is being distributed under the GNU Affero General Public License version 3.  This is an open source license that (roughly) requires (among other things) redistribution of source code of systems linked against the licensed project to anyone receiving a compiled version or using the system as a network service.  The license also promises no warranty or implied fitness.  The distribution is a standalone runnable Jar (source code and license inside the jar) and is the minimal object required to run on Hadoop (which is itself a Java project).    More advanced versions of the library (with better linear algebra libraries, better problem slice control, unit tests, JDBC bindings and with different license arrangements) can be arranged from the code owners: <a target="_blank" href="http://www.win-vector.com/">Win-Vector LLC</a>.  This jar was built for Apache Hadoop version 0.20.2 (the latest version Amazon Elastic Map Reduce runs at this time) and we use as many of the newer interfaces as possible (so the code will run against the current Hadoop 0.21.0 if re-built against Hadoop 0.21.0, the jar can not switch versions without being re-built due to how Hadoop calls methods).</p>
<p>For our example we will work on a small data set.  The code is designed to pass through data directly from disk, storing only the Fisher structures- which require storage proportional to the square of the number of variables and levels but is independent of the number of data rows.   The data format is what we call &#8220;naive TSV&#8221; or &#8220;naive tab separated values.&#8221;  This is a file where each line has exactly the same number of values (separated by tabs) and the first line of the file is the header line naming each column.  This is compatible with Microsoft Excel and R with the proviso that this file format does not allow any sort of escapes, quoting or multiple line fields.  Our data set is taken from the <a target="_blank" href="http://archive.ics.uci.edu/ml/">UCI machine learning database</a> ( <a  target="_blank" href="http://archive.ics.uci.edu/ml/machine-learning-databases/car/">data</a>, <a target="_blank" href="http://archive.ics.uci.edu/ml/machine-learning-databases/car/car.names">description</a> )  and converted into the naive TSV format (split into training and testing subsets: <a target="_blank" href="http://www.win-vector.com/dfiles/WinVectorLogisticRegression/uciCarTrain.tsv" title="uciCarTrain.tsv">uciCarTrain.tsv</a>, <a target="_blank" href="http://www.win-vector.com/dfiles/WinVectorLogisticRegression/uciCarTest.tsv" title="uciCarTest.tsv">uciCarTest.tsv</a>).</p>
<p>The first few lines of the training file are given here:</p>
<pre>
buying	maintenance	doors	persons	lug_boot	safety	rating
vhigh	vhigh	2	2	small	med	FALSE
vhigh	vhigh	2	2	med	low	FALSE
vhigh	vhigh	2	2	med	med	FALSE
</pre>
<p>The first experiment is to use the Java program standalone (without Hadoop) to train a model.  The method used is Fisher scoring by multiple passes over the data file.  Only the Fisher structures are stored in memory- so in principle the data set could be arbitrarily large.  To run the logistic training program download the files WinVectorLogistic.Hadoop0.20.2.jar and uciCarTrain.tsv .  You will also need some libraries ( commons-logging-*.jar and commons-logging-api-*.jar , and sometimes  hadoop-*-core.jar and log4j-*.jar ) from the appropriate <a target="_blan" href="http://hadoop.apache.org/">Hadoop distribution</a>.  Before running the code you can examine the source (and re-build the project using an IDE like <a target="_blank" href="http://www.eclipse.org/">Eclipse</a>) by extracting the code in an empty directory using the Java jar command:</p>
<pre>
jar xvf WinVectorLogistic.Hadoop0.20.2.jar
</pre>
<p>To run the code type at the command line (all in a single line, we have inserted line breaks for clarity, we are also assuming you are using a Unix style shell on Linux, OSX or Cygwin on Windows):</p>
<pre>
java -cp WinVectorLogistic.Hadoop0.20.2.jar:commons-logging-1.0.4.jar:commons-logging-api-1.0.4.jar
   com.winvector.logistic.demo.LogisticTrain
   file:uciCarTrain.tsv "rating ~ buying + maintenance + doors + persons + lug_boot + safety" model.ser
</pre>
<p>The portion of interest is the last three arguments:</p>
<ul>
<li>file:uciCarTrain.tsv :  The URI pointing to the file containing the training data.</li>
<li> &#8220;rating ~ buying + maintenance + doors + persons + lug_boot + safety&#8221; : The formula specifying that rating will be predicted as a function of  buying, maintenance, doors, persons, lug_boot  and safety.</li>
<li>model.ser :  Where to write the Java Serialized model result.</li>
</ul>
<p>After that we can run the scoring procedure on the held-out test data:</p>
<pre>
java -cp WinVectorLogistic.Hadoop0.20.2.jar:commons-logging-1.0.4.jar:commons-logging-api-1.0.4.jar
   com.winvector.logistic.demo.LogisticScore
   model.ser file:uciCarTest.tsv scored.tsv
</pre>
<p>In this case the last three arguments are:</p>
<ul>
<li>model.ser :  Where to read the Java Serialized model from.</li>
<li>file:uciCarTest.tsv : The URI pointing to the file to make predictions for.</li>
<li>scored.tsv : Where to write the predictions to.</li>
</ul>
<p>The first few lines of the result file are:</p>
<pre>
predict.rating.FALSE	predict.rating.TRUE	buying	maintenance	doors	persons	lug_boot	safety	rating
0.9999999999999392	6.091299561082107E-14	vhigh	vhigh	2	2	small	low	FALSE
0.9999999824028766	1.759712345446162E-8	vhigh	vhigh	2	2	small	high	FALSE
</pre>
<p>These lines are just lines from the file uciCarTest.tsv (same format is uciCarTrain.tsv) copied over with the addition of the first two columns that show the modeled probabilities of rating acceptable being FALSE or TRUE.  The accuracy of the prediction is computed and written into the runlog if the data had the rating outcomes in it (else we just get a file of predictions- which is the usual application of machine learning).</p>
<p>The details of running the Hadoop versions of the same process depend on the configuration of your Hadoop environment.  Just unpacking the 0.20.2 version of Hadoop will let you try the single-machine version of the MapReduce Logistic Regression process (which will be much slower than the standalone Java version).  To run the training step the Hadoop command line is as follows (notice this time we do not have to specify the logging jars as they are part of the Hadoop environment):</p>
<pre>
hadoop-0.20.2/bin/hadoop jar WinVectorLogistic.Hadoop0.20.2.jar
   logistictrain
   uciCarTrain.tsv "rating ~ buying + maintinance + doors + persons + lug_boot + safety" model.ser
</pre>
<p>And the scoring procedure is below:</p>
<pre>
hadoop-0.20.2/bin/hadoop jar WinVectorLogistic.Hadoop0.20.2.jar
   logisticscore
   model.ser uciCarTest.tsv scoredDir
</pre>
<p>The only operational differences are that the results are written into the file scoredDir/part-r-00000 (as is Hadoop convention) instead of scored.tsv (and an extra &#8220;offset&#8221; column is also included) and data is handled in Files (to allow Hadoop Paths to be formed) instead of URIs.   The Hadoop training and test steps are able to run in this manner because we have constructed WinVectorLogistic.Hadoop0.20.2.jar as an executable jar file with the class com.winvector.logistic.demo.DemoDriver as the class to execute.  This class uses that standard org.apache.hadoop.util.ProgramDriver pattern to run our jobs under the org.apache.hadoop.util.Tool interface.  This means that the standard Hadoop generic flags for specifying cluster configuration will be respected.</p>
<p>The big benefit of all of this packaging is: if this command is run on a large Hadoop cluster (instead of on a single machine) then the input file could be split up and processed in parallel on many machines.   The easiest way to do this is to use Amazon.com&#8217;s <a target="_blank" href="http://aws.amazon.com/elasticmapreduce/">Amazon Elastic MapReduce</a>.  This service (used in conjunction with S3 storage and EC2 virtual machines) allows the immediate remote provisioning and execution on a version 0.20.* Hadoop cluster.  To demonstrate this service we created a new S3 Bucket named wvlogistic.  Into wvlogistic we copied our jar of our code compiled against Hadoop 0.20.2 APIs ( WinVectorLogistic.Hadoop0.20.2.jar ) and a moderate sized synthetic training data set ( bigProb.tsv,  created by running: java -cp WinVectorLogistic.Hadoop0.20.2.jar com.winvector.logistic.demo.BigExample bigProb.tsv ).  Once this has been set up (and you have signed up for the Amazon Elastic MapReduce credentials) you can run the training procedure from the <a target="_blank" href="https://console.aws.amazon.com/elasticmapreduce/home">Amazon web UI</a>.  In five steps (following the direcitons found in <a href="http://aws.amazon.com/articles/3938">Tutorial: How to Create and Debug an Amazon Elastic MapReduce Job Flow</a> ) the job can be configured and launched.</p>
<p>First: press &#8220;Crate New Job Flow&#8221; and choose a job name, check &#8220;Run your own application&#8221; and select &#8220;Cusom Jar&#8221;.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/12/MRExStep1.png" alt="MRExStep1.png" border="0" width="700" /></p>
<p>Step 1/5<br />
</center></p>
<p>Second: specify the location of the jar in your Bucket and give the command line arguments (prepending S3 paths with &#8220;s3n://&#8221;).</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/12/MRExStep2.png" alt="MRExStep2.png" border="0" width="700"  /></p>
<p>Step 2/5<br />
</center></p>
<p>Third: select the type and number of machine instances you want, run without and EC2 key pair, enable logging and send the log back to your S3 bucket.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/12/MRExStep3.png" alt="MRExStep3.png" border="0" width="700"  /></p>
<p>Step 3/5<br />
</center></p>
<p>Fourth: add the default bootstrap action of configuring the Hadoop cluster.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/12/MRExStep4.png" alt="MRExStep4.png" border="0" width="700"  /></p>
<p>Step 4/5<br />
</center></p>
<p>Fifth: confirm and launch the job.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/12/MRExStep5.png" alt="MRExStep5.png" border="0" width="700"  /></p>
<p>Step 5/5<br />
</center></p>
<p>When the job completes transfer the result ( bigModel.ser )  back to your local system and you have your new map reduced produced logistic model.    We can confirm and use the model locally with a Java command similar to our earlier examples:</p>
<pre>
java -cp WinVectorLogistic.Hadoop0.20.2.jar:commons-logging-1.0.4.jar:commons-logging-api-1.0.4.jar:hadoop-0.20.2-core.jar:log4j-1.2.15.jar
   com.winvector.logistic.demo.LogisticScore
   bigModel.ser bigProb.tsv bigScored.tsv
</pre>
<p>Be aware that at this tens of megabytes scale  there is no advantage in running on a Hadoop cluster (versus using the stand-alone program).  At moderate scale parallelism may not even be attempted (due to block size) and the costs of data motion can overcome the benefit of parallel scans.   The biggest gain is being able to train many models from many gigabytes of data on a single machine without sub-sampling.  While we have the ability to build a logistic model at &#8220;web scale&#8221; (terabytes or petabytes of data) you would not want to use the MapReduce calling pattern until you had a web-scale amount of training data.</p>
<p>The point of this exercise was to take a solid implementation of  <a target="_blank" href="http://www.win-vector.com/blog/2010/11/learn-a-powerful-machine-learning-tool-logistic-regression-and-beyond/">regularized logistic regression article</a> and use the decomposition into the &#8221; Statistical Query Model&#8221;  (as suggested in the NIPS paper &#8220;Map-Reduce for Machine Learning on Multicore&#8221;) to quickly get an intermediate sophistication machine learning method (more sophisticated than Naive Bayes, less sophisticated than Kernelized Support Vector Machines) working at large (beyond RAM) scale.  Briefly: most of the technique is in an interface that considers the mis-fit, gradient if mis-fit and hessian of mis-fit as a linear (summable) function over the data.  Or in the &#8220;book&#8217;s worth of preparation so we can write the result in one line&#8221; paradigm: all of the machinery we have been discussing is support so the following summable interface (part of the source code we are distributing) can be used to do all of the work:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/12/LinearContribution.png" alt="LinearContribution.png" border="0" width="956" height="286" /></p>
<p>Summable Interface<br />
</center></p>
<p>Of course once you have the framework up that makes one non-trivial task easy you have likely made many other non-trivial tasks easy.</p>
<p>We hope this demonstration and examining the source code in our WinVectorLogistic.Hadoop0.20.2.jar will help you find ways to tackle your large data machine learning problems.</p>
<hr/>
<p>Code License:</p>
<blockquote><p>
Packages com.winvector.*, extra.winvector.*<br />
	     Code for performing logistic regression on Hadoop.<br />
	     Copyright (C) Win Vector LLC 2010 (contact: John Mount jmount@win-vector.com).<br />
	     Distributed under GNU Affero General Public License version 3 (2007, see http://www.gnu.org/licenses/agpl.html ).<br />
	       This program is free software: you can redistribute it and/or modify<br />
	       it under the terms of the GNU Affero General Public License as<br />
	       published by the Free Software Foundation, only version 3 of the<br />
	       License.<br />
	       This program is distributed in the hope that it will be useful,<br />
	       but WITHOUT ANY WARRANTY; without even the implied warranty of<br />
	       MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the<br />
	       GNU Affero General Public License for more details.<br />
	       You should have received a copy of the GNU Affero General Public License<br />
	       along with this program.  If not, see <http://www.gnu.org/licenses/>.<br />
	    (Source code in jar, see also http://www.win-vector.com/blog/2010/12/large-data-logistic-regression-with-example-hadoop-code/ )
</p></blockquote>
<hr/>
Note Dec-15-2011:  We have moved the code distribution to <a target="_blank" href="https://github.com/WinVector/SQL-Screwdriver">github.com/WinVector/SQL-Screwdriver</a> .  We have fixed some major bugs in the supplied optimizers and moved com.winvector.logistic.LogisticScore and com.winvector.logistic.LogisticTrain form freeform arguments to Apache CLI.  The new command lines need flags as shown below:</p>
<pre>
usage: com.winvector.logistic.LogisticTrain
 -formula &lt;arg&gt;      formula to fit
 -inmemory           if set data is held in memory during training
 -resultSer &lt;arg&gt;    (optional) file to write seriazlized results to
 -resultTSV &lt;arg&gt;    (optional) file to write TSV results to
 -trainClass &lt;arg&gt;   (optional) alternate class to use for training
 -trainHDL &lt;arg&gt;     XML file to get JDBC connection to training data
                     table
 -trainTBL &lt;arg&gt;     table to use from database for training data
 -trainURI &lt;arg&gt;     URI to get training TSV data from
</pre>
<pre>
usage: com.winvector.logistic.LogisticScore
 -dataHDL &lt;arg&gt;      XML file to get JDBC connection to scoring data table
 -dataTBL &lt;arg&gt;      table to use from database for scoring data
 -dataURI &lt;arg&gt;      URI to get scoring data from
 -modelFile &lt;arg&gt;    file to read serialized model from
 -resultFile &lt;arg&gt;   file to write results to
</pre>
<hr/>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2011/09/the-equivalence-of-logistic-regression-and-maximum-entropy-models/' rel='bookmark' title='The equivalence of logistic regression and maximum entropy models'>The equivalence of logistic regression and maximum entropy models</a></li>
<li><a href='http://www.win-vector.com/blog/2011/09/the-simpler-derivation-of-logistic-regression/' rel='bookmark' title='The Simpler Derivation of Logistic Regression'>The Simpler Derivation of Logistic Regression</a></li>
<li><a href='http://www.win-vector.com/blog/2010/11/learn-a-powerful-machine-learning-tool-logistic-regression-and-beyond/' rel='bookmark' title='Learn Logistic Regression (and beyond)'>Learn Logistic Regression (and beyond)</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2010/12/large-data-logistic-regression-with-example-hadoop-code/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>A Personal Perspective on Machine Learning</title>
		<link>http://www.win-vector.com/blog/2010/10/a-personal-perspective-on-machine-learning/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=a-personal-perspective-on-machine-learning</link>
		<comments>http://www.win-vector.com/blog/2010/10/a-personal-perspective-on-machine-learning/#comments</comments>
		<pubDate>Sun, 31 Oct 2010 21:45:48 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Expository Writing]]></category>
		<category><![CDATA[History]]></category>
		<category><![CDATA[Opinion]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Clustering]]></category>
		<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Machine Learning]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1551</guid>
		<description><![CDATA[Having a bit of history as both a user of machine learning and a researcher in the field I feel I have developed a useful perspective on the various trends, flavors and nuances in machine learning and artificial intelligence. I thought I would take a moment to outline a bit of it here and demonstrate [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/' rel='bookmark' title='A Demonstration of Data Mining'>A Demonstration of Data Mining</a></li>
<li><a href='http://www.win-vector.com/blog/2011/10/kernel-methods-and-support-vector-machines-de-mystified/' rel='bookmark' title='Kernel Methods and Support Vector Machines de-Mystified'>Kernel Methods and Support Vector Machines de-Mystified</a></li>
<li><a href='http://www.win-vector.com/blog/2011/07/book-review-ensemble-methods-in-data-mining-seni-elder/' rel='bookmark' title='Book Review: Ensemble Methods in Data Mining (Seni &amp; Elder)'>Book Review: Ensemble Methods in Data Mining (Seni &#038; Elder)</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>Having a bit of history as both a user of machine learning and a researcher in the field I feel I have developed a useful perspective on the various trends, flavors and nuances in machine learning and artificial intelligence.  I thought I would take a moment to outline a bit of it here and demonstrate how what we call artificial intelligence is becoming more statistical in nature.<span id="more-1551"></span><br />
In the early days <a target="_blank" href="http://en.wikipedia.org/wiki/Machine_learning">machine learning</a> and artificial intelligence were famous for promising far too much and delivering far too little.  This has changed.  Artificial decision and reasoning systems are now everywhere.  One of the things masking the breadth and authority of artificial intelligence is the current prejudice: &#8220;if a system is well understood or works then it is no longer called artificial intelligence.&#8221;  A working system becomes a database, expert system, rules engine, machine learning platform, analytics dashboard, pattern recognition system or statistics warehouse.  We clearly have not reached anywhere near building a conversational intelligence (like Hal from 2001 or <a target="_blank" href="http://mzlabs.com/MZLabsJM/page6/Gerty/Gerty.html">Gerty</a> from Moon).  Yet every day machines decide if your credit card is accepted, advise on medical care, route goods, curate information and control vast industrial plants.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/10/Hal-9000.jpg" alt="Hal-9000.jpg" border="0" width="150" height="150" /><br />
<br/>Hal 9000<br />
</center></p>
<p>There have been vast improvements in artificial intelligence.  Much of the improvement has been driven by the engineering effects of Moore&#8217;s Law (resulting in my mobile phone&#8217;s processor having 12 times the clock speed and over 32 times the memory of an $8 million <a target="_blank" href="http://en.wikipedia.org/wiki/Cray-1">Cray 1 super computer</a>)  and significant machine learning research results.  These machine size changes happened during the productive careers of many researchers, so ideas are often evaluated at a series of radically different machine capabilities and data scales.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/10/Cray-1-deutsches-museum.jpg" alt="Cray-1-deutsches-museum.jpg" border="0" width="487" height="536" /><br />
<br/>Cray 1<br />
</center></p>
<p>von Neuman himself commented that scale was a major limiting factor in early computers.  He asked the question how you could be expected to achieve anything significant even from a roomful of geniuses if (as with his early computers) all notes, communication and memory were limited to less than a single typed page.  von Neuman&#8217;s comment stands in contrast to science fiction scientists and early boosters of artificial intelligence who always seem to be in awe of their own creations.  Computers are certainly much larger- but we need to be humble and put off deciding if we are yet in the era of large computers (compared to human or animal brains).  Everything we are doing now may still just be artificial intelligence&#8217;s pre-history and prologue.  Feynman in his lectures on computation mentions that RNA transcription can be estimated to take around 100 kT of energy to transcribe a bit while a transistor may easily use 100,000,000 kT energy units to switch states.  This means for the amount of heat the human head dissipates (energy supply and heat dissipation are rapidly becoming the most relevant measures of computational power) you could do a million times more work using RNA techniques (if you knew how) than with transistors.  So computers may not yet be what we should call large (though they are likely getting there).  What we currently call <a target="_blank" href="http://www.win-vector.com/blog/2009/08/on-the-hysteria-over-the-cloud/">&#8220;datacenters&#8221;</a> are in fact block sized computers (consuming an enormous amount of energy and dissipating a huge amount of heat).</p>
<p><center><br />
<img  target="_blank" src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/google_data_center_lenoir.jpg" alt="google_data_center_lenoir.jpg" border="0" width="512" height="355" /><br />
<br/>A datacenter (or a block sized computer)<br />
</center></p>
<p>Not all improvements in machine intelligence have come from (or are to come from) improvements in hardware.  Many of the improvements came from machine learning research results and these are what I will outline below.</p>
<p>Early machine learning algorithms were driven by analogy.  This led us to perceptrons (1957, fairly early in the history of computer science) and neural nets.  These methods have their successes but were largely over used and developed before researchers developed a good list of desirable properties of a machine learning method.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/10/220px-Neural_network_example.svg_.png" alt="220px-Neural_network_example.svg.png" border="0" width="220" height="293" /><br />
<br/>Neural Net diagram<br />
</center></p>
<p>These methods live on but are,  in my opinion, not currently competitive.  Some of their important ideas and contributions have been revived from time to time, such as the online update rules becoming what we now call stochastic gradients.</p>
<p>A list of (often incompatible) desirable properties of a machine learning algorithm is the following:</p>
<ul>
<li>Able to represent complicated functions</li>
<li>Good generalization performance (quality predictions on data not seen during training)</li>
<li>Unique optimal model for a given set of data and feature definitions</li>
<li>Efficient and well characterized solution method</li>
<li>Consistent summary statistics</li>
<li>Preference for simple models</li>
</ul>
<p>We divert from this list for a bit of background and context.</p>
<p>The neural net was largely celebrated for its ability to represent complex functions and the perceived efficiency of its newer back-propagation based training method (related to the <a target="_blank" href="http://www.win-vector.com/blog/2010/07/gradients-via-reverse-accumulation/">efficient calculation of gradients</a>).  The downsides were you never knew if your neural net was the right one (even assuming you had the right features, layout and training data) and could not be sure you were biasing towards simple models that might perform well on novel queries.  Great effort was expended in extending neural nets based on the supposition they should work as they were an analogy to how we imagined biological neurons might function.  An almost mystic hope was derived from the non-linear nature and special properties of the sigmoid curve (which was in fact a curve already known to statisticians).</p>
<p>Other methods than neural nets also had early success.  The field of information retrieval (which was not &#8220;sexy&#8221; prior to the Web) had huge success since the 1960s with <a taret="_blank" href="http://en.wikipedia.org/wiki/Naive_Bayes_classifier">Naive Bayes</a>, <a target="_blank" href="http://en.wikipedia.org/wiki/Rocchio_Classification">Rocchio Classification</a>, and <a target="_blank" href="http://en.wikipedia.org/wiki/Tf–idf">TF/IDF</a> methods.  The early success of these methods may have in fact delayed research on current hot research areas such as segmentation and author topic models.</p>
<p>Theoretical computer science initially sought to characterize machine learning methods in non-statistical language.  In the 1980s a great amount of ink was spilled on &#8220;learning boolean functions.&#8221;  Papers proving nothing was learnable (by picking a function related to cryptography) alternated with papers proving everything was learnable (for example via amplification techniques like boosting).  Generalization of models to new data remained a theoretical problem that was dealt with by appeals to model complexity and <a target="_blank" href="http://en.wikipedia.org/wiki/Minimum_description_length">MDL</a> (minimum description length).  A major breakthrough in characterizing generalization performance was the <a target="_blank" href="http://en.wikipedia.org/wiki/Probably_approximately_correct_learning">PAC model</a> (probably approximately correct) framework which finally allowed direct treatment of generalization performance.</p>
<p>We now have enough context  to discuss some of the current best of breed machine learning techniques (that address many of the desired properties mentioned above):</p>
<ul>
<li><a target="_blank" href="http://en.wikipedia.org/wiki/Support_vector_machine">Support Vector Machines</a></li>
<li><a target="_blank" href="http://en.wikipedia.org/wiki/Kernel_methods">Kernel Methods</a></li>
<li><a target="_blank" href="http://en.wikipedia.org/wiki/Logistic_regression">Logistic Regression</a></li>
<li><a target="_blank" href="http://en.wikipedia.org/wiki/Principle_of_maximum_entropy">Maximum Entropy Methods</a></li>
<li><a target="_blank" href="http://en.wikipedia.org/wiki/Regularization_(mathematics)">Regularization</a></li>
<li><a target="_blank" href="http://en.wikipedia.org/wiki/Graphical_model">Graphical Models</a></li>
<li><a target="_blank" href="http://en.wikipedia.org/wiki/Conditional_random_field">Conditional Random Fields</a></li>
<p> </ul>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/10/556px-Svm_max_sep_hyperplane_with_margin.png" alt="556px-Svm_max_sep_hyperplane_with_margin.png" border="0" width="278" /><br />
<br/><br />
Typical SVM maximum margin diagram<br />
</center></p>
<p>Not all of these methods are new (Logistic Regression for example dates from 1925 and is itself based on regression which goes back to Gauss).  But the concerns these methods address are all much more statistical than artificial intelligence in nature.  For example we don&#8217;t  suppose that there is some cryptographically obscured combination of features that we need to find to make the best prediction.  We instead worry about detecting which features are useful and note that it is a significant (though solvable) problem to correctly use combinations of useful features (phrased as statistical concerns: feature to feature dependencies and higher order interactions).  Machine learning has always run where statisticians fear to tread.   But more and  more often we are seeing that the methods and concerns of statisticians are what are needed to achieve many of the listed desired properties of machine learning models.</p>
<p>The methods I have singled out for praise are very effective and achieve a number of our listed desired properties.  For example:  both logistic regression and maximum entropy have a unique solution that is easy to find.  They are also both consistent with all summaries known during training.  That is: if 30% of the positive training data has a feature present then 30% of the data also has the feature present when weighted by the model&#8217;s score (so the model score shares a lot of properties with training truth).  Support Vector Machines also have well understood solutions and a theory (called maximum margin) that directly addresses generalization (good predictions on new data).  Kernel Methods (both as used in SVMs and elsewhere) allow controlled introduction of very complex functions.  Graphical Models and Conditional Random Fields also allow the controlled introduction of modeled dependencies in the data.</p>
<p>It is now common to call what was previously thought of as artificial intelligence or machine learning: &#8220;statistical machine learning.&#8221;  This reflects that the kind of prediction and characterization we expect from machine learning algorithms are in fact statistical concerns that we can deal with if we have enough data and enough computational resources. </p>
<p>The current important issues for statistical machine learning include:</p>
<ul>
<li>Dealing with very large datasets (driving the return of simpler methods like Naive Bayes)</li>
<li>Dealing with lack of training data (driving interest in clustering and manifold regularization methods)</li>
<li>Dealing with unstructured data and text mining (driving interest in information extraction and segmentation via generative models)</li>
</ul>
<p>Just as Wigner famously wrote about &#8220;The Unreasonable Effectiveness of Mathematics&#8221; in the 1960s  Halevy,Norvig and Pereira write about the &#8220;Unreasonable Effectiveness of Data.&#8221;   They argue that we are in the age of big data (or the age of analysts).   Or, as Varian observed: &#8220;it is a good time to supply a good complementary to data&#8221; (i.e. it is a good time to be an analyst).  I would temper this with we are likely in the age of unmarked data and unstructured data.  Less often are we asked to automate a known prediction and more often we are asked to cluster, characterize and segment wild data. In my opinion the hard problem in machine learning has moved from prediction to characterization.  With enough marked training data (that is data for which we know both the observables and desired outcome) it is now quite possible to use standard techniques and libraries to build a very good predictive model.  However, it is still hard to characterize, segment or extract useful information from the wealth of unstructured and unmarked data that is upon us.  And this is where a lot of the current research in statistical machine learning is directed.  </p>
<p>Or course characterization and clustering have their own infamous history.  Rota wrote: &#8220;&#8230; Or a subject is important, but nobody understands what is going on; such is the case with quantum field theory, the distribution of primes, pattern recognition and cluster analysis.&#8221;  Artificial intelligence may be moving from areas where computer scientists have over-promised to areas where statisticians have over-promised.  But this is not a disaster: the most valuable research tends to be done in hectic times in messy fields, not in calm times in neat fields.  And the already large scale adoption of statistical machine learning techniques means there is immediate great client value in even seemingly small improvements in understanding, explanation, documentation, training, tools, libraries and techniques.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/10/Xbarst1.jpg" alt="Xbarst1.jpg" border="0" width="384" height="398" /><br />
<br/><br />
Classic attempt to add structure to text<br />
</center></p>
<p>(images from Wikipedia)</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/' rel='bookmark' title='A Demonstration of Data Mining'>A Demonstration of Data Mining</a></li>
<li><a href='http://www.win-vector.com/blog/2011/10/kernel-methods-and-support-vector-machines-de-mystified/' rel='bookmark' title='Kernel Methods and Support Vector Machines de-Mystified'>Kernel Methods and Support Vector Machines de-Mystified</a></li>
<li><a href='http://www.win-vector.com/blog/2011/07/book-review-ensemble-methods-in-data-mining-seni-elder/' rel='bookmark' title='Book Review: Ensemble Methods in Data Mining (Seni &amp; Elder)'>Book Review: Ensemble Methods in Data Mining (Seni &#038; Elder)</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2010/10/a-personal-perspective-on-machine-learning/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>The Local to Global Principle</title>
		<link>http://www.win-vector.com/blog/2009/11/the-local-to-global-principle/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=the-local-to-global-principle</link>
		<comments>http://www.win-vector.com/blog/2009/11/the-local-to-global-principle/#comments</comments>
		<pubDate>Wed, 11 Nov 2009 16:37:53 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Exciting Techniques]]></category>
		<category><![CDATA[Expository Writing]]></category>
		<category><![CDATA[Mathematics]]></category>
		<category><![CDATA[Dynamic Programming]]></category>
		<category><![CDATA[Local to Global]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[PageRank]]></category>
		<category><![CDATA[Problem Solving]]></category>
		<category><![CDATA[Speech Recognition]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1123</guid>
		<description><![CDATA[We describe the &#8220;the local to global principle.&#8221; It is a principle used to break algorithmic problem solving into two distinct phases (local criticism followed by global solution) and is an aid both in the design and in the application of algorithms. Instead of giving a formal definition of the principle we quickly define it [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/' rel='bookmark' title='Automatic Differentiation with Scala'>Automatic Differentiation with Scala</a></li>
<li><a href='http://www.win-vector.com/blog/2010/08/what-did-theorists-do-before-the-age-of-big-data/' rel='bookmark' title='What Did Theorists Do Before The Age Of Big Data?'>What Did Theorists Do Before The Age Of Big Data?</a></li>
<li><a href='http://www.win-vector.com/blog/2010/10/a-personal-perspective-on-machine-learning/' rel='bookmark' title='A Personal Perspective on Machine Learning'>A Personal Perspective on Machine Learning</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>We describe the &#8220;the local to global principle.&#8221; It is a principle used to break algorithmic problem solving into two distinct phases (local criticism followed by global solution) and is an aid both in the design and in the application of algorithms. Instead of giving a formal definition of the principle we quickly define it and discuss a few examples and methods.  We have produced both a stand-alone <a href="http://www.win-vector.com/dfiles/LocalToGlobal.pdf">PDF</a> (more legible) and a HTML/blog form (more skimable).<br />
<span id="more-1123"></span></p>
<h1 align="center">The Local to Global Principle</h1>
<p align="center"><strong>John Mount<a name="tex2html3" href="#foot21" id="tex2html3"><sup>1</sup></a></strong></p>
<p></p>
<p align="center"><b>Date:</b> November 11, 2009</p>
<hr />
<h3>Abstract:</h3>
<div>We describe the &#8220;the local to global principle.&#8221; It is a principle used to break algorithmic problem solving into two distinct phases (local criticism followed by global solution) and is an aid both in the design and in the application of algorithms. Instead of giving a formal definition of the principle we quickly define it and discuss a few examples and methods.</div>
<p></p>
<h2><a name="SECTION00010000000000000000" id="SECTION00010000000000000000">Contents</a></h2>
<p><!--Table of Contents--></p>
<ul>
<li><a name="tex2html32" href="#SECTION00020000000000000000" id="tex2html32">Introduction</a></li>
<li><a name="tex2html33" href="#SECTION00030000000000000000" id="tex2html33">The Examples</a>
<ul>
<li><a name="tex2html34" href="#SECTION00031000000000000000" id="tex2html34">Web Page Link Analysis</a></li>
<li><a name="tex2html35" href="#SECTION00032000000000000000" id="tex2html35">Natural Language Processing</a></li>
<li><a name="tex2html36" href="#SECTION00033000000000000000" id="tex2html36">Machine Learning</a></li>
</ul>
<p></li>
<li><a name="tex2html37" href="#SECTION00040000000000000000" id="tex2html37">Some Methods</a>
<ul>
<li><a name="tex2html38" href="#SECTION00041000000000000000" id="tex2html38">Local Methods</a></li>
<li><a name="tex2html39" href="#SECTION00042000000000000000" id="tex2html39">Globalization Methods</a></li>
</ul>
<p></li>
<li><a name="tex2html40" href="#SECTION00050000000000000000" id="tex2html40">Conclusion</a></li>
<li><a name="tex2html41" href="#SECTION00060000000000000000" id="tex2html41">Bibliography</a></li>
<li><a name="tex2html42" href="#SECTION00070000000000000000" id="tex2html42">Acknowledgement</a></li>
</ul>
<p><!--End of Table of Contents--></p>
<h1><a name="SECTION00020000000000000000" id="SECTION00020000000000000000">Introduction</a></h1>
<p><font>A common vain hope of computer scientists and algorithm designers is that a domain expert has already &#8220;boiled down&#8221; a problem to a precise, but unsolved, algorithmic core. On this point the mathematician Gian-Carlo Rota wrote:</font></p>
<blockquote><p><font>One of the rarest mathematical talents is the talent for applied mathematics, for picking out of a maze of experimental data the two or three parameters that are relevant, and to discard all other data. This talent is rare. It is taught only at the shop level.[<a href="#IndiscreteThoughts">Rot97</a>, ``A Mathematician's Gossip'']</font></p></blockquote>
<p><font>We describe a useful tool for designing algorithmic applications and solutions which we call &#8220;the local to global principle.&#8221; The local to global principle is the method of deriving applications and solutions by specifying &#8220;local&#8221; (and deliberately myopic) heuristics, critiques and methods followed by using a powerful general method to &#8220;globalize&#8221; this specification into a complete solution.</font></p>
<p><font>There are many important problem solving prescriptions and methods of thought already systematically described and taught:</font></p>
<ul>
<li>Bacon&#8217;s &#8220;New Organon&#8221; and Mill&#8217;s principles of inductive logic.[<a href="#Mill">Mil02</a>]</li>
<li>Feynman&#8217;s genius method.[<a href="#IndiscreteThoughts">Rot97</a>, ``Ten Lessons I Wish I Had Been Taught'']</li>
<li>Reductionism (top down and bottom up).</li>
<li>Divide and conquer.[<a href="#IntroductionToAlgorithms">CLRS09</a>]</li>
<li>Forward deduction, backwards induction.</li>
<li>Root Cause Analysis.</li>
<li>Polya&#8217;s heuristic and conjecture and prove patterns [<a href="#citeulike:679515">Pol71</a>,<a href="#Polya1">Pol54a</a>,<a href="#Polya2">Pol54b</a>]</li>
<li>Doron Zeilberger&#8217;s &#8220;Method of Undetermined Generalization and Specialization.&#8221; [<a href="#Zeilberger:1995p277">Zei95</a>]</li>
<li>Zbigniew Michalewicz and David B. Fogel&#8217;s presentation of evolutionary algorithms.[<a href="#HTSMH">MF00</a>]</li>
</ul>
<p><font>The local to global principle is more of an organizational pattern than &#8220;computer aided technique&#8221; as no one specific species of software or family of notation is required.</font></p>
<p><font>The local to global principle can be identified in a number of previous important applications, but it is not currently an identified principle.<a name="tex2html4" href="#foot244" id="tex2html4"><sup>2</sup></a> The principle is very general, so any succinct description of it is going to be painfully vague. Instead, we explain the principle by discussing some example applications and methods.  For each of our example applications we deliberately use a different globalization technique. The effective algorithmist or practitioner must in fact come to each problem already familiar with a reasonably large set of already known local and global techniques, so we conclude with some appropriate fields of study and preparation.</font></p>
<p><font>The local to global principle is divided into two parts: local encoding of the problem followed by a globalization step that uses the encoding. The guiding feature of local encodings is that they are usually easy to compute from the data at hand. Any extension that looks like enumeration, search or optimization is best left to the global step. The local step is essentially the translation of your problem into an abstract language that is ready for the globalization step. In contrast globalization methods are often &#8220;off the shelf&#8221; in that once you abstract and encode the particulars of your problem you can look for pre-existing useful methods or software to finish your solution. The idea of globalization is to find a best overall or global compromise between competing local criteria. The local step does not so much have to avoid conflicts but instead &#8220;price them.&#8221; There is also an important trade-off that sophisticated local techniques allow the use of simpler globalization methods and more powerful globalization methods allow the use of simpler local techniques.</font></p>
<h1><a name="SECTION00030000000000000000" id="SECTION00030000000000000000">The Examples</a></h1>
<p><font>To demonstrate the breadth of the local to global principle we choose a diverse collection of example applications: web page link analysis, natural language processing and machine learning. For each example application we will set up the problem, introduce a reasonable set of local criteria and pick an appropriate globalization technique. We will favor finishing each example without describing the globalization technique in detail, as this would distract from our point and is best left to the given references. These examples are previously solved problems, our contribution is demonstrating the shared underlying principle.</font></p>
<h2><a name="SECTION00031000000000000000" id="SECTION00031000000000000000">Web Page Link Analysis</a></h2>
<p><font>For our first example application we demonstrate web page link analysis in the form of the famous PageRank score.[<a href="#Page:1998p2689">PBMW98</a>]</font></p>
<p><font>One of the many good ideas leading up to the early Google search engine was the design of a non-text based measure of importance or interestingness of web pages. A search engine that could fold &#8220;interestingness&#8221; or popularity into its notion of relevance could better sort important pages into the search user&#8217;s view. When the web got so large that there were many pages that were exact matches to any common user query popularity became a critical consideration. A link based notion of popularity exploits what is important about the web (the link structure, for example see [<a href="#Kleinberg:1997p32">Kle97</a>]) and avoids having to depend on a lot of natural language understanding technology. This technique also uses authority outside of the given page, so has some hope at being resistant (though not immune) to web-spam.</font></p>
<p><font>Taken all at once, the task of designing a score of page importance is a daunting task. However, by working in stages (as the local to global principle prescribes) we can quickly derive interesting scores including the famous PageRank score. We start with the idea that popularity (or the amount of web traffic a page receives) is (loosely) correlated with importance. So for our first approximation step we decide to try to estimate popularity (or web traffic) and use this estimate as our importance score. Accurately estimating web traffic is itself a hard problem and a big industry (just a few of the major companies involved in this are: Google/Urchin, Quantcast, Nielsen, comScore, Alexa, Hitwise and LookSmart). For our second approximation step we are going to try and estimate popularity from the link structure<a name="tex2html6" href="#foot43" id="tex2html6"><sup>4</sup></a> of the web (using no other measurements or historic data) and use this as our score. This link based estimate is unlikely to completely reproduce real web surfing patterns, but it is very interesting in its own right and has been proven in the market to be a useful score.</font></p>
<p><font>Now the problem is to try to estimate the popularity of a web page from the link structure of the web. We claim: we can generate a useful (but not necessarily accurate) estimate of web traffic from the web&#8217;s link structure alone. Consider Figure&nbsp;<a href="#fig:Links1">1</a> where we have a universe of three web pages A,B and C that link to each other in the pattern illustrated by what is called a graph<a name="tex2html7" href="#foot45" id="tex2html7"><sup>5</sup></a></font></p>
<div align="center"><a name="fig:Links1" id="fig:Links1"></a><a name="50"></a></p>
<table>
<caption align="bottom"><strong>Figure 1:</strong> A set of Mutually Linked Web Pages</caption>
<tr>
<td>
<div align="center"><img width="300" height="436" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/Links1.png" alt="Image Links1"></div>
</td>
</tr>
</table>
</div>
<p><font>In Figure&nbsp;<a href="#fig:Links1">1</a> we can consider each link to another page as evidence the other page is interesting or popular. One idea is to simulate a very simple web surfer who clicks on the links on a page uniformly at random. This is called &#8220;the random surfer model&#8221; and even a model this simple allows us to read some useful information from the link structure of the web. For instance, we could ask what fraction of their time the random surfer spends on each web page, with an eye to the idea that the pages the random surfer visits more often are the more important ones. Let <img width="35" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg2.png" alt="$ p(A)$"> denote the proportion of time the random web surfer spends on page A (and define <img width="36" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg3.png" alt="$ p(B)$"> and <img width="36" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg4.png" alt="$ p(C)$"> similarly). While we do not know any of <!-- MATH<br />
 $p(A), p(B)$<br />
 --><br />
<img width="76" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg5.png" alt="$ p(A), p(B)$"> or <img width="36" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg4.png" alt="$ p(C)$"> we can derive some relationships between them by inspecting the link graph:</font></p>
<p></p>
<div align="center"><!-- MATH<br />
 \begin{eqnarray*}<br />
p(A) &#038; = &#038; \frac{1}{2} P(B) + P(C) \\<br />
p(B) &#038; = &#038; \frac{1}{2} P(A) \\<br />
p(C) &#038; = &#038; \frac{1}{2} P(A) + \frac{1}{2} P(B) .<br />
\end{eqnarray*}<br />
 --></p>
<table cellpadding="0" align="center" width="100%">
<tr valign="middle">
<td nowrap align="right"><img width="35" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg6.png" alt="$\displaystyle p(A)$"></td>
<td width="10" align="center" nowrap><img width="16" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg7.png" alt="$\displaystyle =$"></td>
<td align="left" nowrap><img width="109" height="49" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg8.png" alt="$\displaystyle \frac{1}{2} P(B) + P(C)$"></td>
<td width="10" align="right">&nbsp;</td>
</tr>
<tr valign="middle">
<td nowrap align="right"><img width="36" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg9.png" alt="$\displaystyle p(B)$"></td>
<td width="10" align="center" nowrap><img width="16" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg7.png" alt="$\displaystyle =$"></td>
<td align="left" nowrap><img width="52" height="49" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg10.png" alt="$\displaystyle \frac{1}{2} P(A)$"></td>
<td width="10" align="right">&nbsp;</td>
</tr>
<tr valign="middle">
<td nowrap align="right"><img width="36" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg11.png" alt="$\displaystyle p(C)$"></td>
<td width="10" align="center" nowrap><img width="16" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg7.png" alt="$\displaystyle =$"></td>
<td align="left" nowrap><img width="125" height="49" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg12.png" alt="$\displaystyle \frac{1}{2} P(A) + \frac{1}{2} P(B) .$"></td>
<td width="10" align="right">&nbsp;</td>
</tr>
</table>
</div>
<p><br clear="all"></p>
<p><font>The first equation is just reading from the graph that: all visits on page-A must come from pages B and C, half of the visitors on page-B continue on to A and all of the visitors on page-C continue on to A. The second and third equations are the appropriate summaries of how traffic is routed to pages B and C. We can insist that <!-- MATH<br />
 $P(A) + P(B)<br />
+ P(C) = 1$<br />
 --><br />
<img width="183" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg13.png" alt="$ P(A) + P(B) + P(C) = 1$"> as we want these numbers to represent the fraction of time the random web surfer spends on each page. A more sophisticated model would add more features<a name="tex2html9" href="#foot245" id="tex2html9"><sup>6</sup></a> to get a more useful result.</font></p>
<p><font>It turns out we have already encoded enough local rules to completely determine <!-- MATH<br />
 $P(A), P(B)$<br />
 --><br />
<img width="85" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg14.png" alt="$ P(A), P(B)$"> and <img width="40" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg15.png" alt="$ P(C)$"> . In this example application an algorithmist already familiar with linear algebra&nbsp;[<a href="#Strang">Str76</a>] would recognize these local conditions as &#8220;a system of linear equations.&#8221; Solving even web-scale systems of linear systems is considered easy with modern techniques and modern computers. For our small example example the solution is: <!-- MATH<br />
 $p(A) = \frac{4}{9}$<br />
 --><br />
<img width="67" height="34" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg16.png" alt="$ p(A) = \frac{4}{9}$"> , <!-- MATH<br />
 $p(B) = \frac{2}{9}$<br />
 --><br />
<img width="68" height="34" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg17.png" alt="$ p(B) = \frac{2}{9}$"> , and <!-- MATH<br />
 $p(C) = \frac{3}{9}$<br />
 --><br />
<img width="67" height="34" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg18.png" alt="$ p(C) = \frac{3}{9}$"> . The role of the local steps was to reduce a new problem (estimating the importance or popularity of web page from the link structure) to something with its <em>already known</em> known techniques (like solving a linear system as illustrated in Figure&nbsp;<a href="#fig:LinAlg">2</a>).</font></p>
<div align="center"><a name="fig:LinAlg" id="fig:LinAlg"></a><a name="79"></a></p>
<table>
<caption align="bottom"><strong>Figure 2:</strong> Linear Algebra Solution: As Taught in School</caption>
<tr>
<td>
<div align="center"><img width="400" height="365" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LinAlg.jpg" alt="Image LinAlg"></div>
</td>
</tr>
</table>
</div>
<p><font>So page-A is the most important page by the PageRank measure.</font></p>
<p><font>In this example application the local step was setting up the system of linear equalities (which are easy to derive from the web link graph) and the global step was solving the entire system for the final scores (which were not obvious). You spend most of your time encoding the problem and then use a known technique (in this case solving a linear system) to finish the solution.</font></p>
<h2><a name="SECTION00032000000000000000" id="SECTION00032000000000000000">Natural Language Processing</a></h2>
<p><font>Our next example application is natural language processing&nbsp;[<a href="#CharniakBook">Cha96</a>,<a href="#Charniak:1997p1484">Cha97</a>]. Speech recognition (the alignment or transcription of recognized intelligible segments of sound to written text) is an important problem in natural language processing. An example problem is the need to find the most likely text matching a sequence of sounds such as is shown in Figure&nbsp;<a href="#fig:SoundSeq1">3</a>.</font></p>
<div align="center"><a name="fig:SoundSeq1" id="fig:SoundSeq1"></a><a name="89"></a></p>
<table>
<caption align="bottom"><strong>Figure 3:</strong> A Sequence of Sounds</caption>
<tr>
<td>
<div align="center"><img width="500" height="69" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/SoundSeq1.png" alt="Image SoundSeq1"></div>
</td>
</tr>
</table>
</div>
<p><font>Consider Figure&nbsp;<a href="#fig:SoundSeq3">4</a> (which shows a bad transcription) and Figure&nbsp;<a href="#fig:SoundSeq2">5</a> (which shows a good transcription).</font></p>
<div align="center"><a name="fig:SoundSeq3" id="fig:SoundSeq3"></a><a name="98"></a></p>
<table>
<caption align="bottom"><strong>Figure 4:</strong> A Bad Transcription</caption>
<tr>
<td>
<div align="center"><img width="500" height="142" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/SoundSeq3.png" alt="Image SoundSeq3"></div>
</td>
</tr>
</table>
</div>
<div align="center"><a name="fig:SoundSeq2" id="fig:SoundSeq2"></a><a name="105"></a></p>
<table>
<caption align="bottom"><strong>Figure 5:</strong> A Good Transcription</caption>
<tr>
<td>
<div align="center"><img width="500" height="142" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/SoundSeq2.png" alt="Image SoundSeq2"></div>
</td>
</tr>
</table>
</div>
<p><font>Our claim: we can (given access to training data, and this is the age of data&nbsp;[<a href="#Halevy:2009p2327">HNP09</a>]) solve this problem with a local step that is a set of simple criticisms of proposed transcriptions. A good starting point is a database of previous sounds to text transcriptions. This database allows the construction of a series of tables that give the historic frequency (or probability) of all of the following:</font></p>
<ul>
<li>Prior probability of each sound</li>
<li>Probability of each sound given the immediately previous sound</li>
<li>Prior probability of each word</li>
<li>Probability of each word given the immediately previous word</li>
<li>Which combinations of word fragments are legitimate words</li>
<li>Probability of each sound being assigned to each word fragment (syllables, phonemes and so on).</li>
</ul>
<p><font>These tables encode a &#8220;speech model&#8221; (the rules involving sounds only), a language model (the rules involving text or words only) and the linkage between the two models. These models are deliberately simple in that they capture only local interactions (like probability of a word given the word before it) but no long range interactions (like subject predicate agreement).</font></p>
<p><font>Each box, nested box and arrow on our diagram represents one possible local critique. For each item in our diagram (again, the boxes and arrows) we can use our tables to assign a goodness or plausibility score. For instance bad word to word transitions (like &#8220;won&#8221; <!-- MATH<br />
 $\rightarrow$<br />
 --><br />
<img width="19" height="13" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg19.png" alt="$ \rightarrow$"> &#8220;won&#8221;) will be rare in our historic tables so, just looking up probabilities from the tables (or, better, using the logarithms of probabilities) gives as a &#8220;plausibility score&#8221; that prefers known patterns of language. Then a score for the overall transcription can be derived by multiplying all of the local scores together. These local scores (though simple) already have encoded enough evidence to prefer the good transcription to the bad transcription <em>without</em> requiring any deep knowledge of speech, text or the meaning of the text. This is because the bad transcription has a series of obvious flaws such as: unlikely sound to word fragment assignments and unlikely word to word transitions.</font></p>
<div align="center"><a name="fig:SoundSeqPartial" id="fig:SoundSeqPartial"></a><a name="116"></a></p>
<table>
<caption align="bottom"><strong>Figure 6:</strong> Naively Extending a Partial Transcription</caption>
<tr>
<td>
<div align="center"><img width="500" height="142" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/SoundSeqPartial.png" alt="Image SoundSeqPartial"></div>
</td>
</tr>
</table>
</div>
<p><font>For example consider Figure&nbsp;<a href="#fig:SoundSeqPartial">6</a> where a naive solver is in the process of considering selecting the word &#8220;one&#8221; as the third word to fill in. The <em>only</em> local critiques they need to consider are:</font></p>
<ul>
<li>how likely the word &#8220;one&#8221; is in general (call this <img width="49" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg20.png" alt="$ P[one]$"> )</li>
<li>how likely the word &#8220;one&#8221; is to follow the word &#8220;nine&#8221; (call this <!-- MATH<br />
 $P[one | nine]$<br />
 --><br />
<img width="86" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg21.png" alt="$ P[one \vert nine]$"> )</li>
<li>how likely the letter sequence &#8220;o&#8221; is given the sound &#8220;w&#8221; (call this <!-- MATH<br />
 $P[o | \text{w\textschwa}]$<br />
 --><br />
<img width="55" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg24.png" alt="$P[o \vert \text{w\textschwa}]$"> )</li>
<li>how likely the letter sequence &#8220;ne&#8221; is given the sound &#8220;n&#8221; (call this <!-- MATH<br />
 $P[ne | \text{n}]$<br />
 --><br />
<img width="41" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg25.png" alt="$ P[ne \vert$">&nbsp; &nbsp;n<img width="7" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg23.png" alt="$ ]$"> ).</li>
</ul>
<p><font>So the local plausibility of the fill-in word &#8220;one&#8221; is: <!-- MATH<br />
 $P[one]<br />
\times P[one | nine] \times P[o | \text{w\textschwa}] \times P[ne |<br />
\text{o}]$<br />
 --><br />
<img width="292" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg28.png" alt="$P[one] \times P[one \vert nine] \times P[o \vert \text{w\textschwa}] \times P[ne \vert \text{o}]$"> . We will call this the critique of &#8220;one&#8221; in position 3 and write as <!-- MATH<br />
 $C_3(w_2,one)$<br />
 --><br />
<img width="84" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg29.png" alt="$ C_3(w_2,one)$"> where <img width="22" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg30.png" alt="$ w_2$"> is the word known to be in position 2. Similarly we can generate all of the possible critiques <img width="53" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg31.png" alt="$ C_1(w_1)$"> , <!-- MATH<br />
 $C_2(w_1,w_2)$<br />
 --><br />
<img width="78" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg32.png" alt="$ C_2(w_1,w_2)$"> , <!-- MATH<br />
 $C_3(w_2,w_3)$<br />
 --><br />
<img width="78" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg33.png" alt="$ C_3(w_2,w_3)$"> , <!-- MATH<br />
 $C_4(w_3,w_4)$<br />
 --><br />
<img width="78" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg34.png" alt="$ C_4(w_3,w_4)$"> and the overall criticize of a sequence <!-- MATH<br />
 $w_1 \; w_2 \; w_3 \; w_4$<br />
 --><br />
<img width="77" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg35.png" alt="$ w_1 \; w_2 \; w_3 \; w_4$"> : <!-- MATH<br />
 $C_1(w_1)<br />
\times C_2(w_1,w_2) \times C_3(w_2,w_3) \times C_4(w_3,w_4)$<br />
 --><br />
<img width="336" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg36.png" alt="$ C_1(w_1) \times C_2(w_1,w_2) \times C_3(w_2,w_3) \times C_4(w_3,w_4)$"> from our pre-computed tables of probabilities. Notice for all of these critiques only the immediately previous word and the nearby sounds were used to determine the plausibility of the word we are attempting to fit in. Instead of using these critiques to directly fill in a possible solution (or using search) we will package up these critiques (in the form of the <img width="32" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg37.png" alt="$ C_i()$"> ) and pass them on to a powerful separate globalization step called Dynamic Programming&nbsp;[<a href="#DynamicProgramming">Bel57</a>].</font></p>
<p><font>The globalization or finding of a best overall transcription is not trivial even though our score is simple. This is because the overall <em>best</em> sequence could depend on clever non-local fill-ins (like deliberately picking a less likely first word to allow a later favored transition to a fantastically good third word). Dynamic Programing does not fill in the transcription from left to right, but instead uses a table of scores derived from the left to right arrows and the <img width="32" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg37.png" alt="$ C_i()$"> . In our example Dynamic Programming consists of building a table of information as shown in Figure&nbsp;<a href="#fig:DynBackFill">7</a>. Let <img width="9" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg38.png" alt="$ i$"> represent the word position we are working looking at (so <img width="9" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg38.png" alt="$ i$"> ranges from 1 to 4) and let <img width="15" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg39.png" alt="$ w$"> be a variable that ranges over every word in the dictionary. Our table is indexed by <img width="9" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg38.png" alt="$ i$"> and <img width="15" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg39.png" alt="$ w$"> and when filled in <img width="51" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg40.png" alt="$ T(i,w)$"> stores what the highest &#8220;plausibility score&#8221; of a partial sequence of words where words 1 through <img width="9" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg38.png" alt="$ i$"> have been filled in and the <img width="9" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg38.png" alt="$ i$"> -th word is <img width="15" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg39.png" alt="$ w$"> .</font></p>
<div align="center"><a name="fig:DynBackFill" id="fig:DynBackFill"></a><a name="134"></a></p>
<table>
<caption align="bottom"><strong>Figure 7:</strong> Dynamic Programming: Back Chaining in <img width="27" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg1.png" alt="$ T()$"> for a Solution</caption>
<tr>
<td>
<div align="center"><img width="300" height="298" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/DynTableBackFill.png" alt="Image DynTableBackFill"></div>
</td>
</tr>
</table>
</div>
<p><font>If we already had this magic table <img width="27" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg1.png" alt="$ T()$"> we could find a best possible sequence by &#8220;back chaining.&#8221; We start by finding a fourth word (<img width="22" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg41.png" alt="$ w_4$"> ) such that <img width="61" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg42.png" alt="$ T(4,w_4)$"> is maximal (in this case &#8220;one&#8221;). We then find a best third word (<img width="22" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg43.png" alt="$ w_3$"> ) by enumerating all words and picking <img width="22" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg43.png" alt="$ w_3$"> such that <!-- MATH<br />
 $T(3,w_3) \times C_4(w_3,w_4) = T(4,w_4)$<br />
 --><br />
<img width="234" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg44.png" alt="$ T(3,w_3) \times C_4(w_3,w_4) = T(4,w_4)$"> . We continue back until we had found words <img width="22" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg30.png" alt="$ w_2$"> and <img width="22" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg45.png" alt="$ w_1$"> to get a complete best sequence. Notice that we work from right to left (backwards) and except for the starting step we pick each word to match the calculation we are trying to un-roll, not to be the maximal entry in the column. For instance we pick <!-- MATH<br />
 $w_1 = dial$<br />
 --><br />
<img width="70" height="29" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg46.png" alt="$ w_1 = dial$"> even though it does not have a the highest score, but because <!-- MATH<br />
 $T(1,dial) C_2(dial,nine)<br />
C_3(nine,one) C_4(one,one) = T(4,one)$<br />
 --><br />
<img width="433" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg47.png" alt="$ T(1,dial) C_2(dial,nine) C_3(nine,one) C_4(one,one) = T(4,one)$"> is the maximal complete chain.</font></p>
<p><font>Of course, we don&#8217;t start with the table <img width="27" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg1.png" alt="$ T()$"> already filled in- so we need a procedure to build it. This procedure is the heart of the Dynamic Programming method (for more examples see: &#8220;Introduction to Algorithms&#8221;&nbsp;[<a href="#IntroductionToAlgorithms">CLRS09</a>]). Notice that <img width="54" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg48.png" alt="$ T(1,w)$"> can be filled in for all <img width="15" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg39.png" alt="$ w$"> just by plugging in words and computing the critiques <img width="46" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg49.png" alt="$ C_1(w)$"> (i.e. <!-- MATH<br />
 $T(1,w) = C_1(w)$<br />
 --><br />
<img width="118" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg50.png" alt="$ T(1,w) = C_1(w)$"> ). Once all the <img width="54" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg48.png" alt="$ T(1,w)$"> are filled in we can fill in the the <img width="54" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg51.png" alt="$ T(2,w)$"> with the general (and slightly trickier) formula:</font></p>
<p><!-- MATH<br />
 \begin{displaymath}<br />
T(i+1,w) = \max_{v} T(i,v) C_{i+1}(v,w)<br />
\end{displaymath}<br />
 --></p>
<div align="center"><img width="249" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg52.png" alt="$\displaystyle T(i+1,w) = \max_{v} T(i,v) C_{i+1}(v,w) $"></div>
<p><font>as we illustrate for <img width="74" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg53.png" alt="$ T(2,nine)$"> in Figure&nbsp;<a href="#fig:DynTable">8</a>.</font></p>
<div align="center"><a name="fig:DynTable" id="fig:DynTable"></a><a name="145"></a></p>
<table>
<caption align="bottom"><strong>Figure 8:</strong> Dynamic Programming: Building the Table <img width="27" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg1.png" alt="$ T()$"></caption>
<tr>
<td>
<div align="center"><img width="400" height="261" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/DynTableCalculate.png" alt="Image DynTableCalculate"></div>
</td>
</tr>
</table>
</div>
<p><font>The magic of the Dynamic Programing technique is: by being careful to not store too much in the table <img width="51" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg40.png" alt="$ T(i,w)$"> we avoid an explosion in record keeping that would render the method inefficient. Dynamic Programming exploits the small dependence structure encoded in <img width="32" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg37.png" alt="$ C_i()$"> (each box in our diagram depending on only a few arrows) and as we have shown can find &#8220;clever&#8221; solutions (such as taking a sub-optimal first word to get better transitions into preferred later words). For those who want more detail on solving this problem we recommend [<a href="#CharniakBook">Cha96</a>] (as our goal is not to fully explain Dynamic Programming, but to demonstrate how it could be applied to the transcription problem as a pre-packaged globalizer).</font></p>
<p><font>In this example the local step was the graph link based critiques and the globalization step was Dynamic Programming. The separation of concerns from the local scoring to the globalizing step is a strength of the local to global principle.</font></p>
<h2><a name="SECTION00033000000000000000" id="SECTION00033000000000000000">Machine Learning</a></h2>
<p><font>Our final example application is machine learning. Machine learning is loosely defined as computer programs that adapt or learn from data. Thomas Mitchell helps distinguish this activity as a specialty of artificial intelligence that concentrates on &#8220;well-posed learning problems.&#8221;&nbsp;[<a href="#MitchellML">Mit97</a>] Trevor Hastie, Robert Tibshirani, Jerome Friedman emphasize the relation to statistics (versus more traditional symbolic AI)&nbsp;[<a href="#TibHat">TH09</a>]. A simple demonstration can be found in [<a href="#MLArt">Mou09b</a>].</font></p>
<p><font>Machine learning is perhaps the strongest example of the local to global principle and is inspired by the work of Kristin P. Bennett and Emilio Parrado-Hernandez&nbsp;[<a href="#Bennett:2006p400">BPH06</a>]. In hindsight many machine learning algorithms (each of which has had a turn at being &#8220;the most exciting breakthrough ever&#8221; for a while) can be seen as the pairing of a performance criterion (which we call a local criterion as it applies to one specific set of parameter values at a time) and an optimization method (what we have been calling the globalization step). The work of Bennett and Parrado-Hernandez calls this distinction out and shows how it is not productive to present machine learning systems as unique named monolithic units, but instead to consider how to break them into an objective function and an optimizer. This allows both choice of better optimizers (such as replacing the inferior method of gradient descent method wherever it occurs) and for explicit control of important concepts such as hypothesis regularization and control of over-fitting (which some algorithms claim to achieve by deliberately using early exit from a an inferior optimizer).</font></p>
<p><font>At a &#8220;30,000 feet level&#8221; we can build a table of common machine learning techniques and name what is commonly used to implement their local and global steps. When a machine learning algorithm is defined by what conditions are meant to be true at the optimum we are no longer bound by details of the original implementation and can examine fix and improve the components.<a name="tex2html17" href="#foot154" id="tex2html17"><sup>7</sup></a> Table&nbsp;<a href="#fig:MachineLearning">1</a> is a crude summary of a wide selection for machine learning algorithms that may be more likely to offend everybody than just offend somebody. But this is also the point: it is the algorithmist&#8217;s job to think fluidly (beyond given names and provenances) and to invent scaffolding to convert partial analogies into practical correspondences.</font></p>
<p></p>
<div align="center"><a name="190"></a></p>
<table>
<caption><strong>Table 1:</strong> Various Machine Learning Techniques</caption>
<tr>
<td>
<div align="center">
<table cellpadding="3" border="1" align="center">
<tr>
<td align="left" valign="top" width="180"><font size="-1">Machine Learning Method</font></td>
<td align="left" valign="top" width="144"><font size="-1">Local Criterion</font></td>
<td align="left" valign="top" width="144"><font size="-1">Globalization Method</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">Linear Regression [<a href="#Breiman:1997p1133">BF97</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">square error</font></td>
<td align="left" valign="top" width="144"><font size="-1">Linear Algebra</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">Linear Discriminant Analysis [<a href="#Fisher:1936p2576">Fis36</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">square error</font></td>
<td align="left" valign="top" width="144"><font size="-1">Linear Algebra</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">Logistic Regression [<a href="#Komarek:2008p1742">Kom08</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">logit penalty</font></td>
<td align="left" valign="top" width="144"><font size="-1">Newton&#8217;s Method</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">Perceptron [<a href="#Beigel:1991p1027">BRS91</a>] [<a href="#Blum:2002p1867">BD02</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">error rate</font></td>
<td align="left" valign="top" width="144"><font size="-1">error based update</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">Naive Bayes [<a href="#Maron:2000p2553">MK00</a>] [<a href="#Maron:1961p2566">Mar61</a>] [<a href="#Lewis:1998p105">Lew98</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">frequency tables</font></td>
<td align="left" valign="top" width="144"><font size="-1">arithmetic</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">Nearest Neighbor [<a href="#Ailon:2006p872">AC06</a>] [<a href="#Indyk:1999p166">IM99</a>] [<a href="#Andoni:2006p52">AI06</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">Kernel Methods</font></td>
<td align="left" valign="top" width="144"><font size="-1">enumeration,<br />
projection</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">Decision Trees [<a href="#bfso:1984">BFSO84</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">information theory</font></td>
<td align="left" valign="top" width="144"><font size="-1">partitioning</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">clustering [<a href="#Cilibrasi:2005p8">CV05</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">square error</font></td>
<td align="left" valign="top" width="144"><font size="-1">partitioning</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">MaxEnt [<a href="#Grunwald:2000p108">Gru00</a>] [<a href="#Grunwald:2004p739">GD04</a>] [<a href="#Skilling:1988p780">Ski88</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">entropy penalty</font></td>
<td align="left" valign="top" width="144"><font size="-1">Newton&#8217;s Method</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">Neural Net with Back Propagation [<a href="#NNCPE">Hus99</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">sigmoid penalty function</font></td>
<td align="left" valign="top" width="144"><font size="-1">Automatic Differentiation,<br />
steepest descent</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">Winnow [<a href="#Kivinen:1995p1836">KWA95</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">error rate</font></td>
<td align="left" valign="top" width="144"><font size="-1">multiplicative error based update</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">Boosting [<a href="#Freund:1999p1015">FS99</a>] [<a href="#Breiman:2000p1134">Bre00</a>] [<a href="#Collins:2002p1008">CSS02</a>] [<a href="#Trevisan:2008p2166">TTV08</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">weighted errors,<br />
data re-weighting</font></td>
<td align="left" valign="top" width="144"><font size="-1">Conjugate Gradient</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">HMM [<a href="#Kristjansson:2004p545">KCVM04</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">probability penalty</font></td>
<td align="left" valign="top" width="144"><font size="-1">Gibbs Sampler</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">Latent Dirichlet Allocation [<a href="#Blei:2003p1063">BNJ03</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">KL divergence</font></td>
<td align="left" valign="top" width="144"><font size="-1">Variational Methods</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">Support Vector Machine [<a href="#Joachims:1998p406">Joa98</a>] [<a href="#SVMBook">STC00</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">L1 Margin,<br />
Kernel Methods</font></td>
<td align="left" valign="top" width="144"><font size="-1">Quadratic Optimization</font></td>
</tr>
</table>
</div>
<p><a name="fig:MachineLearning" id="fig:MachineLearning"></a></td>
</tr>
</table>
</div>
<p></p>
<p><font>This table is a necessarily crude summary. For example: notice that several known techniques can not even be distinguished from each other by the local and global columns of the table.</font></p>
<p><font>There are a few points we would like to make. Back propagation was considered unique to Neural Nets for quite a while because it was so entwined with the technique it was not recognized as the simple application of Automatic Differentiation&nbsp;[<a href="#Rall:1996p2473">RC96</a>] that it is. Support Vector Machines (SVM) are remarkable for their uniform very good choice of component methods (maximum L1 margin objective regularization, Kernel Methods&nbsp;[<a href="#KernBook">STC04</a>] and sophisticated optimization methods&nbsp;[<a href="#Joachims:2006p403">Joa06</a>]). Many of the machine learning methods that SVM outperforms become again competitive when they adopt some of SVM&#8217;s technologies (especially using kernel methods to produce synthetic features).</font></p>
<p><font>Beyond these points we invoke a &#8220;globalizers are pre-packaged&#8221; principle and leave the discussion of machine learning and optimization to our reference: [<a href="#Bennett:2006p400">BPH06</a>]. In this example the local step is a per-example score or penalty and the globalization step is optimization.</font></p>
<h1><a name="SECTION00040000000000000000" id="SECTION00040000000000000000">Some Methods</a></h1>
<p><font>The application of the local to global principle is similar to the Feynman &#8220;genius method.&#8221; Feynman&#8217;s method is to always have in mind a list of problems and a list of solution methods. The genius step is: anytime you see a new problem or a new solution method to immediately try it against every item from the complementary list.&nbsp;[<a href="#IndiscreteThoughts">Rot97</a>, ``Ten Lessons I Wish I Had Been Taught''] This deliberate retention and activity greatly increases your problem solving ability. The power of the local to global principle is itself proportional to the number of local methods times the number of globalization strategies. Of course, to even start: the practitioner must already have available a number of candidate local and globalization methods. We list some methods and some guidance on variation and invention.</font></p>
<h2><a name="SECTION00041000000000000000" id="SECTION00041000000000000000">Local Methods</a></h2>
<p><font><img width="100" height="100" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/nails.jpg" alt="Image nails"> Good sources of ideas and analogies for local methods include:</font></p>
<ul>
<li>Introduce a Graph Structure
<p>A graph structure is a network of nodes connected by edges. Use of graphs was demonstrated both in the natural language processing and web page link analysis examples. We can dress up how we solved these problems and say we used a &#8220;Hidden Markov Model&#8221;, but the real power was we encoded our problem in a simple graph. Some problems (especially those from logic or those involving time) are essentially solved once they are translated out of their original form and into graph notation (for an example see: [<a href="#Mount:2000p360">Mou00</a>]).</p>
</li>
<li>Appeal to Physical Conservation Laws
<p>A good example physical law is Kirchhoff&#8217;s law or conservation of flow. All of the web page link analysis&#8217;s equations were derived by saying that the attention of at node is essentially the sum of attentions from other nodes (more sophisticated versions of the analysis actually do create and destroy flow, but they do it in a principled way).</p>
</li>
<li>Encode the Problem into an Objective Function
<p>This method is essentially your declaration that you intend to use an optimizer for the globalization step. In operations research this specific technique has long been the practice (with no disrespect: a very productive part of operations research has been translating different problems into linear programs so the simplex method can be applied, for an example see [<a href="#TradeArt">Mou09a</a>]).</p>
</li>
<li>Gradient Like Computations
<p>Includes Gradients, Secants, Lagrangians and other ideas from calculus. Gradients can drive optimizer based globalizers and techniques like Lagrangians are often powerful enough use mere inspection as the globalization step.</p>
</li>
<li>Violation Driven Updates
<p>This method is particularly effective when your problem is not amenable to continuous optimization. A good example is the Lin-Kernighan heuristic for solving the traveling salesman problem.[<a href="#Lin:1973p2739">LK73</a>] This heuristic looks at subsets of the problem and suggests improving &#8220;surgeries&#8221; (until no more such improvements are possible).</p>
</li>
<li>Introduction of Symbols
<p>Often, as with the web page link analysis example, you can not specify specific values for the unknowns, but you can specify relationships. You often can then solve for the symbols or introduce additional conditions and use an optimizer to complete the solution (see for example the maximum entropy method as described in [<a href="#Skilling:1988p780">Ski88</a>]).</p>
</li>
<li>Over Specification
<p>If we anticipate using a global step like search, enumeration, summation or integration then over specification is a good local idea.</p>
<p>For example: consider computing the probability that a fair count flipped 10 times comes up with heads exactly 3 times. The easiest way to perform this calculation is to specify exactly which 3 coins come up heads (the local over-specification) and then sum over all choices of 3 out of 10 coins (the global step). In mathematical notation this is:</p>
<p><!-- MATH<br />
 \begin{displaymath}<br />
P[\text{exactly 3 heads out of 10 flips}] = \binom{10}{3} 2^{-10} \approx 0.117<br />
\end{displaymath}<br />
 --></p>
<div align="center"><img width="20" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg54.png" alt="$\displaystyle P[$">exactly 3 heads out of 10 flips<img width="157" height="54" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg55.png" alt="$\displaystyle ] = \binom{10}{3} 2^{-10} \approx 0.117 $"></div>
<p>or just under 12%.</li>
<li>Under Specification
<p>One of the core principles of Dynamic Programming is to forget as much as possible about partial solutions, keeping only partial solution cost and just enough information to extend partial solutions. If you anticipate using something like Dynamic Programming as your globalization step then your goal should be to under specify.</p>
</li>
<li>Tables
<p>A key step of the natural language processing example was the use of tables of past experience to determine which sounds likely corresponded to which words, which words likely followed each other (and so on). Encoding domain knowledge or expertise as probability tables is a very effective problem solving strategy (especially if the globalization strategy is going to be search or Dynamic Programming). In natural language processing examples tables and statistics are <em>much</em> easier to manage than comprehensive rules or grammars.</p>
</li>
<li>Set up as Ranking or Machine Learning Problem
<p>This tactic is especially appropriate if your solution success metric is counts, frequencies or probabilities (instead of having to always be correct or always be optimal).</p>
</li>
</ul>
<h2><a name="SECTION00042000000000000000" id="SECTION00042000000000000000">Globalization Methods</a></h2>
<p><font><img width="100" height="100" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/hammer.jpg" alt="Image hammer"> The universe of possible globalization methods is very diverse (in particular globalization is not always optimization).</font></p>
<ul>
<li>Search / Enumeration
<p>Search can be slow, but it is always an option to consider. If your problem translates naturally into a graph structure or your solutions are naturally seen as being composed of small pieces search should be considered. One of the big advantages using the local phase to formally encode your problem&#8217;s structure and putting search off to the global phase is: you can use advanced search techniques. Once you are freed from your specific problem details it becomes much easier to consider search techniques like branch and bound, A*, game theoretic search and general speed-up techniques like hashing and caching.</p>
</li>
<li>Dynamic Programming
<p>If your problem has a bit more structure (in that partial solutions summarize and compose easily) then you can likely replace search with Dynamic Programming. The advantage is that Dynamic Programming typically offers an incredible speed up when compared to search.</p>
</li>
<li>Optimization
<p>If your problem is continuous (involves numbers instead of discrete or categorical decisions), can be encoded as a reasonable objective function (linear, positive definite quadratic) and has reasonable constraints (linear or convex) then you can immediately apply an optimizer as your globalization step. Typical optimization methods include: conjugate gradient, Newton methods, quasi Newton methods, linear programming and quadratic programming.</p>
</li>
<li>Combinatorial Optimization
<p>If your problem includes a &#8220;discrete variables&#8221; (that is variables that take on one of fixed set of values instead of values from a numeric range) then you may not be able to apply standard optimization techniques. At this point you may want to use more expensive combinatorial optimization techniques like integer linear programing or constraint satisfaction.</p>
</li>
<li>Fixed Point Methods / Iteration
<p>Fixed point methods are based on the idea: &#8220;incrementally improve until there is no incremental improvement possible.&#8221; If the problem is continuous this is similar to steepest descent. If the problem is discrete then this is similar the Lin-Kernighan heuristic.</p>
</li>
<li>Linear Algebra
<p>The web page link analysis and optimization examples were essentially solved once we reduced them to linear algebra. If you can write your problem as a linear relationship between unknowns or as the fixed-point of a linear operator (i.e. an <img width="12" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg56.png" alt="$ x$"> such that <img width="54" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg57.png" alt="$ A x = x$"> ) then you can immediately use linear algebra to solve the problem at very large scale (e.g. web scale).</p>
</li>
<li>Sampling / Problem Kernels
<p>A very successful line of attack on large problems is to reduce to a smaller problem containing most of the essential difficulty. David Karger has produced a number of effective algorithms for graph cuts and flows using a theory of sampling&nbsp;[<a href="#Karger:1998p556">Kar98</a>]. Rod Downey and M. Fellows have demonstrated an effective theory of &#8220;problem kernels&#8221; that finds solution by focusing on smaller sub-problems (on which we can afford to use more expensive procedures).[<a href="#DF98">DF98</a>]</p>
</li>
<li>Amortized Analysis / Economic Mechanism Methods
<p>Daniel Sleator and Robert Tarjan&#8217;s ideas of amortized analysis&nbsp;[<a href="#Sleator:1985p168">ST85</a>] allow approximation schemes similar to problem kernels. The method is to approximately optimize by pairing a bunch of unavoidable large penalties (conditions we can&#8217;t meet) with some accounting credits (say bonuses from other conditions we are meeting very well). We then isolate these paired items and optimize the rest of the problem exactly. The technique often works by showing the approximation can not be too bad because, due to the pairing of large penalties to good credits, there can not be too many large penalties. An informal example is: if it is impossible to pick someplace where all of an office will eat for lunch, perhaps you can solve the problem by paying one person to accept a restaurant they do not like (if the removal of their objection opens up a venue that is acceptable to everybody else).</p>
</li>
<li>Relaxation / Homotopic methods
<p>These methods involve changing hard constraints to soft penalties (so allowing the constraints to be violated, but at a slowly increasing cost). After such a relaxation the homotopic (or continuous deformation) method is to increase the cost of violation and re-solve to try and get a trajectory of better and better nearly acceptable solutions that point to a possible overall solution.</p>
</li>
</ul>
<h1><a name="SECTION00050000000000000000" id="SECTION00050000000000000000">Conclusion</a></h1>
<p><font>The purpose of this article has been to make more visible an idea we call the local to global principle. This principle is an organizing tool useful both in designing and analyzing a wide variety of applications. Essentially the whole point of this writeup is to set up enough framework to quickly write down a table of advice such as Table&nbsp;<a href="#fig:ProblemTable">2</a> (and for such a table to mean something).</font></p>
<p></p>
<div align="center"><a name="227"></a></p>
<table>
<caption><strong>Table 2:</strong> Various Applications, Local Steps and Global Steps</caption>
<tr>
<td>
<div align="center">
<table cellpadding="3" border="1" align="center">
<tr>
<td align="left"><font size="-1">Example</font></td>
<td align="left"><font size="-1">Local Step</font></td>
<td align="left"><font size="-1">Global Step</font></td>
</tr>
<tr>
<td align="left"><font size="-1">speech transcription</font></td>
<td align="left"><font size="-1">tables</font></td>
<td align="left"><font size="-1">Dynamic Programming</font></td>
</tr>
<tr>
<td align="left"><font size="-1">PageRank</font></td>
<td align="left"><font size="-1">graph structure, linear equations</font></td>
<td align="left"><font size="-1">Linear Algebra</font></td>
</tr>
<tr>
<td align="left"><font size="-1">machine learning</font></td>
<td align="left"><font size="-1">objective function</font></td>
<td align="left"><font size="-1">optimization</font></td>
</tr>
</table>
</div>
<p><a name="fig:ProblemTable" id="fig:ProblemTable"></a></td>
</tr>
</table>
</div>
<p></p>
<p><font>The principle is not universal; not everything can be fit into such a table. For example the local to global decoupling is <em>not</em> a feature of the famous EM algorithm&nbsp;[<a href="#Dempster:1977p761">DLR77</a>], which depends on mixing predictions and corrections.</font></p>
<p><font>To conclude: the recipe is as follows. If you come to a problem with a large shopping bag of possible ways to build local criteria and powerful globalization procedures then you stand a very good chance of solving the problem quickly. Also, if you keep the local to global principle in mind you are more likely to identify and retain potential local tricks and globalizers when you see them and thus have a larger more nimble set of tools available to solve problems when the time comes.</font></p>
<h2><a name="SECTION00060000000000000000" id="SECTION00060000000000000000">Bibliography</a></h2>
<dl compact>
<dt><a name="Ailon:2006p872" id="Ailon:2006p872">AC06</a></dt>
<dd>Nir Ailon and Bernard Chazelle, <i>Approximate nearest neighbors and the fast johnson-lindenstrauss transform</i>, STOC (2006).</dd>
<dt><a name="Andoni:2006p52" id="Andoni:2006p52">AI06</a></dt>
<dd>Alexandr Andoni and Piotr Indyk, <i>Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions</i>.</dd>
<dt><a name="Blum:2002p1867" id="Blum:2002p1867">BD02</a></dt>
<dd>Avrim Blum and John Dunagan, <i>Smoothed analysis of the perceptron algorithm for linear programming</i>, SODA (2002), 11.</dd>
<dt><a name="DynamicProgramming" id="DynamicProgramming">Bel57</a></dt>
<dd>Richard Bellman, <i>Dynamic programming</i>, Princeton University Press, 1957.</dd>
<dt><a name="Breiman:1997p1133" id="Breiman:1997p1133">BF97</a></dt>
<dd>Leo Breiman and Jerome&nbsp;H Friedman, <i>Predicting multivariate responses in multiple linear regression</i>, Journal of the Royal Statistical Society, Series B (Methodological) <b>59</b> (1997), no.&nbsp;1, 3-54.</dd>
<dt><a name="bfso:1984" id="bfso:1984">BFSO84</a></dt>
<dd>Leo Breiman, Jerome Friedman, Charles&nbsp;J. Stone, and R.&nbsp;A. Olshen, <i>Classification and regression trees</i>, Chapman &amp; Hall/CRC, January 1984.</dd>
<dt><a name="Blei:2003p1063" id="Blei:2003p1063">BNJ03</a></dt>
<dd>David&nbsp;M Blei, Andrew&nbsp;Y Ng, and Michael&nbsp;I Jordan, <i>Latent dirichlet allocation</i>, Journal of Machine Learning Research <b>3</b> (2003), 993-1022.</dd>
<dt><a name="Bennett:2006p400" id="Bennett:2006p400">BPH06</a></dt>
<dd>Kristin&nbsp;P. Bennett and Emilio Parrado-Hernandez, <i>The interplay of optimization and machine learning research</i>, Journal of Machine Learning Research <b>7</b> (2006), 1265-1281.</dd>
<dt><a name="Breiman:2000p1134" id="Breiman:2000p1134">Bre00</a></dt>
<dd>Leo Breiman, <i>Special invited paper. additive logistic regression: A statistical view of boosting: Discussion</i>, Ann. Statist. <b>28</b> (2000), no.&nbsp;2, 374-377.</dd>
<dt><a name="Beigel:1991p1027" id="Beigel:1991p1027">BRS91</a></dt>
<dd>R&nbsp;Beigel, N&nbsp;Reingold, and D&nbsp;Spielman, <i>The perceptron strikes back</i>, Structure in Complexity Theory Conference <b>6</b> (1991), 286-291.</dd>
<dt><a name="CharniakBook" id="CharniakBook">Cha96</a></dt>
<dd>Eugene Charniak, <i>Statistical language learning</i>, MIT Press, 1996.</dd>
<dt><a name="Charniak:1997p1484" id="Charniak:1997p1484">Cha97</a></dt>
<dd>to3em, <i>Statistial techniques for natural language parsing</i>, AI Magazine <b>18</b> (1997), no.&nbsp;4, 33-44.</dd>
<dt><a name="IntroductionToAlgorithms" id="IntroductionToAlgorithms">CLRS09</a></dt>
<dd>Thomas&nbsp;H. Cormen, Charles&nbsp;E. Leiserson, Ronald&nbsp;L. Rivest, and Clifford Stein, <i>Introduction to algorithms</i>, MIT Press, 2009.</dd>
<dt><a name="Collins:2002p1008" id="Collins:2002p1008">CSS02</a></dt>
<dd>Michael Collins, Robert&nbsp;E Schapire, and Yoram Singer, <i>Logistic regression, adaboost and bregman distances</i>, Machine Learning <b>48</b> (2002), no.&nbsp;1/2/3, 30.</dd>
<dt><a name="Cilibrasi:2005p8" id="Cilibrasi:2005p8">CV05</a></dt>
<dd>Rudi Cilibrasi and Paul&nbsp;M.B Vitanyi, <i>Clustering by compression</i>, IEEE Transactions on Information Theory <b>51</b> (2005), no.&nbsp;4, 1523-1545.</dd>
<dt><a name="DF98" id="DF98">DF98</a></dt>
<dd>Rod&nbsp;G. Downey and M.&nbsp;R. Fellows, <i>Parameterized complexity</i>, Monographs in Computer Science, Springer, November 1998.</dd>
<dt><a name="Dempster:1977p761" id="Dempster:1977p761">DLR77</a></dt>
<dd>A&nbsp;P Dempster, N&nbsp;M Laird, and D&nbsp;B Rubin, <i>Maximum likelihood from incomplete data via the em algorithm</i>, Journal of the Royal Statistical Society, Series B (Methodological) <b>39</b> (1977), no.&nbsp;1, 1-38.</dd>
<dt><a name="Fisher:1936p2576" id="Fisher:1936p2576">Fis36</a></dt>
<dd>Ronald&nbsp;A Fisher, <i>The use of multiple measurements in taxonomic problems</i>, Annals of Eugenics <b>7</b> (1936), 179-188.</dd>
<dt><a name="Freund:1999p1015" id="Freund:1999p1015">FS99</a></dt>
<dd>Yoav Freund and Robert&nbsp;E Schapire, <i>A short introduction to boosting</i>, Journal of Japanese Society for Artificial Intelligence <b>14</b> (1999), no.&nbsp;5, 771-780.</dd>
<dt><a name="Grunwald:2004p739" id="Grunwald:2004p739">GD04</a></dt>
<dd>Peter&nbsp;D Grunwald and A&nbsp;Philip Dawid, <i>Game theory, maximum entropy, minimum discrepancy and robust bayesian decision theory</i>, Ann. Statist. <b>32</b> (2004), no.&nbsp;4, 1367-1433.</dd>
<dt><a name="Grunwald:2000p108" id="Grunwald:2000p108">Gru00</a></dt>
<dd>PD&nbsp;Grunwald, <i>Maximum entropy and the glasses you are looking through</i>, Conference on Uncertainty in Artificial Intelligence (2000), 238-246.</dd>
<dt><a name="Halevy:2009p2327" id="Halevy:2009p2327">HNP09</a></dt>
<dd>Alon Halevy, Peter Norvig, and Fernando Pereira, <i>The unreasonable effectiveness of data</i>, IEEE Intellegent Systems (2009).</dd>
<dt><a name="NNCPE" id="NNCPE">Hus99</a></dt>
<dd>Dirk Husmeier, <i>Neural networks for conditional probability estimation</i>, Springer, 1999.</dd>
<dt><a name="Indyk:1999p166" id="Indyk:1999p166">IM99</a></dt>
<dd>Piotr Indyk and Rajeev Motwani, <i>Approximate nearest neighbors: Towards removing the curse of dimensionality</i>.</dd>
<dt><a name="Joachims:1998p406" id="Joachims:1998p406">Joa98</a></dt>
<dd>Thorsten Joachims, <i>Making large-scale svm learning practical</i>, Advances in Kernel Methods &#8211; Support Vector Learning (1998).</dd>
<dt><a name="Joachims:2006p403" id="Joachims:2006p403">Joa06</a></dt>
<dd>to3em, <i>Training linear svms in linear time</i>, KDD (2006).</dd>
<dt><a name="Karger:1998p556" id="Karger:1998p556">Kar98</a></dt>
<dd>David&nbsp;R Karger, <i>Randomization in graph optimization problems: A survey</i>, Optima: Mathematical Programming Society Newsletter <b>58</b> (1998).</dd>
<dt><a name="Kristjansson:2004p545" id="Kristjansson:2004p545">KCVM04</a></dt>
<dd>Trausti Kristjansson, Aron Culotta, Paul Viola, and Andrew&nbsp;Kachites McCallum, <i>Interactive information extraction with constrained conditional random fields</i>, AAAI (2004).</dd>
<dt><a name="Kleinberg:1997p32" id="Kleinberg:1997p32">Kle97</a></dt>
<dd>Jon&nbsp;M Kleinberg, <i>Authoritative souces in a hyperlinked environment</i>, ACM SIAM Symposium on Discrete Algorithms (1997).</dd>
<dt><a name="Komarek:2008p1742" id="Komarek:2008p1742">Kom08</a></dt>
<dd>Paul Komarek, <i>Logistic regression for data mining and high-dimensional classification</i>, CMU CS Thesis (2008), 138.</dd>
<dt><a name="Kivinen:1995p1836" id="Kivinen:1995p1836">KWA95</a></dt>
<dd>J&nbsp;Kivinen, Manfred&nbsp;K Warmuth, and P&nbsp;Auer, <i>The perceptron algorithm v.s. winnowo: Linear v.s. logarithmic mistake bounds when few input variables are relevant</i>, COLT (1995), 289-296.</dd>
<dt><a name="Lewis:1998p105" id="Lewis:1998p105">Lew98</a></dt>
<dd>David&nbsp;D Lewis, <i>Naive (bayes) at forty: The independence assumption in information retrieval</i>, find journal (1998).</dd>
<dt><a name="Lin:1973p2739" id="Lin:1973p2739">LK73</a></dt>
<dd>S&nbsp;Lin and BW&nbsp;Kernighan, <i>An effective heuristic algorithm for the traveling-salesman problem</i>, Operations Research (1973), 498-516.</dd>
<dt><a name="Maron:1961p2566" id="Maron:1961p2566">Mar61</a></dt>
<dd>M&nbsp;E Maron, <i>Automatic indexing: An experimental inquiry</i>, RAND Technical Report (1961), 404-417.</dd>
<dt><a name="HTSMH" id="HTSMH">MF00</a></dt>
<dd>Zbigniew Michalewicz and David&nbsp;B. Fogel, <i>How to solve it: Modern heuristics</i>, Springer, 2000.</dd>
<dt><a name="Mill" id="Mill">Mil02</a></dt>
<dd>John&nbsp;Stuart Mill, <i>A system of logic</i>, University Press of the Pacific, 2002.</dd>
<dt><a name="MitchellML" id="MitchellML">Mit97</a></dt>
<dd>Thomas Mitchell, <i>Machine learning</i>, McGraw-Hill, 1997.</dd>
<dt><a name="Maron:2000p2553" id="Maron:2000p2553">MK00</a></dt>
<dd>M&nbsp;E Maron and J&nbsp;L Kuhns, <i>On relevance, probabilistic indexing and information retrieval</i>, 1960 (2000), 1-29.</dd>
<dt><a name="Mount:2000p360" id="Mount:2000p360">Mou00</a></dt>
<dd>John&nbsp;A Mount, <i>Automatic detection of potential deadlock</i>, Dr. Dobbs Journal (2000).</dd>
<dt><a name="TradeArt" id="TradeArt">Mou09a</a></dt>
<dd>John Mount, <i>Automatic generation and testing of un-rolls for profitable technical trades</i>, <a href="http://www.win-vector.com/blog/2007/10/paper-on-stock-trading/">http://www.win-vector.com/blog/2007/10/paper-on-stock-trading/</a>, 2009.</dd>
<dt><a name="MLArt" id="MLArt">Mou09b</a></dt>
<dd>to3em, <i>A demonstration of data mining</i>, <a href="http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/">http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/</a>, 2009.</dd>
<dt><a name="Page:1998p2689" id="Page:1998p2689">PBMW98</a></dt>
<dd>Lawrence Page, Sergey Brin, Rajeev Motwani, and Tery Winograd, <i>The pagerank citation ranking: Bringing order to the web</i>, <a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.31.1768">http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.31.1768</a> (1998).</dd>
<dt><a name="Polya1" id="Polya1">Pol54a</a></dt>
<dd>G.&nbsp;Polya, <i>Induction and analogy in mathematics</i>, Princeton University Press, 1954.</dd>
<dt><a name="Polya2" id="Polya2">Pol54b</a></dt>
<dd>to3em, <i>Patterns of plausible inference</i>, Princeton University Press, 1954.</dd>
<dt><a name="citeulike:679515" id="citeulike:679515">Pol71</a></dt>
<dd>to3em, <i>How to solve it</i>, Princeton University Press, November 1971.</dd>
<dt><a name="Rall:1996p2473" id="Rall:1996p2473">RC96</a></dt>
<dd>Louis&nbsp;B Rall and George&nbsp;F Corliss, <i>An introduction to automatic differentiation</i>, SIAM: Computational Differentiation: Techniques, Applications and Tools (1996), 1-18.</dd>
<dt><a name="IndiscreteThoughts" id="IndiscreteThoughts">Rot97</a></dt>
<dd>Gian-Carlo Rota, <i>Indiscrete thoughts</i>, Birkhauser, 1997.</dd>
<dt><a name="Skilling:1988p780" id="Skilling:1988p780">Ski88</a></dt>
<dd>John Skilling, <i>The axioms of maximum entropy</i>, Maximum Entropy and Bayesian Methods in Science and Engineering <b>1</b> (1988), no.&nbsp;173-187.</dd>
<dt><a name="Sleator:1985p168" id="Sleator:1985p168">ST85</a></dt>
<dd>Daniel&nbsp;Dominic Sleator and Robert&nbsp;Endre Tarjan, <i>Amortized efficiency of list update and paging rules</i>, Communications of the ACM <b>28</b> (1985), no.&nbsp;2.</dd>
<dt><a name="SVMBook" id="SVMBook">STC00</a></dt>
<dd>Jown Shawe-Taylor and Nello Cristianini, <i>Support vector machines</i>, Cambridge University Press, 2000.</dd>
<dt><a name="KernBook" id="KernBook">STC04</a></dt>
<dd>to3em, <i>Kernel methods for pattern analysis</i>, Cambridge University Press, 2004.</dd>
<dt><a name="Strang" id="Strang">Str76</a></dt>
<dd>Gilbert Strang, <i>Linear algebra and its applications</i>, Academic Press, Inc., 1976.</dd>
<dt><a name="TibHat" id="TibHat">TH09</a></dt>
<dd>Jerome&nbsp;Friedman Trevor&nbsp;Hastie, Robert&nbsp;Tibshirani, <i>The elements of statistical learning: Data mining, inference and prediction</i>, Springer, 2009.</dd>
<dt><a name="Trevisan:2008p2166" id="Trevisan:2008p2166">TTV08</a></dt>
<dd>Luca Trevisan, Madhur Tulsiani, and Salil Vadhan, <i>Regularity, boosting, and efficiently simulating every high-entropy distribution</i>, Electronic Colloquium on Computational Complexity (2008), 18.</dd>
<dt><a name="Zeilberger:1995p277" id="Zeilberger:1995p277">Zei95</a></dt>
<dd>Doron Zeilberger, <i>The method of undetermined generalization and specialization illustrated with fred galvin&#8217;s amazing proof of the dinitz conjecture</i>, <a href="http://arxiv.org/abs/math/9506215">http://arxiv.org/abs/math/9506215</a>, 1995.</dd>
</dl>
<h1><a name="SECTION00070000000000000000" id="SECTION00070000000000000000">Acknowledgement</a></h1>
<p><font><font>A thank you to readers who supplied help and comments on earlier drafts.</font></font></p>
<p></p>
<hr />
<h4>Footnotes</h4>
<dl>
<dt><a name="foot21" id="foot21">&#8230; Mount</a><a href="#tex2html3"><sup>1</sup></a></dt>
<dd>email: <tt><a name="tex2html1" href="mailto:jmount@win-vector.com" id="tex2html1">mailto:jmount@win-vector.com</a></tt> web: <tt><a name="tex2html2" href="http://www.win-vector.com/" id="tex2html2">http://www.win-vector.com/</a></tt></dd>
<dt><a name="foot244" id="foot244">&#8230; principle.</a><a href="#tex2html4"><sup>2</sup></a></dt>
<dd>The pre-existing practice that comes cloesest to the local o global principle is found in operations research where encoding a problem to be solved by an optimizer is a central technique. We claim the natural statement of the local to global principle is more general than <font><em>always</em> encoding constraints for a particular optimizer (in particular globalization is not always optimization).</font></dd>
<dt><font><a name="foot43" id="foot43">&#8230; structure</a><a href="#tex2html6"><sup>4</sup></a></font></dt>
<dd><font>By &#8220;link structure&#8221; we mean which web pages link to which other web pages.</font></dd>
<dt><font><a name="foot45" id="foot45">&#8230; graph</a><a href="#tex2html7"><sup>5</sup></a></font></dt>
<dd><font>Remember, a graph is diagram consisting of nodes and edges (here depicted as arrows).</font></dd>
<dt><font><a name="foot245" id="foot245">&#8230; features</a><a href="#tex2html9"><sup>6</sup></a></font></dt>
<dd><font>For example the model could account for:</font></p>
<ul>
<li>surfers entering and leaving the model</li>
<li>link odds that vary where they are on a page</li>
<li>surfers staying on a page proportional to how much text is on the page</li>
<li>matching known traffic and click behavior where we have such data.</li>
</ul>
<p><font>For simplicity we will just stick with the example given example.</font></dd>
<dt><font><a name="foot154" id="foot154">&#8230; components.</a><a href="#tex2html17"><sup>7</sup></a></font></dt>
<dd><font>When a system is named and defined as an exact set of procedures the system can, by definition, not be improved. This is because with any change in procedure we have a new system that no longer matches the original definition and therefore requires a new name.</font></dd>
</dl>
<p><font><br /></font></p>
<hr />
<address><font>John Mount 2009-11-11</font></address>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/' rel='bookmark' title='Automatic Differentiation with Scala'>Automatic Differentiation with Scala</a></li>
<li><a href='http://www.win-vector.com/blog/2010/08/what-did-theorists-do-before-the-age-of-big-data/' rel='bookmark' title='What Did Theorists Do Before The Age Of Big Data?'>What Did Theorists Do Before The Age Of Big Data?</a></li>
<li><a href='http://www.win-vector.com/blog/2010/10/a-personal-perspective-on-machine-learning/' rel='bookmark' title='A Personal Perspective on Machine Learning'>A Personal Perspective on Machine Learning</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2009/11/the-local-to-global-principle/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>A Demonstration of Data Mining</title>
		<link>http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=a-demonstration-of-data-mining</link>
		<comments>http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/#comments</comments>
		<pubDate>Thu, 20 Aug 2009 01:16:27 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Applications]]></category>
		<category><![CDATA[Expository Writing]]></category>
		<category><![CDATA[Mathematics]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[ML]]></category>
		<category><![CDATA[Regression]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=252</guid>
		<description><![CDATA[REPOST (now in HTML in addition to the original PDF). This paper demonstrates and explains some of the basic techniques used in data mining. It also serves as an example of some of the kinds of analyses and projects Win Vector LLC engages in. August 19, 2009 John Mount1 A Demonstration of Data Mining 1&#160;&#160;Introduction [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2011/07/book-review-ensemble-methods-in-data-mining-seni-elder/' rel='bookmark' title='Book Review: Ensemble Methods in Data Mining (Seni &amp; Elder)'>Book Review: Ensemble Methods in Data Mining (Seni &#038; Elder)</a></li>
<li><a href='http://www.win-vector.com/blog/2009/04/the-data-enrichment-method/' rel='bookmark' title='The Data Enrichment Method'>The Data Enrichment Method</a></li>
<li><a href='http://www.win-vector.com/blog/2011/07/your-data-is-never-the-right-shape/' rel='bookmark' title='Your Data is Never the Right Shape'>Your Data is Never the Right Shape</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>REPOST (now in HTML in addition to the original  <a href="http://www.win-vector.com/dfiles/ADemonstrationOfDataMining.pdf"> PDF</a>).</p>
<p>This paper  demonstrates and explains some of the basic techniques used in data mining.  It also serves as an example of some of the kinds of analyses and projects Win Vector LLC engages in.<span id="more-252"></span>
<div class="p"><!----></div>
<h3 align="center">August 19, 2009 </h3>
<h3 align="center">John Mount<a href="#tthFtNtAAB" name="tthFrefAAB"><sup>1</sup></a> </h3>
<h1 align="center">A Demonstration of Data Mining </h1>
<div class="p"><!----></div>
<h2><a name="tth_sEc1"><br />
1</a>&nbsp;&nbsp;Introduction</h2>
<div class="p"><!----></div>
<p> A major industry in our time is the collection of large data sets in preparation for the magic of data mining [<a href="#NYTStat" name="CITENYTStat">Loh09</a>,<a href="#Halevy:2009p2327" name="CITEHalevy:2009p2327">HNP09</a>].  There is extreme excitement about both the possible applications (identifying important customers, identifying medical risks, targeting advertising, designing auctions and so on) and the various methods for data mining and machine learning.  To some extent these methods are classic statistics presented in a new bottle.  Unfortunately, the concerns, background and language of the modern data-mining practitioner are different than that of the classic statistician- so some demonstration and translation is required.  In this writeup we will show how much of the magic of current data mining and machine learning can be explained in terms of statistical regression techniques and show how the statistician&#8217;s view is useful in choosing techniques.</p>
<div class="p"><!----></div>
<p> Too often data mining is used as a black-box. It is quite possible to clearly use statistics to understand the meaning and mechanisms of data mining.</p>
<div class="p"><!----></div>
<h2><a name="tth_sEc2"><br />
2</a>&nbsp;&nbsp;The Example Problem</h2>
<div class="p"><!----></div>
<p> Throughout this writeup we will work on a single idealized example problem.  For our problem we will assume we are working with a company that sells items and that this company has recorded its past sales visits.  We assume they recorded how well the prospect matched the product offering (we will call this &#8220;match factor&#8221;), how much of a discount was offered to the prospect (we will call this &#8220;discount factor&#8221;) and if the prospect became a customer or not (this is our determination of positive or negative outcome).  The goal is to use this past record as &#8220;training data&#8221; and build a model to predict the odds of making a new sale as a function of the match factor and the discount factor.  In a perfect world the historic data would look a lot like Figure&nbsp;<a href="#fig:IdealFitting">1</a>.  In Figure&nbsp;<a href="#fig:IdealFitting">1</a> each icon represents a past sales-visit, the red diamonds are non-sales and the green disks are successful sales.  Each icon is positioned horizontally to correspond to the discount factor used and vertically to correspond to the degree of product match estimated during the prospective customer visit.  This data is literally too good to be true in at least three qualities: the past data covers a large range of possibilities, every possible combination has already been tried in an orderly fashion and the good and bad events &#8220;are linearly separable.&#8221;  The job of the modeler would then be to draw the separating line (shown in Figure&nbsp;<a href="#fig:IdealFitting">1</a>) and label every situation above and to the right of the separating line as good (or positive) and every situation below and to the left as bad (or negative).</p>
<div class="p"><!----></div>
<div class="p"><!----></div>
<p><a name="tth_fIg1"><br />
</a><br />
<center><img width="400" src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/IdealFitting.png" alt="IdealFitting.png" /></p>
<p></center><center>Figure 1: Ideal Fitting Situation</center><br />
<a name="fig:IdealFitting"><br />
</a></p>
<div class="p"><!----></div>
<p> In reality past data is subject to what prospects were available (so you are unlikely to have good range and an orderly layout of past sales calls) and also heavily affected by past policy.  An example policy might be that potential customers with good product match factor may never have been offered a significant discount in the past; so we would have no data from that situation.  Finally each outcome is a unique event that depends on a lot more than the two quantities we are recording- so it is too much to hope that the good prospects are simply separable from the bad ones.</p>
<div class="p"><!----></div>
<p> Figure&nbsp;<a href="#fig:IdealFitting">1</a> is a mere cartoon or caricature of the modeling process, but it represents the initial intuition behind data mining.  Again: the flaws in Figure&nbsp;<a href="#fig:IdealFitting">1</a> represent the implicit hopes of the data miner.  The data miner wishes that the past experiments are laid out in an orderly manner, data covers most of the combinations of possibilities and there is a perfect and simple concept ready to be learned.</p>
<div class="p"><!----></div>
<p> Frankly, an experienced data miner would feel incredibly fortunate if the past data looked anything like what is shown in Figure&nbsp;<a href="#fig:EmpiricalData">2</a>.</p>
<div class="p"><!----></div>
<div class="p"><!----></div>
<p><a name="tth_fIg2"><br />
</a><br />
<center><img width="400" src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/empirical1.png" alt="empirical1.png" /></p>
<p></center><center>Figure 2: Empirical Data</center><br />
<a name="fig:EmpiricalData"><br />
</a></p>
<div class="p"><!----></div>
<p> The green disks (representing good past prospects) and the red diamonds (representing bad past prospects) are intermingled (which is bad).  There is some evidence that past policy was to lower the discount offered as the match factor increased (as seen in the diagonal spread of the green disks).  Finally we see the red diamonds are also distributed differently than the green disks. This is both good and bad.  The good is that the center of mass of the red diamonds differs from the center of mass of the green disks.  The bad is that the density of red diamonds does not fall any faster as it passes into the green disks than it falls in any other direction.  This indicates there is something important and different (and not measured in our two variables) about at least some of the bad prospects.  It is the data miner&#8217;s job be aware and to press on.</p>
<div class="p"><!----></div>
<h3><a name="tth_sEc2.1"><br />
2.1</a>&nbsp;&nbsp;The Trendy Now</h3>
<div class="p"><!----></div>
<p> In truth data miners often rush where classical statisticians fear to tread.  Right now the temptation is to immediately select from any number of &#8220;red hot&#8221; techniques, methods or software packages.  My short list of super-star method buzzwords includes:</p>
<div class="p"><!----></div>
<ul>
<li> Boosting[<a href="#Schapire:2001p1019" name="CITESchapire:2001p1019">Sch01</a>,<a href="#Breiman:2000p1134" name="CITEBreiman:2000p1134">Bre00</a>,<a href="#Freund:2003p1009" name="CITEFreund:2003p1009">FISS03</a>]
<div class="p"><!----></div>
</li>
<li> Latent Dirichlet Allocation[<a href="#Blei:2003p1063" name="CITEBlei:2003p1063">BNJ03</a>]
<div class="p"><!----></div>
</li>
<li> Linear Regression[<a href="#statistics" name="CITEstatistics">FPP07</a>,<a href="#Agresti" name="CITEAgresti">Agr02</a>]
<div class="p"><!----></div>
</li>
<li> Linear Discriminant Analysis[<a href="#Fisher:1936p2576" name="CITEFisher:1936p2576">Fis36</a>]
<div class="p"><!----></div>
</li>
<li> Logistic Regression[<a href="#Agresti" name="CITEAgresti">Agr02</a>,<a href="#Klein:2003p261" name="CITEKlein:2003p261">KM03</a>]
<div class="p"><!----></div>
</li>
<li> Kernel Methods[<a href="#kernel1" name="CITEkernel1">CST00</a>,<a href="#kernel2" name="CITEkernel2">STC04</a>]
<div class="p"><!----></div>
</li>
<li> Maximum Entropy[<a href="#Klein:2003p261" name="CITEKlein:2003p261">KM03</a>,<a href="#Grunwald:2005p108" name="CITEGrunwald:2005p108">Gru05</a>,<a href="#Stern:1989p1480" name="CITEStern:1989p1480">SC89</a>,<a href="#Dudik:2006p954" name="CITEDudik:2006p954">DS06</a>]
<div class="p"><!----></div>
</li>
<li> Naive Bayes[<a href="#Lewis:1998p105" name="CITELewis:1998p105">Lew98</a>]
<div class="p"><!----></div>
</li>
<li> Perceptrons[<a href="#Beigel:2008p1027" name="CITEBeigel:2008p1027">BRS08</a>,<a href="#Dasgupta:2005p2013" name="CITEDasgupta:2005p2013">DKM05</a>]
<div class="p"><!----></div>
</li>
<li> Quantile Regression[<a href="#quantile" name="CITEquantile">Koe05</a>]
<div class="p"><!----></div>
</li>
<li> Ridge Regression[<a href="#Breiman:1997p1133" name="CITEBreiman:1997p1133">BF97</a>]
<div class="p"><!----></div>
</li>
<li> Support Vector Machines[<a href="#kernel1" name="CITEkernel1">CST00</a>]
<div class="p"><!----></div>
</li>
</ul>
<div class="p"><!----></div>
<p> Based on some of the above referenced writing and analysis I would first pick &#8220;logistic regression&#8221; as I am confident that, when used properly, it is just about as powerful as any of the modern data mining techniques (despite its somewhat less than trendy status).  Using logistic regression I immediately get just about as close to a separating line as this data set will support: Figure&nbsp;<a href="#fig:LinearSepartor">3</a>.</p>
<div class="p"><!----></div>
<div class="p"><!----></div>
<p><a name="tth_fIg3"><br />
</a><br />
<center><img width="400" src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/lin1.png" alt="lin1.png" /></p>
<p></center><center>Figure 3: Linear Separator</center><br />
<a name="fig:LinearSepartor"><br />
</a></p>
<div class="p"><!----></div>
<p> The separating line actually encodes a simple rule of the form: &#8220;if 2.2*DiscountFactor + 3.1*MatchFactor &#8805; 1 then we have a good chance of a sale.&#8221;  This is classic black-box data mining magic.  The purpose of this writeup is to look deeper how to actually derive and understand something like this.</p>
<div class="p"><!----></div>
<h2><a name="tth_sEc3"><br />
3</a>&nbsp;&nbsp;Explanation</h2>
<div class="p"><!----></div>
<p> What is really going on?  Why is our magic formula at all sensible advice, why did this work at all and what motivates the analysis?  It turns out regression (be it linear regression or logistic regression) works in this case because it somewhat imitates the methodology of linear discriminant analysis (described in: [<a href="#Fisher:1936p2576" name="CITEFisher:1936p2576">Fis36</a>]).  In fact in many cases it would be a better idea to perform a linear discriminant analysis or perform an analysis of variance than to immediately appeal to a complicated method.  I will first step through the process of linear discriminant analysis and then relate it to our logistic regression.  Stepping through understandable stages lets us see where we were lucky in modeling and what limits and opportunities for improvement we have.</p>
<div class="p"><!----></div>
<div class="p"><!----></div>
<p><a name="tth_fIg4"><br />
</a><br />
<center></p>
<table>
<tr>
<td><img width="250" src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/posDat.png" alt="posDat.png" /></td>
<td><img width="250" src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/negDat.png" alt="negDat.png" />
</td>
</tr>
</table>
<p></center><center>Figure 4: Separate Plots</center><br />
<a name="fig:SeparatePlots"><br />
</a></p>
<div class="p"><!----></div>
<p> Our data initially looks very messy (the good and bad group are fairly mixed together).  But if we examine out data in separate groups we can see we are actually incredibly lucky in that the data is easy to describe.  As we can see in Figure&nbsp;<a href="#fig:SeparatePlots">4</a>: the data, when separated by outcome (plotting only all of the good green disks or only all of the bad red diamonds), is grouped in simple blobs without bends, intrusions or other odd (and more work to model) configurations.</p>
<div class="p"><!----></div>
<p> We can plot the idealizations of these data distributions (or densities) as &#8220;contour maps&#8221; (as if we are looking down on the elevations of a mountain on a map) which gives us Figure&nbsp;<a href="#fig:SeparateDistributions">5</a>.</p>
<div class="p"><!----></div>
<div class="p"><!----></div>
<p><a name="tth_fIg5"><br />
</a><br />
<center></p>
<table>
<tr>
<td><img width="250" src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/posDist.png" alt="posDist.png" /></td>
<td> <img width="250" src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/negDist.png" alt="negDist.png" />
</td>
</tr>
</table>
<p></center><center>Figure 5: Separate Distributions</center><br />
<a name="fig:SeparateDistributions"><br />
</a></p>
<div class="p"><!----></div>
<h3><a name="tth_sEc3.1"><br />
3.1</a>&nbsp;&nbsp;Full Bayes Model</h3>
<div class="p"><!----></div>
<p> From Figure&nbsp;<a href="#fig:SeparateDistributions">5</a> we can see while our data is not separable there are significant differences between the groups.  The difference in the groups is more obvious if we plot the difference of the densities on the same graph as in Figure&nbsp;<a href="#fig:DifferenceInDensity">6</a>.  Here we are visualizing the distribution of positive examples as a connected pair of peaks (colored green) and the distribution of negative examples a deep valley (colored red) located just below and to the left of the peaks.</p>
<div class="p"><!----></div>
<div class="p"><!----></div>
<p><a name="tth_fIg6"><br />
</a><br />
<center><img width="400" src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/diff1.png" alt="diff1.png" /></p>
<p></center><center>Figure 6: Difference in Density</center><br />
<a name="fig:DifferenceInDensity"><br />
</a></p>
<div class="p"><!----></div>
<p> This difference graph is demonstrating how both of the densities or distributions (positive and negative) reach into different regions of the plane.  The white areas are where the difference in densities is very small which includes the areas in the corners (where there is little of either distribution) and the area between the blobs (where there is a lot of mass from both distributions competing).  This view is a bit closer to what a statistician wants to see- how the distributions of successes and failures different (this is a step to take before even guessing at or looking for causes and explanations).</p>
<div class="p"><!----></div>
<p> Figure&nbsp;<a href="#fig:DifferenceInDensity">6</a> is already an actionable model- we can predict the odds a new prospect will buy or not at a given discount by looking where they fall on Figure&nbsp;<a href="#fig:DifferenceInDensity">6</a> and checking if they fall in a region on strong red or strong green color.  We can also recommend a discount for a given potential customer by drawing a line at the height determined by their degree of match and tracing from left to right until we first hit a strong green region.  We could hand out a simplified Figure&nbsp;<a href="#fig:FullBayesModel">7</a> as a sales rulebook.</p>
<div class="p"><!----></div>
<div class="p"><!----></div>
<p><a name="tth_fIg7"><br />
</a><br />
<center><img width="400" src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/bayesModel1.png" alt="bayesModel1.png" /></p>
<p></center><center>Figure 7: Full Bayes Model</center><br />
<a name="fig:FullBayesModel"><br />
</a></p>
<div class="p"><!----></div>
<p> This model is a full Bayes model (but not a Naive Bayes model, which is oddly more famous and which we will cover later).  The steps we took were: first we summarized or idealized our known data into two Gaussian blobs (as depicted in Figure&nbsp;<a href="#fig:SeparateDistributions">5</a>).  Once we had estimated the centers, widths and orientations of these blobs we could then: for any new point say how likely the point is under the modeled distribution of sales and how likely the point is under the modeled distribution of non-sales.  Mathematically we claim we can estimate P(x,y &#124;sale)<a href="#tthFtNtAAC" name="tthFrefAAC"><sup>2</sup></a> and P(x,y &#124; non-sale) (where x is our discount factor and y is our matching factor).<a href="#tthFtNtAAD" name="tthFrefAAD"><sup>3</sup></a> Neither of these are what we are actually interested in (we want: P(sale &#124; x,y)<a href="#tthFtNtAAE" name="tthFrefAAE"><sup>4</sup></a>).  We can, however, use these values to calculate what we want to know.  Bayes&#8217; law is a law of probability that says if we know P(sale &#124; x,y), P(non-sale &#124; x,y), P(sale) and P(non-sale)<a href="#tthFtNtAAF" name="tthFrefAAF"><sup>5</sup></a> then:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/eqn1.png"/><br />
</center></p>
<p>Figure&nbsp;<a href="#fig:FullBayesModel">7</a> depicts a central hourglass shaped region (colored green) that represents the region of x, y values where P(sale &#124;x,y) is estimated to be at least 0.5 and the remaining (darker red region) are the situations predicted to be less favorable.  Here we are using priors of P(sale) = P(non-sale) = 0.5, for different priors and thresholds we would get different graphs.</p>
<div class="p"><!----></div>
<p> Even at this early stage in the analysis we have already accidentally introduced what we call &#8220;an inductive bias.&#8221;  By modeling both distributions as Gaussians we have guaranteed that our acceptance region will be an hourglass figure (as we saw in Figure&nbsp;<a href="#fig:FullBayesModel">7</a>).  One undesirable consequence of the modeling technique is the prediction sales become unlikely when both match factor and discount factor are very large.  This is somewhat a consequence of our modeling technique (though the fact that the negative data does not fall quickly as it passes into the green region also added to this).  This un-realistic (or &#8220;not physically plausible&#8221;) prediction is called an artifact (of the technique and of the data) and it is the statistician&#8217;s job to see this, confirm they don&#8217;t want it and eliminate it (by deliberately introducing a &#8220;useful modeling bias&#8221;).</p>
<div class="p"><!----></div>
<h3><a name="tth_sEc3.2"><br />
3.2</a>&nbsp;&nbsp;Linear Discriminant</h3>
<div class="p"><!----></div>
<p> To get around the bad predictions of our model in the upper-right quadrant we &#8220;apply domain knowledge&#8221; and introduce a useful modeling bias as follows.  Let us insist that our model be monotone: that if moving some direction is good than moving further in the same direction is better.  In fact let&#8217;s insist that our model be a half-plane (instead of two parabolas).  We want a nice straight separating cut, which brings us to linear discriminant analysis.  We have enough information to apply Fisher linear discriminant technique and find a separator that maximizes the variance of data across categories while minimizing the variance of data within one category and within the other category.  This is called the linear discriminant and it is shown in Figure&nbsp;<a href="#fig:LinearDiscriminant">8</a>.</p>
<div class="p"><!----></div>
<div class="p"><!----></div>
<p><a name="tth_fIg8"><br />
</a><br />
<center><img width="400" src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/lda1.png" alt="lda1.png" /></p>
<p></center><center>Figure 8: Linear Discriminant</center><br />
<a name="fig:LinearDiscriminant"><br />
</a></p>
<div class="p"><!----></div>
<p> The blue line is the linear discriminant (similar to the logistic regression line depicted earlier on the data-slide).  Everything above or to the right of the blue line is considered good and everything below or to the left of the blue line is considered bad.  Notice that this advice while not quite as accurate as the Bayes Model near the boundary between the two distributions is much more sensible about the upper right corner of the graph.</p>
<div class="p"><!----></div>
<p> To evaluate a separator we collapse all variation parallel to the separating cut (as shown in Figure&nbsp;<a href="#fig:collapse">9</a>).  We then see that each distribution becomes a small interval or streak.  A separator is good if these resulting streaks are both short (the collapse packs the blobs) and the two centers of the streaks are far apart (and on opposite size of the separator).  In Figure&nbsp;<a href="#fig:collapse">9</a> the streaks are fairly short and despite some overlap we do have some usable separation between the two centers.</p>
<div class="p"><!----></div>
<div class="p"><!----></div>
<p><a name="tth_fIg9"><br />
</a><br />
<center><img width="400" src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/collapse2.png" alt="collapse2.png" /></p>
<p></center><center>Figure 9: Evaluating Quality of Separating Cut</center><br />
<a name="fig:collapse"><br />
</a></p>
<div class="p"><!----></div>
<p> To make the above precise we switch to mathematical notation.  For the i-th positive training example form the vector v<sub>+,i</sub> and the matrix S<sub>+,i</sub> where</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/eqn2.png"/><br />
</center></p>
<div class="p"><!----></div>
<p> where x<sub>i</sub> and y<sub>i</sub> are the known x and y coordinates for this particular past experience.  Define v<sub>&#8722;,i</sub>, S<sub>&#8722;,i</sub> similarly for all negative examples.  In this notation we have for a direction &#947;: the distance along the &#947; direction between the center of positive examples and center of negative examples is: &#947;<sup>T</sup> ( &#8721;<sub>i</sub> v<sub>+,i</sub> / n<sub>+</sub> &#8722; &#8721;<sub>i</sub> v<sub>&#8722;,i</sub> / n<sub>&#8722;</sub>) (where n<sub>+</sub> is the number of positive examples and n<sub>&#8722;</sub> is the number of negative examples).  We would like this quantity to be large.  The degree of spread or variance of the positive examples along the &#947; direction is &#947;<sup>T</sup> (&#8721;<sub>i</sub> S<sub>+,i</sub> / n<sub>+</sub>) &#947;.  The degree of spread or variance of the negative examples along the &#947; direction is &#947;<sup>T</sup> (&#8721;<sub>i</sub> S<sub>&#8722;,i</sub> / n<sub>&#8722;</sub>) &#947;.  We would like the last two quantities to be small.  The linear discriminant is picked to maximize:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/eqn3.png"/><br />
</center></p>
<p>It is a fairly standard observation (involving the Rayleigh quotient) that this form is maximized when:<br />
<center><br />
<a name="eq:lda"><br />
</a><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/eqn4.png"/><br />
</center></p>
<div class="p"><!----></div>
<p> As we have said, the linear discriminant is very similar to what is returned by a regression or logistic regression.  In fact in our diagrams the regression lines are almost identical to the linear discriminant.  A large part of why regression can be usefully applied in classification comes from its close relationship to the linear discriminant.</p>
<div class="p"><!----></div>
<h3><a name="tth_sEc3.3"><br />
3.3</a>&nbsp;&nbsp;Linear Regression</h3>
<div class="p"><!----></div>
<p> Linear regression is designed to model continuous functions subject to independent normal errors in observation.  Linear regression is incredibly powerful at characterizing and elimination correlations between the input variables of a model.  While function fitting is different than classification (our example problem) linear regression is so useful whenever there is any suspected correlation (which is almost always the case) that it is an appropriate tool.  In our example in the positive examples (those that led to sales) there is clearly a historical dependence between the degree of estimated match and amount of discount offered.  Likely this dependence is from past prospects being subject to a (rational) policy of &#8220;the worse the match the higher the offered discount&#8221; (instead of being arranged in a perfect grid-like experiment as in our first diagram: Figure&nbsp;<a href="#fig:IdealFitting">1</a>).  If this dependence is not dealt with we would under-estimate the value of discount because we would think that discounted customers are not signing up at a higher rate (when these prospects are in fact clearly motivated by discount, once you control for the fact that many of the deeply discounted prospects had a much worse degree of match than average).</p>
<div class="p"><!----></div>
<p> For analysis of categorical data linear regression is closely linked to ANOVA (analysis of variance).[<a href="#Agresti" name="CITEAgresti">Agr02</a>] Recall that variance was a major consideration with the linear discriminant analysis, so we should by now be on familiar ground.</p>
<div class="p"><!----></div>
<p>In our notation the standard least-squares regression solution is:<br />
<center><br />
<a name="eq:leastsquares"><br />
</a><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/eqn5.png"/><br />
</center></p>
<p>where y<sub>+,i</sub> = 1 for all i and y<sub>&#8722;,i</sub> = &#8722;1 for all i.</p>
<div class="p"><!----></div>
<p> If we have the same number of positive and negative examples (i.e.  n<sub>+</sub> = n<sub>&#8722;</sub>) then Equation&nbsp;<a href="#eq:lda">1</a> and Equation&nbsp;<a href="#eq:leastsquares">2</a> are identical and we have &#946; = &#947;.  So in this special case the linear discriminant equals the least square linear regression solution.  We can even ask how the solutions change if the relative proportions of positive and negative training data changes.  The linear discriminant is carefully designed not to move, but the regression solution will tilt to be an angle that is more compatible with the larger of the example classes and shift to cut less into that class.  The linear regression solution can be fixed (by re-weighting the data) to also be insensitive to the relative proportions of positive and negative examples but does not behave that way &#8220;fresh out of the box.&#8221;</p>
<div class="p"><!----></div>
<h3><a name="tth_sEc3.4"><br />
3.4</a>&nbsp;&nbsp;Logistic Regression</h3>
<div class="p"><!----></div>
<p> While linear regression is designed to pick a function that minimizes the sum of square errors logistic regression is designed to pick a separator that maximizes something called <em>the plausibility of the data</em>.  In our case since the data is so well behaved the logistic regression line is essentially the same as the linear regression line.  It is in fact an important property of logistic regression that there is always a re-weighting (or choice of re-emphasis) of the data that causes some linear regression to pick the same separator as the logistic regression.  Because linear and logistic regression are only identical in specific circumstances it is the job of the statistician to know which of the two is more appropriate for a given data set and given intended use of the resulting model.</p>
<div class="p"><!----></div>
<h2><a name="tth_sEc4"><br />
4</a>&nbsp;&nbsp;Other Methods and Techniques</h2>
<div class="p"><!----></div>
<h3><a name="tth_sEc4.1"><br />
4.1</a>&nbsp;&nbsp;Kernelized Regression</h3>
<div class="p"><!----></div>
<p> One way to greatly expand the power of modeling methods is a trick called kernel methods.  Roughly kernel methods are those methods that increase the power of machine learning by moving from a simple problem space (like ours in variables x and y) to a richer problem space that may be easier to work in.  A lot of ink is spilled about how efficient the kernel methods are (they work in time proportional to the size of the simple space, not the complex one) but this is not their essential feature.  The essential feature is the expanded explanation power and this is so important that even the trivial kernel methods (such as directly adjoining additional combinations of variables) pick up most of the power of the method.  Kernel methods are also overly associated with Support Vector Machines- but are just as useful when added to Naive Bayes, linear regression or logistic regression.</p>
<div class="p"><!----></div>
<p> For instance: Figure&nbsp;<a href="#fig:KernelizedRegression">10</a> shows a bow-tie like acceptance region found by using linear regression over the variables x, y, x<sup>2</sup>, y<sup>2</sup> and x y (instead of just x and y).  Note how this result is similar to the full Bayes model (but comes from a different feature set and fitting technique).</p>
<div class="p"><!----></div>
<div class="p"><!----></div>
<p><a name="tth_fIg10"><br />
</a><br />
<center><img width="400" src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/kRegression.png" alt="kRegression.png" /></p>
<p></center><center>Figure 10: Kernelized Regression</center><br />
<a name="fig:KernelizedRegression"><br />
</a></p>
<div class="p"><!----></div>
<h3><a name="tth_sEc4.2"><br />
4.2</a>&nbsp;&nbsp;Naive Bayes Model</h3>
<div class="p"><!----></div>
<p> We briefly return to the Bayes model to discuss a more common alternative called &#8220;Naive Bayes.&#8221;  A Naive Bayes model is like a full Bayes model except an additional modeling simplification is introduced in assuming that P(x,y&#124;sale) = P(x&#124;sale)P(y&#124;sale) and P(x,y&#124;non-sale) = P(x&#124;non-sale)P(y&#124;non-sale).  That is we are assuming that the distributions of the x and y measurements are essentially independent (once we know which outcome happened).  This assumption is the opposite of what we do with regression in that we ignore dependencies in the data (instead of modeling and eliminating the dependencies).  However, Naive Bayes methods are quite powerful and very appropriate in sparse-data situations (such as text classification).  The &#8220;naive&#8221; assumption that the input variables are independent greatly reduces the amount of data that needs to be tracked (it is much less work to track values of variables instead of simultaneous values of pairs of variables).  The curved separator from this Naive Bayes model is illustrated in Figure&nbsp;<a href="#fig:NaiveBayesModel">11</a>.</p>
<div class="p"><!----></div>
<div class="p"><!----></div>
<p><a name="tth_fIg11"><br />
</a><br />
<center><img width="400" src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/naiveBayesModel1.png" alt="naiveBayesModel1.png" /></p>
<p></center><center>Figure 11: Naive Bayes Model</center><br />
<a name="fig:NaiveBayesModel"><br />
</a></p>
<div class="p"><!----></div>
<p> The Naive Bayes version of the advice or policy chart is always going to be an axis-aligned parabola as in Figure&nbsp;<a href="#fig:NaiveBayesDecision">12</a>.  Notice how both the linear discriminant and the Naive Bayes model make mistakes (places some colors on the wrong side of the curve)- but they are simple, reliable models that have the desirable property of having connected prediction regions.</p>
<div class="p"><!----></div>
<div class="p"><!----></div>
<p><a name="tth_fIg12"><br />
</a><br />
<center><img width="400" src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/naiveBayesModel2.png" alt="naiveBayesModel2.png" /></p>
<p></center><center>Figure 12: Naive Bayes Decision</center><br />
<a name="fig:NaiveBayesDecision"><br />
</a></p>
<div class="p"><!----></div>
<h3><a name="tth_sEc4.3"><br />
4.3</a>&nbsp;&nbsp;More Exotic Methods</h3>
<div class="p"><!----></div>
<p> Many of the hot buzzword machine learning and data mining methods we listed earlier are essentially different techniques of fitting a linear separator over data.  These methods seem very different but they all form a family once you realize many of the details of the methods are determined by:</p>
<div class="p"><!----></div>
<ul>
<li> Choice of Loss Function
<div class="p"><!----></div>
<p> This is what notion of &#8220;goodness of fit&#8221; is being used.  It can be normalized mean-variance (linear discriminants), un-normalized variance (linear regression), plausibility (logistic regression), L1 distance (support vector machines, quantile regression), entropy (maximum entropy), probability mass and so on.</p>
<div class="p"><!----></div>
</li>
<li> Choice of Optimization Technique
<div class="p"><!----></div>
<p> For a given loss function we can optimize in many ways (though most authors make the mistake of binding their current favorite optimization method deep into their specification of technique): EM, steepest descent, conjugate gradient, quasi-Newton, linear programming and quadratic programming to name a few.</p>
<div class="p"><!----></div>
</li>
<li> Choice of Regularization Method
<div class="p"><!----></div>
<p> Regularization is the idea of forcing the model to not pick extreme values of parameters to over-fit irrelevant artifacts in training data.  Methods include MDL, controlling energy/entropy, Lagrange smoothing, shrinkage, bagging and early termination of optimization.  Non-explicit treatment of regularization is one reason many methods completely specify their optimization procedure (to get some accidental regularization).</p>
<div class="p"><!----></div>
</li>
<li> Choice of Features/Kernelization
<div class="p"><!----></div>
<p> The richness of the feature set the method is applied to is the single largest determinant of model quality.</p>
<div class="p"><!----></div>
</li>
<li> Pre-transformation Tricks
<div class="p"><!----></div>
<p> Some statistical methods are improved by pre-transforming the outcome data to look more normal or be more homoscedastic.<a href="#tthFtNtAAG" name="tthFrefAAG"><sup>6</sup></a></p>
<div class="p"><!----></div>
</li>
</ul>
<div class="p"><!----></div>
<p> If you think along a few axes like these (instead of evaluating them by their name and lineage) you tend to see different data mining methods more as embodying different trade-offs than as being unique incompatible disciplines.</p>
<div class="p"><!----></div>
<div class="p"><!----></div>
<h2><a name="tth_sEc5"><br />
5</a>&nbsp;&nbsp;Conclusion</h2>
<div class="p"><!----></div>
<p> Our goal for this writeup was to fully demonstrate a data mining method and then survey some important data mining and machine learning techniques.  Many of the important considerations are &#8220;too obvious&#8221; to be discussed by statisticians and &#8220;too statistical&#8221; to be comfortably expressed in terms popular with data miners.  The theory and considerations from statistics when combined with the experience and optimism of data-mining/machine-learning truly make possible achieving the important goal of &#8220;learning from data.&#8221;</p>
<div class="p"><!----></div>
<p>This expository writeup is also meant to serve as an example of the<br />
types of research, analysis, software and training supplied by<br />
Win-Vector LLC <a href="http://www.win-vector.com"><tt>http://www.win-vector.com</tt></a> .  Win-Vector LLC<br />
prides itself in depth of research and specializes in identifying,<br />
documenting and implementing the &#8220;simplest technique that can<br />
possibly work&#8221; (which is often the most understandable, maintainable,<br />
robust and reliable).  Win-Vector LLC specializes in research but<br />
has significant experience in delivering full solutions (including<br />
software solutions and integration with existing databases).</p>
<div class="p"><!----></div>
<p><font size="-1"></p>
<h2>References</h2>
<dl compact="compact">
<dt><a href="#CITEAgresti" name="Agresti">[Agr02]</a></dt>
<dd>
Alan Agresti, <em>Categorical data analysis (wiley series in probability and<br />
  statistics)</em>, Wiley-Interscience, July 2002.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITEBreiman:1997p1133" name="Breiman:1997p1133">[BF97]</a></dt>
<dd>
Leo Breiman and Jerome&nbsp;H Friedman, <em>Predicting multivariate responses in<br />
  multiple linear regression</em>, Journal of the Royal Statistical Society, Series<br />
  B (Methodological) <b>59</b> (1997), no.&nbsp;1, 3-54.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITEBlei:2003p1063" name="Blei:2003p1063">[BNJ03]</a></dt>
<dd>
David&nbsp;M Blei, Andrew&nbsp;Y Ng, and Michael&nbsp;I Jordan, <em>Latent dirichlet<br />
  allocation</em>, Journal of Machine Learning Research <b>3</b> (2003),<br />
  993-1022.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITEBreiman:2000p1134" name="Breiman:2000p1134">[Bre00]</a></dt>
<dd>
Leo Breiman, <em>Special invited paper. additive logistic regression: A<br />
  statistical view of boosting: Discussion</em>, Ann. Statist. <b>28</b> (2000),<br />
  no.&nbsp;2, 374-377.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITEBeigel:2008p1027" name="Beigel:2008p1027">[BRS08]</a></dt>
<dd>
Richard Beigel, Nick Reingold, and Daniel&nbsp;A Spielman, <em>The perceptron<br />
  strikes back</em>, 6.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITEkernel1" name="kernel1">[CST00]</a></dt>
<dd>
Nello Cristianini and John Shawe-Taylor, <em>An introduction to support<br />
  vector machines and other kernel-based learning methods</em>, 1 ed., Cambridge<br />
  University Press, March 2000.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITEDasgupta:2005p2013" name="Dasgupta:2005p2013">[DKM05]</a></dt>
<dd>
Sanjoy Dasgupta, Adam&nbsp;Tauman Kalai, and Claire Monteleoni, <em>Analysis of<br />
  perceptron-based active learning</em>, CSAIL Tech. Report (2005), 16.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITEDudik:2006p954" name="Dudik:2006p954">[DS06]</a></dt>
<dd>
Miroslav Dudik and Robert&nbsp;E Schapire, <em>Maximum entropy distribution<br />
  estimation with generalized regularization</em>, COLT (2006), 15.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITEFisher:1936p2576" name="Fisher:1936p2576">[Fis36]</a></dt>
<dd>
Ronald&nbsp;A Fisher, <em>The use of multiple measurements in taxonomic problems</em>,<br />
  Annals of Eugenics <b>7</b> (1936), 179-188.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITEFreund:2003p1009" name="Freund:2003p1009">[FISS03]</a></dt>
<dd>
Yoav Freund, Raj Iyer, Robert&nbsp;E Schapire, and Yoram Singer, <em>An efficient<br />
  boosting algorithm for combining preferences</em>, Journal of Machine Learning<br />
  Research <b>4</b> (2003), 933-969.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITEstatistics" name="statistics">[FPP07]</a></dt>
<dd>
David Freedman, Robert Pisani, and Roger Purves, <em>Statistics 4th edition</em>,<br />
  W. W. Norton and Company, 2007.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITEGrunwald:2005p108" name="Grunwald:2005p108">[Gru05]</a></dt>
<dd>
Peter&nbsp;D Grunwald, <em>Maximum entropy and the glasses you are looking<br />
  through</em>.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITEHalevy:2009p2327" name="Halevy:2009p2327">[HNP09]</a></dt>
<dd>
Alon Halevy, Peter Norvig, and Fernando Pereira, <em>The unreasonable<br />
  effectiveness of data</em>, IEEE Intellegent Systems (2009).</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITEKlein:2003p261" name="Klein:2003p261">[KM03]</a></dt>
<dd>
Dan Klein and Christopher&nbsp;D Manning, <em>Maxent models, conditional<br />
  estimation, and optimization</em>.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITEquantile" name="quantile">[Koe05]</a></dt>
<dd>
Roger Koenker, <em>Quantile regression</em>, Cambridge University Press, May<br />
  2005.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITELewis:1998p105" name="Lewis:1998p105">[Lew98]</a></dt>
<dd>
David&nbsp;D Lewis, <em>Naive (bayes) at forty: The independence assumption in<br />
  information retrieval</em>.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITENYTStat" name="NYTStat">[Loh09]</a></dt>
<dd>
Steve Lohr, <em>For today’s graduate, just one word: Statistics</em>,<br />
  <a href="http://www.nytimes.com/2009/08/06/technology/06stats.html"><tt>http://www.nytimes.com/2009/08/06/technology/06stats.html</tt></a>, August 2009.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITER:Sarkar:2008" name="R:Sarkar:2008">[Sar08]</a></dt>
<dd>
Deepayan Sarkar, <em>Lattice: Multivariate data visualization with R</em>,<br />
  Springer, New York, 2008, ISBN 978-0-387-75968-5.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITEStern:1989p1480" name="Stern:1989p1480">[SC89]</a></dt>
<dd>
Hal Stern and Thomas&nbsp;M Cover, <em>Maximum entropy and the lottery</em>, Journal<br />
  of the American Statistical Association <b>84</b> (1989), no.&nbsp;408,<br />
  980-985.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITESchapire:2001p1019" name="Schapire:2001p1019">[Sch01]</a></dt>
<dd>
Robert&nbsp;E Schapire, <em>The boosting approach to machine learning an<br />
  overview</em>, 23.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITEkernel2" name="kernel2">[STC04]</a></dt>
<dd>
John Shawe-Taylor and Nello Cristianini, <em>Kernel methods for pattern<br />
  analysis</em>, Cambridge University Press, June 2004.</dd>
</dl>
<p></font></p>
<div class="p"><!----></div>
<p><center><b>APPENDIX</b><br />
</center></p>
<div class="p"><!----></div>
<h2><a name="tth_sEcA"><br />
A</a>&nbsp;&nbsp;Graphs</h2>
<div class="p"><!----></div>
<p>The majority of the graphs in this writeup were produced using &#8220;R&#8221;<br />
<a href="http://www.r-project.org/"><tt>http://www.r-project.org/</tt></a> and Deepayan Sarkar&#8217;s Lattice<br />
package[<a href="#R:Sarkar:2008" name="CITER:Sarkar:2008">Sar08</a>].</p>
<div class="p"><!----></div>
<hr />
<h3>Footnotes:</h3>
<div class="p"><!----></div>
<p><a name="tthFtNtAAB"></a><a href="#tthFrefAAB"><sup>1</sup></a><br />
<a href="mailto:jmount@win-vector.com"><tt>mailto:jmount@win-vector.com</tt></a><br />
<a href="http://www.win-vector.com/"><tt>http://www.win-vector.com/</tt></a><br />
<a href="http://www.win-vector.com/blog/"><tt>http://www.win-vector.com/blog/</tt></a></p>
<div class="p"><!----></div>
<p><a name="tthFtNtAAC"></a><a href="#tthFrefAAC"><sup>2</sup></a>Read P(A &#124; B) as: &#8220;the probability of A will<br />
  happen given we know B is true.&#8221;</p>
<div class="p"><!----></div>
<p><a name="tthFtNtAAD"></a><a href="#tthFrefAAD"><sup>3</sup></a>Technically we are working with densities, not<br />
  probabilities, but we will use probability notation for its<br />
  intuition.</p>
<div class="p"><!----></div>
<p><a name="tthFtNtAAE"></a><a href="#tthFrefAAE"><sup>4</sup></a>P(sale &#124; x,y) is the probability of<br />
making a sale as a function of what we know about the prospective<br />
customer and our offer.  Whereas P(x,y&#124;sale) was just how likely it is<br />
to see a prospect with the given x and y values, conditioned on knowing we made<br />
a sale to this prospect.</p>
<div class="p"><!----></div>
<p><a name="tthFtNtAAF"></a><a href="#tthFrefAAF"><sup>5</sup></a> P(sale) and<br />
  P(non-sale) are just the &#8220;prior odds&#8221; of sales or what<br />
  our estimate of our chances of success are before we look at any<br />
  facts about a particular customer.  We can use our historical<br />
  overall success and failure rates as estimates of these quantities.</p>
<div class="p"><!----></div>
<p><a name="tthFtNtAAG"></a><a href="#tthFrefAAG"><sup>6</sup></a>A situation is homoscedastic if the errors are independent of where we are in the parameter space (our x,y or match factor and discount factor).  This property is very important for meaningful fitting/modeling and interpreting significance of fits.</p>
<hr /><small>File translated from<br />
T<sub><font size="-1">E</font></sub>X<br />
by <a href="http://hutchinson.belmont.ma.us/tth/"><br />
T<sub><font size="-1">T</font></sub>H</a>,<br />
version 3.85.<br />On 29 Aug 2009, 11:43.</small></p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2011/07/book-review-ensemble-methods-in-data-mining-seni-elder/' rel='bookmark' title='Book Review: Ensemble Methods in Data Mining (Seni &amp; Elder)'>Book Review: Ensemble Methods in Data Mining (Seni &#038; Elder)</a></li>
<li><a href='http://www.win-vector.com/blog/2009/04/the-data-enrichment-method/' rel='bookmark' title='The Data Enrichment Method'>The Data Enrichment Method</a></li>
<li><a href='http://www.win-vector.com/blog/2011/07/your-data-is-never-the-right-shape/' rel='bookmark' title='Your Data is Never the Right Shape'>Your Data is Never the Right Shape</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>

