<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Win-Vector Blog &#187; Mathematical Bedside Reading</title>
	<atom:link href="http://www.win-vector.com/blog/tag/mathematical-bedside-reading/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.win-vector.com/blog</link>
	<description>The Applied Theorist&#039;s Point of View</description>
	<lastBuildDate>Sat, 04 Feb 2012 17:42:12 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Kernel Methods and Support Vector Machines de-Mystified</title>
		<link>http://www.win-vector.com/blog/2011/10/kernel-methods-and-support-vector-machines-de-mystified/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=kernel-methods-and-support-vector-machines-de-mystified</link>
		<comments>http://www.win-vector.com/blog/2011/10/kernel-methods-and-support-vector-machines-de-mystified/#comments</comments>
		<pubDate>Sat, 08 Oct 2011 00:17:46 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Expository Writing]]></category>
		<category><![CDATA[Mathematics]]></category>
		<category><![CDATA[Opinion]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Tutorials]]></category>
		<category><![CDATA[Kernel Methods]]></category>
		<category><![CDATA[Logistic Regression]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Mathematical Bedside Reading]]></category>
		<category><![CDATA[Naive Bayes]]></category>
		<category><![CDATA[Support Vector Machines]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1804</guid>
		<description><![CDATA[We give a simple explanation of the interrelated machine learning techniques called kernel methods and support vector machines. We hope to characterize and de-mystify some of the properties of these methods. To do this we work some examples and draw a few analogies. The familiar no matter how wonderful is not perceived as mystical. Goals [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2010/10/a-personal-perspective-on-machine-learning/' rel='bookmark' title='A Personal Perspective on Machine Learning'>A Personal Perspective on Machine Learning</a></li>
<li><a href='http://www.win-vector.com/blog/2011/04/do-your-tools-support-production-or-complexity/' rel='bookmark' title='Do your tools support production or complexity?'>Do your tools support production or complexity?</a></li>
<li><a href='http://www.win-vector.com/blog/2011/07/book-review-ensemble-methods-in-data-mining-seni-elder/' rel='bookmark' title='Book Review: Ensemble Methods in Data Mining (Seni &amp; Elder)'>Book Review: Ensemble Methods in Data Mining (Seni &#038; Elder)</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>We give a simple explanation of the interrelated machine learning techniques called <a target="_blank" href="http://en.wikipedia.org/wiki/Kernel_methods">kernel methods</a> and <a target="_blank" href="http://en.wikipedia.org/wiki/Support_vector_machine">support vector machines</a>.  We hope to characterize and de-mystify some of the properties of these methods.  To do this we work some examples and draw a few analogies.  The familiar no matter how wonderful is not perceived as mystical.<span id="more-1804"></span><br />
<h2>Goals of this writeup</h2>
<ol>
<li>De-mystify  kernel methods and support vector machines<br />
<blockquote><p>
Kernel methods and support vector machines have taken mythological proportions in the machine learning imagination. Partly this is because a number of good ideas are overly associated with them: support/non-support training datums, weighting training data, discounting data, regularization, margin and the bounding of generalization error.  My issue is that these are all important enough ideas to stand on their own and are often seen in simpler settings.  The observations that inform my view are as follows:</p>
<ul>
<li>Kernel methods and support vector machines are in fact two good ideas.  Each is important even without the other: kernels are useful all over and support vector machines would be  useful even if we restricted to the trivial identity kernel.</li>
<li>Small scale &#8220;kernel tricks&#8221; are not that different than the classic technique of adding &#8220;interaction variables.&#8221;  Kernels let you escape from the limits of &#8220;linear hypotheses&#8221; (really by moving to a bigger space where things are again linear but look curved from the point of view of your smaller original space).  We demonstrate the linear methods and &#8220;primal kernel tricks&#8221; in <a target="_blank" href="http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/">A Demonstration of Data Mining</a>.</li>
<li>Support/non-support is the central issue of nearest neighbor classifiers.</li>
<li>Weighing training data is most famously shared by logistic regression and boosting.</li>
<li>Re-weighting of data by <i>smoothing kernels</i> (different but related use of the work &#8220;kernel&#8221;) is central to non-parametric statistics (kernel smoothers and splines).</li>
<li>Regularization is of supreme importance in modeling in general.</li>
<li>Most practitioners get tired of &#8220;kernel shopping&#8221; and fall back to the identity, cosine or radial/Gaussian kernels.</li>
</ul>
</blockquote>
</li>
<li>Show concrete examples of what are and what are not kernels.<br />
Few sources give enough theorems to think about kernels abstractly and fewer still work concrete examples.
</li>
<li>
Use as little math as possible (which, unfortunately, turns out to be quite a bit).  We will discuss encodings and stopping conditions (important to understand what is going on) but avoid explaining the optimizers (the most beautiful part of support vector machines, but also the part that is available pre-packaged in libraries).
</li>
<li>
Call out common magical thinking and unreasonable expectations associate with kernel methods and support vector machines.  This should help the reader be in a better position to &#8220;defend their doubts&#8221; regarding machine learning promises.
</li>
<li>
Try to place all techniques in a wider context (if it is usable only one place it is a trick, if it is usable multiple places it is a technique).
</li>
<li>Discuss margin and its impact on generalization error.<br />
<blockquote><p>
Generalization error is an effect of &#8220;over fitting&#8221; where a model has learned things that are true about the training examples that do not hold for the overall truth or concept we are trying to learn (i.e. don&#8217;t generalize).   Generalization error is the excess error rate we observe when scoring new examples versus the error-rate we saw in learning the training data.  Margin is in fact a <em>posterior</em> observation.  That is: margin is observed after the training data is seen, not known before data is seen (like, for example, <a target="_blank" href="http://en.wikipedia.org/wiki/VC_dimension">VC dimension</a>).   Margin is useful as it bounds generalization error but it is not the <em>prior</em> bound it is often portrayed as.  So we assert margin estimates are not much more special than simple cross-validation estimates which can also be performed once we have data available.
</p></blockquote>
</li>
</ol>
<p>The goals are not to try to indict or try to cut down kernel methods or support vector machines, but just to dump some of the associated baggage so they can be used fluidly and without anxiety.  </p>
<h2>Example Problem</h2>
<p>Consider the following simple (caricature) machine learning problem:  we are given a number of points labeled as circles and square and we want to, given a new point, predict the label.  In our example the only input will be two numbers: x and y.  This example is simple enough that we can depict the entire situation in Figure 1 below:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/TruthAndData.png" alt="TruthAndData.png" border="0" width="480" height="480" /></p>
<p>Figure 1: Truth and Training Data</p>
<p></center></p>
<p>We are assuming (for simplicity of exposition) that we have the incredible luck that the label is indeed a deterministic function of x and y.  We portray this &#8220;ground truth&#8221; as the blue parabola- every point found in this region will be considered to be labeled as &#8220;circle&#8221; and every other point (i.e. the red region) will be labeled as &#8220;triangle.&#8221;  We have overlaid 20 example points (with appropriate shape labels).  These points will be our training data.  We will try to learn from the observed training data a good approximation of the unseen blue and red regions.  This process is called learning or training and the ability to correctly predict the label of new points is called generalization.  To make sure our concept is learnable we are going to further guarantee a moat (or margin) the distribution we are picking training and test examples from will never pick a point from the shaded region between the blue and black.  That is we will never be asked about an example where x*x is very near y.</p>
<p>To promote visual thinking we are going to avoid formulas until the section titled &#8220;precise functions used&#8221; (where we will list the formulas used for each graph).</p>
<h2>Nearest Neighbor Solution</h2>
<p>An interesting model in these days of big data and fast machines is the nearest neighbor model.  Such a model colors each point in space blue or red as it <em>thinks</em> the point should be classified.  In this case each point is colored the color given by the closest known training point.  This induces the type of model seen in Figure 2 whose boundary is piecewise line segments mid-way between training points.  The data points that determine segments of the classification boundary in this way are thought to support the boundary and are called &#8220;support examples&#8221; or &#8220;support vectors.&#8221;  In fact any training point that is not &#8220;supporting&#8221; part of the boundary is &#8220;irrelevant&#8221; (could be removed and we would get the exact same model).  This split between important and non-important training points is considered one of the important observations from  the theory of support vector machines, but we see it even here in the nearest neighbor models.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/1NN.png" alt="1NN.png" border="0" width="480" height="480" /></p>
<p>Figure 2: 1 nearest neighbor model</p>
<p></center></p>
<p>Notice that the nearest neighbor model in Figure 2 is &#8220;not half bad.&#8221;  The blue region is much too wide, but near the training data the blue mass is roughly in the correct position.  If we had infinite data and infinite computational resources this sort of model would be hard to beat (as any point we are likely to be tested on would have a lot of very near examples to work from).  We can try to clean up the shape of the model a bit by using a vote of the 2 nearest points to color each point in the plane (see Figure 3) or even the majority of the 3 nearest points (see Figure 4).  </p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/2NN.png" alt="2NN.png" border="0" width="480" height="480" /></p>
<p>Figure 3: 2 nearest neighbor model</p>
<p></center></p>
<p>The purple region in Figure 3 is a &#8220;region of uncertainty&#8221; where net-vote over the 2 nearest points is zero (they gave inconsistent advice).  We can use this as a hint that predictions take from this region are less reliable.  As you may notice, we are not getting real improvements.  We should always spend more time getting more features (useful coordinates in addition to x and y) and more data; and spend less time tweaking models.  But if your data and features are fixed all you work on is your modeling technique (alternately: your modeling technique is all you can prepare before receiving features and data).  So we will continue to discuss features and technique.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/3NN.png" alt="3NN.png" border="0" width="480" height="480" /></p>
<p>Figure 4: 3 nearest neighbor model</p>
<p></center></p>
<p>Nearest neighbor classifiers are optimal in the sense that with an infinite amount of data the 1-nearest neighbor classifier has an error rate that approaches twice the Bayes error rate (the Bayes error rate being the ideal error observed on identical repetitions, or the theoretical best error rate) and for large k the k-nearest neighbor method approaches the Bayes error rate itself (see for example <a target="_blank" href="http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm">k-nearest neighbor algorithm</a> ).</p>
<p>However the effectiveness of the nearest neighbor classifiers is coming from the very low dimension (2) of our input variables x and y.  If we were attempting to predict from n variables we might need an amount of data exponential in n to get a useful nearest neighbor classifier.  This is an <a target="_blank" href="http://en.wikipedia.org/wiki/Efficiency_(statistics)">inefficient</a> use of data of the data; what we want is to use all of the data we have effectively.  To do this we further assume there is some relational reason that the position of the point in the plane determines the shape-label (i.e. that we are learning, generalizing, interpolating and extrapolating- not just memorizing).  We set our ambitions above memorizing and move towards functional modeling.  We start thinking in terms of change: how the label density of examples changes as we move is a good clue as to how it will continue to change.  We can do this either in a parametric or primal formulation (where we attempt to directly infer parameters of an assumed functional form as in regression or logistic regression) or in a non-parametric or dual form (where we attempt to learn relations between points and the training data instead of parameters).  We have discussed this before ( <a target="_blan" href="http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/">A Demonstration of Data Mining</a> ) and expanded on primal methods ( <a target="_blank" href="http://www.win-vector.com/blog/2010/11/learn-a-powerful-machine-learning-tool-logistic-regression-and-beyond/">Learn Logistic Regression (and beyond</a>, <a target="_blank" href="http://www.win-vector.com/blog/2011/09/the-simpler-derivation-of-logistic-regression/">The Simpler Derivation of Logistic Regression</a> and <a target="_blank" href="http://www.win-vector.com/blog/2011/09/the-equivalence-of-logistic-regression-and-maximum-entropy-models/">The equivalence of logistic regression and maximum entropy models</a>).</p>
<h2>Functional Solution</h2>
<p>Our next idea is: can we use our training data to tease out a functional form for the relation between x,y and shape label?  The first function we will try is chosen to be similar to the nearest neighbor model we just demonstrated.  For this form we say each training point is the center of a <a target="_blank" href="http://en.wikipedia.org/wiki/Gaussian_function">Gaussian</a> hump (say up for blue/circle and down for red/triangle) which we will call a &#8220;discount function&#8221; (or informally a hump).  A &#8220;Gaussian&#8221; is just a function that falls off exponentially in the square of Euclidian distance from a central point (we can see the simple form of such functions in the &#8220;precise functions used&#8221; section).  The contour lines (or levels as seen from looking down upon) of one such hump or discount function are shown in figure 5 (all graphs from here on in will be looking down at a height represented by color and contour lines, much like topographic maps).</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/GaussianKernel3ExampleConcept.png" alt="GaussianKernel3ExampleConcept.png" border="0" width="480" height="480" /></p>
<p>Figure 5: Narrow Gaussian example concept</p>
<p></center></p>
<p>This particular discount function has its maximum (or peak) centered on a data point and then rapidly falls as we move away from the training point.  The only property we will use in this section is that we can evaluate the discount easily (so we don&#8217;t at this time need or make use of any restrictive properties like integrability, positive semi-definiteness or even anything like bounded level sets).<br />
We could make our functional model just the sum of all of these up and down humps over all of the training data.  Each point gets a hump of the same height and same radius, pointing up for blue/circle and down for red/triangle.  The net coloring of this sum function is shown in figure 6.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/GaussianKernel3KernelSum1.png" alt="GaussianKernel3KernelSum.png" border="0" width="480" height="480" /></p>
<p>Figure 6: Narrow Gaussian sum model</p>
<p></center></p>
<p>The idea is that this would imitate the nearest neighbor model we have already seen.  This is because even though we are summing over all data the closest training examples should usually dominate (since the Gaussian humps we picked fall to zero as we get further from their centers).   The correspondence to nearest neighbor increases if we tighten the radius of our functions (searching for the right radius is a very important statistical problem called &#8220;bandwidth estimation&#8221;).  Figure 7, for example shows the sum over much narrower Gaussians.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/GaussianKernel20KernelSum1.png" alt="GaussianKernel20KernelSum.png" border="0" width="480" height="480" /></p>
<p>Figure 7: Very narrow Gaussian sum model</p>
<p></center></p>
<p>Figure 7 looks even more like the nearest neighbor solution.  In fact your could think of the discount-weighted sum over all of the data as the natural model (as it is easer to reason about) and the nearest neighbors a more efficient approximation (summing only over near points).  We also have seen that we can alter the smoothness of our model (which has consequences on model generalization) by changing the steepness of our discount functions.  We can also build intermediate models (such as building a nearest neighbor but using discount weighted sums instead of uniform voting in the neighborhoods).   We haven&#8217;t yet greatly improved on nearest neighbor, but we have identified a couple of obvious avenues for further improvement: mess with the bandwidths (the obvious idea to try and deal with near/far scale) or re-weight the data (as one of the lessons of the nearest neighbor model is that points that are not near a boundary are less important).   </p>
<h2>Bandwidth Solution</h2>
<p>As a digression lets play with bandwidth a bit first.  Suppose simultaneously for each training datum we picked an ideal bandwidth or steepness of the Gaussian.  </p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/BandwidthModel.png" alt="BandwidthModel.png" border="0" width="480" height="480" /></p>
<p>Figure 8: Simultaneous bandwidth model</p>
<p></center> </p>
<p>This simultaneous bandwidth model depicted in figure 8 was determined by picking a simultaneous assignment of Gaussian widths (each training datum gets its own width) that maximizes the log-sum of the category signed model function on the training data (much like the logistic regression maximum likelihood decoding).  This differs from typical bandwidth selection problems where all points are given a single common &#8220;best bandwidth.&#8221;  We instead picked a different bandwidth for each point (bandwidths picked to maximize how much of the correct category portion of the sum was correct on each training example).  The blue region is reasonable, but does not match the shape of the true concept.   Partly this is because there has not been enough training data to have learned a lot about the true shape (for example there is no empirical evidence for circle/blue in the top-left quadrant).  The other reason is a bias inherent to this type of model.  If one of the bandwidth functions is wider than all of the others (which is almost certain to occur) then it falls slower than all of the others and for points very far away from the center of the training data this one function is most of what remains.  Or equivalently: due to the nature of this modeling technique one of the concepts learned is going to be the union of bounded islands and the other concept will grab all points sufficiently far away from the center of the training data.  This is strong (and undesirable) bias, but it is a bias also shared by support vector machines with Gaussian kernels (though support vector machines get it from their so-called &#8220;dc-term&#8221; b not being zero; this will be discussed later). Our bandwidth correction was successful, but frankly the optimization problem we solved to estimate the optimal bandwidths was nasty and would not be something we would advise for a large amount of data.</p>
<h2>Data weighting solution (support vector machine)</h2>
<p>So instead of messing with the bandwidths let us consider re-weighting the Gaussians in our sum.  We will leave the shape of each Gaussian the same but allow each individual training datum to have a unique weight or importance.  Perhaps by picking the right weights we can get a better model.  By &#8220;best&#8221; we will mean &#8220;best margin&#8221; (to be defined later) because with &#8220;best margin&#8221; as our objective the optimization problem of solving for the best data weights has a particularly beautiful form that can be reliably solved at great scale.  This is called a &#8220;support vector model&#8221; (or support vector machine) and we will describe these ideas in detail later.  But first lets see the result.  Figure 9 shows the original narrow Gaussians re-summed according to the weights picked by the support vector method (instead of all being 1 as in figure 6).</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/GaussianKernel3SVM1.png" alt="GaussianKernel3SVM.png" border="0" width="480" height="480" /></p>
<p>Figure 9: Narrow Gaussian support vector model</p>
<p></center></p>
<p>Figure 9 is again an improvement.  The figure is formed by taking a signed sum of our discount functions centered at our training points and weighted by the &#8220;support vector machine weights.&#8221;  The signs are picked with one class encoded as +1 and the other as -1.  The colors red and blue are picked if the sum is above or below a constant &#8220;b&#8221; called &#8220;the dc-term&#8221; (part of the support vector solution).  The solution again looks okay: all the training points are in the correct color region (and you can&#8217;t hope to interpolate let alone extrapolate if you can&#8217;t even reproduce your training concept).  Also, as with the bandwidth model, contours are set sensibly (steep/dense where categories are near each other, shallow/sparse where things are safe).   We see the same inductive bias: one of the concepts is the union of bounded islands (this will always going to be the case unless the model picks b=0).  This bias is not obvious from the kernel choice, as it is hidden in the dc-term of the support vector fitting equations (it is not part of the kernel, but can be defeated by choosing unbounded kernels).  An interesting empirical fact about support vector models is they tend to work well with a larger bandwidth.    For example if we use a wider Gaussian we get an even more convincing model (see Figure 10).</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/GaussianKernel025SVM.png" alt="GaussianKernel025SVM.png" border="0" width="480" height="480" /></p>
<p>Figure 10: Wide Gaussian support vector model</p>
<p></center></p>
<p>In figure 10 we have also called out one of the important features of support vector machines.   In the literature all datums are called &#8220;vectors&#8221; (as they are represented in coordinates) and a subset of these are called the support vectors.  In figure 10 we have drawn the 4 training examples that turn out to be support vectors as large shapes and all other training examples as small shapes.  The support vectors are the datums with non negligible weights.  The learned model is a function of the support vectors only.  So after fitting we can discard the other 16 training points. This is similar to the fact that nearest neighbor also only needs the training examples that are supporting its model boundary.</p>
<p>Notice that at this point the support vector models are not &#8220;magically&#8221; better, they in fact look less like the truth we are trying to learn than either the nearest neighbor models or the un-weighted sum models.  To fix this we need to do some &#8220;kernel shopping&#8221; or find functions that better respect what we are trying to model.  With the right functions support vector machines can do a very good job at learning the concept (but knowing &#8220;the right functions&#8221; is a huge hint as to the what the concept is.  There is nothing wrong with using a hint but we do have to a method to produce the hint or have a reasonable number of hints to try from. </p>
<p>The &#8220;sum over everything with the same weight&#8221; solutions from the earlier section are very similar to Naive Bayes which also sums over all matching features with no weight adjustment.  The support vector &#8220;use the same model but pick better weights&#8221; stands over these sum models in very much the same way Logistic Regression stands over Naive Bayes.  In fact the optimization problems are very related.  If we consider weights over features or coordinates as &#8220;primal&#8221; and weights over data items as &#8220;dual&#8221; we can informally say something like: support vector machines optimize by working over dual variables and inspecting for primal weights, and logistic regression works over primal variables and inspects for data weights.   But this is a bit vague.</p>
<p>At this point we need to get a bit more explicit and precise.</p>
<h2>Precise functions used</h2>
<p>In general we will write our training data as a sequence of n-vectors.  Our points will be named u(1) &#8230; u(m).   For each i = 1..m we also know which category the training point is labeled with.  We will encode this as y(1) &#8230; y(m) where each y() is either +1 or -1 depending on the training label.  So the example we have been working has n=2 and m=20.  Let z represent a n-vector we wish to classify.</p>
<p>Figure 6 was a picture of the sign of the function:<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/E4BCA580-314E-4BD8-AD83-3CF55C7AE24E.jpg" alt="E4BCA580-314E-4BD8-AD83-3CF55C7AE24E.jpg" border="0" height="50" /><br />
</center></p>
<p>Figure 7 was a picture of the sign of the function:<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/9F34681D-45FC-4E0B-8466-943CF2881AAC.jpg" alt="9F34681D-45FC-4E0B-8466-943CF2881AAC.jpg" border="0" height="50" /><br />
</center></p>
<p>Figure 8 was a picture of the sign of the function:<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/C70C6F0B-7EB9-44C5-9505-013604F46D9C.jpg" alt="C70C6F0B-7EB9-44C5-9505-013604F46D9C.jpg" border="0"  height="50" /><br />
</center><br />
where the w(i) are non-negative numbers picked so that<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/54765538-DADD-4353-A59A-EF3B6C42743E.jpg" alt="54765538-DADD-4353-A59A-EF3B6C42743E.jpg" border="0"  height="50" /><br />
</center><br />
is large.</p>
<p>Figure 9 was a picture of the sign of -b plus the function:<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/CD6E8576-D6D3-4D61-B285-46DB169BC885.jpg" alt="CD6E8576-D6D3-4D61-B285-46DB169BC885.jpg" border="0" height="50" /><br />
</center><br />
where the a(i) and b are picked so &#8220;the margin&#8221; (discussed later) is large.</p>
<p>Figure 10 was a picture of the sign of -b plus of the function:<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/1932DC0A-E051-41E7-8D90-F0868247A69C.jpg" alt="1932DC0A-E051-41E7-8D90-F0868247A69C.jpg" border="0" height="50" /><br />
</center><br />
where the a(i) and b are picked so the margin is large.</p>
<p>All of this repetition is to emphasize the commonality of the models.  They are all of the form:<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/1FC41F1F-16D8-40A3-BA4E-0EF38E455F00.jpg" alt="1FC41F1F-16D8-40A3-BA4E-0EF38E455F00.jpg" border="0" height="50" /><br />
</center><br />
where the a(i) are non-negative numbers called the &#8220;support weights&#8221; and k(,) is a function mapping pairs of vectors to numbers and is called a &#8220;a kernel.&#8221;  Unfortunately there are many incompatible uses of the word kernel in mathematics and statistics.  Here &#8220;kernel&#8221; is being used in the sense of <a target="_blank" href="http://en.wikipedia.org/wiki/Positive-definite_kernel">positive semi-definiteness</a> or that k(u,u) &ge; 0 for all u.   Note that the support vector machine instead of using the sign of f(z) as its decision instead uses which side f(z) is of a constant b as its category decision.  A consequence is: for support vector machines if b is non-zero (as it almost surely will be) and the kernels all go to zero as we approach infinity fast enough (as they are designed to do) then exactly one of the learned classes is infinite and the other is a union of islands (regardless if this was true for the training data).</p>
<h2>About Kernels</h2>
<p>A lot of awe and mysticism is associated with kernels, but we think they are not that big a deal.  Kernels are a combination of two good ideas, they have one important property and are subject to one major limitation.  It is also unfortunate that support vector machines and kernels are tied so tightly together.   Kernels are the idea of summing functions that imitate similarity (induce a positive-definite encoding of nearness) and support vector machines are the idea of solving a clever dual problem to maximize a quantity called margin.  Each is a useful tool even without the other.</p>
<p>The two good ideas are related and unfortunately treated as if they are the same.  The two good ideas we promised are:</p>
<ol>
<li>The &#8220;kernel trick.&#8221;  Adding new features/variables that are functions of your other input variables can change linearly inseparable problems into linearly separable problems.  For example if our points were encoded not as u(i) = (x(i),y(i)) but as u(i) = (x(i),y(i),x(i)*x(i),y(i)*y(i),x(i)*y(i))</li>
<p>  we could easily find the exact concept ( y(i) > x(i)*x(i) which is now a linear concept encoded as the vector (0,1,-1,0,0).</p>
<li>Often you don&#8217;t need the coordinates of u(i).  You are only interested in functions of distances ||u(i)-u(j)|^2 and in many cases you can get at these by inner products and relations like ||u(i)-u(j)||^2 = &lt;u(i),u(i)&gt; + &lt;u(j),u(j)&gt; &#8211; 2&lt;u(i),u(j)&gt; .</li>
</ol>
<p>We will expand on these issues later.</p>
<p>The important property is that kernels look like inner products in a transformed space.  The definition of a kernel is: there exists a magic function phi() such that for all u,v:<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/F7B3D1C2-2702-48E0-8259-C48BBDACFE50.jpg" alt="F7B3D1C2-2702-48E0-8259-C48BBDACFE50.jpg" border="0"  height="20" /> .<br />
</center><br />
This means that k(.,.) is behaving like an inner product in some (possibly unknown) space.  The important consequence is the positive semi-definiteness, which implies k(u,u)&ge;0 for all u (and this just follows from the fact about inner products over the real numbers that &lt;z,z&gt;&ge;0 for all z).  This is why optimization problems that use the kernel as their encoding are well formed (such as the optimization problem of maximizing margin which is how support vector machines work).  You can <a target="_blank" href="http://en.wikipedia.org/wiki/Regularization_(mathematics)">&#8220;regularize&#8221;</a> optimization problems with a kernel penalty because it behaves a lot like a norm.  Without the positive semidefinite property all of these optimization problems would be able to &#8220;run to negative infinity&#8221; or use negative terms (which are not possible from a kernel) to hide high error rates.  The limits of the kernel functions (not being able to turn distance penalties into bonuses) help ensure that the result of optimization is actually useful (and not just a flaw in our problem encoding).</p>
<p>And this brings us to the major limitations of kernels.  The phi() transform can be arbitrarily magic except when transforming one vector it doesn&#8217;t know what the other vector is and phi() doesn&#8217;t even know which side of the inner product it is encoding.   That is kernels are not as powerful as any of the following forms:</p>
<ul>
<li>snooping (knowing the other):<br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/1BA6AACD-96DE-4B2A-A3FA-1236AE9D2957.jpg" alt="1BA6AACD-96DE-4B2A-A3FA-1236AE9D2957.jpg" border="0"  height="20" />
</li>
<li>positional (knowing which part of inner product mapping to):<br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/D51DD0F0-DF01-4E37-B5C4-E5ECF3179165.jpg" alt="D51DD0F0-DF01-4E37-B5C4-E5ECF3179165.jpg" border="0" height="20" />
</li>
<li>fully general:<br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/845BD9A6-97DC-4F2C-AD4E-633FC9AFADA7.jpg" alt="845BD9A6-97DC-4F2C-AD4E-633FC9AFADA7.jpg" border="0" height="20" />
</li>
</ul>
<p>Not everything is a kernel.</p>
<p>Some non-kernels are:</p>
<ol>
<li> k(u,v) = -c for any c>0</li>
<blockquote><p>
A consequence of the fact that no negative number can be written as a sum of squares or as the limit of a sum of squares in the reals.
</p></blockquote>
<li> k(u,v) = ||u-v||<br />
<blockquote><p>
This can be shown as follows.  Suppose k(u,v) is a kernel with k(u,u) = 0 for all u (which is necessary to try and match ||u-v|| as ||u-u|| = 0 for all u).  By the definition<br />
of kernels k(u,u) = &lt; phi(u), phi(u) &gt; for some real vector valued function phi(.).  But, by the properties of the inner real inner product &lt;.,.&gt;, this means phi(u) is<br />
the zero vector for all u.  So k(u,v) = 0 for all u,v and does not match ||u-v|| for any u,v such that u &ne; v.
</p></blockquote>
</li>
</ol>
<p>There are some obvious kernels:</p>
<ol>
<li> k(u,v) = c for any c &ge; 0 (non-negative constant kernels) </li>
<li> k(u,v) = &lt; u , v &gt;  (the identity kernel) </li>
<li> k(u,v) = f(u) f(v) for any real valued function f(.) </li>
<li> k(u,v) = &lt; f(u) , f(v) &gt; for any real vector valued function f(.) (again, the definition of a kernel)</li>
<li> k(u,v) = transpose(u) B v where B is any symmetric positive semi-definite matrix</li>
</ol>
<p>And there are several subtle ways to build new kernels from old.  If q(.,.) and r(.,.) are kernels then so are:</p>
<ol>
<li> k(u,v) = q(u,v) + r(u,v)</li>
<li> k(u,v) = c q(u,v) for any c &ge; 0 </li>
<li> k(u,v) = q(u,v) r(u,v) </li>
<li> k(u,v) = q(f(u),f(v)) for any real vector valued function f(.)</li>
<li> k(u,v) = lim_{k-> infinity} q_k(u,v) where q_k(u,v) is sequence of kernels and the limit exists.</li>
<li> k(u,v) = p(q(u,v)) where p(.) is any polynomial with all non-negative terms.</li>
<li> k(u,v) = f(q(u,v)) where f(.) is any function with an absolutely convergent Taylor series with all non-negative terms.</li>
</ol>
<p>Most of these facts are taken from the excellent book: John Shawe-Taylor and Nello Cristianini&#8217;s <a target="_blank" href="http://www.kernel-methods.net/">&#8220;Kernel Methods for Pattern Analysis&#8221;</a>, Cambridge 2004.  We are allowing<br />
kernels of the form k(u,v) = &lt; phi(u), phi(v) &gt; where phi(.) is mapping into an infinite dimensional vector space (like a series).  Most of these facts can be checked by<br />
imagining how to alter the phi(.) function.  For example to add two kernels just build a larger vector with enough slots for all of the coordinates for the vectors encoding<br />
the phi(.)&#8217;s of the two kernels you are trying to add.  To scale a kernel by c multiply all coordinates of phi() by sqrt(c).  </p>
<p>Multiplying two kernels is the trickiest.  Without loss of generality assume q(u,v) = &lt; f(u) ,f(v) &gt; and r(u,v) = &lt; g(u), g(v) &gt; and f(.) and g(.) are both mapping into the same finite dimensional vector space R^m (any other situation can be simulated or approximated by padding with zeros and/or taking limits).  Imagine a new vector function p(.) that maps into R^{m*m} such that p(z)_{m*(i-1) + j} = f(z)_i g(z)_j .  It is easy to check that k(u,v) = &lt; p(u) , p(v) &gt; is a kernel and k(u,v) = q(u,v) r(u,v). (The reference proof uses tensor notation and the Schur product, but these are big hammers mathematicians use when they don&#8217;t want to mess around with re-encoding indices).</p>
<p>Note that one of the attractions of kernel methods is that you never have to actually implement any of the above constructions.  What you do is think in terms of sub-routines (easy for computer scientists, unpleasant for mathematicians).  For instance: if you had access to two functions q(.,.) and r(.,.) that claim to be kernels and you wanted the product kernel you would just, when asked to evaluate the kernel, just get the results for the two sub-kernels and multiply (so you never need to see the space phi(.) is implicitly working in).</p>
<p>This implicitness can be important (though far too much is made of it).  For example the Gaussians we have been using throughout are kernels, but kernels of infinite dimension, so we can not directly represent the space they live in.  To see the Gaussian is a kernel notice the following:<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/926E0E3C-883C-4879-A49C-4B91DEB3E449.jpg" alt="926E0E3C-883C-4879-A49C-4B91DEB3E449.jpg" border="0"  height="20" /> .<br />
</center> </p>
<p>And for all c &ge; 0 each of the three terms on the right is a kernel (the first because the Taylor series of exp() is absolutely convergent and non-negative and the second two are the f(u) f(v) form we listed as obvious kernels).  The trick is magic, but the idea is to use the fact that Euclidian squared distance breaks nicely into dot-products ( ||u-v||^2 = &lt; u, u &#038;gt + &lt; v, v &#038;gt &#8211; 2 &lt; u , v &gt;) and exp() converts addition to multiplication.  It is rather remarkable that kernels (which are a generalization of inner products that induce half open spaces) can encode bounded concepts (more on this later when we discuss the VC dimension of the Gaussian).</p>
<h2>Back to Support Vector Machines</h2>
<p>We can now restate what a support vector machine is: it is a method for picking data weights so that the modeling function for a given kernel has a maximum margin.  The margin is the minimal distance between a training example and the model&#8217;s boundary between blue and red.  Notice the 4 support vectors in figure 10 are all not &#8220;tight up against the boundary,&#8221; but are all the same (minimal distance) from the boundary.  This distance is called the margin and the area near the boundary is obviously a place the model has uncertainty (so it makes sense to keep the boundary as far as possible from the training data).  The great benefit of the support vector machine is that with access only to the data labels and the kernel function (in fact only the kernel function evaluated at pairs of training datums) the support vector machine can quickly solve for the optimal margin and data weights achieving this margin.</p>
<p>For fun lets plug in new kernel called the cosine kernel:<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/328D7D80-4133-4E42-A63F-F7E8F3ADC90C.jpg" alt="328D7D80-4133-4E42-A63F-F7E8F3ADC90C.jpg" border="0"  height="43" /><br />
</center><br />
(c &ge; 0).</p>
<p>This kernel can be thought of as having a phi(.) function that takes a vector z and adds an extra coordinate of sqrt(c) and then projects the resulting vector onto the unit sphere.  This kernel induces concepts that look like parabolas, as seen in figure 11.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/CosineKernelExampleConcept.png" alt="CosineKernelExampleConcept.png" border="0" width="480" height="480" /></p>
<p>Figure 11: Cosine example concept</p>
<p></center></p>
<p>Figure 12 shows the sum-concept (add all discount functions with same weight) model. In this case it is far too broad (averaging of the cosine concepts which are not bounded concepts like the Gaussians) creates an overly wide model.  The model not only generalizes poorly (fails to match the shape of the truth diagram) it also gets some of its own training data wrong.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/CosineKernelKernelSum1.png" alt="CosineKernelKernelSum.png" border="0" width="480" height="480" /></p>
<p>Figure 12: Cosine kernel sum model</p>
<p></center></p>
<p>And figure 13 shows the excellent fit returned by the support vector machine.  Actually an excellent fit determined by 4 support vectors (indicated by larger labels).  Also notice the support vector machine using unbounded concepts generated a very good unbounded model (unbounded in the sense that both the blue and red regions are infinite).  By changing the kernel or discount functions/concepts we changed the inductive bias.  So any knowledge of what sort of model we want (one class bounded or not) should greatly influence our choice of kernel functions (since the support vector machine can only pick weights for the kernel functions, not tune their shapes or bandwidths).</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/CosineKernelSVM.png" alt="CosineKernelSVM.png" border="0" width="480" height="480" /></p>
<p>Figure 13: Cosine Support Vector model</p>
<p></center></p>
<p>Another family of kernels (which I consider afflictions) are the &#8220;something for nothing&#8221; kernels.  These are kernels of the form:<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/C2307338-CB70-4EE0-B83D-0B4AE988F951.jpg" alt="C2307338-CB70-4EE0-B83D-0B4AE988F951.jpg" border="0" height="20" /><br />
</center><br />
or higher powers or other finishing functions.  The idea is that if you were to look at the expansion of these kernels you would see lots of higher order terms (powers of u_i and v_j)<br />
and if these terms were available to the support vector machine it could use them.  It is further claimed that in addition to getting new features for free you don&#8217;t use up degrees<br />
of freedom exploiting them (so you cross-validate as well as for simpler models).   Both these claims are fallacious- you can&#8217;t fully use the higher order terms because they are entangled with other terms that are not orthogonal the outcome and the complexity of a kernel is not quite as simple as degrees of freedom (proofs about support vector machines are stated in terms of margin, not in terms of degrees of freedom or even in terms of VC dimension).  The optimizer in the SVM does try hard to make a good hypothesis using the higher order terms- but due to the forced relations among the terms you get best fits like figure 14.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/SquareKernelSVM.png" alt="SquareKernelSVM.png" border="0" width="480" height="480" /></p>
<p>Figure 14: Squared magic kernel support vector model</p>
<p></center></p>
<p>If you want higher order terms I feel you are much better off performing a primal transform so the terms are available in their most general form.   That is re-encode the vector u = (x,y) as the larger vector (x,y,x*y,x*x,y*y).  You have to be honest: if you are trying to fit more complicated functions you are searching a larger hypothesis space so you need more data to falsify the bad hypotheses.   You can be efficient (don&#8217;t add terms you don&#8217;t think you can use as they make generalization harder) but you can&#8217;t get something for nothing (even with kernel methods).</p>
<h2>Back to margin</h2>
<p>Large margin (having a large distance between the decision boundary and all training examples) is a good idea.  At the simplest level it means anything close to a training data point gets classified correctly (so the model is immune to an amount of fuzz proportional to the margin width).  </p>
<p>We get the following remarkable generalization theorem (this one is the hard SVM version, we weaken the result a bit to make it easier to state).  Suppose we train a model using kernel k(.,.) on m training examples u(1),&#8230;,u(m) with training labels y(1)&#8230;y(1).  Further assume the hidden true concept is separable with respect to our kernel and data distribution with a margin of at least w (that is for any sample we draw we can find a SVM model that classifies all of the training data correctly and sees a margin of at least w).  Or more simply: assume w is behaving like a constant with respect to m (i.e. it is not going down as we increase sample size). Then with probability at least 1-d the generalization error (that is error seen on new test points not seen during training, but drawn from the same distribution as the training examples) is no more than:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/1F1B1C26-C676-46EA-9522-5DBA181CC8C6.jpg" alt="1F1B1C26-C676-46EA-9522-5DBA181CC8C6.jpg" border="0" height="65" /><br />
.<br />
</center><br />
(we take this from Theorem 7.22 of &#8220;Kernel Methods for Pattern Analysis, page 215).</p>
<p>If we further assume the expected value k(u,u) under the training distribution exists (i.e. we don&#8217;t have too many examples of very high kernel norm according to the distribution we are training with respect to) and is stable (not taking a lot of value from rare instances) then we can read off the meaning of this upper bound.  Assuming we can always build a model from the training data of margin at least w then the generalization error (excess error we expect to see on classifying similar points) is falling at a rate proportional to the square root of the amount of training data we have.  This is in one sense amazing- we have a bound on how well we are going to do on data we have not seen.  And this bound is also in some sense &#8220;best possible&#8221; because we need around this much data to even confirm such an accuracy (see <a target="_blank" href="http://www.win-vector.com/blog/2011/06/what-is-a-large-enough-random-sample/">What is a large enough random sample?</a> ).  Also notice how the dimension of the implicit features space (which can in fact be infinite) does not enter into the bound.</p>
<p>However, the bound is also some sense to be expected.  It depends on the following strong assumptions (usually not remembered or stated):</p>
<ol>
<li>Test and training data must be generated in the same manner (i.e. come from the same distribution, this is usually stated but not obeyed).</li>
<li>We must not have collapsing margin as the number of data points goes up (so there must actually be a moat between the positive and negative examples such as portrayed in figure1 where the blurry boundary was a region we never generated training or test examples from).</li>
<li>The expected value of k(u,u) must be bounded for examples drawn from our training distribution.  For example this is not true even for the identity kernel if x is just numbers drawn from a <a target="_blan" href="http://en.wikipedia.org/wiki/Cauchy_distribution">Cauchy distribution</a>.</li>
</ol>
<p>Also it is a <em>posterior</em> estimate of expected generalization error.  That is, we only get a bound after we plug in some facts found from fitting a model to the training data (the margin and the expected value of k(u,u)).  It is not like the famous <em>prior</em> bounds based on VC dimension that depend only on the structure of the kernel k(,) and not the distribution of training data or labels (beyond the concept being expressible by the kernel, which is itself a <em>posterior</em> fact).  The thing to remember for all posterior estimates is you can get such estimates at a small expense in statistical efficiency (i.e. you waste some data) and a moderate computational blow-up by cross validation (and with essentially no math and many fewer assumptions).  In fact the cross validation and hold-out estimates may be much tighter than calculated bounds (and are certainly easier to explain).</p>
<p>When we say we weakened the result we mean we have added some stated some conditions that are not strictly needed for the bound to be true.  One of these assumptions is that the margin is not collapsing (the bound remains true even for a collapsing margin; it is just not useful).  We added this assumption because most readings of the bound (or presumed applications) implicitly assume the bound remains useful (which means something near our additional assumptions must also hold).  The reader is really lured to think of the margin w as a constant depending only on the kernel when in fact it depends on the kernel, data distribution and number of training examples.  For example: consider a very simple one dimensional classification problem:  learning the concept x &ge; 0 from labeled training data drawn uniformly from the interval [-0.5,0.5] using a Gaussian kernel.  The support vector models are going to be essentially picking two neighboring points in the observed training sample that have opposite labels and saying the concept boundary is between them.  In this simple case the margin is collapsing: as m (the number of training examples increases) the expected margin width is O(1/m).  This renders the bound calculated above useless (drives it above 1 as the number of training examples increases).  The model is good but the bound is not useful.  This is because the training distribution is discovering points closer and closer to the concept boundary (a necessary discontinuity) as the number of training examples grows.  This is why we had to add the additional (weakening) assumption that not only do we need to assume the concept to be learned is separable with respect to the kernel we are using, but we must further assume the data-distribution (determining where test and training data come from) stays some minimal distance away from the decision surface (i.e. together the data distribution and underlying concept have a moat, as in our figure 1 example).  We will phrase this as &#8220;the concept is wide margin separable with respect to the training distribution&#8221; (implicitly we need to know the kernel, but it is the training distribution that is critical). This is a very strong assumption (not true for all classification problems) but is often true when the classification task is to &#8220;un-mix a mixture&#8221; (the observed data is coming from two different sources and the training labels are which source they come from, in this case you often do have a margin).  Obviously a soft-margin SVM deals better with this (some of the issue is due to our use of hard-margin)- but it is an interesting exercise to think why you would expect your learned  model to have a large margin if your underlying concept and training data do not.</p>
<p>The result remains important: support vector machines are in some sense an optimal learning procedure (up to some constants they achieve as tight a generalization error as can even be confirmed for a given training set size).  Another strong point of the support vector result is the result applies even for kernels with unbounded VC dimension (such as the Gaussian kernel).</p>
<h2>The Gaussian (or radial) kernel again</h2>
<p>We said the Gaussian kernel was of infinite VC dimension.  But we have not shown why it is true.   The fact that the only encoding we could think of off the top of our head was a power-series or an limit does not guarantee there isn&#8217;t  some more clever encoding that easily demonstrates that the Gaussian kernel is of bounded VC dimension (like the identity kernel and cosine kernels are). </p>
<p>However we can see the Gaussian kernel can not have finite VC dimension because it can place an arbitrary number of bounded patches in the plane.  Thus we can build arbitrarily large sets (by placing all points far apart) that can be shattered (thus falsifying any bounded VC dimension).</p>
<p>One of the wonders of the support vector method is that the theory works even in this situation.</p>
<h2>Back to SVM</h2>
<p>Support vector machines seem nearly magical in their power to pick &#8220;best weights.&#8221;  However, as we have seen, these best weights may not always be much better than typical good weights (for instance: using all uniform weights).  Also it is very important to remember the support vector machine is picking the best weighting of a sum of already picked kernels.  It is not picking the best kernel shape or adjusting things like bandwidth (you must pick the best kernel ahead of time or try several kernels to find a good one).  You must remember support vector machines can only directly adjust dual weights (or relative training data weighting) this can affect some changes in hypothesis shape but the support vector machine can not directly implement arbitrary changes in hypothesis shape like a nearest neighbor or a parameterized primal method can (like logistic regression with the kernel trick of adding interaction variables). </p>
<p>Support vector machines are to be much preferred to other combiners like <a target="_blan" href="http://en.wikipedia.org/wiki/Boosting">Boosting</a>.  This is because support vector machines have explicit stopping criteria (conditions known to be true at the optimal solution that characterize the optimal solution) and explicit regularization (control of over fitting).  Boosting relies a path argument (&#8220;do the computation in these stages&#8221;) so it is much harder to reason about and frankly the folklore that &#8220;early stop prevents over fitting in boosting&#8221; is just false (you are much better off explicitly regularizing).</p>
<p>It is also lore that random kernel transformations (like power series over the cosine kernel) are magic.  They can make data separable but they are unlikely to meet the (unstated but critical) conditions of the support vector theorem (margin being wide, margin not filling in and expectation of k(.,.) being small).</p>
<h3>A comment on optimization</h3>
<p>In this writeup we have skipped the most beautiful part of support vector machines- the form of the optimization problem used to solve them.  We strongly suggest consulting  John Shawe-Taylor and Nello Cristianini&#8217;s &#8220;Kernel Methods for Pattern Analysis&#8221;, Cambridge 2004 for this.  The writing is dense but accurate.  You can literally paste their formulas into a general solver and they work.</p>
<h2>Conclusion</h2>
<p>We have shown how kernel methods and support vector machines fit in a sequence of machine learning methods:</p>
<ol>
<li>nearest neighbor models</li>
<li>uniform sum models</li>
<li>support vector machines.</li>
</ol>
<p>We also showed a number of examples of kernels and non-kernels.  It is good to be able to quickly remember the constant &#8220;-1&#8243; is not a kernel and the constant &#8220;1&#8243; is a kernel (though not a very useful one).</p>
<p>It is our argument that support vector machines are &#8220;dual methods&#8221; (as they work in weights over the data instead of weights over coordinates in the features space) and the move from the primal algorithm Naive Bayes to Logistic regression is similar to the move from uniform kernel sums to support vector machines.</p>
<p>Also, if you want to directly manipulate shape (like picking bandwidth) you must in some sense use a primal method (this isn&#8217;t what support vector machines are for).</p>
<p>We end with: If you have some feeling of both what a method can and can not do then you have some understanding of the method.</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2010/10/a-personal-perspective-on-machine-learning/' rel='bookmark' title='A Personal Perspective on Machine Learning'>A Personal Perspective on Machine Learning</a></li>
<li><a href='http://www.win-vector.com/blog/2011/04/do-your-tools-support-production-or-complexity/' rel='bookmark' title='Do your tools support production or complexity?'>Do your tools support production or complexity?</a></li>
<li><a href='http://www.win-vector.com/blog/2011/07/book-review-ensemble-methods-in-data-mining-seni-elder/' rel='bookmark' title='Book Review: Ensemble Methods in Data Mining (Seni &amp; Elder)'>Book Review: Ensemble Methods in Data Mining (Seni &#038; Elder)</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2011/10/kernel-methods-and-support-vector-machines-de-mystified/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>What Did Theorists Do Before The Age Of Big Data?</title>
		<link>http://www.win-vector.com/blog/2010/08/what-did-theorists-do-before-the-age-of-big-data/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=what-did-theorists-do-before-the-age-of-big-data</link>
		<comments>http://www.win-vector.com/blog/2010/08/what-did-theorists-do-before-the-age-of-big-data/#comments</comments>
		<pubDate>Mon, 02 Aug 2010 18:42:45 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Expository Writing]]></category>
		<category><![CDATA[Mathematics]]></category>
		<category><![CDATA[Age of Big Data]]></category>
		<category><![CDATA[Big Data]]></category>
		<category><![CDATA[Mathematical Bedside Reading]]></category>
		<category><![CDATA[Mean]]></category>
		<category><![CDATA[Mean of Medians]]></category>
		<category><![CDATA[Median]]></category>
		<category><![CDATA[Median of Means]]></category>
		<category><![CDATA[Theorist]]></category>
		<category><![CDATA[Winsorized mean]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1514</guid>
		<description><![CDATA[We have been living in the age of &#8220;big data&#8221; for some time now. This is an age where incredible things can be accomplished through the effective application of statistics and machine learning at large scale (for example see: &#8220;The Unreasonable Effectiveness of Data&#8221; Alon Halevy, Peter Norvig, Fernando Pereira, IEEE Intelligent Systems (2009)). But [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2009/08/good-graphs-graphical-perception-and-data-visualization/' rel='bookmark' title='Good Graphs: Graphical Perception and Data Visualization'>Good Graphs: Graphical Perception and Data Visualization</a></li>
<li><a href='http://www.win-vector.com/blog/2009/04/the-data-enrichment-method/' rel='bookmark' title='The Data Enrichment Method'>The Data Enrichment Method</a></li>
<li><a href='http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/' rel='bookmark' title='A Demonstration of Data Mining'>A Demonstration of Data Mining</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>We have been living in the age of &#8220;big data&#8221; for some time now.  This is an age where incredible things can be accomplished through the effective application of statistics and machine learning at large scale (for example see: &#8220;The Unreasonable Effectiveness of Data&#8221; Alon Halevy, Peter Norvig, Fernando Pereira, IEEE Intelligent Systems (2009)).  But I have gotten to thinking about the period before this.   The period before we had easy access to so much data, before most computation was aggregation and before we accepted numerical analysis style convergence as &#8220;efficient.&#8221;  A small problem I needed to solve (as part of a bigger project)  reminded me what theoretical computer scientists did then: we worried about provable worst case efficiency.</p>
<p><span id="more-1514"></span><br />
The problem that got me thinking is this: </p>
<p>Given a sequence of n integers x1 through xn and an integer k (1 &le; k &le; n), find the mean value of all of the medians of the k-sized selections from x1 through xn.  Or as a formula:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/07/EMedian.png" alt="EMedian.png" border="0" width="220" /><br />
</center></p>
<p>where x_s is defined as the sequence of integers whose indices are in the set s (not necessarily a contiguous sequence).   The median is the &#8220;value in the middle&#8221; (a value such that half of the selected data are above it and half are below) and &#8220;(n choose k)&#8221; is the number of ways to choose k items from a collection of n items (which is just the number: n!/((n-k)! k!)).  So our sum is adding up a number of terms and then dividing by the number of such terms to get the average or mean of the terms.  We will call this sum a &#8220;mean of medians&#8221;.</p>
<p>Some obvious special cases are: for k=1 the<br />
expression simplifies to the sum of the x_i divided by n (or just the mean of all the x_i) and for k=n it simplifies to the median of all of the x_i.  For intermediate values of k it is not immediately obvious how to efficiently compute the value of the sum.  Directly adding all (n choose k)  terms (as the sum is written) would be very slow for large n with even moderate sized k.  Instead we look to find some method that has the same value of the total sum, without directly computing all of the terms.</p>
<p>This gets us to the ad-hoc side of theoretical computer science.  We need a clever idea.  In this case the idea is simple.  To keep things simple: suppose all of the n integers in our sequence have different values and that k is odd (neither of these conditions are important they just let us avoid some non-essential technicalities).  What values can median(x_s) take on? median(x_s) is always x_i form some i in the subset s.  In fact our sum is equivalent to:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/07/Sum2.png" alt="Sum2.png" border="0" width="330"  /><br />
</center></p>
<p>This new sum has a reasonable number of terms- so we can actually calculate it directly if we knew the values of the terms.  Without loss of generality assume the x_i are sorted in increasing order.  Then the number of times x_i is the median of some x_s is exactly:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/07/term.png" alt="term.png" border="0" width="191" /><br />
</center><br />
(and 0 for i &lt; 1+(k-1)/2 or i &gt; n &#8211; (k-1)/2).  This is just the number of ways to place x_i in the center of a subset- by choosing (k-1)/2 smaller neighbors and (k-1)/2 larger neighbors.   The count is given by multiplying the number of ways to choose smaller neighbors by the number of ways to choose the larger neighbors.</p>
<p>The complete solution calculating the mean of medians for distinct sorted x_i is:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/07/fullsum1.png" alt="fullsum.png" border="0" width="333"  /><br />
</center></p>
<p>A statistician would recognize this expression as a kind of centrally weighted Winsorized mean.  The shape of the graph of weights (in this case the n=10, k=5) is suggestive of<br />
a bounded normal window (though i is a rank, not a free-ranging value):</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/07/10w5.png" alt="10w5.png" border="0" width="525" height="525" /><br />
</center></p>
<p>Likely we have re-invented a data treatment known to statisticians.  But the above steps were really just combinatorics.  What a theorist does is abstract something down to this sort of problem and think of variations and solutions.   The formal side of theoretical computer science is attempting to organize all of the problems in the world into two classes: problems presumed easy and problems presumed hard.</p>
<p>For example- what if we had wanted to know the median of many means instead of the mean of many medians?<br />
It turns out a small variation of the median of means problem is already known to be difficult.  The hard version of the reversed problem is called &#8220;Kth largest subset&#8221; (this is a different K than we have been using up until now).   The Kth largest subset problem is: given a sequence of integers and constants K and B do K or more distinct subsets have a sum of no more than B?  The Kth largest subset problem is known to be &#8220;NP hard&#8221; which means that a whole host of other thought to be hard problems could be solved if we had the ability to solve this problem (see &#8220;Computers and Intractability: A Guide to the Theory of NP-Completeness&#8221; Michael R. Garey and David S. Johnson, 1979).  The median of many means is not quite as expressive as the Kth largest subset problem (so we have <em>not</em> proven the median of many means is itself NP hard) but the relation is strong: to even verify that we had the right median we would need to solve a Kth largest subset problem with K=(n choose k)/2 and B= k*the_claimed_median (assuming we modified the subset problem to restrict itself to size k sub-sequences).   If fact even checking if a given number is one of the possible means (let alone if it is the median of the means) is equivalent to the NP Complete knapsack problem.  This correspondence makes us suspect that something as simple as the trick we devised for the mean of medians problem will not solve the median of means problem.  One should not take this argument too seriously though, as the same argument would seem to apply to the pair of problems &#8220;min of means&#8221; and &#8220;mean of mins&#8221; both of which are in fact easy.  We have not proven the median of means problem to be hard, we have just found relevant algorithms and related problems.  </p>
<p>What theorists do is find these analogs and then do the heavy lifting (continue the work even when the solution is not simple) to find a way to either efficiently solve the problem at hand or prove it is in fact as hard as other important problems.  This kind of thing is hard work for small gains, but the value accumulates because each of these gains is permanent.  Finally additional variations of the problem are tried and characterized, to help check we hare not &#8220;leaving money on the table&#8221; (missing nearby improvements).  Some of this may seem like pointless work on toy problems- but every once in a while one of these solutions or characterizations is exactly what is needed to take a large system (like a search engine) to the next level of performance.</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2009/08/good-graphs-graphical-perception-and-data-visualization/' rel='bookmark' title='Good Graphs: Graphical Perception and Data Visualization'>Good Graphs: Graphical Perception and Data Visualization</a></li>
<li><a href='http://www.win-vector.com/blog/2009/04/the-data-enrichment-method/' rel='bookmark' title='The Data Enrichment Method'>The Data Enrichment Method</a></li>
<li><a href='http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/' rel='bookmark' title='A Demonstration of Data Mining'>A Demonstration of Data Mining</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2010/08/what-did-theorists-do-before-the-age-of-big-data/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Gradients via Reverse Accumulation</title>
		<link>http://www.win-vector.com/blog/2010/07/gradients-via-reverse-accumulation/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=gradients-via-reverse-accumulation</link>
		<comments>http://www.win-vector.com/blog/2010/07/gradients-via-reverse-accumulation/#comments</comments>
		<pubDate>Thu, 15 Jul 2010 00:00:04 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Applications]]></category>
		<category><![CDATA[Coding]]></category>
		<category><![CDATA[Exciting Techniques]]></category>
		<category><![CDATA[Mathematics]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Tutorials]]></category>
		<category><![CDATA[Automatic Differentiation]]></category>
		<category><![CDATA[Conjugate Gradient]]></category>
		<category><![CDATA[Gradient]]></category>
		<category><![CDATA[Mathematical Bedside Reading]]></category>
		<category><![CDATA[Optimization]]></category>
		<category><![CDATA[Reverse Accumulation]]></category>
		<category><![CDATA[Scala]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1493</guid>
		<description><![CDATA[We extend the ideas of from Automatic Differentiation with Scala to include the reverse accumulation. Reverse accumulation is a non-obvious improvement to automatic differentiation that can in many cases vastly speed up calculations of gradients. As the tables, diagrams and equations do not translate well into HTML, our full article is available here in PDF: [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/' rel='bookmark' title='Automatic Differentiation with Scala'>Automatic Differentiation with Scala</a></li>
<li><a href='http://www.win-vector.com/blog/2011/06/automatic-detection-of-potential-deadlock/' rel='bookmark' title='Automatic Detection of Potential Deadlock'>Automatic Detection of Potential Deadlock</a></li>
<li><a href='http://www.win-vector.com/blog/2010/05/must-have-software/' rel='bookmark' title='Must Have Software'>Must Have Software</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>We extend the ideas of from <a target="ext" href="http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/">Automatic Differentiation with Scala</a> to include the <em>reverse accumulation</em>.  Reverse accumulation is a non-obvious improvement to automatic differentiation that can in many cases vastly speed up calculations of gradients.<span id="more-1493"></span><br />
As the tables, diagrams and equations do not translate well into HTML, our full article is available here in PDF: <a href="http://www.win-vector.com/dfiles/ReverseAccumulation.pdf">http://www.win-vector.com/dfiles/ReverseAccumulation.pdf</a>.</p>
<p>The purpose of our article is to explain reverse accumulation automatic differentiation clearly (and to release some sample code and timing results).  A side effect of the article is to make sense of the following two diagrams:</p>
<p>If the following is picture of standard or forward differentiation:</p>
<p><img src="http://www.win-vector.com/blog/wp-content/uploads/2010/07/cutFwd.png" alt="cutFwd.png" border="0" width="408" height="677" /></p>
<p>then the following is a picture of reverse accumulation:</p>
<p><img src="http://www.win-vector.com/blog/wp-content/uploads/2010/07/cutRev.png" alt="cutRev.png" border="0" width="487" height="739" /></p>
<hr/>
Example code now distributed from: <a target="_blank" href="https://github.com/WinVector/AutoDiff">github.com/WinVector/AutoDiff</a>.</p>
<hr/>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/' rel='bookmark' title='Automatic Differentiation with Scala'>Automatic Differentiation with Scala</a></li>
<li><a href='http://www.win-vector.com/blog/2011/06/automatic-detection-of-potential-deadlock/' rel='bookmark' title='Automatic Detection of Potential Deadlock'>Automatic Detection of Potential Deadlock</a></li>
<li><a href='http://www.win-vector.com/blog/2010/05/must-have-software/' rel='bookmark' title='Must Have Software'>Must Have Software</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2010/07/gradients-via-reverse-accumulation/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>&#8220;Easy&#8221; Portfolio Allocation</title>
		<link>http://www.win-vector.com/blog/2010/01/easy-portfolio-allocation/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=easy-portfolio-allocation</link>
		<comments>http://www.win-vector.com/blog/2010/01/easy-portfolio-allocation/#comments</comments>
		<pubDate>Thu, 14 Jan 2010 20:09:13 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Finance]]></category>
		<category><![CDATA[Mathematics]]></category>
		<category><![CDATA[Tutorials]]></category>
		<category><![CDATA[Lagrange Multipliers]]></category>
		<category><![CDATA[Mathematical Bedside Reading]]></category>
		<category><![CDATA[Portfolio Theory]]></category>
		<category><![CDATA[Sharpe Ratio]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1342</guid>
		<description><![CDATA[This is an elementary mathematical finance article. This means if you know some math (linear algebra, differential calculus) you can find a quick solution to a simple finance question. The topic was inspired by a recent article in The American Mathematical Monthly (Volume 117, Number 1 January 2010, pp. 3-26): &#8220;Find Good Bets in the [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2010/08/fast-portfolio-re-balancing-as-a-fractional-linear-program/' rel='bookmark' title='Fast Portfolio re-Balancing as a Fractional Linear Program'>Fast Portfolio re-Balancing as a Fractional Linear Program</a></li>
<li><a href='http://www.win-vector.com/blog/2011/10/kernel-methods-and-support-vector-machines-de-mystified/' rel='bookmark' title='Kernel Methods and Support Vector Machines de-Mystified'>Kernel Methods and Support Vector Machines de-Mystified</a></li>
<li><a href='http://www.win-vector.com/blog/2008/09/a-quick-appreciation-of-the-sharpe-ratio/' rel='bookmark' title='A Quick Appreciation of the Sharpe Ratio'>A Quick Appreciation of the Sharpe Ratio</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>This is an elementary mathematical finance article. This means if you know some math (linear algebra, differential calculus) you can find a quick solution to a simple finance question. The topic was inspired by a recent article in The American Mathematical Monthly (Volume 117, Number 1 January 2010, pp. 3-26): &#8220;Find Good Bets in the Lottery, and Why You Shouldn&#8217;t Take Them&#8221; by Aaron Abrams and Skip Garibaldi which said optimal asset allocation is now an undergraduate exercise. That may well be, but there are a lot of people with very deep mathematical backgrounds that have yet to have seen this. We will fill in the details here. The style is terse, but the content should be about what you would expect from one day of lecture in a mathematical finance course.</p>
<p><span id="more-1342"></span></p>
<p>Portfolio allocation is not the &#8220;magic predict the future&#8221; part of finance, it is the scheme for correctly applying magic predictions of the future. The idea is that if you had an prediction of future returns of a number of assets, the naive thing to do would be to invest everything into the asset with highest predicted return. Portfolio theory, while still taking the predictions at face value, picks an investment pattern that will (in risk-adjusted dollars) outperform the naive strategy even if the predictions are correct and is a bit safer when the predictions are wrong.</p>
<p>Suppose you had <img width="14" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg1.png" alt="$ n$"> different assets you could invest in. For the <img width="10" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg2.png" alt="$ i$"> -th asset there is an expected excess relative return of <img width="19" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg3.png" alt="$ \mu_i$"> and an estimated variance of <img width="17" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg4.png" alt="$ s_i$"> (for a definition of relative return see <a href="http://www.win-vector.com/blog/2010/01/relative-returns-a-banker-versus-trader-paradox/">Relative returns: a banker versus trader paradox</a> and for a definition of variance see <a href="http://www.win-vector.com/blog/2008/09/a-quick-appreciation-of-the-sharpe-ratio/">A Quick Appreciation of the Sharpe Ratio</a>). Let the vector <img width="16" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg5.png" alt="$ w$"> be such that <img width="23" height="29" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg6.png" alt="$ X_i$"> represents the number of dollars we invest in the <img width="10" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg2.png" alt="$ i$"> -th asset. If <img width="23" height="29" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg6.png" alt="$ X_i$"> is positive then our plan is &#8220;to go long&#8221; or buy some of the <img width="10" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg2.png" alt="$ i$"> -th asset. If <img width="23" height="29" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg6.png" alt="$ X_i$"> is negative our plan is &#8220;to short&#8221; or sell some of the <img width="10" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg2.png" alt="$ i$"> -th asset to somebody else (It is called going short as we actually sell something we do not have. This is often allowed in finance; as long as we make the same pay-outs to the buyer that the buyer would receive if we really had the item to sell).</p>
<p>When we appeal to the idea of optimizing the portfolio Sharpe Ratio (again, see <a href="http://www.win-vector.com/blog/2008/09/a-quick-appreciation-of-the-sharpe-ratio/">A Quick Appreciation of the Sharpe Ratio</a>) then we say a good portfolio is one that doesn&#8217;t just maximize expected relative returns (which is <img width="39" height="34" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg11.png" alt="$ X^{\top} \mu$"> ) but maximizes the ratio of expected relative return to standard deviation:</p>
</p>
<div align="center"><img width="73" height="56" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg8.png" alt="$\displaystyle \frac{X^{\top} \mu}{\sqrt{X^{\top} C X}} $"></div>
<p>where (for now) <img width="17" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg9.png" alt="$ C$"> is the matrix <img width="30" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg10.png" alt="$ s s^{\top}$"> . This ratio is called a &#8220;risk adjusted return&#8221; (versus the un-adjusted form <img width="39" height="34" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg11.png" alt="$ X^{\top} \mu$"> ). Also notice that the ratio is homogeneous in <img width="19" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg12.png" alt="$ X$"> (doubling <img width="19" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg12.png" alt="$ X$"> does not change the ratio as it simultaneously doubles the numerator and the denominator) so an optimal solution <img width="19" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg12.png" alt="$ X$"> describes not how much to invest, but what pattern to invest in. This allows us to introduce an important practical constraint: we are only going to allow ourselves to risk a total of <img width="16" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg13.png" alt="$ T$"> dollars (both long and short). That is: we insist <img width="105" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg14.png" alt="$ \sum_{i=1}^{n} \vert X_i\vert = T$"> . We will ignore this total investment constraint until the end when we can satisfy the constraint by simply re-scaling an partial solution.</p>
<p>To solve for <img width="19" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg12.png" alt="$ X$"> we introduce an old friend: <a href="http://en.wikipedia.org/wiki/Lagrange_multipliers">Lagrange Multipliers</a> (or equivalently the Karush-Kuhn-Tucker conditions of optimality). Since the fraction we are trying to optimize is homogeneous in <img width="19" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg12.png" alt="$ X$"> we can convert the denominator into a constraint and arbitrarily insist that <img width="99" height="38" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg15.png" alt="$ \sqrt{X^{\top} C X} = 1$"> without changing the nature of the problem. We are now trying to maximize <img width="39" height="34" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg11.png" alt="$ X^{\top} \mu$"> subject to <img width="99" height="38" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg15.png" alt="$ \sqrt{X^{\top} C X} = 1$"> . The Lagrangian conditions of optimality state at the optimum we must have the gradient of the objective is proportional to the gradient of the constraint or:</p>
</p>
<div align="center"><img width="225" height="40" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg16.png" alt="$\displaystyle \nabla_X X^{\top} \mu = \lambda \nabla_X ( \sqrt{X^{\top} C X} - 1 ) $"></div>
<p>for some (to be determined) constant <img width="13" height="15" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg17.png" alt="$ \lambda$"> . Pushing the gradient operator through we get:</p>
<div align="center"><img width="213" height="37" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg18.png" alt="$\displaystyle \mu = \lambda (1/2) ( X^{\top} C X )^{-1/2} 2 C X . $"></div>
<p>A similar equation could be gotten by appealing to a Rayleigh Quotient argument.</p>
<p>We do not yet know <img width="19" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg12.png" alt="$ X$"> (that is what we are trying to solve for), so we do not know what <img width="56" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg19.png" alt="$ X^{\top} C X$"> is. However, this is just a scalar and since we are just trying to solve up to a multiple we can throw it out and introduce a new multiple and see that it is enough to solve:</p>
</p>
<div align="center"><img width="76" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg20.png" alt="$\displaystyle \mu = \lambda' C X $"></div>
<p>where <img width="18" height="16" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg21.png" alt="$ \lambda'$"> is new (still unknown) scalar. This means we have:</p>
<div align="center"><img width="121" height="35" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg22.png" alt="$\displaystyle X = (1/\lambda') C^{-1} \mu $"></div>
<p>so our desired solution is some re-scaling of <img width="43" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg23.png" alt="$ C^{-1} \mu$"> .</p>
<p>As we stated earlier we have a total investment constraint of <img width="105" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg14.png" alt="$ \sum_{i=1}^{n} \vert X_i\vert = T$"> . We can achieve this with the following adjusted solution:</p>
</p>
<div align="center"><img width="189" height="51" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg24.png" alt="$\displaystyle X = \frac{T}{\sum_{i=1}^{n} \vert(C^{-1} \mu)_i\vert} C^{-1} \mu $"></div>
<p>as our desired optimal portfolio allocation. In the end we can solve for the optimal portfolio by merely solving a linear system (we don&#8217;t need anything as expensive as a general purpose optimizer in this case).</p>
<p>These are very old results (going back as long as there has been Sharpe Ratios and portfolio theory). A good example reference is: &#8220;The Valuation of Risk Assets and the Selection of Risky Investments in Stock Portfolios and Capital Budgets,&#8221; John Lintner, The Review of Economics and Statistics (1965) vol. 47 (1) pp. 13-37. These results are the basis for advice like: &#8220;diversify.&#8221; Without modeling risk you would tend to put all of your money in the predicted highest paying asset. When modeling risk you tend to put some of your money in each high paying asset and as long as they do not all fail at the same time you have some safety. Another (very different) route to diversification is the Kelly Criterion (discussed in <a href="http://www.win-vector.com/blog/2009/10/what-is-the-gamblers-equivalent-of-amdahls-law/">What is the gambler&#8217;s equivalent of Amdahl&#8217;s Law?</a>).</p>
<p>A very important risk we have not yet modeled is that our assets may have a tendency to fail at the same time (meaning we may not have really diversified usefully). The notion of assets may fail at the same time brings us to the ideas of correlation and covariance. When we took <img width="64" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg25.png" alt="$ C = s s^{\top}$"> we were implicitly assuming (or modeling), without justification, that each possible asset was independent of all the others (that there was no correlation between asset returns). This is, of course, not going to be anywhere near true in practice. Instead we should take <img width="17" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg9.png" alt="$ C$"> to be the <a href="http://en.wikipedia.org/wiki/Covariance_matrix">Covariance Matrix</a> that represent our estimate of the assent to asset correlations. In this case the solution methods above all work exactly as before. Companies such as MSCI Barra have made complete businesses out of producing and selling estimates of <img width="17" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg9.png" alt="$ C$"> .</p>
<p>Another issue is when we do not allow ourselves to &#8220;short&#8221; (or take a negative allocation of) assets. In this case we have the additional constraints <img width="48" height="29" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg26.png" alt="$ X \ge 0$"> which complicates our solution. For the special case where the asset variances are assumed to be independent (i.e. <img width="64" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg25.png" alt="$ C = s s^{\top}$"> ) it is enough to solve as above and merely replace any negative allocations with zero when inspecting and scaling the final step of the solution. When the covariances are non-trivial (<img width="17" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg9.png" alt="$ C$"> has non-zero off-diagonal entries) this solution may not be optimal. In this case the Karush-Kuhn-Tucker conditions are more complicated and at the point of optimal solution we have the following conditions:</p>
<p></p>
<div align="center">
<table cellpadding="0" align="center">
<tr valign="middle">
<td nowrap align="right"><img width="145" height="60" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg27.png" alt="$\displaystyle \mu + \lambda C X - \sum_{i=1}^{n} \tau_i E^i$"></td>
<td width="10" align="center" nowrap><img width="17" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg28.png" alt="$\displaystyle =$"></td>
<td align="left" nowrap>0</td>
<td width="10" align="right">&nbsp;</td>
</tr>
<tr valign="middle">
<td nowrap align="right"><img width="19" height="29" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg29.png" alt="$\displaystyle X$"></td>
<td width="10" align="center" nowrap><img width="17" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg30.png" alt="$\displaystyle \ge$"></td>
<td align="left" nowrap>0</td>
<td width="10" align="right">&nbsp;</td>
</tr>
<tr valign="middle">
<td nowrap align="right"><img width="48" height="60" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg31.png" alt="$\displaystyle \sum_{i=1}^{n} X_i$"></td>
<td width="10" align="center" nowrap><img width="17" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg28.png" alt="$\displaystyle =$"></td>
<td align="left" nowrap><img width="16" height="29" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg32.png" alt="$\displaystyle T$"></td>
<td width="10" align="right">&nbsp;</td>
</tr>
<tr valign="middle">
<td nowrap align="right"><img width="13" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg33.png" alt="$\displaystyle \tau$"></td>
<td width="10" align="center" nowrap><img width="17" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg30.png" alt="$\displaystyle \ge$"></td>
<td align="left" nowrap>0</td>
<td width="10" align="right">&nbsp;</td>
</tr>
<tr valign="middle">
<td nowrap align="right"><img width="38" height="36" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg34.png" alt="$\displaystyle \tau^{\top} X$"></td>
<td width="10" align="center" nowrap><img width="17" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg28.png" alt="$\displaystyle =$"></td>
<td align="left" nowrap>0</td>
<td width="10" align="right">&nbsp;</td>
</tr>
</table>
</div>
<p><br clear="all"><br />
where <img width="19" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg12.png" alt="$ X$"> is the allocation vector we wish to solve for, <img width="13" height="15" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg17.png" alt="$ \lambda$"> is an unknown scalar, <img width="13" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg35.png" alt="$ \tau$"> is a new unknown vector and <img width="22" height="16" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg36.png" alt="$ E^i$"> is the vector with <img width="69" height="34" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg37.png" alt="$ (E^i)_i = 1$"> and zeroes elsewhere. Using the Karush-Kuhn-Tucker conditions has allowed us to again almost linearize the problem, but we know have sign constraints on <img width="19" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg12.png" alt="$ X$"> and <img width="13" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg35.png" alt="$ \tau$"> and what is called a complementarity constraint: <img width="67" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg38.png" alt="$ \tau^{\top} X = 0$"> . This sort of problem essentially called a &#8220;Linear Complementarity Problem&#8221; and is about as hard as solving a linear program (the typical solution method is a variation of the simplex method called &#8220;Lemke&#8217;s algorithm&#8221;). (Technically the <img width="13" height="15" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg17.png" alt="$ \lambda$"> prevents the problem from being in the right form, but <img width="13" height="15" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/EPAimg17.png" alt="$ \lambda$"> can be inspected out of the problem.) The problem can still be solved, you just need a bit more software. If we can not short assets (or at least simulate shorting assets) we not only eliminate many possible portfolios from consideration (so we likely end up with a less profitable portfolio than we would like) we also make the mathematics and computation a bit harder.</p>
<p>The goal of this writeup has been to show how to systematically convert investment advice like &#8220;this stock is going to really take off&#8221; into an allocation of assets (which in turn implies a pattern of trades). We take as unexamined premises where to get such advice and whether to use the Sharpe ratio or some other notion of risk and/or utility. The point is that even though it may be complicated, from this point it is just calculation and calculation is easy to automate.</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2010/08/fast-portfolio-re-balancing-as-a-fractional-linear-program/' rel='bookmark' title='Fast Portfolio re-Balancing as a Fractional Linear Program'>Fast Portfolio re-Balancing as a Fractional Linear Program</a></li>
<li><a href='http://www.win-vector.com/blog/2011/10/kernel-methods-and-support-vector-machines-de-mystified/' rel='bookmark' title='Kernel Methods and Support Vector Machines de-Mystified'>Kernel Methods and Support Vector Machines de-Mystified</a></li>
<li><a href='http://www.win-vector.com/blog/2008/09/a-quick-appreciation-of-the-sharpe-ratio/' rel='bookmark' title='A Quick Appreciation of the Sharpe Ratio'>A Quick Appreciation of the Sharpe Ratio</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2010/01/easy-portfolio-allocation/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>What is the gambler&#8217;s equivalent of Amdahl&#8217;s Law?</title>
		<link>http://www.win-vector.com/blog/2009/10/what-is-the-gamblers-equivalent-of-amdahls-law/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=what-is-the-gamblers-equivalent-of-amdahls-law</link>
		<comments>http://www.win-vector.com/blog/2009/10/what-is-the-gamblers-equivalent-of-amdahls-law/#comments</comments>
		<pubDate>Thu, 08 Oct 2009 20:38:21 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Expository Writing]]></category>
		<category><![CDATA[Quantitative Finance]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Amdahl's Law]]></category>
		<category><![CDATA[Kelly Criterion]]></category>
		<category><![CDATA[Kraft Inequality]]></category>
		<category><![CDATA[Mathematical Bedside Reading]]></category>
		<category><![CDATA[Statistical Detective]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=878</guid>
		<description><![CDATA[While executing some statistical detective work for a client we had a major &#8220;aha!&#8221; moment and realized something like &#8220;Amdahl&#8217;s Law&#8221; rephrased in terms of probability would solve everything. We finished our work using direct methods and moved on. But it is an interesting question: what is the probabilist&#8217;s (or gambler&#8217;s) equivalent of Amdahl&#8217;s Law? [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2009/08/good-graphs-graphical-perception-and-data-visualization/' rel='bookmark' title='Good Graphs: Graphical Perception and Data Visualization'>Good Graphs: Graphical Perception and Data Visualization</a></li>
<li><a href='http://www.win-vector.com/blog/2007/06/new-paper/' rel='bookmark' title='New Paper'>New Paper</a></li>
<li><a href='http://www.win-vector.com/blog/2009/04/the-data-enrichment-method/' rel='bookmark' title='The Data Enrichment Method'>The Data Enrichment Method</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>While executing some statistical detective work for a client we had a major &#8220;aha!&#8221; moment and realized  something like &#8220;Amdahl&#8217;s Law&#8221; rephrased in terms of probability would solve everything.  We finished our work using direct methods and moved on.  But it is an interesting question: what is the probabilist&#8217;s (or gambler&#8217;s) equivalent of Amdahl&#8217;s Law?<span id="more-878"></span></p>
<p>Amdahl&#8217;s Law is famous idea due to computer architect Gene Amdahl.  It is a simple technique that computer scientists use to re-direct their work back to important parts of problems.  Suppose you have a complicated system you wish to speed up.  Suppose this system is spending a p-fraction of its time in an important sub-process and that you have an idea that would speed up the sub-process by a factor of k.  Should you invest the effort?  </p>
<p>Amdahl&#8217;s Law says (by simple arithmetic): the speed-up (the ratio of the old run-time over the new run-time) the entire system would achieve if you implemented your improvement is not the factor of k you would hope for, but instead:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/10/eq11.png" alt="eq1.png" border="0" width="141" height="56" /><br />
</center></p>
<p>For example if p = 1/3 then you can only speed up the over all system by at most a factor of 33%, even your idea is so astoundingly good that you have k=1000.</p>
<p>Amdahl&#8217;s Law reminds us that speeding up a component you do not lose much time to is not an important accomplishment.  In fact Amdahl&#8217;s Law directly prescribes looking at your most expensive components as being the largest opportunities for improvement.  Appealing to Amdahl&#8217;s Law is an important nerd-tool to end &#8220;color of the bike shed&#8221; arguments (and concentrate only on the design of systems that actually have an impact on outcomes).</p>
<p>It is clear there are similar principles for managing expenses, revenue, effort and so on (such as the Pareto Principle).</p>
<p>But what is the equivalent statement in the harder and more complicated world of probabilities and gambling systems?  There are a lot of candidate statements and theorems (such as &#8220;look for horses not for zebras&#8221;, the Kraft Inequality, Kullback Leibler Distance, Cross Entropy and the Asymptotic Equipartition Principle) but I think the most powerful and direct analogue is: the Kelly Betting System.  The Kelly Betting System is a remarkable system that, like Amdahl&#8217;s Law, tells us exactly what to look at (and surprisingly some things to ignore).</p>
<p>Kelly&#8217;s original paper: &#8220;A New Interpretation of Information Rate&#8221; J. L. Jr Kelly, AT&#038;T Technical Journal (1956) phrases the problem as betting at a horse race.  The technique applies more generally (other forms of gambling, portfolio management, even explaining the preferences of lab-mice) but the clearest example remains a horse race.</p>
<p>We follow the excellent discussion of the problem from Cover and Thomas &#8220;Information Theory&#8221; Wiley (1991).    Consider a simplified horse race where there is only one payoff offered: picking the winning horse.  Suppose the (unknown) true probability of the i-th horse winning is p_i.  Further suppose the track publishes a set of payoffs for each horse such that if you bet a dollar on the i-th horse and it wins: you are given o_i dollars back.   </p>
<p>Now a gambler that has no estimate of the p_i might put all of their money on &#8220;the highest paying horse.&#8221;   That is picking the i such that o_i is maximal (&#8220;going for big score&#8221;).   A somewhat more informed gambler might put all of their money on the &#8220;horse with the best expected return&#8221; that is a horse i that maximizes p_i * o_i.  But this betting strategy &#8220;invites ruin&#8221;:  you have probability of 1 &#8211; p_i of losing all of your money.  Kelly starts with the controversial idea of trying to maximize expected log-return (instead of maximizing expected return).  Maximizing expected log-return avoids ruin, maximizes the exponential rate your wealth grows  and maximizes the median wealth over all outcomes (see: &#8220;The Kelly System Maximizes Median Fortune&#8221; S N Ethier, Journal of Applied Probability (2004) vol. 41 (4) pp. 1230-1236).  Even the observation that you don&#8217;t always want to put all of your money in a &#8220;favorable bet&#8221; (that is one with expectation p_i * o_i >1) is an important one.</p>
<p>To get the next part of Kelly&#8217;s system consider the sum of reciprocals of track offered payoffs:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/10/sum1.png" alt="sum1.png" border="0" width="82" height="68" /><br />
</center></p>
<p>At any real track this sum will be greater than 1 (i.e. the o_i will be small, making the sum large).   The larger the sum the more clearly unfair the track&#8217;s published payoff schedule is.  Let us assume we were at a fantastically generous track where this sum is exactly 1 (admittedly unrealistic, and both the paper and the book work beyond this limitation).  In this case we can write r_i = 1/o_i and we know r_i > 0 and the r_i sum to 1.  That is we can interpret the r_i as the track&#8217;s estimate of the probability of the i-th horse winning.   If o_i = 100 (the track is paying off 100:1) we then can infer they think the i-th horse has no more than a 1 in 100 chance of winning (else they could not afford to offer the bet).  Kelly&#8217;s system gives (and proves correct) the following remarkable advice: if the sum given above is 1 (i.e. the track is paying off at least a fair rate) then you can safely bet all of your money and you should bet a p_i fraction of your money on the i-th horse.  </p>
<p>That is: if you decide the track is paying off so much that it is worth your while to gamble then you should then completely ignore the track&#8217;s payoff schedule in making your bet.   You might use the track&#8217;s published payoffs as some of your evidence when trying to estimate the p_i (the probability of each horse winning), but once you have estimated these probabilities you then ignore the track&#8217;s payoff rates in designing your bets.  In fact your expected rate of winning is exactly proportional to how much closer to the true probabilities your estimate is than the track&#8217;s estimate is (Cover/Thomas example 6.1.1, so if unless you know something the track does not know you should not bet).  Also you should bet even on unlikely and underpaying horses to help cover the possibilities (this is because you are making a series of bets, not just a single bet- so each bet&#8217;s value is computed under the assumption that your other bets have failed).  This (provably correct) advice is contrary to many obvious and traditional betting systems.</p>
<p>The Kelly System is simultaneously very precise and broadly applicable.  For example: it has be extended to many other games and the stock market (see: &#8220;The Kelly Criterion and the Stock Market&#8221; Louis M Rotando, Edward O Thorp, The American Mathematical Monthly (1992) vol. 99 (10) pp. 922-931).  The Kelly System gives actionable advice (exact amounts to bet or exact amounts of effort to invest) and is very specific in saying what to look at.  </p>
<p>Just as Amdahl&#8217;s law shows us component speedup is a distraction the Kelly System shows us that published rates of return are siren songs.  Thus the Kelly System is the gambler&#8217;s equivalent of Amdahl&#8217;s Law.</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2009/08/good-graphs-graphical-perception-and-data-visualization/' rel='bookmark' title='Good Graphs: Graphical Perception and Data Visualization'>Good Graphs: Graphical Perception and Data Visualization</a></li>
<li><a href='http://www.win-vector.com/blog/2007/06/new-paper/' rel='bookmark' title='New Paper'>New Paper</a></li>
<li><a href='http://www.win-vector.com/blog/2009/04/the-data-enrichment-method/' rel='bookmark' title='The Data Enrichment Method'>The Data Enrichment Method</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2009/10/what-is-the-gamblers-equivalent-of-amdahls-law/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Good Graphs: Graphical Perception and Data Visualization</title>
		<link>http://www.win-vector.com/blog/2009/08/good-graphs-graphical-perception-and-data-visualization/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=good-graphs-graphical-perception-and-data-visualization</link>
		<comments>http://www.win-vector.com/blog/2009/08/good-graphs-graphical-perception-and-data-visualization/#comments</comments>
		<pubDate>Fri, 28 Aug 2009 15:40:41 +0000</pubDate>
		<dc:creator>Nina Zumel</dc:creator>
				<category><![CDATA[Exciting Techniques]]></category>
		<category><![CDATA[Expository Writing]]></category>
		<category><![CDATA[Mathematics]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Cleveland]]></category>
		<category><![CDATA[data exploration]]></category>
		<category><![CDATA[graphical perception]]></category>
		<category><![CDATA[Lattice]]></category>
		<category><![CDATA[Mathematical Bedside Reading]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[visualization]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=296</guid>
		<description><![CDATA[What makes a good graph? When faced with a slew of numeric data, graphical visualization can be a more efficient way of getting a feel for the data than going through the rows of a spreadsheet. But do we know if we are getting an accurate or useful picture? How do we pick an effective [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2011/12/my-favorite-graphs/' rel='bookmark' title='My Favorite Graphs'>My Favorite Graphs</a></li>
<li><a href='http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/' rel='bookmark' title='A Demonstration of Data Mining'>A Demonstration of Data Mining</a></li>
<li><a href='http://www.win-vector.com/blog/2010/08/what-did-theorists-do-before-the-age-of-big-data/' rel='bookmark' title='What Did Theorists Do Before The Age Of Big Data?'>What Did Theorists Do Before The Age Of Big Data?</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>What makes a good graph? When faced with a slew of numeric data, graphical visualization can be a more efficient way of getting a feel for the data than going through the rows of a spreadsheet. But do we know if we are getting an accurate or useful picture? How do we pick an effective visualization that neither obscures important details, or drowns us in confusing clutter? In 1968, William Cleveland published a text called <a href="http://www.stat.purdue.edu/~wsc/elements.html"><em>The Elements of Graphing Data,</em></a> inspired by Strunk and White&#8217;s classic writing handbook <a href="http://www.amazon.com/Elements-Style-50th-Anniversary/dp/0205632645"><em>The Elements of Style</em></a> . <em>The Elements of Graphing Data</em> puts forward Cleveland&#8217;s philosophy about how to produce good, clear graphs — not only for presenting one&#8217;s experimental results to peers, but also for the purposes of data analysis and exploration. Cleveland&#8217;s approach is based on a theory of graphical perception: how well the human perceptual system accomplishes certain tasks involved in reading a graph. For a given data analysis task, the goal is to align the information being presented with the perceptual tasks the viewer accomplishes the best. <span id="more-296"></span></p>
<blockquote><p>When a graph is made, quantitative and categorical information is encoded by a display method. Then the information is visually decoded. This visual perception is a vital link. No matter how clever the choice of the information, and no matter how technologically impressive the encoding, a visualization fails if the decoding fails. Some display methods lead to efficient, accurate decoding, and others lead to inefficient, inaccurate decoding. It is only through scientific study of visual perception that informed judgments can be made about display methods. The display methods of <em>Elements</em> rest on a foundation of scientific enquiry.</p></blockquote>
<p>— from the preface of <em>The Elements of Graphing Data</em></p>
<p>A revised edition of <em>The Elements of Graphing Data</em> was published in 1994, along with a companion volume, <a href="http://www.stat.purdue.edu/~wsc/visualizing.html"><em>Visualizing Data,</em></a> which is oriented towards the implementation and technical details of different graphing techniques. I highly recommend <em>The Elements of Graphing Data</em> as a guidebook for creating graphs, as well as for its excellent survey of several useful techniques. Cleveland, along with other colleagues at Bell Labs, developed the <a href="http://stat.bell-labs.com/project/trellis/s.html">Trellis display system,</a> a framework for the visualization of multivariable databases, using the ideas developed in his texts. Trellis, in turn, influenced Deepayan Sarkar&#8217;s Lattice graphics system for R. Lattice implements many of Cleveland&#8217;s ideas, and I also recommend Sarkar&#8217;s <a href="http://lmdvr.r-forge.r-project.org/figures/figures.html">Lattice manual</a> if you do data visualization in R.</p>
<p>It&#8217;s important to note here that Cleveland writes for researchers and decision-makers who use graphs to analyze data, or to convey scientific results to colleagues in an (ideally) objective manner. This distinguishes him from Darrell Huff, whose 1954 <a href="http://www.amazon.com/How-Lie-Statistics-Darrell-Huff/dp/0393310728"><em>How to Lie with Statistics</em></a> considered the use of graphs (and statistics in general) as rhetorical devices for convincing others of one&#8217;s point of view. Hence, some of Cleveland&#8217;s recommendations and guidelines actually contradict Huff&#8217;s. <a id="refHuff" href="#Huff"><sup>1</sup></a></p>
<p>Edward Tufte also explored the idea that the choice of graphical display should be influenced by the viewer&#8217;s cognitive processes, in his 1990 book <a href="http://www.edwardtufte.com/tufte/books_ei"><em>Envisioning Information</em></a>. Tufte tends to be more broadly concerned with the gestalt of a graph, beyond its use as an analysis tool; he is also more concerned than Cleveland is with aesthetic considerations.</p>
<p>Cleveland&#8217;s philosophy might be summarized as: <em>minimize the mental gymnastics that the viewer must go through to understand the graph</em>. This leads to some obvious advice: avoid clutter and occlusion, make graphing symbols or color-coding unambiguous, use scale-lines on all four sides of the graph, and so on. It also leads to advice that perhaps should be as obvious, but isn&#8217;t: <em>make the aspect of the data that you want to analyze as clear as possible</em>. But what does this mean in practice?</p>
<p><strong>Make important differences large enough to perceive</strong></p>
<p>Weber&#8217;s Law is a well known observation from the psychophysics literature, which states that the &#8220;just noticeable&#8221; change in a stimulus is a constant ratio of the original stimulus. Put another way, people are only capable of detecting a change in a stimulus that is greater than a certain percentage <em>k</em> of the original stimulus. Here, &#8220;stimulus&#8221; can refer to any perceivable physical quantity: weight, intensity, length, orientation. The percentage <em>k</em> will vary with stimulus, and with observer.</p>
<table border="0" align="center">
<tbody>
<tr>
<td>
<div style="text-align:center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/weberslaw.jpg" border="0" alt="weberslaw.jpg" width="488" height="233" /></div>
</td>
</tr>
</tbody>
<caption>Figure 1: From Cleveland, <em>The Elements of Graphing Data</em></caption>
<tbody>
<tr>
<td></td>
</tr>
</tbody>
</table>
<p>Figure 1 shows the application of Weber&#8217;s law to lengths. The bars A and B are of different lengths, but the difference is such a small fraction of the &#8220;base&#8221; length (say, A&#8217;s length, to be specific) that is difficult to tell whether or not they are different, or which is longer. On the right, the bars have been embedded in frames of identical length, and now it is easy to see that B is longer. Why? Because the difference in lengths of the <em>white</em> intervals is a much larger percentage of the white &#8220;base&#8221; length (say the white A interval). It is easy to see that the white B interval is shorter than the white A interval, and therefore, the black B interval is longer than the black A interval.</p>
<p>The moral is that you always want the viewer to be estimating changes or differences with respect to a short base length. You can do this with reference grids, as demonstrated below.</p>
<table border="0" align="center">
<caption>From Cleveland, <em>The Elements of Graphing Data</em></caption>
<tbody>
<tr>
<td><!-- original 319 by 601 --><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/noreferencegrids.jpg" border="0" alt="noreferencegrids.jpg" width="200" height="400" align="left" /></td>
<td><!-- original 319 by 601 --><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/referencegrids1.jpg" border="0" alt="referencegrids.jpg" width="200" height="400" align="right" /></td>
</tr>
<tr>
<td align="center">Figure 2</td>
<td align="center">Figure 3</td>
</tr>
</tbody>
</table>
<p>Figure 2 shows eight curves. Which one dips to the lowest minimum? Are the high curves approaching the same value, and which one is rising the fastest? Are the low curves dipping to the same minimum? Are they going to the same steady state? Figure 3 shows the same curves, graphed with identical reference grids. The grids shorten the base lengths that are being compared, and it is now much easier to compare highs, lows, and steady state behavior.</p>
<p>But wouldn&#8217;t it be better to compare the graphs by superposing them? For two or three curves, perhaps. But in this case, eight curves can clutter the graph, and use up the symbol or color space, making it difficult to distinguish the different datasets &#8212; increasing the mental gymnastics.</p>
<p>Reference grids are useful even for a single curve, especially one with slowly varying segments, such as these graphs have. The reference grid makes it easier to answer questions like: does the process return to the initial state, or to a different steady state? Has the process reached steady state, or is it still growing?</p>
<p><strong>Make important shape changes large enough to perceive: Banking to 45 degrees.</strong></p>
<p>The aspect ratio of a graph is important when trying to understand shape. Rate of change information is encoded in the slope of the curve, which the viewer estimates by changes in the orientation of the local tangents at each point of the graph. Weber&#8217;s Law tells us that very small changes in this orientation will be difficult to detect. For a given (physical) curve, the local orientation changes will be dependent on the aspect ratio of its graphical presentation, as shown (to an exaggerated degree) in Figure 4. Here, the same curve (two line segments) is plotted at three different aspect ratios, one that centers the graph at 45 degrees, one that forces the curve to be nearly vertical, and another that forces it to be nearly horizontal. In the last two cases, the change in orientation of the two line segments is so small as to be nearly undetectable.</p>
<table border="0" align="center">
<caption>Figure 4: From Cleveland</caption>
<tbody>
<tr>
<td><!-- original 670 by 630 --></p>
<div style="text-align:center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/angles.jpg" border="0" alt="angles.jpg" width="446" height="420" align="left" /></div>
</td>
</tr>
</tbody>
</table>
<p>For two line segments with positive, unequal slopes, a simple geometric argument shows that their absolute difference in orientation is maximized by the aspect ratio that sets their average orientation to 45 degrees (the first graph in Figure 4). Empirical studies by Cleveland and others have indeed verified that a viewer&#8217;s ability to judge the relative slopes of line segments on a graph is maximized when the absolute values of the orientations of the segments are centered on 45 degrees.</p>
<p>This result leads to a technique called <em>Banking to 45</em>, whereby the aspect ratio of the graph is chosen so that the average slope of the entire graph is 45 degrees. The details are discussed in Cleveland, and many of the plots in R&#8217;s Lattice package also have an option to bank the graph to 45 degrees.</p>
<p>This deliberate exaggeration of slope is something that Darrell Huff deplores. In <em>How to Lie with Statistics</em>, Huff refers to these graphs as &#8220;gee-whiz&#8221; graphs — and in the context of his discussion of statistics as rhetoric, they are:</p>
<table border="0" align="center">
<caption>Figure 5: From Huff, <em>How to Lie With Statistics</em></caption>
<tbody>
<tr>
<td><!-- original 461 by 351 --></p>
<div style="text-align:center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/geewhiz.jpg" border="0" alt="geewhiz.jpg" width="461" height="351" /></div>
</td>
</tr>
</tbody>
</table>
<p>To insist that a graph should always include a zero line and that units be in proportion may be good advice from a rhetorical perspective; but it is poor advice if the purpose of the graph is data analysis. As Figure 6 below demonstrates, we can lose resolution if we always insist on including the zero. Does the trend line in the left graph increase linearly, superlinearly, or sublinearly? The convexity of the curve is more apparent when it is banked to 45, as on the right. Assuming that the scientist reads the axis and is cognizant of the actual magnitude changes involved, the graph on the right conveys more information.</p>
<table border="0" align="center">
<caption>Figure 6: From Cleveland</caption>
<tbody>
<tr>
<td><img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/bank451.jpg" border="0" alt="bank45.jpg" width="500"  /></td>
</tr>
</tbody>
</table>
<p><strong>Make sure all the data is equally well resolved.</strong></p>
<p>It is quite common for positive data —  word frequencies, populations, price distributions, just to name a few examples — to be skewed: most of the data is bunched towards low values, the rest of it is spread out on a very long tail. This long tail squashes the majority of the data into a tiny interval of a very narrow dynamic range, as in Figure 7, making it difficult to evaluate the data.</p>
<table border="0" align="center">
<tbody>
<tr>
<td>
<table border="0">
<tbody>
<tr>
<td><!-- original size: 990 by 860 --><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/skewed1.gif" border="0" alt="skewed.gif" width="250" /></td>
</tr>
</tbody>
<caption>Figure 7: Long-tailed distribution of purchase sizes</caption>
</table>
</td>
<td>
<table border="0">
<tbody>
<tr>
<td><!-- original = 499 by 675 --><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/logskewed1.gif" border="0" alt="logskewed.gif" width="250" /></td>
</tr>
</tbody>
<caption>Figure 8: Distribution of log(purchase size)</caption>
</table>
</td>
</tr>
</tbody>
</table>
<p>Imagine that Figure 7 represents the distribution of average purchase size across an online merchant&#8217;s customers: average purchase size is plotted on the x-axis, and the y-axis represents the fraction of the total customer population whose average purchase size is a given value (the area under the graph integrates to one). According to this graph, most customers make fairly small purchases on average, but there is a long tail of big spenders trailing out into the range of several thousand dollars. Obviously, one would like a little more resolution on the big spike of customers near zero. One could simply &#8220;zoom in&#8221; on this range, by chopping off some long chunk of the tail, but you may potentially lose sight of some global patterns in the data by doing so.</p>
<p>Graphing the distribution of log(purchase size) enables you to increase the resolution near zero, while preserving the global view. Figure 8 shows the distribution of log(purchase size), revealing two spending populations: a population of high spenders who tend to make purchases in the $3000 range (in log space), and another population whose purchases are centered (in log space) around $60. The existence of these two distinct populations is not apparent in the original graph.</p>
<p>Notice that Figure 8 has two x-axis scales: the top axis is marked in log units, while the bottom axis is marked in absolute dollars, spaced on a log scale. This accords with the principle of minimizing mental gymnastics, since the viewer of the graph will typically be concerned about prices in dollars, not log dollars. In fact, it would have been better yet to have plotted the distribution of log<sub>2</sub> or log<sub>10</sub> of the data; the former would allow us to see at a glance the doubling of price ranges, the latter to see price changes in factors of ten.</p>
<table border="0" align="center">
<caption>Figure 9: The 14 most abundant elements in meteorites. From Cleveland</caption>
<tbody>
<tr>
<td><!-- original = 543 by 522 --><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/metals.jpg" border="0" alt="metals.jpg" width="250" /></td>
<td><!-- original = 550 by 600 --><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/logmetals.jpg" border="0" alt="logmetals.jpg" width="250" /></td>
</tr>
</tbody>
</table>
<p>Figure 9 shows another example: the fourteen most abundant elements in meteorites, specifically the average percent of each of the elements. If we graph the percentages directly, as on the left, we cannot easily distinguish the differences in the elements from aluminum on down. Graphing log<sub>2</sub> of the percentages, as on the right, improves the resolution. Again, we have two x-axes on the graph of the log data.</p>
<p><strong>If you want to analyze the difference between two processes, then graph the difference, not the processes (or graph both).</strong></p>
<p>Suppose that we are comparing the two processes f1 and f2 that are shown in Figure 10. As x increases, the two processes appear to be approaching each other  — that is, the difference between the two seems to be decreasing. In reality, the difference between the two is constant: f2 = f1+1.</p>
<table border="0" align="center">
<tbody>
<tr>
<td>
<table border="0">
<tbody>
<tr>
<td><!-- original size: 990 by 860 --><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/difference1.gif" border="0" alt="difference.gif" width="250" /></td>
</tr>
</tbody>
<caption>Figure 10: The illusion of convergence</caption>
</table>
</td>
<td>
<table border="0">
<tbody>
<tr>
<td><!-- original = 499 by 675 --><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/imports.jpg" border="0" alt="imports.jpg" width="250" /></td>
</tr>
</tbody>
<caption>Figure 11: British Imports and Exports. From Cleveland</caption>
</table>
</td>
</tr>
</tbody>
</table>
<p>It turns out that people are good at perceiving the perpendicular difference between two curves, but not the differences in height, which is what we are actually interested in here. When we try to infer the differences from the process graph, we may not only miss key information, we may actually draw incorrect conclusions.</p>
<p>A less toy example is given in Figure 11. Here the imports to and exports from England are graphed over the first 80 years of the 18th century. In the difference graph on the bottom, we can see a local peak in (imports-exports) just after 1760; this is not obvious from simply comparing the two processes (top graph).</p>
<p><strong>If you are interested in rate of change, then graph rate of change.</strong></p>
<p>In Figure 12, we see the population figures for a given community from 1990 to 2009. Obviously, the population is steadily increasing, but how quickly? Is the rate of population growth increasing over time, or is it decreasing? If we are interested in these questions, then simply graphing the population over time is not enough. We need to look at the rate of change directly.</p>
<table border="0" align="center">
<tbody>
<tr>
<td>
<table border="0">
<caption>Figure 12</caption>
<tbody>
<tr>
<td><!-- original 998 by 860 --><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/rateofchange1.gif" border="0" alt="rateofchange.gif" width="250" /></td>
</tr>
</tbody>
</table>
</td>
<td>
<table border="0">
<caption>Figure 13</caption>
<tbody>
<tr>
<td><!-- original 720 by 720 --><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/lograteofchange2.gif" border="0" alt="lograteofchange.gif" width="250" /></td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
<p>The classic way to do this is by graphing the logarithm of the data. In Figure 13, we have graphed log<sub>2</sub> of the population over time, with the log scale printed on the right hand y-axis, and the actual population numbers printed at a log scale on the left hand axis. Now we can see that the population increased at a constant rate from 1990 to 2000, quadrupling approximately every four years, and then slowed down (to a lower constant rate) after 2000.</p>
<p><strong>Graphs as a research tool</strong></p>
<p>Throughout this discussion, we have considered graphs as a tool for data exploration and initial understanding. It is an iterative process &#8212; as questions arise, the data will be reprocessed and re-plotted to highlight the new issues to be examined. A good research graph must display this information directly, with a minimum of mental gymnastics, but &#8212; as with any research tool &#8212; there can be a learning curve. For example, densityplots (such as those shown in Figures 7 and 8) are in my opinion more useful than histograms for understanding how numerical data is distributed &#8212; and I am constantly surprised at the amount of explanation that they require when I show them to people who are unfamiliar with them. A number of very useful graphs that are discussed in Cleveland&#8217;s texts meet with the same reaction from people who encounter that style of graph for the first time. This is a disadvantage, relative to using a more fashionable graph, when attempting to communicate results. But the insight into the data that these graphs provide often make it worth spending the time to educate clients or peers on how to read the graph.</p>
<p>Even so, a good graph still may not be a quick read. As Cleveland writes:</p>
<blockquote><p>While there is a place for rapidly-understood graphs, it is too limiting to make speed a requirement in science and technology, where the use of graphs ranges from detailed in-depth data analysis to quick presentation.<br />
&#8230;</p>
<p>The important criterion for a graph is not simply how fast we can see a result; rather it is whether through the use of the graph we can see something that would have been harder to see otherwise or that could not have been seen at all.</p></blockquote>
<p>- <em>The Elements of Graphing Data</em>, Chapter 2</p>
<hr /><a id="Huff" href="#refHuff">[Back]</a><sup>1</sup><em>How to Lie with Statistics</em> is an entertaining (if a little dated) discussion of how to read statistical and quantitative claims critically, and is definitely worth a read.</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2011/12/my-favorite-graphs/' rel='bookmark' title='My Favorite Graphs'>My Favorite Graphs</a></li>
<li><a href='http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/' rel='bookmark' title='A Demonstration of Data Mining'>A Demonstration of Data Mining</a></li>
<li><a href='http://www.win-vector.com/blog/2010/08/what-did-theorists-do-before-the-age-of-big-data/' rel='bookmark' title='What Did Theorists Do Before The Age Of Big Data?'>What Did Theorists Do Before The Age Of Big Data?</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2009/08/good-graphs-graphical-perception-and-data-visualization/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>The Data Enrichment Method</title>
		<link>http://www.win-vector.com/blog/2009/04/the-data-enrichment-method/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=the-data-enrichment-method</link>
		<comments>http://www.win-vector.com/blog/2009/04/the-data-enrichment-method/#comments</comments>
		<pubDate>Fri, 01 May 2009 01:03:06 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Applications]]></category>
		<category><![CDATA[Expository Writing]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Data Enrichment]]></category>
		<category><![CDATA[Mathematical Bedside Reading]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=80</guid>
		<description><![CDATA[We explore some of the ideas from the seminal paper &#8220;The Data-Enrichment Method&#8221; ( Henry R Lewis, Operations Research (1957) vol. 5 (4) pp. 1-5). The paper explains a technique of improving the quality of statistical inference by increasing the effective size of the data-set. This is called &#8220;Data-Enrichment.&#8221; Now more than ever we must [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/' rel='bookmark' title='A Demonstration of Data Mining'>A Demonstration of Data Mining</a></li>
<li><a href='http://www.win-vector.com/blog/2009/08/good-graphs-graphical-perception-and-data-visualization/' rel='bookmark' title='Good Graphs: Graphical Perception and Data Visualization'>Good Graphs: Graphical Perception and Data Visualization</a></li>
<li><a href='http://www.win-vector.com/blog/2010/08/what-did-theorists-do-before-the-age-of-big-data/' rel='bookmark' title='What Did Theorists Do Before The Age Of Big Data?'>What Did Theorists Do Before The Age Of Big Data?</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>We explore some of the ideas from the seminal paper &#8220;The Data-Enrichment Method&#8221; ( Henry R Lewis, Operations Research (1957) vol. 5 (4) pp. 1-5).  The paper explains a technique of improving the quality of statistical inference by increasing the effective size of the data-set.  This is called &#8220;Data-Enrichment.&#8221;</p>
<p>Now more than ever we must be familiar with the consequences of these important techniques.  Especially if we don&#8217;t know if we might already be a victim of them.</p>
<p><span id="more-80"></span><br />
&#8220;The Data-Enrichment Method&#8221; is an absolutely wonderful 1957 tongue in cheek parody of a very tempting method of accidental data falsification.  The method presented is spookily plausible and actually anticipates some very important (and correct) methods later used in the EM, Jackknife, Bootstrap and other resampling techniques (for example see: &#8220;Bootstrap Methods: Another Look at the Jackknife&#8221;, Bradley Efron. Ann. Statist. (1979) vol. 7 (1) pp. 1-26).</p>
<p>The idea is innocently presented with an accompanying data-set: perception of a sound at a different presented decibel levels (loudnesses):</p>
<p><center></p>
<table>
<tr>
<th>Source.DB</th>
<th>Detections</th>
<th>Failures</th>
</tr>
<tr>
<td>62</td>
<td>5</td>
<td>40</td>
</tr>
<tr>
<td>65</td>
<td>10</td>
<td>30</td>
</tr>
<tr>
<td>68</td>
<td>15</td>
<td>20</td>
</tr>
<tr>
<td>71</td>
<td>20</td>
<td>10</td>
</tr>
<tr>
<td>74</td>
<td>25</td>
<td>5</td>
</tr>
<tr>
<td>77</td>
<td>30</td>
<td>3</td>
</tr>
</table>
<p></center></p>
<p>From this table it is obvious that the number of detections is increasing (and the number of failures is decreasing) as the sound is presented louder and louder.  This makes sense and puts a quantitative rate to our prior expectation that detection gets easier as loudness increases.  For this data the trend is quite obvious and we can easily plot a regression line that accurately models the effect of Source.DB on detection rate:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/04/sourcedbdetectionrate.gif" alt="SourceDBDetectionRate.gif" border="0" width="400" height="400" /><br />
</center></p>
<p>But we want more.  Can we increase our model precision and confidence by incorporating our domain knowledge?  If we are only trying to accurately estimate the rate that loudness increases the detection level and we are willing to assume that it really does increase, then: could we not pre-prepare the data to use our domain knowledge? </p>
<p>The method suggested is to add in some contra-factuals that we feel confident about.  For example we could (using our assumption that loudness increases detection, just to an unknown degree) notice that the 30 failures at 65 DB certainly would not have been heard if they had been run at 62 DB (even quieter).  By the same reasoning we can assume that the 5 detections at 62 DB would have been heard had they been run at 65 DB, 68 DB, 71 DB, 74 Db or 77 DB.  In this way we have used our starting &#8220;seed data&#8221; and our domain knowledge to boost into a much larger data set that shows the expected relation much more strongly.</p>
<p>The above paragraph is, of course, nonsense.  I am doing the original paper an injustice by summarizing- because in the original paper the procedure seems perfectly plausible (and useful).  It is not until the author works a second example that has a poor initial relation (that actually needs the enrichment) that the joke is revealed.</p>
<p>The second example is coin flipping.  The author applies an inductive bias that &#8220;clearly standing higher up on a staircase increases the chances of a coin flip coming up heads&#8221; and then uses the data enrichment method to enhance the data set.  The original data set is indeed too noisy to show the effect and the enhancement is in fact quite dramatic.  The original data:</p>
<p><center></p>
<table>
<tr>
<th>Stair.Step</th>
<th>Heads</th>
<th>Tails</th>
</tr>
<tr>
<td> 1</td>
<td>4</td>
<td>6</td>
</tr>
<tr>
<td> 2</td>
<td>5</td>
<td>5</td>
</tr>
<tr>
<td> 3</td>
<td>7</td>
<td>3</td>
</tr>
<tr>
<td> 4</td>
<td>4</td>
<td>6</td>
</tr>
<tr>
<td> 5</td>
<td>6</td>
<td>4</td>
</tr>
<tr>
<td> 6</td>
<td>5</td>
<td>5</td>
</tr>
<tr>
<td> 7</td>
<td>6</td>
<td>4</td>
</tr>
<tr>
<td> 8</td>
<td>6</td>
<td>4</td>
</tr>
<tr>
<td> 9</td>
<td>3</td>
<td>7</td>
</tr>
<tr>
<td> 10</td>
<td>4</td>
<td>6</td>
</tr>
</table>
<p></center></p>
<p>The enhanced data is much more interesting:</p>
<p><center></p>
<table>
<tr>
<th>Stair.Step</th>
<th>Virtual.Heads</th>
<th>Virtual.Tails</th>
</tr>
<tr>
<td>1</td>
<td>  4</td>
<td> 50</td>
</tr>
<tr>
<td>2</td>
<td>  9</td>
<td> 44</td>
</tr>
<tr>
<td>3</td>
<td> 16</td>
<td> 39</td>
</tr>
<tr>
<td>4</td>
<td> 20</td>
<td> 36</td>
</tr>
<tr>
<td>5</td>
<td> 26</td>
<td> 30</td>
</tr>
<tr>
<td>6</td>
<td> 31</td>
<td> 26</td>
</tr>
<tr>
<td>7</td>
<td> 37</td>
<td> 21</td>
</tr>
<tr>
<td>8</td>
<td> 43</td>
<td> 17</td>
</tr>
<tr>
<td>9</td>
<td> 46</td>
<td> 13</td>
</tr>
<tr>
<td>10</td>
<td> 50</td>
<td>  6</td>
</tr>
</table>
<p></center></p>
<p>It is easier to see what is going on in the following plots (which show measured success rates as a function of number of stairs up the staircase and show a smoothed fit of the relationship).  The original data is a noisy mess:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/04/coinsmoothed.gif" alt="CoinSmoothed.gif" border="0" width="400" height="400" /><br />
</center></p>
<p>And the enriched data is more trend-like:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/04/virtualsmoothed.gif" alt="VirtualSmoothed.gif" border="0" width="400" height="400" /><br />
</center></p>
<p>In fact the regression line fit onto the raw data even has the wrong sign (points down instead of up):</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/04/coinfit.gif" alt="CoinFit.gif" border="0" width="400" height="400" /><br />
</center></p>
<p>Now, obviously this is a joke.  The enhancement procedure did not so much enhance the data as obliterate it.  The procedure makes no sense and it is treating the procedure with undue respect to point out any one feature as being &#8220;what is wrong with it.&#8221;  But the original desire is legitimate: can we use informed assumptions to gain a useful inductive bias?  If we do know something should we not need less data?</p>
<p>The answer is yes- but we have to be careful.  We must read up on the differences between Bayesian, frequentist and empirical methods and decide which set of methods is best for us.  Up until now we have been fitting &#8220;by standard methods&#8221; which is really just minimizing how far the data is from the model (by moving the model around).  That isn&#8217;t the only way to fit (see: &#8220;Controversies In The Foundation Of Statistics&#8221; Bradley Efron, American Mathematical Monthly (1978) vol. 85 (4) pp. 231-246).</p>
<p>For example a Bayesian might say that the goal of model fitting is not to pick a model that is closest to the data (maximizes the data&#8217;s plausibility with respect to the model) but to pick a model that simultaneously maximizes the product of the data&#8217;s plausibility with respect to the model and the model&#8217;s acceptability.  For example we could say all models for coin-flips with negative slopes are unacceptable and pick the best model with a non-negative slope.  However, assigning of degrees of acceptability (or priors) on every possible model is laborious and may require more knowledge than we have from our &#8220;reasonable prior domain knowledge.&#8221;</p>
<p>Another method is to use more sophisticated notions.  One such method is Quantile Regression ( Roger Koenker, Cambridge University Press 2005).  This methodology treats regression as a constrained optimization problem- so it is a simple matter to add in more constraints (like the slope must be positive) without having to assign arbitrary plausibilities to every possible model.  Another (huge) advantage is that Quantile Regression is much more stable and even without any entered constraints recognizes that the coin-flip data is likely trend free.  Here we plot the Quantile Regression analysis of the coin-data (without having added any prior constraints):</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/04/quantileregression.gif" alt="QuantileRegression.gif" border="0" width="400" height="400" /><br />
</center></p>
<p>To be honest: the method got lucky- the fit is better than should be expected.  But Quantile Regression is the perfect framework for adding in domain-constraints.</p>
<p>So: while The Data Enrichment Method is a fraud, there are ways to to enhance analysis to incorporate domain knowledge into results.  Instead of saying &#8220;any bias (even useful bias) ruins fitting&#8221; one should have a cookbook of methods ready to be applied.  These cookbooks hide under names like &#8220;Econometric Society Monographs&#8221; (in my opinion the econometricians really own the interface between theoretical statistics and hard-nosed applications).</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/' rel='bookmark' title='A Demonstration of Data Mining'>A Demonstration of Data Mining</a></li>
<li><a href='http://www.win-vector.com/blog/2009/08/good-graphs-graphical-perception-and-data-visualization/' rel='bookmark' title='Good Graphs: Graphical Perception and Data Visualization'>Good Graphs: Graphical Perception and Data Visualization</a></li>
<li><a href='http://www.win-vector.com/blog/2010/08/what-did-theorists-do-before-the-age-of-big-data/' rel='bookmark' title='What Did Theorists Do Before The Age Of Big Data?'>What Did Theorists Do Before The Age Of Big Data?</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2009/04/the-data-enrichment-method/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>A Quick Appreciation of the Sharpe Ratio</title>
		<link>http://www.win-vector.com/blog/2008/09/a-quick-appreciation-of-the-sharpe-ratio/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=a-quick-appreciation-of-the-sharpe-ratio</link>
		<comments>http://www.win-vector.com/blog/2008/09/a-quick-appreciation-of-the-sharpe-ratio/#comments</comments>
		<pubDate>Wed, 01 Oct 2008 03:15:07 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Applications]]></category>
		<category><![CDATA[Expository Writing]]></category>
		<category><![CDATA[Finance]]></category>
		<category><![CDATA[Mathematical Bedside Reading]]></category>
		<category><![CDATA[Sharpe Ratio]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=22</guid>
		<description><![CDATA[The current state of the global financial markets has gotten more people than usual worrying about the technical aspects of finance. One method for reasoning about investment returns and risk is a tool called the Sharpe Ratio. It is well worth reviewing this measure and seeing how, if used properly, it doesn&#8217;t favor any of [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2010/01/easy-portfolio-allocation/' rel='bookmark' title='&#8220;Easy&#8221; Portfolio Allocation'>&#8220;Easy&#8221; Portfolio Allocation</a></li>
<li><a href='http://www.win-vector.com/blog/2011/11/an-appreciation-of-locality-sensitive-hashing/' rel='bookmark' title='An Appreciation of Locality Sensitive Hashing'>An Appreciation of Locality Sensitive Hashing</a></li>
<li><a href='http://www.win-vector.com/blog/2008/05/betting-best-of-series/' rel='bookmark' title='Betting Best-Of Series'>Betting Best-Of Series</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>The current state of the global financial markets has gotten more people than usual worrying about the technical aspects of finance.  One method for reasoning about investment returns and risk is a tool called the Sharpe Ratio.  It is well worth reviewing this measure and seeing how, if used properly, it doesn&#8217;t favor any of the mistakes that underly our current financial crisis.<span id="more-22"></span></p>
<p>The Sharpe ratio is a famous measure of &#8220;risk adjusted return&#8221; and is defined as &#8220;the ratio of the expected excess return from an investment divided by standard deviation of the excess return.&#8221;  It is most easily demonstrated by an example (which we work in pieces).</p>
<p>If an investment is expected to generate a profit of 15% in the next year and an insured bank account would generate 10% profit then the expected excess return invested is 15% &#8211; 10% = 5%.  A rational investor would never take a risky investment that did not have a positive excess return (else they would expect to make more money at a bank). &#8220;Expected&#8221; is a technical term which means the average return of the investment averaged over all possible outcomes (weighted by the odds of each outcome), we can explain this by working a couple of examples.</p>
<p>Consider investment &#8220;A&#8221; which is a generally good idea that returns a 20% profit in half the possible years and a 10% profit in the other half of the years.  Investment A has an expected return of 0.5*20% + 0.5*10% = 15%.  Investment &#8220;A&#8221; has 15% &#8211; 10% = 5% excess return.</p>
<p>Also consider another investment &#8220;B&#8221; which is a risky bet that returns 20% profit most years (around 95.8% of them) and goes bankrupt in the other years.  The expected return of investment &#8220;B&#8221; is 0.958*20% + 0.042*(-100%) = 14.96%, or essentially 15%.   Investment &#8220;B&#8221; has 15% &#8211; 10% = 5% excess return.</p>
<p>As we can see &#8220;expectation&#8221; alone can not really tell these two investments apart.  That is why the second component of the Sharpe ratio is something called the standard deviation.  The standard deviation is defined as the square-root of the squared deviations of the return from the target value of 15%.  What we do is measure for each possible outcome how far off the return is from the target of 15%, multiply this number by itself (called squaring it) and then take the square-root of the sum of all such values.  Again, this is best explained by an example.</p>
<p>Investment &#8220;A&#8221; has a standard deviation of:<br />
square-root(  0.5 * (20% &#8211; 15%)*(20% &#8211; 15%) +  0.5 * (10% &#8211; 15%)*(10% &#8211; 15%)  ) = 5%</p>
<p>And investment &#8220;B&#8221; has a standard deviation of:<br />
square-root( 0.958 *( 20% &#8211; 15%)*( 20% &#8211; 15%) + 0.042*(-100% &#8211; 15%)*(-100% &#8211; 15%) ) = 24%</p>
<p>Just like in the calculation of expectation we are taking every possible situation and summing (weighted by the likelihood) our value of interest (in this case the squared variation).</p>
<p>The standard deviation&#8217;s opinion is that investment &#8220;B&#8221; is about five times riskier than investment &#8220;A.&#8221;  And this is the grace of the Sharpe ratio: it says that investment &#8220;A&#8221;&#8216;s value is (15% &#8211; 10%)/5% =  1 and &#8220;B&#8221;&#8216;s value is (15% &#8211; 10%)/24% = 0.2.</p>
<p>An interesting feature of the Sharpe ratio is that, unlike Wall Street, it does not believe that leveraging increases profitability.  A common desperation move is to take an investment that has a moderate return and borrow money to simulate larger returns by having larger exposure.  For instance an investment that returns 15% can try to simulate a higher return by borrowing.   If for every $1,000 invested we borrow another $1,000 to invest (paying the risk rate of 10% for the money) one can show an apparent rate of return of ($2000*15% &#8211; $1000*10%)/$1000 or 20%.  However, this is not free money- the investor is taking on twice as much risk for only half as much more return.  In fact with sufficient leverage (three times, for times, thirty times) one can convert a safe investment into a risky investment that could even go bankrupt.  The Sharpe ratio (by design) is not fooled by this sort of manipulation.  Investing $1000 in investment A has the exact same Sharpe ratio as investing $1000 plus $1000 more borrowed at the risk-free rate (this is part of the cleverness of using excess returns instead of un-adjusted returns).</p>
<p>Unfortunately to use the Sharpe ratio you need good estimates of three things:</p>
<p>1) The expected return of the investment.</p>
<p>2) The risk-less available in the market (to compute excess).</p>
<p>3) The standard deviation of the investment.</p>
<p>All three of these facts are about the future, so we don&#8217;t really know any of them.  The historic returns of an investment are not the same thing as the expected returns in the future, interest rates can change and the standard deviation is especially hard to estimate.  However, if you have a model (or at least a theory) of what your investments are supposed to do then you can plug in estimates for these three quantities and use the Sharpe ratio to determine which investments really are best.</p>
<p>If you knew how investment &#8220;A&#8221; worked and could estimate that it returned 20% about half the time and 10% the other times you could estimate its Sharpe ratio as 1.  And if you knew investment &#8220;B&#8221; was a gamble that almost always paid off at 20% with a single rare event that causes bankruptcy you could estimate its Sharpe ratio as 0.2.  Even if your estimates were inaccurate (say you estimate investment &#8220;A&#8221;&#8216;s Sharpe ratio is 0.7 and investment &#8220;B&#8221;&#8216;s Sharpe ratio as 0.3) the indication is to stay away from investment &#8220;B.&#8221;</p>
<p>This is in stark contrast to the conclusion you would draw if you thought of these investments as a &#8220;black box&#8221; (like a fund of funds does) and looked only at their historic performance.  If you looked at around 5 years of historic performance of both investments you would (incorrectly) think the following:</p>
<p>Investment A looks kind of noisy, some years it returns 10% and some years it return 20%.  You would estimate (correctly) the return as averaging to 15% and you can even get a historic estimate of its standard deviation that is actually about right (5%)</p>
<p>Investment B looks like easy money.  With about 80% chance you would not have seen a bankruptcy, just 5 years of 20% returns.  You would mis-estimate the return as being 20% (all you have ever seen) and further mis-estimate the standard deviation as 0%.</p>
<p>Based on historic data alone you would fire the manager of investment &#8220;A&#8221;, give the manager of investment &#8220;B&#8221; a huge bonus and invest all of your money.  And a few years later you would go bankrupt.</p>
<p>What is going on is very well explained by Nassim Nicholas Taleb as &#8220;the turkey paradox.&#8221;  Domestic turkeys are all killed at about the exact same age (say 60 days).  For somebody that understands commercial poultry farming there is not any mystery or uncertainty about it.  60 days before you want to sell a turkey carcass you buy a turkey chick.  There is an inevitability and reverse causality- the desire for the turkey&#8217;s carcass funds and causes the turkey&#8217;s start of life 60 days earlier.  Now if the turkey is a statistical empiricist (perhaps with a PhD in machine learning) things look good.  The turkey sets up a model of each day having an unknown chance of being good or bad.  The turkey figures that each day&#8217;s outcome is an independent trial drawn from this single unknown probability.  The turkey collects evidence: every day it gets fed.  Each day is more evidence that all days will be good.  And then on day 60 the turkey gets a nasty surprise.  The turkey&#8217;s life was a bad investment from day one, all of the &#8220;evidence&#8221; the turkey collects along the way was irrelevant because the model was wrong.  And the model was wrong because the turkey guessed at the model instead of investigating the nature of poultry farming.</p>
<p>Much is the same in many investments.  There are investments that look like investment &#8220;B&#8221; when you open the hood.  Many of them involve writing &#8220;out of the money options&#8221; and &#8220;default swaps.&#8221;  These are essentially selling insurance on events that nobody thinks are likely.  Selling insurance that usually is not used is profitable, until the insurance gets used.   This is why insurance companies (if they are ethical) don&#8217;t treat the entirety of collected payments as profit- but as a stockpile that must be kept to pay the claims that will inevitably some day come true.</p>
<p>It is important to point out the Sharpe ratio will give you incorrect results if you plug bad estimates into it.  Overall the Sharpe ratio prefers good investments and diversification but it can be led astray.  In fact that is the whole point: no amount of smart math will undo the inevitable consequences of wrong models that are used because &#8220;you need something you can solve&#8221; (like the turkey) or &#8220;everybody else is getting rich using them&#8221; (like investment &#8220;B&#8221;).</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2010/01/easy-portfolio-allocation/' rel='bookmark' title='&#8220;Easy&#8221; Portfolio Allocation'>&#8220;Easy&#8221; Portfolio Allocation</a></li>
<li><a href='http://www.win-vector.com/blog/2011/11/an-appreciation-of-locality-sensitive-hashing/' rel='bookmark' title='An Appreciation of Locality Sensitive Hashing'>An Appreciation of Locality Sensitive Hashing</a></li>
<li><a href='http://www.win-vector.com/blog/2008/05/betting-best-of-series/' rel='bookmark' title='Betting Best-Of Series'>Betting Best-Of Series</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2008/09/a-quick-appreciation-of-the-sharpe-ratio/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

