<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Win-Vector Blog &#187; Expository Writing</title>
	<atom:link href="http://www.win-vector.com/blog/category/expository-writing/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.win-vector.com/blog</link>
	<description>The Applied Theorist&#039;s Point of View</description>
	<lastBuildDate>Sat, 04 Feb 2012 17:42:12 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Ergodic Theory for Interested Computer Scientists</title>
		<link>http://www.win-vector.com/blog/2012/02/ergodic-theory-for-interested-computer-scientists/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=ergodic-theory-for-interested-computer-scientists</link>
		<comments>http://www.win-vector.com/blog/2012/02/ergodic-theory-for-interested-computer-scientists/#comments</comments>
		<pubDate>Sat, 04 Feb 2012 17:42:07 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Expository Writing]]></category>
		<category><![CDATA[Mathematics]]></category>
		<category><![CDATA[Ergodic Theorem]]></category>
		<category><![CDATA[Gibbs Sampler]]></category>
		<category><![CDATA[Markov Chains]]></category>
		<category><![CDATA[Random Sampling]]></category>
		<category><![CDATA[Randomized Algorithms]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1933</guid>
		<description><![CDATA[We describe ergodic theory in modern notation accessible to interested computer scientists. The ergodic theorem (http://en.wikipedia.org/wiki/Ergodic theory (link)) is an important principle of recurrence and averaging in dynamical systems. However, there are some inconsistent uses of the term, much of the machinery is intended to work with deterministic dynamical systems (not probabilistic systems, as is [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2008/02/hello-world-an-instance-rhetoric-in-computer-science/' rel='bookmark' title='Hello World: An Instance Of Rhetoric in Computer Science'>Hello World: An Instance Of Rhetoric in Computer Science</a></li>
<li><a href='http://www.win-vector.com/blog/2012/01/six-fundamental-methods-to-generate-a-random-variable/' rel='bookmark' title='Six Fundamental Methods to Generate a Random Variable'>Six Fundamental Methods to Generate a Random Variable</a></li>
<li><a href='http://www.win-vector.com/blog/2010/08/what-did-theorists-do-before-the-age-of-big-data/' rel='bookmark' title='What Did Theorists Do Before The Age Of Big Data?'>What Did Theorists Do Before The Age Of Big Data?</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>We describe ergodic theory in modern notation accessible to interested computer scientists.</p>
<p>The ergodic theorem (http://en.wikipedia.org/wiki/Ergodic theory (link)) is an important principle of recurrence and averaging in dynamical systems. However, there are some inconsistent uses of the term, much of the machinery is intended to work with deterministic dynamical systems (not probabilistic systems, as is often implied) and often the conclusion of the theory is mis-described as its premises.</p>
<p>By “interested computer scientists” we mean people who know math and work with probabilistic systems1, but know not to accept mathematical definitions without some justification (actually a good attitude for mathematicians also).<span id="more-1933"></span>Please click through to read <a target="_blank" href="http://www.win-vector.com/dfiles/ErgodicTheory.pdf">Ergodic Theory for Interested Computer Scientists</a>.</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2008/02/hello-world-an-instance-rhetoric-in-computer-science/' rel='bookmark' title='Hello World: An Instance Of Rhetoric in Computer Science'>Hello World: An Instance Of Rhetoric in Computer Science</a></li>
<li><a href='http://www.win-vector.com/blog/2012/01/six-fundamental-methods-to-generate-a-random-variable/' rel='bookmark' title='Six Fundamental Methods to Generate a Random Variable'>Six Fundamental Methods to Generate a Random Variable</a></li>
<li><a href='http://www.win-vector.com/blog/2010/08/what-did-theorists-do-before-the-age-of-big-data/' rel='bookmark' title='What Did Theorists Do Before The Age Of Big Data?'>What Did Theorists Do Before The Age Of Big Data?</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2012/02/ergodic-theory-for-interested-computer-scientists/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Correlation and R-Squared</title>
		<link>http://www.win-vector.com/blog/2011/11/correlation-and-r-squared/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=correlation-and-r-squared</link>
		<comments>http://www.win-vector.com/blog/2011/11/correlation-and-r-squared/#comments</comments>
		<pubDate>Tue, 22 Nov 2011 00:04:28 +0000</pubDate>
		<dc:creator>Nina Zumel</dc:creator>
				<category><![CDATA[Expository Writing]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Statistics To English Translation]]></category>
		<category><![CDATA[correlation]]></category>
		<category><![CDATA[goodness-of-fit]]></category>
		<category><![CDATA[linear regression]]></category>
		<category><![CDATA[R-squared]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1866</guid>
		<description><![CDATA[What is R2? In the context of predictive models (usually linear regression), where y is the true outcome, and f is the model&#8217;s prediction, the definition that I see most often is: In words, R2 is a measure of how much of the variance in y is explained by the model, f. Under &#8220;general conditions&#8221;, [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2011/09/the-simpler-derivation-of-logistic-regression/' rel='bookmark' title='The Simpler Derivation of Logistic Regression'>The Simpler Derivation of Logistic Regression</a></li>
<li><a href='http://www.win-vector.com/blog/2011/09/the-equivalence-of-logistic-regression-and-maximum-entropy-models/' rel='bookmark' title='The equivalence of logistic regression and maximum entropy models'>The equivalence of logistic regression and maximum entropy models</a></li>
<li><a href='http://www.win-vector.com/blog/2010/02/living-in-a-lognormal-world/' rel='bookmark' title='Living in A Lognormal World'>Living in A Lognormal World</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>What is R<sup>2</sup>? In the context of predictive models (usually linear regression), where <em>y</em> is the true outcome, and <em>f</em> is the model&#8217;s prediction, the definition that I see most often is: </p>
<div style="text-align:center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2011/11/4471BBA8-E9DB-4D30-A9AE-A74F8C773247.jpg" alt="4471BBA8-E9DB-4D30-A9AE-A74F8C773247.jpg" border="0" width="195" /></div>
<p></p>
<p>In words, R<sup>2</sup> is a measure of how much of the variance in <em>y</em> is explained by the model, <em>f</em>.  </p>
<p> Under &#8220;general conditions&#8221;, as Wikipedia says,<br />
R<sup>2</sup> is also the square of the correlation between the actual and predicted outcomes: </p>
<div style="text-align:center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2011/11/A4311540-8DFB-45FB-93F7-65E7B72AE6C8.jpg" alt="A4311540-8DFB-45FB-93F7-65E7B72AE6C8.jpg" border="0" width="336" /></div>
</p>
<p>I prefer the &#8220;squared correlation&#8221; definition, as it gets more directly at what is usually my primary concern: prediction. If R<sup>2</sup> is close to one, then the model&#8217;s predictions mirror true outcome, tightly. If R<sup>2</sup> is low, then either the model does not mirror true outcome, or it only mirrors it loosely: a &#8220;cloud&#8221; that — hopefully — is oriented in the right direction. Of course, looking at the graph always helps:</p>
<div style="text-align:center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2011/11/R2_compare.png" alt="R2_compare.png" border="0" width="550" height="450" /></div>
</p>
<p>The question we will address here is : how do you get from R<sup>2</sup> to correlation?</p>
<p><span id="more-1866"></span>
<p>If you look at the two equations for correlation and R<sup>2</sup>, you can see that the relationship between them does not hold for general <em>f</em> and <em>y</em>. In particular, correlation is far more invariant to scaling. For correlation, all of the following relations are true: </p>
<div style="text-align:center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2011/11/3B8B7BE1-9E6B-4F6B-B1F7-7AFF3C2331BD.jpg" alt="3B8B7BE1-9E6B-4F6B-B1F7-7AFF3C2331BD.jpg" border="0" width="143"/></div>
<p>But only the last relation is true for R<sup>2</sup>. So in general, the two cannot be functions of each other.
</p>
<p>However, we are making a specific assumption about <em>f</em>: it is the output of a predictive model. In fact, we are actually making several specific assumptions;</p>
<p>1. <em>f</em> is the model that minimizes squared-error loss <br />
2. Because it is the optimum (in the sense of item 1), there is no shift of <em>f</em> that will improve the fit. <br />
3. Because it is the optimum (in the sense of item 1), there is no scaling of <em>f</em> that will improve the fit. 
</p>
<p>We can express the above assumptions as follows:</p>
<div style="text-align:center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2011/11/B719DE1C-CE30-46BB-9B19-58111A5BEAD6.jpg" alt="B719DE1C-CE30-46BB-9B19-58111A5BEAD6.jpg" border="0" width="212" /></div>
<p>If we express the last line as</p>
<div style="text-align:center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2011/11/25988EED-3859-4732-837B-06106D0F9E90.jpg" alt="25988EED-3859-4732-837B-06106D0F9E90.jpg" border="0" width="210" /></div>
<p>Then loss is optimized at <em>g(1,0)</em>.
</p>
<p>Since <em>g(1,0)</em> is the optimum, then the derivatives of <em>g</em> are zero here: </p>
<div style="text-align:center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2011/11/4CA19B8A-E816-4F11-AC9B-6CA80AC5E961.jpg" alt="4CA19B8A-E816-4F11-AC9B-6CA80AC5E961.jpg" border="0" width="317" /></div>
<div style="text-align:center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2011/11/C2CDE5A4-0BBF-4AEC-9EA5-3B8767BBE33A.jpg" alt="C2CDE5A4-0BBF-4AEC-9EA5-3B8767BBE33A.jpg" border="0" width="290" /></div>
</p>
<p>From the partial with respect to <em>a</em>, we get that</p>
<div style="text-align:center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2011/11/787BDA81-F90A-475D-B546-68B931C7909E.jpg" alt="787BDA81-F90A-475D-B546-68B931C7909E.jpg" border="0" width="90" /></div>
<p>and from the partial with respect to <em>b</em>, we get that</p>
<div style="text-align:center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2011/11/47B35F40-060C-4284-B9B4-685A3A260F09.jpg" alt="47B35F40-060C-4284-B9B4-685A3A260F09.jpg" border="0" width="43"/></div>
<p>(since the mean is just the normalized sum).
</p>
<p>Now, let&#8217;s shift the coordinate system so that <span style="text-decoration:overline">y</span> (and <span style="text-decoration:overline">f</span>) are equal to zero. This makes the equations much simpler, and doesn&#8217;t affect the generality of the result.
</p>
<p>The equation for R<sup>2</sup> is now</p>
<div style="text-align:center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2011/11/6F70842F-4123-4A47-BF47-F5827D52607F.jpg" alt="6F70842F-4123-4A47-BF47-F5827D52607F.jpg" border="0" width="230" /></div>
<p>And the equation for correlation is now</p>
<div style="text-align:center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2011/11/BBB57FF2-F3A3-479D-8352-78E019AD2996.jpg" alt="BBB57FF2-F3A3-479D-8352-78E019AD2996.jpg" border="0" width="250" /></div>
<p>And we are done.
</p>
<p>Notice that this result is true for any model fit that meets the assumptions that we outlined above (squared-error loss, optimality under shifting and scaling). Linear regression (with an intercept) fits this criterion, but so can other model-fitting techniques — generalized additive models, polynomial fits, decision (regression) trees, ensemble methods — if the proper loss function is used.</p>
<div style="text-align:center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2011/11/DecisionTree.png" alt="DecisionTree.png" border="0" width="550" height="450" /></div>
</p>
<p>To repeat: for optimal models (under squared-error loss, shift and scale invariance), R<sup>2</sup> is the square of the correlation between the true and predicted outcomes. This relationship is not true for general <em>f</em> and <em>y</em>.</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2011/09/the-simpler-derivation-of-logistic-regression/' rel='bookmark' title='The Simpler Derivation of Logistic Regression'>The Simpler Derivation of Logistic Regression</a></li>
<li><a href='http://www.win-vector.com/blog/2011/09/the-equivalence-of-logistic-regression-and-maximum-entropy-models/' rel='bookmark' title='The equivalence of logistic regression and maximum entropy models'>The equivalence of logistic regression and maximum entropy models</a></li>
<li><a href='http://www.win-vector.com/blog/2010/02/living-in-a-lognormal-world/' rel='bookmark' title='Living in A Lognormal World'>Living in A Lognormal World</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2011/11/correlation-and-r-squared/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>An Appreciation of Locality Sensitive Hashing</title>
		<link>http://www.win-vector.com/blog/2011/11/an-appreciation-of-locality-sensitive-hashing/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=an-appreciation-of-locality-sensitive-hashing</link>
		<comments>http://www.win-vector.com/blog/2011/11/an-appreciation-of-locality-sensitive-hashing/#comments</comments>
		<pubDate>Mon, 21 Nov 2011 16:41:46 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Exciting Techniques]]></category>
		<category><![CDATA[Expository Writing]]></category>
		<category><![CDATA[Opinion]]></category>
		<category><![CDATA[Tutorials]]></category>
		<category><![CDATA[Locality Sensitive Hashing]]></category>
		<category><![CDATA[Nearest Neighbor]]></category>
		<category><![CDATA[Theorist]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1848</guid>
		<description><![CDATA[We share our admiration for a set of results called &#8220;locality sensitive hashing&#8221; by demonstrating a greatly simplified example that exhibits the spirit of the techniques.Locality sensitive hashing is awe inspiring in its originality, simplicity, beauty and effectiveness. In addition locality sensitive hashing is a remarkable technique as it works even when drastically abridged and [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2008/09/a-quick-appreciation-of-the-sharpe-ratio/' rel='bookmark' title='A Quick Appreciation of the Sharpe Ratio'>A Quick Appreciation of the Sharpe Ratio</a></li>
<li><a href='http://www.win-vector.com/blog/2009/11/the-local-to-global-principle/' rel='bookmark' title='The Local to Global Principle'>The Local to Global Principle</a></li>
<li><a href='http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/' rel='bookmark' title='Automatic Differentiation with Scala'>Automatic Differentiation with Scala</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>We share our admiration for a set of results called &#8220;locality sensitive hashing&#8221; by demonstrating a greatly simplified example that exhibits the spirit of the techniques.<span id="more-1848"></span>Locality sensitive hashing is awe inspiring in its originality, simplicity, beauty and effectiveness.  In addition locality sensitive hashing is a remarkable technique as it works even when drastically abridged and simplified.  In this <a target="_blank" href="http://www.win-vector.com/dfiles/LocalitySensitiveHashing.pdf">paper (link to pdf)</a> we give a description of conditions where the technique works and a heuristic argument why it works (using only elementary math).</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2008/09/a-quick-appreciation-of-the-sharpe-ratio/' rel='bookmark' title='A Quick Appreciation of the Sharpe Ratio'>A Quick Appreciation of the Sharpe Ratio</a></li>
<li><a href='http://www.win-vector.com/blog/2009/11/the-local-to-global-principle/' rel='bookmark' title='The Local to Global Principle'>The Local to Global Principle</a></li>
<li><a href='http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/' rel='bookmark' title='Automatic Differentiation with Scala'>Automatic Differentiation with Scala</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2011/11/an-appreciation-of-locality-sensitive-hashing/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Kernel Methods and Support Vector Machines de-Mystified</title>
		<link>http://www.win-vector.com/blog/2011/10/kernel-methods-and-support-vector-machines-de-mystified/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=kernel-methods-and-support-vector-machines-de-mystified</link>
		<comments>http://www.win-vector.com/blog/2011/10/kernel-methods-and-support-vector-machines-de-mystified/#comments</comments>
		<pubDate>Sat, 08 Oct 2011 00:17:46 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Expository Writing]]></category>
		<category><![CDATA[Mathematics]]></category>
		<category><![CDATA[Opinion]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Tutorials]]></category>
		<category><![CDATA[Kernel Methods]]></category>
		<category><![CDATA[Logistic Regression]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Mathematical Bedside Reading]]></category>
		<category><![CDATA[Naive Bayes]]></category>
		<category><![CDATA[Support Vector Machines]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1804</guid>
		<description><![CDATA[We give a simple explanation of the interrelated machine learning techniques called kernel methods and support vector machines. We hope to characterize and de-mystify some of the properties of these methods. To do this we work some examples and draw a few analogies. The familiar no matter how wonderful is not perceived as mystical. Goals [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2010/10/a-personal-perspective-on-machine-learning/' rel='bookmark' title='A Personal Perspective on Machine Learning'>A Personal Perspective on Machine Learning</a></li>
<li><a href='http://www.win-vector.com/blog/2011/04/do-your-tools-support-production-or-complexity/' rel='bookmark' title='Do your tools support production or complexity?'>Do your tools support production or complexity?</a></li>
<li><a href='http://www.win-vector.com/blog/2011/07/book-review-ensemble-methods-in-data-mining-seni-elder/' rel='bookmark' title='Book Review: Ensemble Methods in Data Mining (Seni &amp; Elder)'>Book Review: Ensemble Methods in Data Mining (Seni &#038; Elder)</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>We give a simple explanation of the interrelated machine learning techniques called <a target="_blank" href="http://en.wikipedia.org/wiki/Kernel_methods">kernel methods</a> and <a target="_blank" href="http://en.wikipedia.org/wiki/Support_vector_machine">support vector machines</a>.  We hope to characterize and de-mystify some of the properties of these methods.  To do this we work some examples and draw a few analogies.  The familiar no matter how wonderful is not perceived as mystical.<span id="more-1804"></span><br />
<h2>Goals of this writeup</h2>
<ol>
<li>De-mystify  kernel methods and support vector machines<br />
<blockquote><p>
Kernel methods and support vector machines have taken mythological proportions in the machine learning imagination. Partly this is because a number of good ideas are overly associated with them: support/non-support training datums, weighting training data, discounting data, regularization, margin and the bounding of generalization error.  My issue is that these are all important enough ideas to stand on their own and are often seen in simpler settings.  The observations that inform my view are as follows:</p>
<ul>
<li>Kernel methods and support vector machines are in fact two good ideas.  Each is important even without the other: kernels are useful all over and support vector machines would be  useful even if we restricted to the trivial identity kernel.</li>
<li>Small scale &#8220;kernel tricks&#8221; are not that different than the classic technique of adding &#8220;interaction variables.&#8221;  Kernels let you escape from the limits of &#8220;linear hypotheses&#8221; (really by moving to a bigger space where things are again linear but look curved from the point of view of your smaller original space).  We demonstrate the linear methods and &#8220;primal kernel tricks&#8221; in <a target="_blank" href="http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/">A Demonstration of Data Mining</a>.</li>
<li>Support/non-support is the central issue of nearest neighbor classifiers.</li>
<li>Weighing training data is most famously shared by logistic regression and boosting.</li>
<li>Re-weighting of data by <i>smoothing kernels</i> (different but related use of the work &#8220;kernel&#8221;) is central to non-parametric statistics (kernel smoothers and splines).</li>
<li>Regularization is of supreme importance in modeling in general.</li>
<li>Most practitioners get tired of &#8220;kernel shopping&#8221; and fall back to the identity, cosine or radial/Gaussian kernels.</li>
</ul>
</blockquote>
</li>
<li>Show concrete examples of what are and what are not kernels.<br />
Few sources give enough theorems to think about kernels abstractly and fewer still work concrete examples.
</li>
<li>
Use as little math as possible (which, unfortunately, turns out to be quite a bit).  We will discuss encodings and stopping conditions (important to understand what is going on) but avoid explaining the optimizers (the most beautiful part of support vector machines, but also the part that is available pre-packaged in libraries).
</li>
<li>
Call out common magical thinking and unreasonable expectations associate with kernel methods and support vector machines.  This should help the reader be in a better position to &#8220;defend their doubts&#8221; regarding machine learning promises.
</li>
<li>
Try to place all techniques in a wider context (if it is usable only one place it is a trick, if it is usable multiple places it is a technique).
</li>
<li>Discuss margin and its impact on generalization error.<br />
<blockquote><p>
Generalization error is an effect of &#8220;over fitting&#8221; where a model has learned things that are true about the training examples that do not hold for the overall truth or concept we are trying to learn (i.e. don&#8217;t generalize).   Generalization error is the excess error rate we observe when scoring new examples versus the error-rate we saw in learning the training data.  Margin is in fact a <em>posterior</em> observation.  That is: margin is observed after the training data is seen, not known before data is seen (like, for example, <a target="_blank" href="http://en.wikipedia.org/wiki/VC_dimension">VC dimension</a>).   Margin is useful as it bounds generalization error but it is not the <em>prior</em> bound it is often portrayed as.  So we assert margin estimates are not much more special than simple cross-validation estimates which can also be performed once we have data available.
</p></blockquote>
</li>
</ol>
<p>The goals are not to try to indict or try to cut down kernel methods or support vector machines, but just to dump some of the associated baggage so they can be used fluidly and without anxiety.  </p>
<h2>Example Problem</h2>
<p>Consider the following simple (caricature) machine learning problem:  we are given a number of points labeled as circles and square and we want to, given a new point, predict the label.  In our example the only input will be two numbers: x and y.  This example is simple enough that we can depict the entire situation in Figure 1 below:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/TruthAndData.png" alt="TruthAndData.png" border="0" width="480" height="480" /></p>
<p>Figure 1: Truth and Training Data</p>
<p></center></p>
<p>We are assuming (for simplicity of exposition) that we have the incredible luck that the label is indeed a deterministic function of x and y.  We portray this &#8220;ground truth&#8221; as the blue parabola- every point found in this region will be considered to be labeled as &#8220;circle&#8221; and every other point (i.e. the red region) will be labeled as &#8220;triangle.&#8221;  We have overlaid 20 example points (with appropriate shape labels).  These points will be our training data.  We will try to learn from the observed training data a good approximation of the unseen blue and red regions.  This process is called learning or training and the ability to correctly predict the label of new points is called generalization.  To make sure our concept is learnable we are going to further guarantee a moat (or margin) the distribution we are picking training and test examples from will never pick a point from the shaded region between the blue and black.  That is we will never be asked about an example where x*x is very near y.</p>
<p>To promote visual thinking we are going to avoid formulas until the section titled &#8220;precise functions used&#8221; (where we will list the formulas used for each graph).</p>
<h2>Nearest Neighbor Solution</h2>
<p>An interesting model in these days of big data and fast machines is the nearest neighbor model.  Such a model colors each point in space blue or red as it <em>thinks</em> the point should be classified.  In this case each point is colored the color given by the closest known training point.  This induces the type of model seen in Figure 2 whose boundary is piecewise line segments mid-way between training points.  The data points that determine segments of the classification boundary in this way are thought to support the boundary and are called &#8220;support examples&#8221; or &#8220;support vectors.&#8221;  In fact any training point that is not &#8220;supporting&#8221; part of the boundary is &#8220;irrelevant&#8221; (could be removed and we would get the exact same model).  This split between important and non-important training points is considered one of the important observations from  the theory of support vector machines, but we see it even here in the nearest neighbor models.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/1NN.png" alt="1NN.png" border="0" width="480" height="480" /></p>
<p>Figure 2: 1 nearest neighbor model</p>
<p></center></p>
<p>Notice that the nearest neighbor model in Figure 2 is &#8220;not half bad.&#8221;  The blue region is much too wide, but near the training data the blue mass is roughly in the correct position.  If we had infinite data and infinite computational resources this sort of model would be hard to beat (as any point we are likely to be tested on would have a lot of very near examples to work from).  We can try to clean up the shape of the model a bit by using a vote of the 2 nearest points to color each point in the plane (see Figure 3) or even the majority of the 3 nearest points (see Figure 4).  </p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/2NN.png" alt="2NN.png" border="0" width="480" height="480" /></p>
<p>Figure 3: 2 nearest neighbor model</p>
<p></center></p>
<p>The purple region in Figure 3 is a &#8220;region of uncertainty&#8221; where net-vote over the 2 nearest points is zero (they gave inconsistent advice).  We can use this as a hint that predictions take from this region are less reliable.  As you may notice, we are not getting real improvements.  We should always spend more time getting more features (useful coordinates in addition to x and y) and more data; and spend less time tweaking models.  But if your data and features are fixed all you work on is your modeling technique (alternately: your modeling technique is all you can prepare before receiving features and data).  So we will continue to discuss features and technique.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/3NN.png" alt="3NN.png" border="0" width="480" height="480" /></p>
<p>Figure 4: 3 nearest neighbor model</p>
<p></center></p>
<p>Nearest neighbor classifiers are optimal in the sense that with an infinite amount of data the 1-nearest neighbor classifier has an error rate that approaches twice the Bayes error rate (the Bayes error rate being the ideal error observed on identical repetitions, or the theoretical best error rate) and for large k the k-nearest neighbor method approaches the Bayes error rate itself (see for example <a target="_blank" href="http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm">k-nearest neighbor algorithm</a> ).</p>
<p>However the effectiveness of the nearest neighbor classifiers is coming from the very low dimension (2) of our input variables x and y.  If we were attempting to predict from n variables we might need an amount of data exponential in n to get a useful nearest neighbor classifier.  This is an <a target="_blank" href="http://en.wikipedia.org/wiki/Efficiency_(statistics)">inefficient</a> use of data of the data; what we want is to use all of the data we have effectively.  To do this we further assume there is some relational reason that the position of the point in the plane determines the shape-label (i.e. that we are learning, generalizing, interpolating and extrapolating- not just memorizing).  We set our ambitions above memorizing and move towards functional modeling.  We start thinking in terms of change: how the label density of examples changes as we move is a good clue as to how it will continue to change.  We can do this either in a parametric or primal formulation (where we attempt to directly infer parameters of an assumed functional form as in regression or logistic regression) or in a non-parametric or dual form (where we attempt to learn relations between points and the training data instead of parameters).  We have discussed this before ( <a target="_blan" href="http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/">A Demonstration of Data Mining</a> ) and expanded on primal methods ( <a target="_blank" href="http://www.win-vector.com/blog/2010/11/learn-a-powerful-machine-learning-tool-logistic-regression-and-beyond/">Learn Logistic Regression (and beyond</a>, <a target="_blank" href="http://www.win-vector.com/blog/2011/09/the-simpler-derivation-of-logistic-regression/">The Simpler Derivation of Logistic Regression</a> and <a target="_blank" href="http://www.win-vector.com/blog/2011/09/the-equivalence-of-logistic-regression-and-maximum-entropy-models/">The equivalence of logistic regression and maximum entropy models</a>).</p>
<h2>Functional Solution</h2>
<p>Our next idea is: can we use our training data to tease out a functional form for the relation between x,y and shape label?  The first function we will try is chosen to be similar to the nearest neighbor model we just demonstrated.  For this form we say each training point is the center of a <a target="_blank" href="http://en.wikipedia.org/wiki/Gaussian_function">Gaussian</a> hump (say up for blue/circle and down for red/triangle) which we will call a &#8220;discount function&#8221; (or informally a hump).  A &#8220;Gaussian&#8221; is just a function that falls off exponentially in the square of Euclidian distance from a central point (we can see the simple form of such functions in the &#8220;precise functions used&#8221; section).  The contour lines (or levels as seen from looking down upon) of one such hump or discount function are shown in figure 5 (all graphs from here on in will be looking down at a height represented by color and contour lines, much like topographic maps).</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/GaussianKernel3ExampleConcept.png" alt="GaussianKernel3ExampleConcept.png" border="0" width="480" height="480" /></p>
<p>Figure 5: Narrow Gaussian example concept</p>
<p></center></p>
<p>This particular discount function has its maximum (or peak) centered on a data point and then rapidly falls as we move away from the training point.  The only property we will use in this section is that we can evaluate the discount easily (so we don&#8217;t at this time need or make use of any restrictive properties like integrability, positive semi-definiteness or even anything like bounded level sets).<br />
We could make our functional model just the sum of all of these up and down humps over all of the training data.  Each point gets a hump of the same height and same radius, pointing up for blue/circle and down for red/triangle.  The net coloring of this sum function is shown in figure 6.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/GaussianKernel3KernelSum1.png" alt="GaussianKernel3KernelSum.png" border="0" width="480" height="480" /></p>
<p>Figure 6: Narrow Gaussian sum model</p>
<p></center></p>
<p>The idea is that this would imitate the nearest neighbor model we have already seen.  This is because even though we are summing over all data the closest training examples should usually dominate (since the Gaussian humps we picked fall to zero as we get further from their centers).   The correspondence to nearest neighbor increases if we tighten the radius of our functions (searching for the right radius is a very important statistical problem called &#8220;bandwidth estimation&#8221;).  Figure 7, for example shows the sum over much narrower Gaussians.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/GaussianKernel20KernelSum1.png" alt="GaussianKernel20KernelSum.png" border="0" width="480" height="480" /></p>
<p>Figure 7: Very narrow Gaussian sum model</p>
<p></center></p>
<p>Figure 7 looks even more like the nearest neighbor solution.  In fact your could think of the discount-weighted sum over all of the data as the natural model (as it is easer to reason about) and the nearest neighbors a more efficient approximation (summing only over near points).  We also have seen that we can alter the smoothness of our model (which has consequences on model generalization) by changing the steepness of our discount functions.  We can also build intermediate models (such as building a nearest neighbor but using discount weighted sums instead of uniform voting in the neighborhoods).   We haven&#8217;t yet greatly improved on nearest neighbor, but we have identified a couple of obvious avenues for further improvement: mess with the bandwidths (the obvious idea to try and deal with near/far scale) or re-weight the data (as one of the lessons of the nearest neighbor model is that points that are not near a boundary are less important).   </p>
<h2>Bandwidth Solution</h2>
<p>As a digression lets play with bandwidth a bit first.  Suppose simultaneously for each training datum we picked an ideal bandwidth or steepness of the Gaussian.  </p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/BandwidthModel.png" alt="BandwidthModel.png" border="0" width="480" height="480" /></p>
<p>Figure 8: Simultaneous bandwidth model</p>
<p></center> </p>
<p>This simultaneous bandwidth model depicted in figure 8 was determined by picking a simultaneous assignment of Gaussian widths (each training datum gets its own width) that maximizes the log-sum of the category signed model function on the training data (much like the logistic regression maximum likelihood decoding).  This differs from typical bandwidth selection problems where all points are given a single common &#8220;best bandwidth.&#8221;  We instead picked a different bandwidth for each point (bandwidths picked to maximize how much of the correct category portion of the sum was correct on each training example).  The blue region is reasonable, but does not match the shape of the true concept.   Partly this is because there has not been enough training data to have learned a lot about the true shape (for example there is no empirical evidence for circle/blue in the top-left quadrant).  The other reason is a bias inherent to this type of model.  If one of the bandwidth functions is wider than all of the others (which is almost certain to occur) then it falls slower than all of the others and for points very far away from the center of the training data this one function is most of what remains.  Or equivalently: due to the nature of this modeling technique one of the concepts learned is going to be the union of bounded islands and the other concept will grab all points sufficiently far away from the center of the training data.  This is strong (and undesirable) bias, but it is a bias also shared by support vector machines with Gaussian kernels (though support vector machines get it from their so-called &#8220;dc-term&#8221; b not being zero; this will be discussed later). Our bandwidth correction was successful, but frankly the optimization problem we solved to estimate the optimal bandwidths was nasty and would not be something we would advise for a large amount of data.</p>
<h2>Data weighting solution (support vector machine)</h2>
<p>So instead of messing with the bandwidths let us consider re-weighting the Gaussians in our sum.  We will leave the shape of each Gaussian the same but allow each individual training datum to have a unique weight or importance.  Perhaps by picking the right weights we can get a better model.  By &#8220;best&#8221; we will mean &#8220;best margin&#8221; (to be defined later) because with &#8220;best margin&#8221; as our objective the optimization problem of solving for the best data weights has a particularly beautiful form that can be reliably solved at great scale.  This is called a &#8220;support vector model&#8221; (or support vector machine) and we will describe these ideas in detail later.  But first lets see the result.  Figure 9 shows the original narrow Gaussians re-summed according to the weights picked by the support vector method (instead of all being 1 as in figure 6).</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/GaussianKernel3SVM1.png" alt="GaussianKernel3SVM.png" border="0" width="480" height="480" /></p>
<p>Figure 9: Narrow Gaussian support vector model</p>
<p></center></p>
<p>Figure 9 is again an improvement.  The figure is formed by taking a signed sum of our discount functions centered at our training points and weighted by the &#8220;support vector machine weights.&#8221;  The signs are picked with one class encoded as +1 and the other as -1.  The colors red and blue are picked if the sum is above or below a constant &#8220;b&#8221; called &#8220;the dc-term&#8221; (part of the support vector solution).  The solution again looks okay: all the training points are in the correct color region (and you can&#8217;t hope to interpolate let alone extrapolate if you can&#8217;t even reproduce your training concept).  Also, as with the bandwidth model, contours are set sensibly (steep/dense where categories are near each other, shallow/sparse where things are safe).   We see the same inductive bias: one of the concepts is the union of bounded islands (this will always going to be the case unless the model picks b=0).  This bias is not obvious from the kernel choice, as it is hidden in the dc-term of the support vector fitting equations (it is not part of the kernel, but can be defeated by choosing unbounded kernels).  An interesting empirical fact about support vector models is they tend to work well with a larger bandwidth.    For example if we use a wider Gaussian we get an even more convincing model (see Figure 10).</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/GaussianKernel025SVM.png" alt="GaussianKernel025SVM.png" border="0" width="480" height="480" /></p>
<p>Figure 10: Wide Gaussian support vector model</p>
<p></center></p>
<p>In figure 10 we have also called out one of the important features of support vector machines.   In the literature all datums are called &#8220;vectors&#8221; (as they are represented in coordinates) and a subset of these are called the support vectors.  In figure 10 we have drawn the 4 training examples that turn out to be support vectors as large shapes and all other training examples as small shapes.  The support vectors are the datums with non negligible weights.  The learned model is a function of the support vectors only.  So after fitting we can discard the other 16 training points. This is similar to the fact that nearest neighbor also only needs the training examples that are supporting its model boundary.</p>
<p>Notice that at this point the support vector models are not &#8220;magically&#8221; better, they in fact look less like the truth we are trying to learn than either the nearest neighbor models or the un-weighted sum models.  To fix this we need to do some &#8220;kernel shopping&#8221; or find functions that better respect what we are trying to model.  With the right functions support vector machines can do a very good job at learning the concept (but knowing &#8220;the right functions&#8221; is a huge hint as to the what the concept is.  There is nothing wrong with using a hint but we do have to a method to produce the hint or have a reasonable number of hints to try from. </p>
<p>The &#8220;sum over everything with the same weight&#8221; solutions from the earlier section are very similar to Naive Bayes which also sums over all matching features with no weight adjustment.  The support vector &#8220;use the same model but pick better weights&#8221; stands over these sum models in very much the same way Logistic Regression stands over Naive Bayes.  In fact the optimization problems are very related.  If we consider weights over features or coordinates as &#8220;primal&#8221; and weights over data items as &#8220;dual&#8221; we can informally say something like: support vector machines optimize by working over dual variables and inspecting for primal weights, and logistic regression works over primal variables and inspects for data weights.   But this is a bit vague.</p>
<p>At this point we need to get a bit more explicit and precise.</p>
<h2>Precise functions used</h2>
<p>In general we will write our training data as a sequence of n-vectors.  Our points will be named u(1) &#8230; u(m).   For each i = 1..m we also know which category the training point is labeled with.  We will encode this as y(1) &#8230; y(m) where each y() is either +1 or -1 depending on the training label.  So the example we have been working has n=2 and m=20.  Let z represent a n-vector we wish to classify.</p>
<p>Figure 6 was a picture of the sign of the function:<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/E4BCA580-314E-4BD8-AD83-3CF55C7AE24E.jpg" alt="E4BCA580-314E-4BD8-AD83-3CF55C7AE24E.jpg" border="0" height="50" /><br />
</center></p>
<p>Figure 7 was a picture of the sign of the function:<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/9F34681D-45FC-4E0B-8466-943CF2881AAC.jpg" alt="9F34681D-45FC-4E0B-8466-943CF2881AAC.jpg" border="0" height="50" /><br />
</center></p>
<p>Figure 8 was a picture of the sign of the function:<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/C70C6F0B-7EB9-44C5-9505-013604F46D9C.jpg" alt="C70C6F0B-7EB9-44C5-9505-013604F46D9C.jpg" border="0"  height="50" /><br />
</center><br />
where the w(i) are non-negative numbers picked so that<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/54765538-DADD-4353-A59A-EF3B6C42743E.jpg" alt="54765538-DADD-4353-A59A-EF3B6C42743E.jpg" border="0"  height="50" /><br />
</center><br />
is large.</p>
<p>Figure 9 was a picture of the sign of -b plus the function:<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/CD6E8576-D6D3-4D61-B285-46DB169BC885.jpg" alt="CD6E8576-D6D3-4D61-B285-46DB169BC885.jpg" border="0" height="50" /><br />
</center><br />
where the a(i) and b are picked so &#8220;the margin&#8221; (discussed later) is large.</p>
<p>Figure 10 was a picture of the sign of -b plus of the function:<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/1932DC0A-E051-41E7-8D90-F0868247A69C.jpg" alt="1932DC0A-E051-41E7-8D90-F0868247A69C.jpg" border="0" height="50" /><br />
</center><br />
where the a(i) and b are picked so the margin is large.</p>
<p>All of this repetition is to emphasize the commonality of the models.  They are all of the form:<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/1FC41F1F-16D8-40A3-BA4E-0EF38E455F00.jpg" alt="1FC41F1F-16D8-40A3-BA4E-0EF38E455F00.jpg" border="0" height="50" /><br />
</center><br />
where the a(i) are non-negative numbers called the &#8220;support weights&#8221; and k(,) is a function mapping pairs of vectors to numbers and is called a &#8220;a kernel.&#8221;  Unfortunately there are many incompatible uses of the word kernel in mathematics and statistics.  Here &#8220;kernel&#8221; is being used in the sense of <a target="_blank" href="http://en.wikipedia.org/wiki/Positive-definite_kernel">positive semi-definiteness</a> or that k(u,u) &ge; 0 for all u.   Note that the support vector machine instead of using the sign of f(z) as its decision instead uses which side f(z) is of a constant b as its category decision.  A consequence is: for support vector machines if b is non-zero (as it almost surely will be) and the kernels all go to zero as we approach infinity fast enough (as they are designed to do) then exactly one of the learned classes is infinite and the other is a union of islands (regardless if this was true for the training data).</p>
<h2>About Kernels</h2>
<p>A lot of awe and mysticism is associated with kernels, but we think they are not that big a deal.  Kernels are a combination of two good ideas, they have one important property and are subject to one major limitation.  It is also unfortunate that support vector machines and kernels are tied so tightly together.   Kernels are the idea of summing functions that imitate similarity (induce a positive-definite encoding of nearness) and support vector machines are the idea of solving a clever dual problem to maximize a quantity called margin.  Each is a useful tool even without the other.</p>
<p>The two good ideas are related and unfortunately treated as if they are the same.  The two good ideas we promised are:</p>
<ol>
<li>The &#8220;kernel trick.&#8221;  Adding new features/variables that are functions of your other input variables can change linearly inseparable problems into linearly separable problems.  For example if our points were encoded not as u(i) = (x(i),y(i)) but as u(i) = (x(i),y(i),x(i)*x(i),y(i)*y(i),x(i)*y(i))</li>
<p>  we could easily find the exact concept ( y(i) > x(i)*x(i) which is now a linear concept encoded as the vector (0,1,-1,0,0).</p>
<li>Often you don&#8217;t need the coordinates of u(i).  You are only interested in functions of distances ||u(i)-u(j)|^2 and in many cases you can get at these by inner products and relations like ||u(i)-u(j)||^2 = &lt;u(i),u(i)&gt; + &lt;u(j),u(j)&gt; &#8211; 2&lt;u(i),u(j)&gt; .</li>
</ol>
<p>We will expand on these issues later.</p>
<p>The important property is that kernels look like inner products in a transformed space.  The definition of a kernel is: there exists a magic function phi() such that for all u,v:<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/F7B3D1C2-2702-48E0-8259-C48BBDACFE50.jpg" alt="F7B3D1C2-2702-48E0-8259-C48BBDACFE50.jpg" border="0"  height="20" /> .<br />
</center><br />
This means that k(.,.) is behaving like an inner product in some (possibly unknown) space.  The important consequence is the positive semi-definiteness, which implies k(u,u)&ge;0 for all u (and this just follows from the fact about inner products over the real numbers that &lt;z,z&gt;&ge;0 for all z).  This is why optimization problems that use the kernel as their encoding are well formed (such as the optimization problem of maximizing margin which is how support vector machines work).  You can <a target="_blank" href="http://en.wikipedia.org/wiki/Regularization_(mathematics)">&#8220;regularize&#8221;</a> optimization problems with a kernel penalty because it behaves a lot like a norm.  Without the positive semidefinite property all of these optimization problems would be able to &#8220;run to negative infinity&#8221; or use negative terms (which are not possible from a kernel) to hide high error rates.  The limits of the kernel functions (not being able to turn distance penalties into bonuses) help ensure that the result of optimization is actually useful (and not just a flaw in our problem encoding).</p>
<p>And this brings us to the major limitations of kernels.  The phi() transform can be arbitrarily magic except when transforming one vector it doesn&#8217;t know what the other vector is and phi() doesn&#8217;t even know which side of the inner product it is encoding.   That is kernels are not as powerful as any of the following forms:</p>
<ul>
<li>snooping (knowing the other):<br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/1BA6AACD-96DE-4B2A-A3FA-1236AE9D2957.jpg" alt="1BA6AACD-96DE-4B2A-A3FA-1236AE9D2957.jpg" border="0"  height="20" />
</li>
<li>positional (knowing which part of inner product mapping to):<br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/D51DD0F0-DF01-4E37-B5C4-E5ECF3179165.jpg" alt="D51DD0F0-DF01-4E37-B5C4-E5ECF3179165.jpg" border="0" height="20" />
</li>
<li>fully general:<br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/845BD9A6-97DC-4F2C-AD4E-633FC9AFADA7.jpg" alt="845BD9A6-97DC-4F2C-AD4E-633FC9AFADA7.jpg" border="0" height="20" />
</li>
</ul>
<p>Not everything is a kernel.</p>
<p>Some non-kernels are:</p>
<ol>
<li> k(u,v) = -c for any c>0</li>
<blockquote><p>
A consequence of the fact that no negative number can be written as a sum of squares or as the limit of a sum of squares in the reals.
</p></blockquote>
<li> k(u,v) = ||u-v||<br />
<blockquote><p>
This can be shown as follows.  Suppose k(u,v) is a kernel with k(u,u) = 0 for all u (which is necessary to try and match ||u-v|| as ||u-u|| = 0 for all u).  By the definition<br />
of kernels k(u,u) = &lt; phi(u), phi(u) &gt; for some real vector valued function phi(.).  But, by the properties of the inner real inner product &lt;.,.&gt;, this means phi(u) is<br />
the zero vector for all u.  So k(u,v) = 0 for all u,v and does not match ||u-v|| for any u,v such that u &ne; v.
</p></blockquote>
</li>
</ol>
<p>There are some obvious kernels:</p>
<ol>
<li> k(u,v) = c for any c &ge; 0 (non-negative constant kernels) </li>
<li> k(u,v) = &lt; u , v &gt;  (the identity kernel) </li>
<li> k(u,v) = f(u) f(v) for any real valued function f(.) </li>
<li> k(u,v) = &lt; f(u) , f(v) &gt; for any real vector valued function f(.) (again, the definition of a kernel)</li>
<li> k(u,v) = transpose(u) B v where B is any symmetric positive semi-definite matrix</li>
</ol>
<p>And there are several subtle ways to build new kernels from old.  If q(.,.) and r(.,.) are kernels then so are:</p>
<ol>
<li> k(u,v) = q(u,v) + r(u,v)</li>
<li> k(u,v) = c q(u,v) for any c &ge; 0 </li>
<li> k(u,v) = q(u,v) r(u,v) </li>
<li> k(u,v) = q(f(u),f(v)) for any real vector valued function f(.)</li>
<li> k(u,v) = lim_{k-> infinity} q_k(u,v) where q_k(u,v) is sequence of kernels and the limit exists.</li>
<li> k(u,v) = p(q(u,v)) where p(.) is any polynomial with all non-negative terms.</li>
<li> k(u,v) = f(q(u,v)) where f(.) is any function with an absolutely convergent Taylor series with all non-negative terms.</li>
</ol>
<p>Most of these facts are taken from the excellent book: John Shawe-Taylor and Nello Cristianini&#8217;s <a target="_blank" href="http://www.kernel-methods.net/">&#8220;Kernel Methods for Pattern Analysis&#8221;</a>, Cambridge 2004.  We are allowing<br />
kernels of the form k(u,v) = &lt; phi(u), phi(v) &gt; where phi(.) is mapping into an infinite dimensional vector space (like a series).  Most of these facts can be checked by<br />
imagining how to alter the phi(.) function.  For example to add two kernels just build a larger vector with enough slots for all of the coordinates for the vectors encoding<br />
the phi(.)&#8217;s of the two kernels you are trying to add.  To scale a kernel by c multiply all coordinates of phi() by sqrt(c).  </p>
<p>Multiplying two kernels is the trickiest.  Without loss of generality assume q(u,v) = &lt; f(u) ,f(v) &gt; and r(u,v) = &lt; g(u), g(v) &gt; and f(.) and g(.) are both mapping into the same finite dimensional vector space R^m (any other situation can be simulated or approximated by padding with zeros and/or taking limits).  Imagine a new vector function p(.) that maps into R^{m*m} such that p(z)_{m*(i-1) + j} = f(z)_i g(z)_j .  It is easy to check that k(u,v) = &lt; p(u) , p(v) &gt; is a kernel and k(u,v) = q(u,v) r(u,v). (The reference proof uses tensor notation and the Schur product, but these are big hammers mathematicians use when they don&#8217;t want to mess around with re-encoding indices).</p>
<p>Note that one of the attractions of kernel methods is that you never have to actually implement any of the above constructions.  What you do is think in terms of sub-routines (easy for computer scientists, unpleasant for mathematicians).  For instance: if you had access to two functions q(.,.) and r(.,.) that claim to be kernels and you wanted the product kernel you would just, when asked to evaluate the kernel, just get the results for the two sub-kernels and multiply (so you never need to see the space phi(.) is implicitly working in).</p>
<p>This implicitness can be important (though far too much is made of it).  For example the Gaussians we have been using throughout are kernels, but kernels of infinite dimension, so we can not directly represent the space they live in.  To see the Gaussian is a kernel notice the following:<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/926E0E3C-883C-4879-A49C-4B91DEB3E449.jpg" alt="926E0E3C-883C-4879-A49C-4B91DEB3E449.jpg" border="0"  height="20" /> .<br />
</center> </p>
<p>And for all c &ge; 0 each of the three terms on the right is a kernel (the first because the Taylor series of exp() is absolutely convergent and non-negative and the second two are the f(u) f(v) form we listed as obvious kernels).  The trick is magic, but the idea is to use the fact that Euclidian squared distance breaks nicely into dot-products ( ||u-v||^2 = &lt; u, u &#038;gt + &lt; v, v &#038;gt &#8211; 2 &lt; u , v &gt;) and exp() converts addition to multiplication.  It is rather remarkable that kernels (which are a generalization of inner products that induce half open spaces) can encode bounded concepts (more on this later when we discuss the VC dimension of the Gaussian).</p>
<h2>Back to Support Vector Machines</h2>
<p>We can now restate what a support vector machine is: it is a method for picking data weights so that the modeling function for a given kernel has a maximum margin.  The margin is the minimal distance between a training example and the model&#8217;s boundary between blue and red.  Notice the 4 support vectors in figure 10 are all not &#8220;tight up against the boundary,&#8221; but are all the same (minimal distance) from the boundary.  This distance is called the margin and the area near the boundary is obviously a place the model has uncertainty (so it makes sense to keep the boundary as far as possible from the training data).  The great benefit of the support vector machine is that with access only to the data labels and the kernel function (in fact only the kernel function evaluated at pairs of training datums) the support vector machine can quickly solve for the optimal margin and data weights achieving this margin.</p>
<p>For fun lets plug in new kernel called the cosine kernel:<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/328D7D80-4133-4E42-A63F-F7E8F3ADC90C.jpg" alt="328D7D80-4133-4E42-A63F-F7E8F3ADC90C.jpg" border="0"  height="43" /><br />
</center><br />
(c &ge; 0).</p>
<p>This kernel can be thought of as having a phi(.) function that takes a vector z and adds an extra coordinate of sqrt(c) and then projects the resulting vector onto the unit sphere.  This kernel induces concepts that look like parabolas, as seen in figure 11.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/CosineKernelExampleConcept.png" alt="CosineKernelExampleConcept.png" border="0" width="480" height="480" /></p>
<p>Figure 11: Cosine example concept</p>
<p></center></p>
<p>Figure 12 shows the sum-concept (add all discount functions with same weight) model. In this case it is far too broad (averaging of the cosine concepts which are not bounded concepts like the Gaussians) creates an overly wide model.  The model not only generalizes poorly (fails to match the shape of the truth diagram) it also gets some of its own training data wrong.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/CosineKernelKernelSum1.png" alt="CosineKernelKernelSum.png" border="0" width="480" height="480" /></p>
<p>Figure 12: Cosine kernel sum model</p>
<p></center></p>
<p>And figure 13 shows the excellent fit returned by the support vector machine.  Actually an excellent fit determined by 4 support vectors (indicated by larger labels).  Also notice the support vector machine using unbounded concepts generated a very good unbounded model (unbounded in the sense that both the blue and red regions are infinite).  By changing the kernel or discount functions/concepts we changed the inductive bias.  So any knowledge of what sort of model we want (one class bounded or not) should greatly influence our choice of kernel functions (since the support vector machine can only pick weights for the kernel functions, not tune their shapes or bandwidths).</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/CosineKernelSVM.png" alt="CosineKernelSVM.png" border="0" width="480" height="480" /></p>
<p>Figure 13: Cosine Support Vector model</p>
<p></center></p>
<p>Another family of kernels (which I consider afflictions) are the &#8220;something for nothing&#8221; kernels.  These are kernels of the form:<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/C2307338-CB70-4EE0-B83D-0B4AE988F951.jpg" alt="C2307338-CB70-4EE0-B83D-0B4AE988F951.jpg" border="0" height="20" /><br />
</center><br />
or higher powers or other finishing functions.  The idea is that if you were to look at the expansion of these kernels you would see lots of higher order terms (powers of u_i and v_j)<br />
and if these terms were available to the support vector machine it could use them.  It is further claimed that in addition to getting new features for free you don&#8217;t use up degrees<br />
of freedom exploiting them (so you cross-validate as well as for simpler models).   Both these claims are fallacious- you can&#8217;t fully use the higher order terms because they are entangled with other terms that are not orthogonal the outcome and the complexity of a kernel is not quite as simple as degrees of freedom (proofs about support vector machines are stated in terms of margin, not in terms of degrees of freedom or even in terms of VC dimension).  The optimizer in the SVM does try hard to make a good hypothesis using the higher order terms- but due to the forced relations among the terms you get best fits like figure 14.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/SquareKernelSVM.png" alt="SquareKernelSVM.png" border="0" width="480" height="480" /></p>
<p>Figure 14: Squared magic kernel support vector model</p>
<p></center></p>
<p>If you want higher order terms I feel you are much better off performing a primal transform so the terms are available in their most general form.   That is re-encode the vector u = (x,y) as the larger vector (x,y,x*y,x*x,y*y).  You have to be honest: if you are trying to fit more complicated functions you are searching a larger hypothesis space so you need more data to falsify the bad hypotheses.   You can be efficient (don&#8217;t add terms you don&#8217;t think you can use as they make generalization harder) but you can&#8217;t get something for nothing (even with kernel methods).</p>
<h2>Back to margin</h2>
<p>Large margin (having a large distance between the decision boundary and all training examples) is a good idea.  At the simplest level it means anything close to a training data point gets classified correctly (so the model is immune to an amount of fuzz proportional to the margin width).  </p>
<p>We get the following remarkable generalization theorem (this one is the hard SVM version, we weaken the result a bit to make it easier to state).  Suppose we train a model using kernel k(.,.) on m training examples u(1),&#8230;,u(m) with training labels y(1)&#8230;y(1).  Further assume the hidden true concept is separable with respect to our kernel and data distribution with a margin of at least w (that is for any sample we draw we can find a SVM model that classifies all of the training data correctly and sees a margin of at least w).  Or more simply: assume w is behaving like a constant with respect to m (i.e. it is not going down as we increase sample size). Then with probability at least 1-d the generalization error (that is error seen on new test points not seen during training, but drawn from the same distribution as the training examples) is no more than:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/1F1B1C26-C676-46EA-9522-5DBA181CC8C6.jpg" alt="1F1B1C26-C676-46EA-9522-5DBA181CC8C6.jpg" border="0" height="65" /><br />
.<br />
</center><br />
(we take this from Theorem 7.22 of &#8220;Kernel Methods for Pattern Analysis, page 215).</p>
<p>If we further assume the expected value k(u,u) under the training distribution exists (i.e. we don&#8217;t have too many examples of very high kernel norm according to the distribution we are training with respect to) and is stable (not taking a lot of value from rare instances) then we can read off the meaning of this upper bound.  Assuming we can always build a model from the training data of margin at least w then the generalization error (excess error we expect to see on classifying similar points) is falling at a rate proportional to the square root of the amount of training data we have.  This is in one sense amazing- we have a bound on how well we are going to do on data we have not seen.  And this bound is also in some sense &#8220;best possible&#8221; because we need around this much data to even confirm such an accuracy (see <a target="_blank" href="http://www.win-vector.com/blog/2011/06/what-is-a-large-enough-random-sample/">What is a large enough random sample?</a> ).  Also notice how the dimension of the implicit features space (which can in fact be infinite) does not enter into the bound.</p>
<p>However, the bound is also some sense to be expected.  It depends on the following strong assumptions (usually not remembered or stated):</p>
<ol>
<li>Test and training data must be generated in the same manner (i.e. come from the same distribution, this is usually stated but not obeyed).</li>
<li>We must not have collapsing margin as the number of data points goes up (so there must actually be a moat between the positive and negative examples such as portrayed in figure1 where the blurry boundary was a region we never generated training or test examples from).</li>
<li>The expected value of k(u,u) must be bounded for examples drawn from our training distribution.  For example this is not true even for the identity kernel if x is just numbers drawn from a <a target="_blan" href="http://en.wikipedia.org/wiki/Cauchy_distribution">Cauchy distribution</a>.</li>
</ol>
<p>Also it is a <em>posterior</em> estimate of expected generalization error.  That is, we only get a bound after we plug in some facts found from fitting a model to the training data (the margin and the expected value of k(u,u)).  It is not like the famous <em>prior</em> bounds based on VC dimension that depend only on the structure of the kernel k(,) and not the distribution of training data or labels (beyond the concept being expressible by the kernel, which is itself a <em>posterior</em> fact).  The thing to remember for all posterior estimates is you can get such estimates at a small expense in statistical efficiency (i.e. you waste some data) and a moderate computational blow-up by cross validation (and with essentially no math and many fewer assumptions).  In fact the cross validation and hold-out estimates may be much tighter than calculated bounds (and are certainly easier to explain).</p>
<p>When we say we weakened the result we mean we have added some stated some conditions that are not strictly needed for the bound to be true.  One of these assumptions is that the margin is not collapsing (the bound remains true even for a collapsing margin; it is just not useful).  We added this assumption because most readings of the bound (or presumed applications) implicitly assume the bound remains useful (which means something near our additional assumptions must also hold).  The reader is really lured to think of the margin w as a constant depending only on the kernel when in fact it depends on the kernel, data distribution and number of training examples.  For example: consider a very simple one dimensional classification problem:  learning the concept x &ge; 0 from labeled training data drawn uniformly from the interval [-0.5,0.5] using a Gaussian kernel.  The support vector models are going to be essentially picking two neighboring points in the observed training sample that have opposite labels and saying the concept boundary is between them.  In this simple case the margin is collapsing: as m (the number of training examples increases) the expected margin width is O(1/m).  This renders the bound calculated above useless (drives it above 1 as the number of training examples increases).  The model is good but the bound is not useful.  This is because the training distribution is discovering points closer and closer to the concept boundary (a necessary discontinuity) as the number of training examples grows.  This is why we had to add the additional (weakening) assumption that not only do we need to assume the concept to be learned is separable with respect to the kernel we are using, but we must further assume the data-distribution (determining where test and training data come from) stays some minimal distance away from the decision surface (i.e. together the data distribution and underlying concept have a moat, as in our figure 1 example).  We will phrase this as &#8220;the concept is wide margin separable with respect to the training distribution&#8221; (implicitly we need to know the kernel, but it is the training distribution that is critical). This is a very strong assumption (not true for all classification problems) but is often true when the classification task is to &#8220;un-mix a mixture&#8221; (the observed data is coming from two different sources and the training labels are which source they come from, in this case you often do have a margin).  Obviously a soft-margin SVM deals better with this (some of the issue is due to our use of hard-margin)- but it is an interesting exercise to think why you would expect your learned  model to have a large margin if your underlying concept and training data do not.</p>
<p>The result remains important: support vector machines are in some sense an optimal learning procedure (up to some constants they achieve as tight a generalization error as can even be confirmed for a given training set size).  Another strong point of the support vector result is the result applies even for kernels with unbounded VC dimension (such as the Gaussian kernel).</p>
<h2>The Gaussian (or radial) kernel again</h2>
<p>We said the Gaussian kernel was of infinite VC dimension.  But we have not shown why it is true.   The fact that the only encoding we could think of off the top of our head was a power-series or an limit does not guarantee there isn&#8217;t  some more clever encoding that easily demonstrates that the Gaussian kernel is of bounded VC dimension (like the identity kernel and cosine kernels are). </p>
<p>However we can see the Gaussian kernel can not have finite VC dimension because it can place an arbitrary number of bounded patches in the plane.  Thus we can build arbitrarily large sets (by placing all points far apart) that can be shattered (thus falsifying any bounded VC dimension).</p>
<p>One of the wonders of the support vector method is that the theory works even in this situation.</p>
<h2>Back to SVM</h2>
<p>Support vector machines seem nearly magical in their power to pick &#8220;best weights.&#8221;  However, as we have seen, these best weights may not always be much better than typical good weights (for instance: using all uniform weights).  Also it is very important to remember the support vector machine is picking the best weighting of a sum of already picked kernels.  It is not picking the best kernel shape or adjusting things like bandwidth (you must pick the best kernel ahead of time or try several kernels to find a good one).  You must remember support vector machines can only directly adjust dual weights (or relative training data weighting) this can affect some changes in hypothesis shape but the support vector machine can not directly implement arbitrary changes in hypothesis shape like a nearest neighbor or a parameterized primal method can (like logistic regression with the kernel trick of adding interaction variables). </p>
<p>Support vector machines are to be much preferred to other combiners like <a target="_blan" href="http://en.wikipedia.org/wiki/Boosting">Boosting</a>.  This is because support vector machines have explicit stopping criteria (conditions known to be true at the optimal solution that characterize the optimal solution) and explicit regularization (control of over fitting).  Boosting relies a path argument (&#8220;do the computation in these stages&#8221;) so it is much harder to reason about and frankly the folklore that &#8220;early stop prevents over fitting in boosting&#8221; is just false (you are much better off explicitly regularizing).</p>
<p>It is also lore that random kernel transformations (like power series over the cosine kernel) are magic.  They can make data separable but they are unlikely to meet the (unstated but critical) conditions of the support vector theorem (margin being wide, margin not filling in and expectation of k(.,.) being small).</p>
<h3>A comment on optimization</h3>
<p>In this writeup we have skipped the most beautiful part of support vector machines- the form of the optimization problem used to solve them.  We strongly suggest consulting  John Shawe-Taylor and Nello Cristianini&#8217;s &#8220;Kernel Methods for Pattern Analysis&#8221;, Cambridge 2004 for this.  The writing is dense but accurate.  You can literally paste their formulas into a general solver and they work.</p>
<h2>Conclusion</h2>
<p>We have shown how kernel methods and support vector machines fit in a sequence of machine learning methods:</p>
<ol>
<li>nearest neighbor models</li>
<li>uniform sum models</li>
<li>support vector machines.</li>
</ol>
<p>We also showed a number of examples of kernels and non-kernels.  It is good to be able to quickly remember the constant &#8220;-1&#8243; is not a kernel and the constant &#8220;1&#8243; is a kernel (though not a very useful one).</p>
<p>It is our argument that support vector machines are &#8220;dual methods&#8221; (as they work in weights over the data instead of weights over coordinates in the features space) and the move from the primal algorithm Naive Bayes to Logistic regression is similar to the move from uniform kernel sums to support vector machines.</p>
<p>Also, if you want to directly manipulate shape (like picking bandwidth) you must in some sense use a primal method (this isn&#8217;t what support vector machines are for).</p>
<p>We end with: If you have some feeling of both what a method can and can not do then you have some understanding of the method.</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2010/10/a-personal-perspective-on-machine-learning/' rel='bookmark' title='A Personal Perspective on Machine Learning'>A Personal Perspective on Machine Learning</a></li>
<li><a href='http://www.win-vector.com/blog/2011/04/do-your-tools-support-production-or-complexity/' rel='bookmark' title='Do your tools support production or complexity?'>Do your tools support production or complexity?</a></li>
<li><a href='http://www.win-vector.com/blog/2011/07/book-review-ensemble-methods-in-data-mining-seni-elder/' rel='bookmark' title='Book Review: Ensemble Methods in Data Mining (Seni &amp; Elder)'>Book Review: Ensemble Methods in Data Mining (Seni &#038; Elder)</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2011/10/kernel-methods-and-support-vector-machines-de-mystified/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>The equivalence of logistic regression and maximum entropy models</title>
		<link>http://www.win-vector.com/blog/2011/09/the-equivalence-of-logistic-regression-and-maximum-entropy-models/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=the-equivalence-of-logistic-regression-and-maximum-entropy-models</link>
		<comments>http://www.win-vector.com/blog/2011/09/the-equivalence-of-logistic-regression-and-maximum-entropy-models/#comments</comments>
		<pubDate>Fri, 23 Sep 2011 16:21:09 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Expository Writing]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Statistics To English Translation]]></category>
		<category><![CDATA[Tutorials]]></category>
		<category><![CDATA[Calculus of Variations]]></category>
		<category><![CDATA[log-likelihood]]></category>
		<category><![CDATA[Logistic Regression]]></category>
		<category><![CDATA[Max-Ent]]></category>
		<category><![CDATA[Maximum Entropy]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1753</guid>
		<description><![CDATA[Nina Zumel recently gave a very clear explanation of logistic regression ( The Simpler Derivation of Logistic Regression ). In particular she called out the central role of log-odds ratios and demonstrated how the &#8220;deviance&#8221; (that mysterious quantity reported by fitting packages) is both a term in &#8220;the pseudo-R^2&#8243; (so directly measures goodness of fit) [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2011/09/the-simpler-derivation-of-logistic-regression/' rel='bookmark' title='The Simpler Derivation of Logistic Regression'>The Simpler Derivation of Logistic Regression</a></li>
<li><a href='http://www.win-vector.com/blog/2010/11/learn-a-powerful-machine-learning-tool-logistic-regression-and-beyond/' rel='bookmark' title='Learn Logistic Regression (and beyond)'>Learn Logistic Regression (and beyond)</a></li>
<li><a href='http://www.win-vector.com/blog/2010/12/large-data-logistic-regression-with-example-hadoop-code/' rel='bookmark' title='Large Data Logistic Regression (with example Hadoop code)'>Large Data Logistic Regression (with example Hadoop code)</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>Nina Zumel recently gave a very clear explanation of logistic regression ( <a target="_blank" href="http://www.win-vector.com/blog/2011/09/the-simpler-derivation-of-logistic-regression/">The Simpler Derivation of Logistic Regression</a> ).  In particular she called out the central role of log-odds ratios and demonstrated how the &#8220;deviance&#8221; (that mysterious<br />
quantity reported by fitting packages) is both a term in &#8220;the pseudo-R^2&#8243; (so directly measures goodness of fit) and is the quantity that is actually optimized during the fitting procedure.  One great point of the writeup was how simple everything is once you start thinking in terms of derivatives (and that it isn&#8217;t so much the functional form of the sigmoid that is special but its relation to its own derivative that is special).</p>
<p>We adapt these presentation ideas to make explicit the well known equivalence of logistic regression and maximum entropy models.<span id="more-1753"></span>In our new writeup: <a target="_blank" href="http://www.win-vector.com/dfiles/LogisticRegressionMaxEnt.pdf">The equivalence of logistic regression and maximum entropy models</a>  we move to multi-category modeling and demonstrate how one invents something as remarkable as logistic regression.</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2011/09/the-simpler-derivation-of-logistic-regression/' rel='bookmark' title='The Simpler Derivation of Logistic Regression'>The Simpler Derivation of Logistic Regression</a></li>
<li><a href='http://www.win-vector.com/blog/2010/11/learn-a-powerful-machine-learning-tool-logistic-regression-and-beyond/' rel='bookmark' title='Learn Logistic Regression (and beyond)'>Learn Logistic Regression (and beyond)</a></li>
<li><a href='http://www.win-vector.com/blog/2010/12/large-data-logistic-regression-with-example-hadoop-code/' rel='bookmark' title='Large Data Logistic Regression (with example Hadoop code)'>Large Data Logistic Regression (with example Hadoop code)</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2011/09/the-equivalence-of-logistic-regression-and-maximum-entropy-models/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The Simpler Derivation of Logistic Regression</title>
		<link>http://www.win-vector.com/blog/2011/09/the-simpler-derivation-of-logistic-regression/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=the-simpler-derivation-of-logistic-regression</link>
		<comments>http://www.win-vector.com/blog/2011/09/the-simpler-derivation-of-logistic-regression/#comments</comments>
		<pubDate>Wed, 14 Sep 2011 15:36:37 +0000</pubDate>
		<dc:creator>Nina Zumel</dc:creator>
				<category><![CDATA[Expository Writing]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Statistics To English Translation]]></category>
		<category><![CDATA[Tutorials]]></category>
		<category><![CDATA[likelihood]]></category>
		<category><![CDATA[log-likelihood]]></category>
		<category><![CDATA[Logistic Regression]]></category>
		<category><![CDATA[newton's method]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1740</guid>
		<description><![CDATA[Logistic regression is one of the most popular ways to fit models for categorical data, especially for binary response data. It is the most important (and probably most used) member of a class of models called generalized linear models. Unlike linear regression, logistic regression can directly predict probabilities (values that are restricted to the (0,1) [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2011/09/the-equivalence-of-logistic-regression-and-maximum-entropy-models/' rel='bookmark' title='The equivalence of logistic regression and maximum entropy models'>The equivalence of logistic regression and maximum entropy models</a></li>
<li><a href='http://www.win-vector.com/blog/2010/11/learn-a-powerful-machine-learning-tool-logistic-regression-and-beyond/' rel='bookmark' title='Learn Logistic Regression (and beyond)'>Learn Logistic Regression (and beyond)</a></li>
<li><a href='http://www.win-vector.com/blog/2010/12/large-data-logistic-regression-with-example-hadoop-code/' rel='bookmark' title='Large Data Logistic Regression (with example Hadoop code)'>Large Data Logistic Regression (with example Hadoop code)</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>Logistic regression is one of the most popular ways to fit models for categorical data, especially for binary response data. It is the most important (and probably most used) member of a class of models called generalized linear models. Unlike linear regression, logistic regression can directly predict probabilities (values that are restricted to the (0,1) interval); furthermore, those probabilities are well-calibrated when compared to the probabilities predicted by some other classifiers, such as Naive Bayes. Logistic regression preserves the marginal probabilities of the training data. The coefficients of the model also provide some hint of the relative importance of each input variable.</p>
<p> While you don&#8217;t have to know how to derive logistic regression or how to implement it in order to use it, the details of its derivation give important insights into interpreting and troubleshooting the resulting models. Unfortunately, most derivations (like the ones in [Agresti, 1990] or [Hastie, et.al, 2009]) are too terse for easy comprehension. Here, we give a derivation that is less terse (and less general than Agresti&#8217;s), and we&#8217;ll take the time to point out some details and useful facts that sometimes get lost in the discussion.<span id="more-1740"></span><br/><br/>To make the discussion easier, we will focus on the binary response case. We assume that the case of interest (or &#8220;true&#8221;) is coded to 1, and the alternative case (or &#8220;false&#8221;) is coded to 0.</p>
<p>The logistic regression model assumes that the log-odds of an observation <em>y</em> can be expressed as a linear function of the K input variables <strong>x</strong>:</p>
<div style="text-align:center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2011/09/B437C723-50ED-4B7B-97B6-56D36C66CD3C.jpg" alt="B437C723-50ED-4B7B-97B6-56D36C66CD3C.jpg" border="0" width="189" /></div>
<p>Here, we add the constant term <em>b<sub>0</sub></em>, by setting <em>x<sub>0</sub></em> = 1. This gives us  K+1 parameters. The left hand side of the above equation is called the <em>logit</em><sun> of P (hence, the name logistic regression). </p>
<p>Let&#8217;s take the exponent of both sides of the logit equation.</p>
<div style="text-align:center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2011/09/B4F430F4-5B17-4151-B7B9-AD0282034582.jpg" alt="B4F430F4-5B17-4151-B7B9-AD0282034582.jpg" border="0" width="200" /></div>
<p>This immediately tells us that logistic models are multiplicative in their inputs (rather than additive, like a linear model), and it gives us a way to interpret the coefficients. The value exp(<em>b<sub>j</sub></em>) tells us how the odds of the response being &#8220;true&#8221; increase (or decrease) as <em>x<sub>j</sub> </em>increases by one unit, all other things being equal. For example, suppose <em>b<sub>j</sub></em> = 0.693. Then exp(<em>b<sub>j</sub></em>) = 2. If <em>x<sub>j</sub></em> is a numerical variable (say, age in years), then every year&#8217;s increase in age doubles the odds of the response being true — all other things being equal. If <em>x<sub>j</sub> </em>is a binary variable (say, sex, with female coded as 1 and male as 0), then if the subject is female, then the response is two times more likely to be true than if the subject is male, all other things being equal.</p>
<p>We can also invert the logit equation to get a new expression for P(x):</p>
<div style="text-align:center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2011/09/9512CE26-F521-4F55-A9C7-2F951C8CD791.jpg" alt="9512CE26-F521-4F55-A9C7-2F951C8CD791.jpg" border="0" width="141" /></div>
<p><br/>The right hand side of the top equation is the sigmoid of <em>z</em>, which maps the real line to the interval (0, 1), and is approximately linear near the origin. A useful fact about P(<em>z</em>) is that the derivative P&#8217;(<em>z</em>) = P(<em>z</em>) (1 &#8211; P(<em>z</em>)). Here&#8217;s the derivation:</p>
<div style="text-align:center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2011/09/B2F9E40A-5775-47FE-B582-28B62F19BE99.jpg" alt="B2F9E40A-5775-47FE-B582-28B62F19BE99.jpg" border="0" width="600" /></div>
<p>Later, we will want to take the gradient of P with respect to the set of coefficients <strong>b</strong>, rather than <em>z</em>. In that case, P&#8217;(<em>z</em>) = P(<em>z</em>) (1 &#8211; P(<em>z</em>))<em>z</em>&#8216;, where &#8216; is the gradient taken with respect to <strong>b</strong>.</p>
<p>The solution to a Logistic Regression problem is the set of parameters <strong>b</strong> that maximizes the likelihood of the data, which is expressed as the product of the predicted probabilities of the N individual observations.</p>
<div style="text-align:center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2011/09/2FC8A9C0-D435-459A-8D58-5B183EF952F5.jpg" alt="2FC8A9C0-D435-459A-8D58-5B183EF952F5.jpg" border="0" width="340" /></div>
<p>(<em>X, y</em>) is the set of observations; <em>X</em> is a K+1 by N matrix of inputs, where each column corresponds to an observation, and the first row is <strong>1</strong>; <em>y</em> is an N-dimensional vector of responses; and (<strong>x</strong><sub>i</sub>, <em>y<sub>i</sub></em>) are the individual observations.</p>
<p>It&#8217;s generally easier to work with the log of this expression, known (of course) as the log-likelihood.</p>
<div style="text-align:center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2011/09/B4A3F94A-D4B5-4391-ABFD-C872839F3CFC.jpg" alt="B4A3F94A-D4B5-4391-ABFD-C872839F3CFC.jpg" border="0" width="412" /></div>
<p>Maximizing the log-likelihood will maximize the likelihood. As a side note, the quantity −2*log-likelihood is called the <em>deviance</em> of the model. It is analogous to the residual sum of squares (RSS) of a linear model. Ordinary least squares minimizes RSS; logistic regression minimizes deviance. A useful goodness-of-fit heuristic for a logistic regression model is to compare the deviance of the model with the so-called <em>null deviance</em>: the deviance of the constant model that returns only the global response probability for every data point. One minus the ratio of deviance to null deviance is sometimes called <em>pseudo-R<sup>2</sup></em>, and is used the way one would use R<sup>2</sup> to evaluate a linear model.</p>
<div style="text-align:center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2011/09/E5A7630A-2EA5-47A3-9A8A-10BF46B64F6A.jpg" alt="E5A7630A-2EA5-47A3-9A8A-10BF46B64F6A.jpg" border="0" width="241" /></div>
<p><br/>Traditional derivations of Logistic Regression tend to start by substituting the logit function directly into the log-likelihood equations, and expanding from there. The derivation is much simpler if we don&#8217;t plug the logit function in immediately. To maximize the log-likelihood, we take its gradient with respect to <strong>b</strong>:</p>
<div style="text-align:center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2011/09/B2673255-4B13-4FF1-88F8-ACF004F28518.jpg" alt="B2673255-4B13-4FF1-88F8-ACF004F28518.jpg" border="0" width="260" /></div>
<p>where P<sub>i</sub> is shorthand for P(<string>x</strong><sub>i</sub>). The maximum occurs where the gradient is zero.</p>
<p>We can expand this equation further, when we remember that P&#8217; = P(1-P):</p>
<div style="text-align:center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2011/09/C6B59E85-25AB-4537-A1C1-3664A98EEE00.jpg" alt="C6B59E85-25AB-4537-A1C1-3664A98EEE00.jpg" border="0" width="352" /></div>
<p>The last line merges the two cases (<em>y<sub>i</sub></em> = 1 and <em>y<sub>i</sub></em> = 0) into a single sum. We can now cancel terms and set the gradient to zero. This gives us the set of simultaneous equations that are true at the optimum:</p>
<div style="text-align:center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2011/09/9D60BD95-E4E0-40AF-9350-01DA19EFE845.jpg" alt="9D60BD95-E4E0-40AF-9350-01DA19EFE845.jpg" border="0" width="150" /></div>
<p>Notice that the equations to be solved are in terms of the probabilities P (which are a function of <strong>b</strong>), not directly in terms of the coefficients <strong>b</strong> themselves. This means that logistic models are coordinate-free: for a given set of input variables, the probabilities returned by the model will be the same even if the variables are shifted, combined, or rescaled. Only the values of the coefficients will change.</p>
<p>The other thing to notice from the above equations is that the sum of probability mass across each coordinate of the <strong>x</strong><sub>i</sub> vectors is equal to the count of observations with that coordinate value for which the response was true. For example, suppose the jth input variable is 1 if the subject is female, 0 if the subject is male. Then </p>
<div style="text-align:center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2011/09/2357AF36-1D0E-4EF1-88A9-8694220B62E4.jpg" alt="2357AF36-1D0E-4EF1-88A9-8694220B62E4.jpg" border="0" width="211" /></div>
<p>In other words, the summed probability mass for the female subjects equals the count of female subjects with the response &#8220;true&#8221;. It is also true that the sum of all the probability mass over the entire training set will equal the number of &#8220;true&#8221; responses in the training set. This is what we mean when we say that logistic regression preserves the marginal probabilities of the training data.</p>
<p><strong>Solving for the Coefficients</strong></p>
<p>The most straightforward way to solve for the coefficients <strong>b</strong> is Newton&#8217;s method. The Fisher scoring method that is used in most off-the-shelf implementations is a more general variation of Newton&#8217;s method; it works on the same principles. We will describe solving for the coefficients using Newton&#8217;s method.</p>
<p>Suppose you have a vector valued function <strong>f</strong>: <strong> y = f(b)</strong>.  You want to find the value <strong>b</strong><sub>opt</sub> such that  <strong>f(b)</strong><sub>opt</sub> = <strong>0</strong>. Assuming that we start with an initial guess <strong>b</strong><sub>0</sub>, we can take the Taylor expansion of <strong>f</strong> around <strong>b</strong><sub>0</sub>:</p>
<div style="text-align:center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2011/09/DCAFCF7F-1991-4005-8CAD-39F5002B9C6C.jpg" alt="DCAFCF7F-1991-4005-8CAD-39F5002B9C6C.jpg" border="0" width="237" /></div>
<p>Here, <strong>f</strong>&#8216;  is a matrix; it is the Jacobean of first derivatives of <strong>f</strong> with respect to <strong>b</strong>. Setting the left hand side to zero, we can solve for &#916 as </p>
<div style="text-align:center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2011/09/7447A740-4D4A-4BC5-AC66-45D9CF2665DD.jpg" alt="7447A740-4D4A-4BC5-AC66-45D9CF2665DD.jpg" border="0" width="176" /></div>
<p>We then update our estimate for <strong>b</strong>:</p>
<div style="text-align:center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2011/09/6D54A73B-69F1-4B5B-9103-80AC40A0D08D.jpg" alt="6D54A73B-69F1-4B5B-9103-80AC40A0D08D.jpg" border="0" width="108" /></div>
<p>and iterate until convergence.</p>
<p>In our case, <strong>f</strong> is the gradient of the log-likelihood, and its Jacobean is the Hessian (the matrix of second derivatives) of the log-likelihood function.</p>
<div style="text-align:center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2011/09/F52762C7-618B-45AC-8396-01A228D0A54E.jpg" alt="F52762C7-618B-45AC-8396-01A228D0A54E.jpg" border="0" width="203" /></div>
<p>where <strong>W</strong> is a diagonal matrix of the derivatives P&#8217;<sub>i</sub>, and the ith column of <strong>X</strong> corresponds to the ith observation. So we can solve for &#916 at each iteration as</p>
<div style="text-align:center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2011/09/16E1E7D7-1EEF-43A6-BA5C-B77CBA2E2EED.jpg" alt="16E1E7D7-1EEF-43A6-BA5C-B77CBA2E2EED.jpg" border="0" width="239" /></div>
<p>where <strong>W</strong> is the current matrix of derivatives, <strong>y</strong> is the vector of observed responses, and <strong>P</strong><sub>k</sub> is the vector of probabilities as calculated by the current estimate of <strong>b</strong>.</p>
<p>Compare this to the solution of a linear regression:</p>
<div style="text-align:center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2011/09/2B73A3FE-3667-4BCF-BC67-B5A701014785.jpg" alt="2B73A3FE-3667-4BCF-BC67-B5A701014785.jpg" border="0" width="152"/></div>
<p>Comparing the two, we can see that at each iteration, &#916 is the solution of a weighted least square problem, where the &#8220;response&#8221; is the difference between the observed response and its current estimated probability of being true. This is why the technique for solving logistic regression problems is sometimes referred to as <em>iteratively re-weighted least squares</em>. Generally, the method does not take long to converge (about 6 or so iterations).</p>
<p>Thinking of logistic regression as a weighted least squares problem immediately tells you a few things that can go wrong, and how. For example, if some of the input variables are correlated, then the Hessian <strong>H</strong> will be ill-conditioned, or even singular. This will result in large error bars (or &#8220;loss of significance&#8221;) around the estimates of certain coefficients. It can also result in coefficients with excessively large magnitudes, and often the wrong sign. If an input perfectly predicts the response for some subset of the data (but not all), then the term P<sub>i</sub> (1 &#8211; P<sub>i</sub>) will be driven to zero for that subset, which will drive the coefficient for that input to infinity (if the input perfectly predicted all the data, then the residual (<strong>y</strong> &#8211; <strong>P</strong><sub>k</sub>) has already gone to zero, which means that you are already at the optimum).</p>
<p>On the other hand, the least squares analogy also gives us the solution to these problems: <em>regularized regression</em>, such as lasso or ridge. Regularized regression penalizes excessively large coefficients, and keeps them bounded. If you are implementing your own logistic regression procedure, rather than using a package, then it is straightforward to implement a regularized least squares for the iteration step (<a href="http://www.win-vector.com/blog/2010/11/learn-a-powerful-machine-learning-tool-logistic-regression-and-beyond/">as Win-Vector has done</a>). But even if you are using an off-the-shelf implementation, the above discussion will help give you a sense of how to interpret the coefficients of your model, and how to recognize and troubleshoot some issues that might arise.</p>
<p><strong>Conclusion</strong></p>
<p>Here is what you should now know from going through the derivation of logistic regression step by step: </p>
<p>- Logistic regression models are multiplicative in their inputs. <br/></p>
<p>- The exponent of each coefficient tells you how a unit change in that input variable affects the odds ratio of the response being true. </p>
<p>- Logistic regression is coordinate-free: translations, rotations, and rescaling of the input variables will not affect the resulting probabilities. </p>
<p>- Logistic regression preserves the marginal probabilities of the training data. <br/></p>
<p>- Overly large coefficient magnitudes, overly large error bars on the coefficient estimates, and the wrong sign on a coefficient could be indications of correlated inputs. </br></p>
<p>- Coefficients that tend to infinity could be a sign that an input is perfectly correlated with a subset of your responses. Or put another way, it could be a sign that this input is only really useful on a subset of your data, so perhaps it is time to segment the data. </p>
<p>- Pseudo-R<sup>2</sup> is a useful goodness-of-fit heuristic. <br/></p>
<p><strong>References</strong></p>
<p>[Agresti, 1990] Agresti, A. (1990). Categorical Data Analysis.</p>
<p>[Hastie, et.al, 2009] Hastie, T., R. Tibshirani, and J. Friedman (2009). The Elements of Statistical Learning, 2nd Edition.</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2011/09/the-equivalence-of-logistic-regression-and-maximum-entropy-models/' rel='bookmark' title='The equivalence of logistic regression and maximum entropy models'>The equivalence of logistic regression and maximum entropy models</a></li>
<li><a href='http://www.win-vector.com/blog/2010/11/learn-a-powerful-machine-learning-tool-logistic-regression-and-beyond/' rel='bookmark' title='Learn Logistic Regression (and beyond)'>Learn Logistic Regression (and beyond)</a></li>
<li><a href='http://www.win-vector.com/blog/2010/12/large-data-logistic-regression-with-example-hadoop-code/' rel='bookmark' title='Large Data Logistic Regression (with example Hadoop code)'>Large Data Logistic Regression (with example Hadoop code)</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2011/09/the-simpler-derivation-of-logistic-regression/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>A Personal Perspective on Machine Learning</title>
		<link>http://www.win-vector.com/blog/2010/10/a-personal-perspective-on-machine-learning/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=a-personal-perspective-on-machine-learning</link>
		<comments>http://www.win-vector.com/blog/2010/10/a-personal-perspective-on-machine-learning/#comments</comments>
		<pubDate>Sun, 31 Oct 2010 21:45:48 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Expository Writing]]></category>
		<category><![CDATA[History]]></category>
		<category><![CDATA[Opinion]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Clustering]]></category>
		<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Machine Learning]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1551</guid>
		<description><![CDATA[Having a bit of history as both a user of machine learning and a researcher in the field I feel I have developed a useful perspective on the various trends, flavors and nuances in machine learning and artificial intelligence. I thought I would take a moment to outline a bit of it here and demonstrate [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/' rel='bookmark' title='A Demonstration of Data Mining'>A Demonstration of Data Mining</a></li>
<li><a href='http://www.win-vector.com/blog/2011/10/kernel-methods-and-support-vector-machines-de-mystified/' rel='bookmark' title='Kernel Methods and Support Vector Machines de-Mystified'>Kernel Methods and Support Vector Machines de-Mystified</a></li>
<li><a href='http://www.win-vector.com/blog/2011/07/book-review-ensemble-methods-in-data-mining-seni-elder/' rel='bookmark' title='Book Review: Ensemble Methods in Data Mining (Seni &amp; Elder)'>Book Review: Ensemble Methods in Data Mining (Seni &#038; Elder)</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>Having a bit of history as both a user of machine learning and a researcher in the field I feel I have developed a useful perspective on the various trends, flavors and nuances in machine learning and artificial intelligence.  I thought I would take a moment to outline a bit of it here and demonstrate how what we call artificial intelligence is becoming more statistical in nature.<span id="more-1551"></span><br />
In the early days <a target="_blank" href="http://en.wikipedia.org/wiki/Machine_learning">machine learning</a> and artificial intelligence were famous for promising far too much and delivering far too little.  This has changed.  Artificial decision and reasoning systems are now everywhere.  One of the things masking the breadth and authority of artificial intelligence is the current prejudice: &#8220;if a system is well understood or works then it is no longer called artificial intelligence.&#8221;  A working system becomes a database, expert system, rules engine, machine learning platform, analytics dashboard, pattern recognition system or statistics warehouse.  We clearly have not reached anywhere near building a conversational intelligence (like Hal from 2001 or <a target="_blank" href="http://mzlabs.com/MZLabsJM/page6/Gerty/Gerty.html">Gerty</a> from Moon).  Yet every day machines decide if your credit card is accepted, advise on medical care, route goods, curate information and control vast industrial plants.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/10/Hal-9000.jpg" alt="Hal-9000.jpg" border="0" width="150" height="150" /><br />
<br/>Hal 9000<br />
</center></p>
<p>There have been vast improvements in artificial intelligence.  Much of the improvement has been driven by the engineering effects of Moore&#8217;s Law (resulting in my mobile phone&#8217;s processor having 12 times the clock speed and over 32 times the memory of an $8 million <a target="_blank" href="http://en.wikipedia.org/wiki/Cray-1">Cray 1 super computer</a>)  and significant machine learning research results.  These machine size changes happened during the productive careers of many researchers, so ideas are often evaluated at a series of radically different machine capabilities and data scales.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/10/Cray-1-deutsches-museum.jpg" alt="Cray-1-deutsches-museum.jpg" border="0" width="487" height="536" /><br />
<br/>Cray 1<br />
</center></p>
<p>von Neuman himself commented that scale was a major limiting factor in early computers.  He asked the question how you could be expected to achieve anything significant even from a roomful of geniuses if (as with his early computers) all notes, communication and memory were limited to less than a single typed page.  von Neuman&#8217;s comment stands in contrast to science fiction scientists and early boosters of artificial intelligence who always seem to be in awe of their own creations.  Computers are certainly much larger- but we need to be humble and put off deciding if we are yet in the era of large computers (compared to human or animal brains).  Everything we are doing now may still just be artificial intelligence&#8217;s pre-history and prologue.  Feynman in his lectures on computation mentions that RNA transcription can be estimated to take around 100 kT of energy to transcribe a bit while a transistor may easily use 100,000,000 kT energy units to switch states.  This means for the amount of heat the human head dissipates (energy supply and heat dissipation are rapidly becoming the most relevant measures of computational power) you could do a million times more work using RNA techniques (if you knew how) than with transistors.  So computers may not yet be what we should call large (though they are likely getting there).  What we currently call <a target="_blank" href="http://www.win-vector.com/blog/2009/08/on-the-hysteria-over-the-cloud/">&#8220;datacenters&#8221;</a> are in fact block sized computers (consuming an enormous amount of energy and dissipating a huge amount of heat).</p>
<p><center><br />
<img  target="_blank" src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/google_data_center_lenoir.jpg" alt="google_data_center_lenoir.jpg" border="0" width="512" height="355" /><br />
<br/>A datacenter (or a block sized computer)<br />
</center></p>
<p>Not all improvements in machine intelligence have come from (or are to come from) improvements in hardware.  Many of the improvements came from machine learning research results and these are what I will outline below.</p>
<p>Early machine learning algorithms were driven by analogy.  This led us to perceptrons (1957, fairly early in the history of computer science) and neural nets.  These methods have their successes but were largely over used and developed before researchers developed a good list of desirable properties of a machine learning method.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/10/220px-Neural_network_example.svg_.png" alt="220px-Neural_network_example.svg.png" border="0" width="220" height="293" /><br />
<br/>Neural Net diagram<br />
</center></p>
<p>These methods live on but are,  in my opinion, not currently competitive.  Some of their important ideas and contributions have been revived from time to time, such as the online update rules becoming what we now call stochastic gradients.</p>
<p>A list of (often incompatible) desirable properties of a machine learning algorithm is the following:</p>
<ul>
<li>Able to represent complicated functions</li>
<li>Good generalization performance (quality predictions on data not seen during training)</li>
<li>Unique optimal model for a given set of data and feature definitions</li>
<li>Efficient and well characterized solution method</li>
<li>Consistent summary statistics</li>
<li>Preference for simple models</li>
</ul>
<p>We divert from this list for a bit of background and context.</p>
<p>The neural net was largely celebrated for its ability to represent complex functions and the perceived efficiency of its newer back-propagation based training method (related to the <a target="_blank" href="http://www.win-vector.com/blog/2010/07/gradients-via-reverse-accumulation/">efficient calculation of gradients</a>).  The downsides were you never knew if your neural net was the right one (even assuming you had the right features, layout and training data) and could not be sure you were biasing towards simple models that might perform well on novel queries.  Great effort was expended in extending neural nets based on the supposition they should work as they were an analogy to how we imagined biological neurons might function.  An almost mystic hope was derived from the non-linear nature and special properties of the sigmoid curve (which was in fact a curve already known to statisticians).</p>
<p>Other methods than neural nets also had early success.  The field of information retrieval (which was not &#8220;sexy&#8221; prior to the Web) had huge success since the 1960s with <a taret="_blank" href="http://en.wikipedia.org/wiki/Naive_Bayes_classifier">Naive Bayes</a>, <a target="_blank" href="http://en.wikipedia.org/wiki/Rocchio_Classification">Rocchio Classification</a>, and <a target="_blank" href="http://en.wikipedia.org/wiki/Tf–idf">TF/IDF</a> methods.  The early success of these methods may have in fact delayed research on current hot research areas such as segmentation and author topic models.</p>
<p>Theoretical computer science initially sought to characterize machine learning methods in non-statistical language.  In the 1980s a great amount of ink was spilled on &#8220;learning boolean functions.&#8221;  Papers proving nothing was learnable (by picking a function related to cryptography) alternated with papers proving everything was learnable (for example via amplification techniques like boosting).  Generalization of models to new data remained a theoretical problem that was dealt with by appeals to model complexity and <a target="_blank" href="http://en.wikipedia.org/wiki/Minimum_description_length">MDL</a> (minimum description length).  A major breakthrough in characterizing generalization performance was the <a target="_blank" href="http://en.wikipedia.org/wiki/Probably_approximately_correct_learning">PAC model</a> (probably approximately correct) framework which finally allowed direct treatment of generalization performance.</p>
<p>We now have enough context  to discuss some of the current best of breed machine learning techniques (that address many of the desired properties mentioned above):</p>
<ul>
<li><a target="_blank" href="http://en.wikipedia.org/wiki/Support_vector_machine">Support Vector Machines</a></li>
<li><a target="_blank" href="http://en.wikipedia.org/wiki/Kernel_methods">Kernel Methods</a></li>
<li><a target="_blank" href="http://en.wikipedia.org/wiki/Logistic_regression">Logistic Regression</a></li>
<li><a target="_blank" href="http://en.wikipedia.org/wiki/Principle_of_maximum_entropy">Maximum Entropy Methods</a></li>
<li><a target="_blank" href="http://en.wikipedia.org/wiki/Regularization_(mathematics)">Regularization</a></li>
<li><a target="_blank" href="http://en.wikipedia.org/wiki/Graphical_model">Graphical Models</a></li>
<li><a target="_blank" href="http://en.wikipedia.org/wiki/Conditional_random_field">Conditional Random Fields</a></li>
<p> </ul>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/10/556px-Svm_max_sep_hyperplane_with_margin.png" alt="556px-Svm_max_sep_hyperplane_with_margin.png" border="0" width="278" /><br />
<br/><br />
Typical SVM maximum margin diagram<br />
</center></p>
<p>Not all of these methods are new (Logistic Regression for example dates from 1925 and is itself based on regression which goes back to Gauss).  But the concerns these methods address are all much more statistical than artificial intelligence in nature.  For example we don&#8217;t  suppose that there is some cryptographically obscured combination of features that we need to find to make the best prediction.  We instead worry about detecting which features are useful and note that it is a significant (though solvable) problem to correctly use combinations of useful features (phrased as statistical concerns: feature to feature dependencies and higher order interactions).  Machine learning has always run where statisticians fear to tread.   But more and  more often we are seeing that the methods and concerns of statisticians are what are needed to achieve many of the listed desired properties of machine learning models.</p>
<p>The methods I have singled out for praise are very effective and achieve a number of our listed desired properties.  For example:  both logistic regression and maximum entropy have a unique solution that is easy to find.  They are also both consistent with all summaries known during training.  That is: if 30% of the positive training data has a feature present then 30% of the data also has the feature present when weighted by the model&#8217;s score (so the model score shares a lot of properties with training truth).  Support Vector Machines also have well understood solutions and a theory (called maximum margin) that directly addresses generalization (good predictions on new data).  Kernel Methods (both as used in SVMs and elsewhere) allow controlled introduction of very complex functions.  Graphical Models and Conditional Random Fields also allow the controlled introduction of modeled dependencies in the data.</p>
<p>It is now common to call what was previously thought of as artificial intelligence or machine learning: &#8220;statistical machine learning.&#8221;  This reflects that the kind of prediction and characterization we expect from machine learning algorithms are in fact statistical concerns that we can deal with if we have enough data and enough computational resources. </p>
<p>The current important issues for statistical machine learning include:</p>
<ul>
<li>Dealing with very large datasets (driving the return of simpler methods like Naive Bayes)</li>
<li>Dealing with lack of training data (driving interest in clustering and manifold regularization methods)</li>
<li>Dealing with unstructured data and text mining (driving interest in information extraction and segmentation via generative models)</li>
</ul>
<p>Just as Wigner famously wrote about &#8220;The Unreasonable Effectiveness of Mathematics&#8221; in the 1960s  Halevy,Norvig and Pereira write about the &#8220;Unreasonable Effectiveness of Data.&#8221;   They argue that we are in the age of big data (or the age of analysts).   Or, as Varian observed: &#8220;it is a good time to supply a good complementary to data&#8221; (i.e. it is a good time to be an analyst).  I would temper this with we are likely in the age of unmarked data and unstructured data.  Less often are we asked to automate a known prediction and more often we are asked to cluster, characterize and segment wild data. In my opinion the hard problem in machine learning has moved from prediction to characterization.  With enough marked training data (that is data for which we know both the observables and desired outcome) it is now quite possible to use standard techniques and libraries to build a very good predictive model.  However, it is still hard to characterize, segment or extract useful information from the wealth of unstructured and unmarked data that is upon us.  And this is where a lot of the current research in statistical machine learning is directed.  </p>
<p>Or course characterization and clustering have their own infamous history.  Rota wrote: &#8220;&#8230; Or a subject is important, but nobody understands what is going on; such is the case with quantum field theory, the distribution of primes, pattern recognition and cluster analysis.&#8221;  Artificial intelligence may be moving from areas where computer scientists have over-promised to areas where statisticians have over-promised.  But this is not a disaster: the most valuable research tends to be done in hectic times in messy fields, not in calm times in neat fields.  And the already large scale adoption of statistical machine learning techniques means there is immediate great client value in even seemingly small improvements in understanding, explanation, documentation, training, tools, libraries and techniques.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/10/Xbarst1.jpg" alt="Xbarst1.jpg" border="0" width="384" height="398" /><br />
<br/><br />
Classic attempt to add structure to text<br />
</center></p>
<p>(images from Wikipedia)</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/' rel='bookmark' title='A Demonstration of Data Mining'>A Demonstration of Data Mining</a></li>
<li><a href='http://www.win-vector.com/blog/2011/10/kernel-methods-and-support-vector-machines-de-mystified/' rel='bookmark' title='Kernel Methods and Support Vector Machines de-Mystified'>Kernel Methods and Support Vector Machines de-Mystified</a></li>
<li><a href='http://www.win-vector.com/blog/2011/07/book-review-ensemble-methods-in-data-mining-seni-elder/' rel='bookmark' title='Book Review: Ensemble Methods in Data Mining (Seni &amp; Elder)'>Book Review: Ensemble Methods in Data Mining (Seni &#038; Elder)</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2010/10/a-personal-perspective-on-machine-learning/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>What Did Theorists Do Before The Age Of Big Data?</title>
		<link>http://www.win-vector.com/blog/2010/08/what-did-theorists-do-before-the-age-of-big-data/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=what-did-theorists-do-before-the-age-of-big-data</link>
		<comments>http://www.win-vector.com/blog/2010/08/what-did-theorists-do-before-the-age-of-big-data/#comments</comments>
		<pubDate>Mon, 02 Aug 2010 18:42:45 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Expository Writing]]></category>
		<category><![CDATA[Mathematics]]></category>
		<category><![CDATA[Age of Big Data]]></category>
		<category><![CDATA[Big Data]]></category>
		<category><![CDATA[Mathematical Bedside Reading]]></category>
		<category><![CDATA[Mean]]></category>
		<category><![CDATA[Mean of Medians]]></category>
		<category><![CDATA[Median]]></category>
		<category><![CDATA[Median of Means]]></category>
		<category><![CDATA[Theorist]]></category>
		<category><![CDATA[Winsorized mean]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1514</guid>
		<description><![CDATA[We have been living in the age of &#8220;big data&#8221; for some time now. This is an age where incredible things can be accomplished through the effective application of statistics and machine learning at large scale (for example see: &#8220;The Unreasonable Effectiveness of Data&#8221; Alon Halevy, Peter Norvig, Fernando Pereira, IEEE Intelligent Systems (2009)). But [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2009/08/good-graphs-graphical-perception-and-data-visualization/' rel='bookmark' title='Good Graphs: Graphical Perception and Data Visualization'>Good Graphs: Graphical Perception and Data Visualization</a></li>
<li><a href='http://www.win-vector.com/blog/2009/04/the-data-enrichment-method/' rel='bookmark' title='The Data Enrichment Method'>The Data Enrichment Method</a></li>
<li><a href='http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/' rel='bookmark' title='A Demonstration of Data Mining'>A Demonstration of Data Mining</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>We have been living in the age of &#8220;big data&#8221; for some time now.  This is an age where incredible things can be accomplished through the effective application of statistics and machine learning at large scale (for example see: &#8220;The Unreasonable Effectiveness of Data&#8221; Alon Halevy, Peter Norvig, Fernando Pereira, IEEE Intelligent Systems (2009)).  But I have gotten to thinking about the period before this.   The period before we had easy access to so much data, before most computation was aggregation and before we accepted numerical analysis style convergence as &#8220;efficient.&#8221;  A small problem I needed to solve (as part of a bigger project)  reminded me what theoretical computer scientists did then: we worried about provable worst case efficiency.</p>
<p><span id="more-1514"></span><br />
The problem that got me thinking is this: </p>
<p>Given a sequence of n integers x1 through xn and an integer k (1 &le; k &le; n), find the mean value of all of the medians of the k-sized selections from x1 through xn.  Or as a formula:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/07/EMedian.png" alt="EMedian.png" border="0" width="220" /><br />
</center></p>
<p>where x_s is defined as the sequence of integers whose indices are in the set s (not necessarily a contiguous sequence).   The median is the &#8220;value in the middle&#8221; (a value such that half of the selected data are above it and half are below) and &#8220;(n choose k)&#8221; is the number of ways to choose k items from a collection of n items (which is just the number: n!/((n-k)! k!)).  So our sum is adding up a number of terms and then dividing by the number of such terms to get the average or mean of the terms.  We will call this sum a &#8220;mean of medians&#8221;.</p>
<p>Some obvious special cases are: for k=1 the<br />
expression simplifies to the sum of the x_i divided by n (or just the mean of all the x_i) and for k=n it simplifies to the median of all of the x_i.  For intermediate values of k it is not immediately obvious how to efficiently compute the value of the sum.  Directly adding all (n choose k)  terms (as the sum is written) would be very slow for large n with even moderate sized k.  Instead we look to find some method that has the same value of the total sum, without directly computing all of the terms.</p>
<p>This gets us to the ad-hoc side of theoretical computer science.  We need a clever idea.  In this case the idea is simple.  To keep things simple: suppose all of the n integers in our sequence have different values and that k is odd (neither of these conditions are important they just let us avoid some non-essential technicalities).  What values can median(x_s) take on? median(x_s) is always x_i form some i in the subset s.  In fact our sum is equivalent to:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/07/Sum2.png" alt="Sum2.png" border="0" width="330"  /><br />
</center></p>
<p>This new sum has a reasonable number of terms- so we can actually calculate it directly if we knew the values of the terms.  Without loss of generality assume the x_i are sorted in increasing order.  Then the number of times x_i is the median of some x_s is exactly:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/07/term.png" alt="term.png" border="0" width="191" /><br />
</center><br />
(and 0 for i &lt; 1+(k-1)/2 or i &gt; n &#8211; (k-1)/2).  This is just the number of ways to place x_i in the center of a subset- by choosing (k-1)/2 smaller neighbors and (k-1)/2 larger neighbors.   The count is given by multiplying the number of ways to choose smaller neighbors by the number of ways to choose the larger neighbors.</p>
<p>The complete solution calculating the mean of medians for distinct sorted x_i is:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/07/fullsum1.png" alt="fullsum.png" border="0" width="333"  /><br />
</center></p>
<p>A statistician would recognize this expression as a kind of centrally weighted Winsorized mean.  The shape of the graph of weights (in this case the n=10, k=5) is suggestive of<br />
a bounded normal window (though i is a rank, not a free-ranging value):</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/07/10w5.png" alt="10w5.png" border="0" width="525" height="525" /><br />
</center></p>
<p>Likely we have re-invented a data treatment known to statisticians.  But the above steps were really just combinatorics.  What a theorist does is abstract something down to this sort of problem and think of variations and solutions.   The formal side of theoretical computer science is attempting to organize all of the problems in the world into two classes: problems presumed easy and problems presumed hard.</p>
<p>For example- what if we had wanted to know the median of many means instead of the mean of many medians?<br />
It turns out a small variation of the median of means problem is already known to be difficult.  The hard version of the reversed problem is called &#8220;Kth largest subset&#8221; (this is a different K than we have been using up until now).   The Kth largest subset problem is: given a sequence of integers and constants K and B do K or more distinct subsets have a sum of no more than B?  The Kth largest subset problem is known to be &#8220;NP hard&#8221; which means that a whole host of other thought to be hard problems could be solved if we had the ability to solve this problem (see &#8220;Computers and Intractability: A Guide to the Theory of NP-Completeness&#8221; Michael R. Garey and David S. Johnson, 1979).  The median of many means is not quite as expressive as the Kth largest subset problem (so we have <em>not</em> proven the median of many means is itself NP hard) but the relation is strong: to even verify that we had the right median we would need to solve a Kth largest subset problem with K=(n choose k)/2 and B= k*the_claimed_median (assuming we modified the subset problem to restrict itself to size k sub-sequences).   If fact even checking if a given number is one of the possible means (let alone if it is the median of the means) is equivalent to the NP Complete knapsack problem.  This correspondence makes us suspect that something as simple as the trick we devised for the mean of medians problem will not solve the median of means problem.  One should not take this argument too seriously though, as the same argument would seem to apply to the pair of problems &#8220;min of means&#8221; and &#8220;mean of mins&#8221; both of which are in fact easy.  We have not proven the median of means problem to be hard, we have just found relevant algorithms and related problems.  </p>
<p>What theorists do is find these analogs and then do the heavy lifting (continue the work even when the solution is not simple) to find a way to either efficiently solve the problem at hand or prove it is in fact as hard as other important problems.  This kind of thing is hard work for small gains, but the value accumulates because each of these gains is permanent.  Finally additional variations of the problem are tried and characterized, to help check we hare not &#8220;leaving money on the table&#8221; (missing nearby improvements).  Some of this may seem like pointless work on toy problems- but every once in a while one of these solutions or characterizations is exactly what is needed to take a large system (like a search engine) to the next level of performance.</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2009/08/good-graphs-graphical-perception-and-data-visualization/' rel='bookmark' title='Good Graphs: Graphical Perception and Data Visualization'>Good Graphs: Graphical Perception and Data Visualization</a></li>
<li><a href='http://www.win-vector.com/blog/2009/04/the-data-enrichment-method/' rel='bookmark' title='The Data Enrichment Method'>The Data Enrichment Method</a></li>
<li><a href='http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/' rel='bookmark' title='A Demonstration of Data Mining'>A Demonstration of Data Mining</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2010/08/what-did-theorists-do-before-the-age-of-big-data/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Living in A Lognormal World</title>
		<link>http://www.win-vector.com/blog/2010/02/living-in-a-lognormal-world/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=living-in-a-lognormal-world</link>
		<comments>http://www.win-vector.com/blog/2010/02/living-in-a-lognormal-world/#comments</comments>
		<pubDate>Wed, 03 Feb 2010 23:46:37 +0000</pubDate>
		<dc:creator>Nina Zumel</dc:creator>
				<category><![CDATA[Applications]]></category>
		<category><![CDATA[Expository Writing]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Statistics To English Translation]]></category>
		<category><![CDATA[customer value]]></category>
		<category><![CDATA[lognormal distribution]]></category>
		<category><![CDATA[long tail theory]]></category>
		<category><![CDATA[McPhee's Theory of Exposure]]></category>
		<category><![CDATA[median versus mean]]></category>
		<category><![CDATA[power law distribution]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1388</guid>
		<description><![CDATA[Recently, we had a client come to us with (among other things) the following question: Who is more valuable, Customer Type A, or Customer Type B? This client already tracked the net profit and loss generated by every customer who used his services, and had begun to analyze his customers by group. He was especially [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2009/12/statistics-to-english-translation-part-2b-calculating-significance/' rel='bookmark' title='Statistics to English Translation, Part 2b: Calculating Significance'>Statistics to English Translation, Part 2b: Calculating Significance</a></li>
<li><a href='http://www.win-vector.com/blog/2009/12/statistics-to-english-translation-part-2a-%e2%80%99significant%e2%80%99-doesn%e2%80%99t-always-mean-%e2%80%99important%e2%80%99/' rel='bookmark' title='Statistics to English Translation, Part 2a: ’Significant’ Doesn’t Always Mean ’Important’'>Statistics to English Translation, Part 2a: ’Significant’ Doesn’t Always Mean ’Important’</a></li>
<li><a href='http://www.win-vector.com/blog/2009/11/i-dont-think-that-means-what-you-think-it-means-statistics-to-english-translation-part-1-accuracy-measures/' rel='bookmark' title='&#8220;I don&#8217;t think that means what you think it means;&#8221; Statistics to English Translation, Part 1: Accuracy Measures'>&#8220;I don&#8217;t think that means what you think it means;&#8221; Statistics to English Translation, Part 1: Accuracy Measures</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>Recently, we had a client come to us with (among other things) the following question:<br />
Who is more valuable, Customer Type A, or Customer Type B?</p>
<p>This client already tracked the net profit and loss generated by every customer who used his services, and had begun to analyze his customers by group. He was especially interested in Customer Type A; his gut instinct told him that Type A customers were quite profitable compared to the others (Type B) and he wanted to back up this feeling with numbers.</p>
<p>He found that, on average, Type A customers generate about $92 profit per month, and Type B customers average about $115 per month (The data and figures that we are using in this discussion aren&#8217;t actual client data, of course, but a notional example). He also found that while Type A customers make up about 4% of the customer base, they generate less than 4% of the net profit per month. So Type A customers actually seem to be less profitable than Type B customers. Apparently, our client was mistaken.</p>
<p>Or was he? <span id="more-1388"></span></p>
<p>A little more elementary statistics revealed that the median profit generated by Type A customers is $65 — e.g., half the customers from group A generate more than $65 profit per month. The median for Type B customers is about $4.80 — so half the customers from group B generate less than five dollars profit every month. Maybe our client&#8217;s instincts aren&#8217;t completely off-base.</p>
<p>Let&#8217;s look at the distribution of net profit across both customer populations:</p>
<table border="0" align="center">
<tbody>
<tr>
<td>
<div style="text-align: center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/densityAll.png" border="0" alt="densityAll.png" width="500" /></div>
</td>
</tr>
</tbody>
<caption><em>Figure 1: Distribution of net profit for Type A customers (blue) and Type B customers (red). The x-axis gives the net profit or loss, and the y-axis gives the fraction of the population that generates a given net profit. </em><br />
</caption>
<tbody>
<tr>
<td></td>
</tr>
</tbody>
</table>
<p>This pattern is typical among the customers of many businesses. The majority of customers generate relatively moderate profit (or loss); but there is an important minority of large-profit and large-loss customers out on both tails. In this case, the monthly customer value actually ranges from losses in the tens of thousands to profits of several hundred thousands (I clipped the graph, for &#8220;clarity&#8221;).</p>
<p>I hesitate to call these large magnitude customers &#8220;outliers&#8221; because that term implies anomalous, possibly erroneous, data. In this case, the &#8220;outliers&#8221; are relatively rare, but important, customers who can potentially make the difference between a company that is in the black or in the red. Still, they are the exception and their behavior doesn&#8217;t necessarily tell you anything about the behavior of your typical customer. Knowing the mean profitability of a given customer group is important, of course, but the estimate will be dominated by your exceptionally profitable or lossy customers in that group, and as we&#8217;ve seen, that hides information about the majority of your customers.</p>
<p>You might remember from our <a href="http://www.win-vector.com/blog/2009/08/good-graphs-graphical-perception-and-data-visualization/">Good Graphs article</a> that if you have positive skewed data with a wide dynamic range, graphing the data on a log scale helps you see phenomena across the entire range of data that you might miss on the ordinary graph. Unfortunately, we have data here in the positive and negative range. So let&#8217;s split the customers into three groups: profitable, unprofitable, and break-even. About 5-6% of the customer base is break-even, roughly the same proportion in Groups A and B; we&#8217;ll ignore them for now, and look at the profitable customers first (over 80% of the customers, in both groups).</p>
<table border="0" align="center">
<tbody>
<tr>
<td>
<div style="text-align: center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2010/02/positiveCusts.png" border="0" alt="positiveCusts.png" width="500" /></div>
</td>
</tr>
</tbody>
<caption><em>Figure 2: Distribution of profit from profitable Type A customers (blue) and Type B customers (red). The x-axis gives net profit on a log 10 scale, so every labelled tick corresponds to a change by a factor of 100 (eg. 10^0 = $1, 10^2 = $100, and so on). The y-axis represents the fraction of the profitable customer base that generates a given profit.<br />
</em></caption>
<tbody>
<tr>
<td></td>
</tr>
</tbody>
</table>
<p>Now we can clearly see that (among profitable customers) the typical Type A customer is in fact more profitable than the typical Type B customer. The mean profit from profitable Type A customers is about $227, and the median profit is about $93 (shown by the dashed blue line). About 2/3 of the profitable Type A customers generate between $21 and $400 in profit, and over 95% of them generate between $5 and $1721 in profit. We can call that 95% the set of &#8220;typical&#8221; profitable Type A customers. That&#8217;s not a standard definition, but it&#8217;s an intuitive one, and useful for this discussion.</p>
<p>Approximately 2.5% of Type A customers generate profits greater than $1721; let&#8217;s call them the Type A &#8220;best-customers,&#8221; some of whom generate profits in the tens of thousands. They are responsible for 30% of the profit that comes from profitable Type A customers, and 3% of the profit that comes from all profitable customers (even though they only make up 0.2% of that population).</p>
<p>Profitable Type B customers generate $148 mean profit, and about $7.67 median profit (the red dashed line). A typical profitable Type B customer generates between six cents and $1031 in profit — a lower range than what the typical Type A customer generates, although the very highest-performing Type B customers are competitive with the highest-performing Type A customers (about 130 Type B customers outperform all the Type A customers).</p>
<p>Unfortunately, when Type A customers are unprofitable, they are typically more unprofitable than those of Type B. This is another reason why the mean profit from Type A customers overall was so low. Our client correctly perceived that Type A customers are typically quite profitable, but there is a small population of real clunkers in the group, too.</p>
<table border="0" align="center">
<tbody>
<tr>
<td>
<div style="text-align: center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2010/02/negativeCusts.png" border="0" alt="negativeCusts.png" width="500" /></div>
</td>
</tr>
</tbody>
<caption><em>Figure 3: Distribution of loss from unprofitable Type A customers (blue) and Type B customers (red). The x-axis gives loss on a log 10 scale; further to the right on the graph means a larger loss. An unprofitable Type A customer loses a median of $137 a month, and a mean of $1180. Unprofitable Type B customers lose a median of $4.80, and a mean of $210.<br />
</em></caption>
<tbody>
<tr>
<td></td>
</tr>
</tbody>
</table>
<p>We can do a similar analysis for the entire base of profitable customers. We would find that the typical profitable customer generates between six cents and $1200 in profit every month (median $8.65, mean $153), and that the 2.5% of best-customers generate over 60% of the profits.</p>
<p><strong>The Lognormal Distribution</strong></p>
<table border="0" align="center">
<tbody>
<tr>
<td>
<div style="text-align: center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2010/02/lognormalComp.png" border="0" alt="lognormalComp.png" width="536" height="270" /></div>
</td>
</tr>
</tbody>
<caption><em> Figure 4: (Left) Distribution of profitable customers (graph clipped at $10,000). The x-axis gives the net profit, and the y-axis gives the fraction of the population that generates a given net profit. (Right) Distribution of profitable customers plotted on a log scale.<br />
</em></caption>
<tbody>
<tr>
<td></td>
</tr>
</tbody>
</table>
<p>The distribution of highly skewed positive data, like the value of profitable customers, incomes, sales, or stock prices, can often be modelled as a <a href="http://en.wikipedia.org/wiki/Log-normal_distribution">lognormal distribution</a>: that is, the log of the data is distributed in a bell-shaped curve centered (in log space) at the median of the data (remember, for a normal curve, the median and the mean are the same). In our case, both the profits (seen above, in Figure 4) and the losses are distributed approximately lognormally. For lognormal populations, the mean is generally much higher than the median, and the bulk of the contribution towards the mean will be made by a small population of highest-valued data points. <em>If you use the mean as a stand-in for value, you will overstate the value of most of your customers.</em></p>
<p>If your customer value data is distributed approximately lognormally, then you can quickly estimate the range of values that 95% of your customers will fall into. About 95% of normally distributed data will fall within plus/minus two standard deviations of the mean, and taking logarithms converts multiplication into addition. So: if <em>sd</em> is the standard deviation of the natural log of your customer value data,  <em>M</em> is the median profit, and <em>k</em> = exp(<em>sd</em>), then 95% of your customers will fall in the value range (<em>M/(k*k)</em>, <em>M*k*k</em>). The 2.5% of customers who generate more than <em>M*k*k</em> profit are your best-customers, who often drive a majority of your profit.</p>
<p><strong>Long Tail Theory</strong></p>
<p>The distribution of customers above sounds a lot like Chris Anderson&#8217;s <a href="http://www.wired.com/wired/archive/12.10/tail.html">Long Tail Theory</a> of consumer goods. Most of the revenue of (for example) a bookseller or a music store comes from a few &#8220;hits&#8221;, or blockbusters, with the rest of the merchant&#8217;s inventory out along the tail of Figure 5, moving a relatively small volume per title.</p>
<table border="0" align="center">
<tbody>
<tr>
<td>
<div style="text-align: center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/LongTailComp.png" border="0" alt="LongTailComp.png" width="522" height="644" /></div>
</td>
</tr>
</tbody>
<caption><em> Figure 5: (Top) A notional long tail curve. The y-axis represents sales volume, and the x-axis represents goods ranked from most to least popular. The highest selling goods are to the left. Note that this figure represents the sales curve differently from how the distribution of customer value is represented on the left side of Figure 4. (Bottom) The customer value data (top 10,000 customers) from Figure 4, plotted in the style above. The y-axis has been limited to $50,000 for clarity.<br />
</em></caption>
<tbody>
<tr>
<td></td>
</tr>
</tbody>
</table>
<p>Anderson generally assumes that sales of such goods are distributed as a power law distribution, rather than a lognormal; the log of power law data isn&#8217;t distributed symmetrically, but actually has a longer tail to the right. This means that even for the log of the data, the mean is higher than the median. In fact, in some cases, the mean of a power law distribution can be infinite. If sales volume is power law distributed, then top-selling hits are responsible for an even larger percentage of total sales volume than would be the case with a lognormal.</p>
<p>The <a href="http://en.wikipedia.org/wiki/Pareto_distribution">Pareto Distribution</a>, which is one form of a power distribution, has been proposed as an alternative to the lognormal for modelling income distribution and other similar phenomena. Researchers have debated whether lognormal or Pareto is a better model for income distribution since at least the 1950s. Qualitatively, the two distributions have similar behavior. There are certain estimation and forecasting tasks where it does make a difference if your data follows a power law rather than a lognormal, but for the purposes of this discussion, it doesn&#8217;t really matter. For those who are interested, Michael Mitzenmacher has a <a href="http://projecteuclid.org/DPubS?service=UI&amp;version=1.0&amp;verb=Display&amp;handle=euclid.im/1089229510">fairly approachable discussion</a> about the difference between power laws and lognormal distributions.</p>
<p>Back to Long Tail Theory. Historically, merchants tend to concentrate on high-volume items, due to space limitations and the cost of holding inventory. Overall, however, the sum total of tail-product sales will add up to a respectable volume, especially for web retailers who have unlimited &#8220;floor space&#8221; — or so the Long Tail theory goes. A retailer must then decide whether to follow the traditional &#8220;hits-oriented&#8221; strategy, or a more &#8220;tail-oriented&#8221; strategy that caters to the numerous niche markets.</p>
<p>If we draw an analogy with customer value, then best-customers are &#8220;hits.&#8221; Obviously, our client would like to &#8220;fire&#8221; his unprofitable customers while retaining his best-performing customers, and even attract more customers like them. But what about his little customers — the 95% of customers in the typical range? If his retention and growth strategy focuses primarily on attracting and retaining big customers, he is following a hits-oriented strategy. If his campaign also includes reaching out to little customers, then he is following something analogous to a tail strategy.</p>
<p>Not all business works like a music or book seller; the appropriate strategy will vary. Still, we can think of a few reasons why keeping little customers happy is a good idea.</p>
<p>For one thing, big customers are not only rare, but they are the ones that your competitors covet the most. Little customers, meanwhile, can still add up to a respectable chunk of change (close to 40% of net profit in our example above). A solid cushion of smaller customers may soften the blow to your profit margin, should a few of your bigger customers defect.</p>
<div style="text-align: center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/logos1.png" border="0" alt="logos.png" width="369" height="336" /></div>
<p>Consider computer sales. Microsoft and Dell serve both the corporate and consumer markets. To judge from their past marketing practices, they consider business customers to be the more valuable segment (see <a href="http://www.win-vector.com/blog/2009/07/microsoft-store-again/">here</a> for a rant somewhat related to this topic). But business IT sales have declined in the current moribund economic climate; analysts attribute the growth in computer sales for the last quarter of 2009 <a href="http://www.cultofmac.com/apple-saw-24-growth-in-q4-2009-as-computer-market-bounces-back/26184">primarily to consumer spending</a>. Dell&#8217;s market growth for that last quarter was much lower than that of HP, Acer, and Apple, which are more consumer-oriented companies. It&#8217;s also worth noting that Microsoft saw a 14% <a href="http://www.neowin.net/news/main/09/10/23/windows-and-xbox-help-microsoft-earnings-beat-predictions">decline in revenue</a> for the quarter ending September 30, 2009, compared to the year-ago quarter (and their earnings were in large part due to sales of the Xbox, a consumer product), while at the same time, consumer-oriented Apple saw a <a href="http://www.cultofmac.com/apple-saw-24-growth-in-q4-2009-as-computer-market-bounces-back/26184">24% increase in revenue</a> from its year-ago quarter.</p>
<p>Your pool of little customers is also a pool of potential future best-customers. And <a href="http://insight.kellogg.northwestern.edu/index.php/Kellogg/article/predicting_customer_lifetimevalue">you can&#8217;t always guess which ones</a>. So a wise strategy might be to allocate part of your retention and growth campaign to providing loyalty incentives to smaller customers, and educating them about how your higher-end services or products might benefit them. Those little customers who have the means or opportunity to move on to the next level might very well appreciate your efforts, and stay with you, rather than defecting to a competitor.</p>
<p><strong>Optimizing Sales vs. Optimizing Customers</strong></p>
<p>One last thought about retail hits and high-value customers. McPhee&#8217;s Theory of Exposure, which is cited by Anita Elberse in her Harvard Business Review article <a href="http://hbr.org/2008/07/should-you-invest-in-the-long-tail/ar/1">&#8220;Should You Invest in the Long Tail?&#8221;</a>, states that the popularity of music, film, TV or books is largely driven by &#8220;marginal audience participants&#8221; — the casual, or light, consumer. Casual consumers gravitate to already popular products because they have limited exposure to alternatives, and hence limited knowledge of them. Consumers of more obscure products, on the other hand, tend to be heavy (and knowledgable) consumers: voracious readers, dedicated music or film buffs, or enthusiasts of specific genres, like science-fiction or horror.</p>
<div style="text-align: center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/albums.png" border="0" alt="albums.png" width="300" /></div>
<p>McPhee&#8217;s research was done in 1963, using subjects who had a fairly small range of choices, compared to internet scale. Elberse found, however, that the phenomena McPhee described still held for the internet merchants that she studied. She uses this observation (along with McPhee&#8217;s companion theory of <a href="http://en.wikipedia.org/wiki/Double_jeopardy_(marketing)">Double Jeopardy</a>) to argue that retailers should not substantially alter their traditional hits-based strategies. There is an alternative interpretation:</p>
<p><em>If your business follows McPhee&#8217;s theory, then hit products disproportionately attract low-value (low-volume) customers, and vice-versa. </em></p>
<p>So an overly hits-oriented strategy will skew you towards a base of low-value customers. Indeed, <a href="http://sethgodin.typepad.com/seths_blog/2009/12/its-not-the-rats-you-need-to-worry-about.html">Seth Godin argues</a> that iTunes and Amazon, who are in a better position to implement a more tail-oriented strategy, are thriving at the expense of physical stores exactly because they have been able to steal the quality (high-volume) customers away.</p>
<p>The moral is that both sales and customer value live in a lognormal world, where blockbuster products are marketed to a large cloud of low revenue customers, and high revenue best-customers are supported by large catalogues of low volume products. Fail to serve one side of this relationship, and you risk losing the other side.</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2009/12/statistics-to-english-translation-part-2b-calculating-significance/' rel='bookmark' title='Statistics to English Translation, Part 2b: Calculating Significance'>Statistics to English Translation, Part 2b: Calculating Significance</a></li>
<li><a href='http://www.win-vector.com/blog/2009/12/statistics-to-english-translation-part-2a-%e2%80%99significant%e2%80%99-doesn%e2%80%99t-always-mean-%e2%80%99important%e2%80%99/' rel='bookmark' title='Statistics to English Translation, Part 2a: ’Significant’ Doesn’t Always Mean ’Important’'>Statistics to English Translation, Part 2a: ’Significant’ Doesn’t Always Mean ’Important’</a></li>
<li><a href='http://www.win-vector.com/blog/2009/11/i-dont-think-that-means-what-you-think-it-means-statistics-to-english-translation-part-1-accuracy-measures/' rel='bookmark' title='&#8220;I don&#8217;t think that means what you think it means;&#8221; Statistics to English Translation, Part 1: Accuracy Measures'>&#8220;I don&#8217;t think that means what you think it means;&#8221; Statistics to English Translation, Part 1: Accuracy Measures</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2010/02/living-in-a-lognormal-world/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Statistics to English Translation, Part 2b: Calculating Significance</title>
		<link>http://www.win-vector.com/blog/2009/12/statistics-to-english-translation-part-2b-calculating-significance/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=statistics-to-english-translation-part-2b-calculating-significance</link>
		<comments>http://www.win-vector.com/blog/2009/12/statistics-to-english-translation-part-2b-calculating-significance/#comments</comments>
		<pubDate>Mon, 14 Dec 2009 07:02:40 +0000</pubDate>
		<dc:creator>Nina Zumel</dc:creator>
				<category><![CDATA[Applications]]></category>
		<category><![CDATA[Expository Writing]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Statistics To English Translation]]></category>
		<category><![CDATA[ANOVA]]></category>
		<category><![CDATA[F-test]]></category>
		<category><![CDATA[significance]]></category>
		<category><![CDATA[t-test]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1281</guid>
		<description><![CDATA[In the previous installment of the Statistics to English Translation, we discussed the technical meaning of the term &#8221;significant&#8221;. In this installment, we look at how significance is calculated. This article will be a little more technically detailed than the last one, but our primary goal is still to help you decipher statements about significance [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2009/12/statistics-to-english-translation-part-2a-%e2%80%99significant%e2%80%99-doesn%e2%80%99t-always-mean-%e2%80%99important%e2%80%99/' rel='bookmark' title='Statistics to English Translation, Part 2a: ’Significant’ Doesn’t Always Mean ’Important’'>Statistics to English Translation, Part 2a: ’Significant’ Doesn’t Always Mean ’Important’</a></li>
<li><a href='http://www.win-vector.com/blog/2009/11/i-dont-think-that-means-what-you-think-it-means-statistics-to-english-translation-part-1-accuracy-measures/' rel='bookmark' title='&#8220;I don&#8217;t think that means what you think it means;&#8221; Statistics to English Translation, Part 1: Accuracy Measures'>&#8220;I don&#8217;t think that means what you think it means;&#8221; Statistics to English Translation, Part 1: Accuracy Measures</a></li>
<li><a href='http://www.win-vector.com/blog/2010/02/living-in-a-lognormal-world/' rel='bookmark' title='Living in A Lognormal World'>Living in A Lognormal World</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>In the <a href="http://www.win-vector.com/blog/2009/12/statistics-to-english-translation-part-2a-’significant’-doesn’t-always-mean-’important’/">previous installment</a> of the <a href="http://www.win-vector.com/blog/category/statistics-to-english-translation/">Statistics to English Translation</a>, we discussed the technical meaning of the term &#8221;significant&#8221;. In this installment, we look at how significance is calculated. This article will be a little more technically detailed than the last one, but our primary goal is still to help you decipher statements about significance in research papers: statements like &#8220;<!-- MATH  $(F(2, 864) = 6.6, p = 0.0014)$  --><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg4.png" border="0" alt="$ (F(2, 864) = 6.6, p = 0.0014)$" width="238" height="37" align="middle" /> &#8221;.</p>
<p>As in the <a href="http://www.win-vector.com/blog/2009/12/statistics-to-english-translation-part-2a-’significant’-doesn’t-always-mean-’important’/">last article</a>, we will concentrate on situations where we want to test the difference of means. You should read that previous article first, so you are familiar with the terminology that we use in this one.</p>
<p>A pdf version of this current article can be found <a href="http://win-vector.com/dfiles/ste2b_calculatesig.pdf">here</a>.<br />
<span id="more-1281"></span></p>
<h1><a name="SECTION00010000000000000000" id="SECTION00010000000000000000">How is Significance Determined?</a></h1>
<p>Generally speaking, we calculate significance by computing a <em>test statistic</em> from the data. If we assume a specific null hypothesis, then we know that this test statistic will be distributed in a certain way. We can then compute how likely it is to observe our value of the test statistic, if we assume that the null hypothesis is true.</p>
<p>We&#8217;ll explain the use of a test statistic with our Sneetch example from the last installment.</p>
<h1><a name="SECTION00020000000000000000" id="SECTION00020000000000000000">The t-test for Difference of Means</a></h1>
<p>Suppose that the test scores for both Star-Bellies and Plain-Bellies are normally distributed, with the means and standard deviations as given in the table below.</p>
<div align="center">
<table cellpadding="3" border="1">
<tr>
<td align="center">&nbsp;</td>
<td align="center"><img width="16" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg5.png" alt="$ n$"> (number of subjects)</td>
<td align="center"><img width="21" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg6.png" alt="$ m$"> (mean score)</td>
<td align="center"><img width="14" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg7.png" alt="$ s$"> (standard error)</td>
</tr>
<tr>
<td align="center">Star-Bellies</td>
<td align="center">50</td>
<td align="center">78</td>
<td align="center">7</td>
</tr>
<tr>
<td align="center">Plain-Bellies</td>
<td align="center">40</td>
<td align="center">74</td>
<td align="center">8</td>
</tr>
</table>
</div>
<p>Remember from the previous installment that we can estimate the true population means <img width="24" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg8.png" alt="$ \mu_1$"> and <img width="24" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg9.png" alt="$ \mu_2$"> as normally distributed around the empirical population means <img width="29" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg10.png" alt="$ m_1$"> and <img width="29" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg11.png" alt="$ m_2$"> respectively, with variances<br />
<img width="52" height="38" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg12.png" alt="$ \sigma^2/{n_1}$"> and<br />
<img width="52" height="38" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg13.png" alt="$ \sigma^2/{n_2}$"> . This is shown in Figure <a href="#fig:twomeans">1</a>. Informally speaking, there is no significant difference in the two populations if the shaded overlap area in Figure <a href="#fig:twomeans">1</a> is large.</p>
<div align="center"><a name="fig:twomeans" id="fig:twomeans"></a><a name="36"></a></p>
<table>
<caption align="bottom"><strong>Figure 1:</strong> The estimates of the means for two populations</caption>
<tr>
<td>
<div align="center"><img width="282" height="204" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/./overlap.png" alt="Image overlap"></div>
</td>
</tr>
</table>
</div>
<p>Calculating this area is somewhat involved. Instead, we calculate the <em>t-statistic</em>:</p>
<div align="center">
<table cellpadding="0" width="100%" align="center">
<tr valign="middle">
<td nowrap align="center"><img width="126" height="62" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg14.png" alt="$\displaystyle t = \frac{(m_2 - m_1)}{s_D}$"></td>
<td nowrap width="10" align="right">(1)</td>
</tr>
</table>
</div>
<p><br clear="all"><br />
where <img width="26" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg15.png" alt="$ s_D$"> is called the <em>pooled variance</em> of the two populations.</p>
<div align="center">
<table cellpadding="0" width="100%" align="center">
<tr valign="middle">
<td nowrap align="center"><img width="325" height="64" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg16.png" alt="$\displaystyle {s_D}^2 = \frac{n_1\cdot {s_1}^2 + n_2\cdot {s_2}^2}{n_1 + n_2 - 2} \cdot (1/n_1 + 1/n_2)$"></td>
<td nowrap width="10" align="right">(2)</td>
</tr>
</table>
</div>
<p><br clear="all"></p>
<p>For our Sneetch example, <img width="75" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg17.png" alt="$ s_D = 1.6$"> , and <img width="79" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg18.png" alt="$ t=2.499$"> , or the negative of that, depending on which group is Group 1. There are<br />
<img width="142" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg19.png" alt="$ 50 + 40 - 2 = 88$"> degrees of freedom.</p>
<p>If the null hypothesis is true, and the two populations are identical, then <img width="12" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg1.png" alt="$ t$"> is distributed according to <em>Student&#8217;s distribution with<br />
<img width="105" height="34" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg20.png" alt="$ N_1 + N_2 - 2$"> degrees of freedom</em>. Student&#8217;s distribution is sort of a &#8220;stretched out&#8221; bell curve; as the degrees of freedom increase (<br />
<img width="122" height="34" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg21.png" alt="$ N_1 + N_2 \rightarrow \infty$"> ), Student&#8217;s distribution approaches the standard normal distribution, <img width="63" height="37" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg22.png" alt="$ N(0, 1)$"> <a name="tex2html2" href="#foot209" id="tex2html2"><sup>1</sup></a>.</p>
<p>In other words, if the null hypothesis is true, <img width="12" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg1.png" alt="$ t$"> should be near zero. The probability of seeing a <img width="12" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg1.png" alt="$ t$"> of a certain magnitude or greater under the null hypothesis is given by the area under the tails of Student&#8217;s distribution:</p>
<div align="center"><a name="57"></a></p>
<table>
<caption align="bottom"><strong>Figure 2:</strong> The area under the tails for a given <img width="12" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg1.png" alt="$ t$"></caption>
<tr>
<td>
<div align="center"><img width="514" height="365" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/./twotailedtest.jpg" alt="Image twotailedtest"></div>
</td>
</tr>
</table>
</div>
<p>This area is <img width="14" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg27.png" alt="$ p$"> . For the Sneetch example, <img width="82" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg28.png" alt="$ p = 0.014$"> .</p>
<p>The further out on the tails <img width="12" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg1.png" alt="$ t$"> is, the stronger the evidence that you should reject the null hypothesis. If you know for some reason that the mean of one population will be greater than or equal to the other, than you can use the <em>one-tailed test</em>:</p>
<div align="center"><a name="64"></a></p>
<table>
<caption align="bottom"><strong>Figure 3:</strong> The one-tailed test for a given <img width="12" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg1.png" alt="$ t$"></caption>
<tr>
<td>
<div align="center"><img width="514" height="365" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/./onetailedtest.jpg" alt="Image onetailedtest"></div>
</td>
</tr>
</table>
</div>
<p>This test halves the p-value as compared to the two-tailed test, making a given <img width="12" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg1.png" alt="$ t$"> value twice as significant. When in doubt about which to use, the two-tailed test is more conservative against false positives<a name="tex2html5" href="#foot210" id="tex2html5"><sup>2</sup></a>.</p>
<p>In discussions of t-tests, you will often see statements of the form:</p>
<blockquote><p>The t-test meets the hypothesis that two means are equal if</p></blockquote>
<div align="center">
<table cellpadding="0" width="100%" align="center">
<tr valign="middle">
<td nowrap align="center"><img width="88" height="37" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg31.png" alt="$\displaystyle \vert t\vert &gt; t_{\alpha/2, \nu}$"></td>
<td nowrap width="10" align="right">&nbsp;&nbsp;&nbsp;</td>
</tr>
</table>
</div>
<p><br clear="all"></p>
<blockquote><p>for a two-tailed test, or</p></blockquote>
<div align="center">
<table cellpadding="0" width="100%" align="center">
<tr valign="middle">
<td nowrap align="center"><img width="64" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg32.png" alt="$\displaystyle t &gt; t_{\alpha, \nu}$"></td>
<td nowrap width="10" align="right">&nbsp;&nbsp;&nbsp;</td>
</tr>
</table>
</div>
<p><br clear="all"></p>
<blockquote><p>for a (right-sided) one-tailed test.</p></blockquote>
<p>The quantities on the right hand side of the two equations above are called the <em>critical values</em> for a given significance level <img width="17" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg33.png" alt="$ \alpha$"> (usually,<br />
<img width="75" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg34.png" alt="$ \alpha = 0.05$"> ) and <img width="15" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg35.png" alt="$ \nu$"> degrees of freedom. The critical values are the values for which the area of the right hand tail is equal to <img width="17" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg33.png" alt="$ \alpha$"> .</p>
<div align="center"><a name="211"></a></p>
<table>
<caption align="bottom"><strong>Figure 4:</strong> Critical value for a one-tailed test. Reject the null hypothesis if<br />
<img width="66" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg2.png" alt="$ t &gt; t_{crit}$"></caption>
<tr>
<td>
<div align="center"><img width="385" height="252" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/./onetailedcritval.png" alt="Image onetailedcritval"></div>
</td>
</tr>
</table>
</div>
<p>For a two-tailed test, you must halve the area under a single tail.</p>
<div align="center"><a name="212"></a></p>
<table>
<caption align="bottom"><strong>Figure 5:</strong> Critical value for a two-tailed test. Reject the null hypothesis if<br />
<img width="77" height="37" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg3.png" alt="$ \vert t\vert &gt; t_{crit}$"></caption>
<tr>
<td>
<div align="center"><img width="384" height="248" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/./twotailedcritval.png" alt="Image twotailedcritval"></div>
</td>
</tr>
</table>
</div>
<p>This convention dates back to the time when computational resources were scarce, and researchers had to use pre-computed tables of critical values, rather than calculating <img width="14" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg27.png" alt="$ p$"> directly. Today, general statistical packages such as R or Matlab can compute the CDFs of any number of standard distributions; once you can compute the CDF, directly computing <img width="14" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg27.png" alt="$ p$"> (the area under the tails) is straightforward. Despite this, many tutorials of the t-test (and of the F-test, and other significance tests) still adhere to the convention of comparing test statistics to critical values. This tends to needlessly ritualize the whole process, and make it seem more complicated and mysterious than it actually is, at least in my opinion.</p>
<p>David Freedman was very much against the continued practice of using critical values, rather than reporting the actual p-value. The last chapter of Freedman, Pisani and Purves [<a href="#Freedman07">FPP07</a>] is worth reading for its discussion of this, and other potential pitfalls of significance tests.</p>
<p>Some standard packages for evaluating t-tests, F-tests, or the ANOVA also present analysis results in terms of critical values. Most of them do usually print the actual p value as well, along with the value of the test statistic and the degrees of freedom. Most researchers rightfully report the test statistics along with the actual significance levels: &#8220;we conclude that there is a significant difference in mathematical performance (t(88) = 2.499, p = 0.014)&#8230; .&#8221; Here, 88 gives the degrees of freedom, <img width="45" height="37" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg36.png" alt="$ t(88)$"> is the value of the t-statistic, and <img width="14" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg27.png" alt="$ p$"> is of course the p-value.</p>
<p>Similar comments apply to the F-test, discussed in more detail below.</p>
<h2><a name="SECTION00021000000000000000" id="SECTION00021000000000000000">Assumptions</a></h2>
<p>Strictly speaking, the t-test is only valid for normally distributed data where both populations have equal variance. However, the test is fairly robust to non-normal data [<a href="#Box53">Box53</a>]. You can verify that the sample variances are &#8220;equal enough&#8221; &#8211; that is, they could plausibly both be sampled observations from populations with the same variance, by using the <em>F-test</em>. The F-statistic</p>
<div align="center"><img width="102" height="40" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg37.png" alt="$\displaystyle F = {s_1}^2/{s_2}^2 $"></div>
<p>is distributed according to the <em>F distribution with<br />
<img width="131" height="37" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg38.png" alt="$ (n_1 - 1,n_2 - 1)$"> degrees of freedom</em></p>
<div align="center"><a name="104"></a></p>
<table>
<caption align="bottom"><strong>Figure 6:</strong> The F distribution</caption>
<tr>
<td>
<div align="center"><img width="514" height="365" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/./Ftest.jpg" alt="Image Ftest"></div>
</td>
</tr>
</table>
</div>
<p>In practice, the larger variance is usually put in the numerator, so <img width="54" height="34" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg39.png" alt="$ F &gt; 1$"> . The test should still be two-tailed, so you should double the area under the right-hand tail<a name="tex2html9" href="#foot107" id="tex2html9"><sup>3</sup></a>. In this situation, you want to check if you ƒshould accept the null hypothesis (that<br />
<img width="54" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg44.png" alt="$ F \approx 1$"> ) at a given significance level. If so, then you can go ahead and apply the t-test.</p>
<p>There is a variation of the t-tests for distributions of unequal variance, called Welch&#8217;s t-test [<a href="#WikiWelch">Wikc</a>]. In this case, you are only checking if the means are equal, not that the distributions are the same.</p>
<h1><a name="SECTION00030000000000000000" id="SECTION00030000000000000000">The F-test for Analysis of Variance (ANOVA)</a></h1>
<p>ANOVA is an extension of the difference of means test above to the casae of more than two populations. The null hypothesis in this case is that all the sample means are equal &#8211; or more strictly, that all the treatment groups are drawn from the same population.</p>
<p>The simplest version of the ANOVA is the <em>one-way ANOVA</em>, where there are <img width="15" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg45.png" alt="$ k$"> <em>treatment groups</em> (populations) with <img width="21" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg46.png" alt="$ n_i$"> subjects (or repetitions, or replications) each, for a total of <img width="22" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg47.png" alt="$ N$"> subjects. Each population corresponds to a different single factor (a treatment or a condition: for example, a type of medicine, or a Star-Bellied Sneetch vs. a Plain-Bellied Sneetch vs. a Grinch). Two- or three- way ANOVAs correspond to varying two or three different factors combinatorially. For example, we could do a two-way ANOVA of Sneetch math performance by considering both the belly type and the gender of the Sneetchs.</p>
<div align="center"><a name="115"></a></p>
<table>
<caption align="bottom"><strong>Figure 7:</strong> Table for a Two-way ANOVA of Sneetch math performance</caption>
<tr>
<td>
<div align="center"><img width="203" height="243" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/./twowayANOVA.png" alt="Image twowayANOVA"></div>
</td>
</tr>
</table>
</div>
<p>We will only discuss one-way ANOVA in this article, since that covers all the relevant ideas about calculating significance.</p>
<p>For a one-way ANOVA, we have the population means <img width="27" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg48.png" alt="$ m_i$"> and variances <img width="27" height="38" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg49.png" alt="$ {s_i}^2$"> . We can also calculate the overall mean <img width="29" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg50.png" alt="$ m_0$"> , over the entire aggregate population.</p>
<p>The <em>between-groups mean sum of squares</em>, which is an estimate of the <em>between-groups variance</em>, is given by</p>
<div align="center">
<table cellpadding="0" width="100%" align="center">
<tr valign="middle">
<td nowrap align="center"><img width="260" height="58" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg51.png" alt="$\displaystyle {s_B}^2 = \frac{1}{k-1} \sum_i {n_i \cdot (m_i - m_0)^2}$"></td>
<td nowrap width="10" align="right">(3)</td>
</tr>
</table>
</div>
<p><br clear="all"></p>
<p><img width="33" height="38" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg52.png" alt="$ {s_B}^2$"> is sometimes designated <img width="48" height="34" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg53.png" alt="$ MS_B$"> It is a measure of how the population means vary with respect to the grand mean.</p>
<p>The <em>within-group mean sum of squares</em> is an estimate of the <em>within-group variance</em>:</p>
<div align="center"><a name="eqn:varw" id="eqn:varw"></a></p>
<table cellpadding="0" width="100%" align="center">
<tr valign="middle">
<td nowrap align="center"><img width="256" height="77" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg54.png" alt="$\displaystyle {s_W}^2 = \frac{1}{N-k} \sum_i^k \sum_j^{n_i} {x_{ij} - m_i}^2$"></td>
<td nowrap width="10" align="right">(4)</td>
</tr>
</table>
</div>
<p><br clear="all"></p>
<p><img width="37" height="38" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg55.png" alt="$ {s_W}^2$"> is sometimes designated <img width="52" height="34" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg56.png" alt="$ MS_W$"> . It is a measure of the &#8220;average population variance&#8221;.</p>
<div align="center"><a name="142"></a></p>
<table>
<caption align="bottom"><strong>Figure 8:</strong> Within-group and between-group variance</caption>
<tr>
<td>
<div align="center"><img width="322" height="214" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/./sigmas.png" alt="Image sigmas"></div>
</td>
</tr>
</table>
</div>
<p>If the null hypothesis is true, then</p>
</p>
<div align="center"><img width="114" height="40" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg57.png" alt="$\displaystyle F = {s_B}^2/{s_W}^2 $"></div>
<p>is distributed according to the F distribution wiht<br />
<img width="116" height="37" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg58.png" alt="$ (k-1, n-k)$"> degrees of freedom.</p>
<div align="center"><a name="150"></a></p>
<table>
<caption align="bottom"><strong>Figure 9:</strong> p-value for the one-tailed F-test</caption>
<tr>
<td>
<div align="center"><img width="514" height="365" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/./Ftest.jpg" alt="Image Ftest"></div>
</td>
</tr>
</table>
</div>
<p>That is, under the null hypothesis, the within-group and between-group variances should be about equal:<br />
<img width="54" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg44.png" alt="$ F \approx 1$"> . If <img width="54" height="34" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg59.png" alt="$ F &lt; 1$"> , then some of the treatment groups overlap other groups substantially, so practically speaking, one might as well accept the null hypothesis. Hence, a one-sided F test is good enough. As with the t-test, research papers usually give the value of the F statistic, the degrees of freedom, and the p-value: &#8220;<br />
<img width="238" height="37" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg4.png" alt="$ (F(2, 864) = 6.6, p = 0.0014)$"> &#8221;. In this example, the test statistic value is 6.6, and it was evaluated against the F distribution with (2, 864) degrees of freedom, which means that<br />
<img width="122" height="34" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg60.png" alt="$ k = 3, n = 866$"> . The p-value is 0.0014.</p>
<h2><a name="SECTION00031000000000000000" id="SECTION00031000000000000000">Assumptions</a></h2>
<p>Like the t-test, ANOVA assumes that the data is normally distributed with equal variances. According to Box [<a href="#Box53">Box53</a>], ANOVA is fairly robust to unequal variances when the population sizes are about the same, but you might want to check anyway. If all the populations are the same size (all the <img width="21" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg46.png" alt="$ n_i$"> are the same), the easiest way to check for equality of variances is an F-test of the statistic<br />
<img width="140" height="38" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg61.png" alt="$ F = {s_{max}}^2/{s_{min}}^2$"> with <img width="49" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg62.png" alt="$ n-1$"> degrees of freedom[<a href="#Sachs84">Sac84</a>]. In other cases, you can use Bartlett&#8217;s Test [<a href="#WikiBartlett">Wika</a>] or Levene&#8217;s Test [<a href="#WikiLevene">Wikb</a>]. Bartlett&#8217;s test uses a test statistic that is distributed as the <img width="24" height="38" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg63.png" alt="$ \chi^2$"> distribution, and Levene&#8217;s test uses one that is distributed as the F distribution. Levene&#8217;s test does not assume normally distributed data.</p>
<p>If the data are not normally distributed, or have unequal variance, often they can be transformed to a form that is closer to obeying the assumptions of ANOVA. The following table of transformations is based on [<a href="#Sachs84">Sac84</a>, p. 517], and other sources [<a href="#ndsu">Hor</a>].</p>
<div align="center"><a name="177"></a></p>
<table>
<caption align="bottom"><strong>Figure 10:</strong> Table of Transformations</caption>
<tr>
<td><img width="500" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg64.png" alt="\begin{figure}\begin{center} \begin{tabular}{\vert p{2.5in}\vert p{3.5in}\vert} ... ...} \ $\sigma \approx k\mu$\ &amp; \ \hline \end{tabular} \end{center}\end{figure}"></td>
</tr>
</table>
</div>
<p>Jim Deacon from the University of Edinburgh lists some suggestions as well [<a href="#deacon07">Dea</a>]. He also reminds us that running ANOVA on the transformed data will identify significant differences in the <em>transformed</em> data. This is <em>not</em> the same as saying there are significant differences in the original data!</p>
<h1><a name="SECTION00040000000000000000" id="SECTION00040000000000000000">Once the Null Hypothesis is Rejected</a></h1>
<p>If you are able to reject the ANOVA null hypothesis, you will usually want to know which population means are significantly different from the rest. Often, in fact, you are primarily interested in which population had the highest mean. For example, if you are comparing the efficacy of a new medicine A against existing medicines B and C, you are probably not too concerned about whether B and C perform significantly differently from each other, only about whether A is significantly better than both.</p>
<p>If all you care about is whether the highest mean is significantly higher than the others, you can simply test where the statistic</p>
</p>
<div align="center"><img width="211" height="56" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg65.png" alt="$\displaystyle (m_1 - m_2)/({s_W}^2 \frac{n_1 + n_2}{n_1\cdot n_2}) $"></div>
<p>falls on the Student-t distribution with <img width="50" height="34" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg66.png" alt="$ n-k$"> degrees of freedom. Here, <img width="37" height="38" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg55.png" alt="$ {s_W}^2$"> is the within-group variance, as calculated in Equation <a href="#eqn:varw">4</a>, <img width="29" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg10.png" alt="$ m_1$"> and <img width="29" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg11.png" alt="$ m_2$"> are the highest and second highest population means, <img width="16" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg5.png" alt="$ n$"> is the total number of samples (<br />
<img width="81" height="37" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg67.png" alt="$ n = \sum{n_i}$"> ), and <img width="15" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg45.png" alt="$ k$"> is the number of treatment groups.</p>
<p>This test is usually written</p>
</p>
<div align="center"><img width="409" height="67" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg68.png" alt="$\displaystyle m_1 - m_2 &gt; t_{(n-k, \alpha/2)} \cdot \sqrt{{s_W}^2 \cdot \frac{n_1 + n_2}{n_1\cdot n_2}} = LSD_{(1,2)} $"></div>
<p>where<br />
<img width="75" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg69.png" alt="$ t_{(n-k, \alpha/2)}$"> is the (two-sided) critical value for significance level <img width="17" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg33.png" alt="$ \alpha$"> and <img width="50" height="34" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg66.png" alt="$ n-k$"> is the number of degrees of freedom to use. This quantity is called the <em>least significant difference (LSD)</em> between the highest and second highest means, and the test is usually called the <em>LSD test</em>.</p>
<p>If you want to test all the population differences <img width="73" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg70.png" alt="$ m_i - m_j$"> for significance, (or test the highest value against all of the others explicitly) then you need to take some care with the LSD test. Remember that a significance level of <img width="17" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg33.png" alt="$ \alpha$"> means that with probability <img width="17" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg33.png" alt="$ \alpha$"> you will make a false positive error. To test all possible population differences is <img width="22" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg71.png" alt="$ K$"> = (<img width="15" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg45.png" alt="$ k$"> choose <img width="14" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg72.png" alt="$ 2$"> ) comparisons, or <img width="90" height="34" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg73.png" alt="$ K = k-1$"> comparisons, if you sort all the means in descending order and compare adjacent ones. Testing the highest mean against all the lower values is also <img width="90" height="34" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg73.png" alt="$ K = k-1$"> comparisons. This means you have a<br />
<img width="48" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg74.png" alt="$ K \cdot \alpha$"> probability of making a false positive error. So if you want the overall significance level to be <img width="17" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg33.png" alt="$ \alpha$"> , each individual comparison should use a stricter significance threshold<br />
<img width="78" height="37" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg75.png" alt="$ p \leq \alpha/K$"> .</p>
<p>A preferred way to compare multiple means for significance (once the ANOVA null hypothesis has been rejected) is to use a <em>multiple range test</em> [<a href="#deacon07">Dea</a>] or <em>Tukey&#8217;s method</em> [<a href="#nistTukey">oST06</a>], rather than the LSD test. Tukey&#8217;s method tests all pairwise comparison simultaneously, and the multiple range test starts with the broadest range (the highest and the lowest means), and works its way in until significance is lost.</p>
<h1><a name="SECTION00050000000000000000" id="SECTION00050000000000000000">Conclusion</a></h1>
<p>We&#8217;ve skimmed over many complications in this discussion. Hopefully, though, what we have gone over is enough to demystify much of the statistical discussion in research papers. Perhaps, it will demystify the output of standard ANOVA and t-test packages for you, as well.</p>
<p>Chong-ho Yu&#8217;s site [<a href="#yu09">hY</a>] gives a brief discussion of some of the issues that I&#8217;ve skimmed over. It also lists a few common non-parametric tests. These are tests that do not make assumptions about how the data is distributed, and so they may be more appropriate for data that is very non-normal, or for discrete data. They tend to have less power than parametric tests (that is, they have a lower true positive rate); so if the data is at all normal-like, parametric tests are preferred.</p>
<p>Significance tests are used in other applications beyond testing the difference in means or variances. They are used for testing whether events follow an expected distribution, for testing if there is a correlation between two variables, and for evaluating the coefficients of a regression analysis. We hope to cover some of these applications in future installments of this series.</p>
<h2><a name="SECTION00060000000000000000" id="SECTION00060000000000000000">Bibliography</a></h2>
<dl compact>
<dt><a name="Box53" id="Box53">Box53</a></dt>
<dd>G.E.P. Box, <i>Non-normality and tests on variances</i>, Biometrika <b>40</b> (1953), no.&nbsp;3/4, 318-335.</dd>
<dt><a name="deacon07" id="deacon07">Dea</a></dt>
<dd>Jim Deacon, <i>A multiple range test for comparing means in an analysis of variance</i>, <a href="http://www.biology.ed.ac.uk/research/groups/jdeacon/statistics/tress7.html">http://www.biology.ed.ac.uk/research/groups/jdeacon/statistics/tress7.html</a>.</dd>
<dt><a name="Freedman07" id="Freedman07">FPP07</a></dt>
<dd>David Freedman, Robert Pisani, and Roger Purves, <i>Statistics</i>, 4th ed., W. W. Norton &amp; Company, New York, 2007.</dd>
<dt><a name="ndsu" id="ndsu">Hor</a></dt>
<dd>Rich Horsley, <i>Transformations</i>, <tt><a name="tex2html14" href="http://www.ndsu.nodak.edu/ndsu/horsley/Transfrm.pdf" id="tex2html14">http://www.ndsu.nodak.edu/ndsu/horsley/Transfrm.pdf</a></tt>, Class notes, Plant Sciences 724, North Dakota State University.</dd>
<dt><a name="yu09" id="yu09">hY</a></dt>
<dd>Chong ho&nbsp;Yu, <i>Parametric tests</i>, <a href="http://www.creative-wisdom.com/teaching/WBI/parametric_test.shtml">http://www.creative-wisdom.com/teaching/WBI/parametric_test.shtml</a>.</dd>
<dt><a name="nistTukey" id="nistTukey">oST06</a></dt>
<dd>National&nbsp;Institute of&nbsp;Standards and Technology, <i>Tukey&#8217;s method</i>, NIST/SEMATECH e-Handbook of Statistical Methods, 2006, <a href="http://itl.nist.gov/div898/handbook/prc/section4/prc471.htm">http://itl.nist.gov/div898/handbook/prc/section4/prc471.htm.</dd>
<dt><a name="Sachs84" id="Sachs84">Sac84</a></dt>
<dd>Lothar Sachs, <i>Applied statistics: A handbook of techniques</i>, 2nd ed., Springer-Verlag, New York, 1984.</dd>
<dt><a name="WikiBartlett" id="WikiBartlett">Wika</a></dt>
<dd>Wikipedia, <i>Bartlett&#8217;s test</i>, <tt><a name="tex2html15" href="http://en.wikipedia.org/wiki/Bartlett's_test" id="tex2html15">http://en.wikipedia.org/wiki/Bartlett's_test</a></tt>.</dd>
<dt><a name="WikiLevene" id="WikiLevene">Wikb</a></dt>
<dd>&#8212;&#8211;, <i>Levene&#8217;s test</i>, <tt><a name="tex2html16" href="http://en.wikipedia.org/wiki/Levene's_test" id="tex2html16">http://en.wikipedia.org/wiki/Levene's_test</a></tt>.</dd>
<dt><a name="WikiWelch" id="WikiWelch">Wikc</a></dt>
<dd>&#8212;&#8211;, <i>Welch&#8217;s t test</i>, <tt><a name="tex2html17" href="http://en.wikipedia.org/wiki/Welch's_t_test" id="tex2html17">http://en.wikipedia.org/wiki/Welch's_t_test</a></tt>.</dd>
</dl>
<p></p>
<hr />
<h4>Footnotes</h4>
<dl>
<dt><a name="foot209" id="foot209">&#8230;</a><a href="#tex2html2"><sup>1</sup></a></dt>
<dd>Remember from the last installment that when you are estimating the mean of a distribution with unknown mean <img width="16" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg23.png" alt="$ \mu$"> and unknown variance <img width="24" height="19" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg24.png" alt="$ \sigma^2$"> , the 95% confidence interval around your estimate is<br />
<img width="115" height="39" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg25.png" alt="$ m \pm 2\cdot \sigma/\sqrt{n}$"> . Intuitively speaking, Student&#8217;s distribution is what you get if you calculate confidence intervals using the estimated variance <img width="14" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg7.png" alt="$ s$"> instead of the true but unknown variance <img width="16" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg26.png" alt="$ \sigma$"> . The distribution is stretched out compared to the normal distribution to reflect this increased uncertainty.</dd>
<dt><a name="foot210" id="foot210">&#8230; positives</a><a href="#tex2html5"><sup>2</sup></a></dt>
<dd>In his textbook <em>Statistics</em>, Freedman tells an anecdote about a study that was published in the <em>Journal of the AMA</em>, claiming to demonstrate that cholesterol causes heart attacks. The treatment group that took a cholesterol reducing drug had &#8220;significantly fewer&#8221; heart attacks than the control group (<br />
<img width="82" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg29.png" alt="$ p \approx 0.035$"> ). A closer reading revealed that the researchers used a one-tailed test, which is equivalent to <em>assuming</em> that the treatment group was going to have fewer heart attacks. What if the drug had <em>increased</em> the risk of heart attack? The proper two-tailed significance of their results would have been<br />
<img width="73" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg30.png" alt="$ p \approx 0.07$"> , which is higher than <em>JAMA</em>&#8216;s strict significance threshold of 0.05. [<a href="#Freedman07">FPP07</a>, p. 550]</dd>
<dt><a name="foot107" id="foot107">&#8230; tail</a><a href="#tex2html9"><sup>3</sup></a></dt>
<dd>The area to the right of <img width="19" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg40.png" alt="$ F$"> with <img width="45" height="37" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg41.png" alt="$ (a,b)$"> degrees of freedom is equal to the area to the left of <img width="38" height="37" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg42.png" alt="$ 1/F$"> , with <img width="45" height="37" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg43.png" alt="$ (b,a)$"> degrees of freedom.</dd>
</dl>
<p></p>
<hr />
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2009/12/statistics-to-english-translation-part-2a-%e2%80%99significant%e2%80%99-doesn%e2%80%99t-always-mean-%e2%80%99important%e2%80%99/' rel='bookmark' title='Statistics to English Translation, Part 2a: ’Significant’ Doesn’t Always Mean ’Important’'>Statistics to English Translation, Part 2a: ’Significant’ Doesn’t Always Mean ’Important’</a></li>
<li><a href='http://www.win-vector.com/blog/2009/11/i-dont-think-that-means-what-you-think-it-means-statistics-to-english-translation-part-1-accuracy-measures/' rel='bookmark' title='&#8220;I don&#8217;t think that means what you think it means;&#8221; Statistics to English Translation, Part 1: Accuracy Measures'>&#8220;I don&#8217;t think that means what you think it means;&#8221; Statistics to English Translation, Part 1: Accuracy Measures</a></li>
<li><a href='http://www.win-vector.com/blog/2010/02/living-in-a-lognormal-world/' rel='bookmark' title='Living in A Lognormal World'>Living in A Lognormal World</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2009/12/statistics-to-english-translation-part-2b-calculating-significance/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

