<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Win-Vector Blog &#187; Expository Writing</title>
	<atom:link href="http://www.win-vector.com/blog/category/expository-writing/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.win-vector.com/blog</link>
	<description>The Applied Theorist&#039;s Point of View</description>
	<lastBuildDate>Thu, 29 Jul 2010 17:09:49 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
		<item>
		<title>Living in A Lognormal World</title>
		<link>http://www.win-vector.com/blog/2010/02/living-in-a-lognormal-world/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=living-in-a-lognormal-world</link>
		<comments>http://www.win-vector.com/blog/2010/02/living-in-a-lognormal-world/#comments</comments>
		<pubDate>Wed, 03 Feb 2010 23:46:37 +0000</pubDate>
		<dc:creator>Nina Zumel</dc:creator>
				<category><![CDATA[Applications]]></category>
		<category><![CDATA[Expository Writing]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Statistics To English Translation]]></category>
		<category><![CDATA[customer value]]></category>
		<category><![CDATA[lognormal distribution]]></category>
		<category><![CDATA[long tail theory]]></category>
		<category><![CDATA[McPhee's Theory of Exposure]]></category>
		<category><![CDATA[median versus mean]]></category>
		<category><![CDATA[power law distribution]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1388</guid>
		<description><![CDATA[Recently, we had a client come to us with (among other things) the following question: Who is more valuable, Customer Type A, or Customer Type B? This client already tracked the net profit and loss generated by every customer who used his services, and had begun to analyze his customers by group. He was especially [...]


Related posts:<ol><li><a href='http://www.win-vector.com/blog/2009/12/statistics-to-english-translation-part-2b-calculating-significance/' rel='bookmark' title='Permanent Link: Statistics to English Translation, Part 2b: Calculating Significance'>Statistics to English Translation, Part 2b: Calculating Significance</a></li>
<li><a href='http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/' rel='bookmark' title='Permanent Link: A Demonstration of Data Mining'>A Demonstration of Data Mining</a></li>
<li><a href='http://www.win-vector.com/blog/2009/08/good-graphs-graphical-perception-and-data-visualization/' rel='bookmark' title='Permanent Link: Good Graphs: Graphical Perception and Data Visualization'>Good Graphs: Graphical Perception and Data Visualization</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>Recently, we had a client come to us with (among other things) the following question:<br />
Who is more valuable, Customer Type A, or Customer Type B?</p>
<p>This client already tracked the net profit and loss generated by every customer who used his services, and had begun to analyze his customers by group. He was especially interested in Customer Type A; his gut instinct told him that Type A customers were quite profitable compared to the others (Type B) and he wanted to back up this feeling with numbers.</p>
<p>He found that, on average, Type A customers generate about $92 profit per month, and Type B customers average about $115 per month (The data and figures that we are using in this discussion aren&#8217;t actual client data, of course, but a notional example). He also found that while Type A customers make up about 4% of the customer base, they generate less than 4% of the net profit per month. So Type A customers actually seem to be less profitable than Type B customers. Apparently, our client was mistaken.</p>
<p>Or was he? <span id="more-1388"></span></p>
<p>A little more elementary statistics revealed that the median profit generated by Type A customers is $65 — e.g., half the customers from group A generate more than $65 profit per month. The median for Type B customers is about $4.80 — so half the customers from group B generate less than five dollars profit every month. Maybe our client&#8217;s instincts aren&#8217;t completely off-base.</p>
<p>Let&#8217;s look at the distribution of net profit across both customer populations:</p>
<table border="0" align="center">
<tbody>
<tr>
<td>
<div style="text-align: center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/densityAll.png" border="0" alt="densityAll.png" width="500" /></div>
</td>
</tr>
</tbody>
<caption><em>Figure 1: Distribution of net profit for Type A customers (blue) and Type B customers (red). The x-axis gives the net profit or loss, and the y-axis gives the fraction of the population that generates a given net profit. </em><br />
</caption>
<tbody>
<tr>
<td></td>
</tr>
</tbody>
</table>
<p>This pattern is typical among the customers of many businesses. The majority of customers generate relatively moderate profit (or loss); but there is an important minority of large-profit and large-loss customers out on both tails. In this case, the monthly customer value actually ranges from losses in the tens of thousands to profits of several hundred thousands (I clipped the graph, for &#8220;clarity&#8221;).</p>
<p>I hesitate to call these large magnitude customers &#8220;outliers&#8221; because that term implies anomalous, possibly erroneous, data. In this case, the &#8220;outliers&#8221; are relatively rare, but important, customers who can potentially make the difference between a company that is in the black or in the red. Still, they are the exception and their behavior doesn&#8217;t necessarily tell you anything about the behavior of your typical customer. Knowing the mean profitability of a given customer group is important, of course, but the estimate will be dominated by your exceptionally profitable or lossy customers in that group, and as we&#8217;ve seen, that hides information about the majority of your customers.</p>
<p>You might remember from our <a href="http://www.win-vector.com/blog/2009/08/good-graphs-graphical-perception-and-data-visualization/">Good Graphs article</a> that if you have positive skewed data with a wide dynamic range, graphing the data on a log scale helps you see phenomena across the entire range of data that you might miss on the ordinary graph. Unfortunately, we have data here in the positive and negative range. So let&#8217;s split the customers into three groups: profitable, unprofitable, and break-even. About 5-6% of the customer base is break-even, roughly the same proportion in Groups A and B; we&#8217;ll ignore them for now, and look at the profitable customers first (over 80% of the customers, in both groups).</p>
<table border="0" align="center">
<tbody>
<tr>
<td>
<div style="text-align: center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2010/02/positiveCusts.png" border="0" alt="positiveCusts.png" width="500" /></div>
</td>
</tr>
</tbody>
<caption><em>Figure 2: Distribution of profit from profitable Type A customers (blue) and Type B customers (red). The x-axis gives net profit on a log 10 scale, so every labelled tick corresponds to a change by a factor of 100 (eg. 10^0 = $1, 10^2 = $100, and so on). The y-axis represents the fraction of the profitable customer base that generates a given profit.<br />
</em></caption>
<tbody>
<tr>
<td></td>
</tr>
</tbody>
</table>
<p>Now we can clearly see that (among profitable customers) the typical Type A customer is in fact more profitable than the typical Type B customer. The mean profit from profitable Type A customers is about $227, and the median profit is about $93 (shown by the dashed blue line). About 2/3 of the profitable Type A customers generate between $21 and $400 in profit, and over 95% of them generate between $5 and $1721 in profit. We can call that 95% the set of &#8220;typical&#8221; profitable Type A customers. That&#8217;s not a standard definition, but it&#8217;s an intuitive one, and useful for this discussion.</p>
<p>Approximately 2.5% of Type A customers generate profits greater than $1721; let&#8217;s call them the Type A &#8220;best-customers,&#8221; some of whom generate profits in the tens of thousands. They are responsible for 30% of the profit that comes from profitable Type A customers, and 3% of the profit that comes from all profitable customers (even though they only make up 0.2% of that population).</p>
<p>Profitable Type B customers generate $148 mean profit, and about $7.67 median profit (the red dashed line). A typical profitable Type B customer generates between six cents and $1031 in profit — a lower range than what the typical Type A customer generates, although the very highest-performing Type B customers are competitive with the highest-performing Type A customers (about 130 Type B customers outperform all the Type A customers).</p>
<p>Unfortunately, when Type A customers are unprofitable, they are typically more unprofitable than those of Type B. This is another reason why the mean profit from Type A customers overall was so low. Our client correctly perceived that Type A customers are typically quite profitable, but there is a small population of real clunkers in the group, too.</p>
<table border="0" align="center">
<tbody>
<tr>
<td>
<div style="text-align: center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2010/02/negativeCusts.png" border="0" alt="negativeCusts.png" width="500" /></div>
</td>
</tr>
</tbody>
<caption><em>Figure 3: Distribution of loss from unprofitable Type A customers (blue) and Type B customers (red). The x-axis gives loss on a log 10 scale; further to the right on the graph means a larger loss. An unprofitable Type A customer loses a median of $137 a month, and a mean of $1180. Unprofitable Type B customers lose a median of $4.80, and a mean of $210.<br />
</em></caption>
<tbody>
<tr>
<td></td>
</tr>
</tbody>
</table>
<p>We can do a similar analysis for the entire base of profitable customers. We would find that the typical profitable customer generates between six cents and $1200 in profit every month (median $8.65, mean $153), and that the 2.5% of best-customers generate over 60% of the profits.</p>
<p><strong>The Lognormal Distribution</strong></p>
<table border="0" align="center">
<tbody>
<tr>
<td>
<div style="text-align: center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2010/02/lognormalComp.png" border="0" alt="lognormalComp.png" width="536" height="270" /></div>
</td>
</tr>
</tbody>
<caption><em> Figure 4: (Left) Distribution of profitable customers (graph clipped at $10,000). The x-axis gives the net profit, and the y-axis gives the fraction of the population that generates a given net profit. (Right) Distribution of profitable customers plotted on a log scale.<br />
</em></caption>
<tbody>
<tr>
<td></td>
</tr>
</tbody>
</table>
<p>The distribution of highly skewed positive data, like the value of profitable customers, incomes, sales, or stock prices, can often be modelled as a <a href="http://en.wikipedia.org/wiki/Log-normal_distribution">lognormal distribution</a>: that is, the log of the data is distributed in a bell-shaped curve centered (in log space) at the median of the data (remember, for a normal curve, the median and the mean are the same). In our case, both the profits (seen above, in Figure 4) and the losses are distributed approximately lognormally. For lognormal populations, the mean is generally much higher than the median, and the bulk of the contribution towards the mean will be made by a small population of highest-valued data points. <em>If you use the mean as a stand-in for value, you will overstate the value of most of your customers.</em></p>
<p>If your customer value data is distributed approximately lognormally, then you can quickly estimate the range of values that 95% of your customers will fall into. About 95% of normally distributed data will fall within plus/minus two standard deviations of the mean, and taking logarithms converts multiplication into addition. So: if <em>sd</em> is the standard deviation of the natural log of your customer value data,  <em>M</em> is the median profit, and <em>k</em> = exp(<em>sd</em>), then 95% of your customers will fall in the value range (<em>M/(k*k)</em>, <em>M*k*k</em>). The 2.5% of customers who generate more than <em>M*k*k</em> profit are your best-customers, who often drive a majority of your profit.</p>
<p><strong>Long Tail Theory</strong></p>
<p>The distribution of customers above sounds a lot like Chris Anderson&#8217;s <a href="http://www.wired.com/wired/archive/12.10/tail.html">Long Tail Theory</a> of consumer goods. Most of the revenue of (for example) a bookseller or a music store comes from a few &#8220;hits&#8221;, or blockbusters, with the rest of the merchant&#8217;s inventory out along the tail of Figure 5, moving a relatively small volume per title.</p>
<table border="0" align="center">
<tbody>
<tr>
<td>
<div style="text-align: center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/LongTailComp.png" border="0" alt="LongTailComp.png" width="522" height="644" /></div>
</td>
</tr>
</tbody>
<caption><em> Figure 5: (Top) A notional long tail curve. The y-axis represents sales volume, and the x-axis represents goods ranked from most to least popular. The highest selling goods are to the left. Note that this figure represents the sales curve differently from how the distribution of customer value is represented on the left side of Figure 4. (Bottom) The customer value data (top 10,000 customers) from Figure 4, plotted in the style above. The y-axis has been limited to $50,000 for clarity.<br />
</em></caption>
<tbody>
<tr>
<td></td>
</tr>
</tbody>
</table>
<p>Anderson generally assumes that sales of such goods are distributed as a power law distribution, rather than a lognormal; the log of power law data isn&#8217;t distributed symmetrically, but actually has a longer tail to the right. This means that even for the log of the data, the mean is higher than the median. In fact, in some cases, the mean of a power law distribution can be infinite. If sales volume is power law distributed, then top-selling hits are responsible for an even larger percentage of total sales volume than would be the case with a lognormal.</p>
<p>The <a href="http://en.wikipedia.org/wiki/Pareto_distribution">Pareto Distribution</a>, which is one form of a power distribution, has been proposed as an alternative to the lognormal for modelling income distribution and other similar phenomena. Researchers have debated whether lognormal or Pareto is a better model for income distribution since at least the 1950s. Qualitatively, the two distributions have similar behavior. There are certain estimation and forecasting tasks where it does make a difference if your data follows a power law rather than a lognormal, but for the purposes of this discussion, it doesn&#8217;t really matter. For those who are interested, Michael Mitzenmacher has a <a href="http://projecteuclid.org/DPubS?service=UI&amp;version=1.0&amp;verb=Display&amp;handle=euclid.im/1089229510">fairly approachable discussion</a> about the difference between power laws and lognormal distributions.</p>
<p>Back to Long Tail Theory. Historically, merchants tend to concentrate on high-volume items, due to space limitations and the cost of holding inventory. Overall, however, the sum total of tail-product sales will add up to a respectable volume, especially for web retailers who have unlimited &#8220;floor space&#8221; — or so the Long Tail theory goes. A retailer must then decide whether to follow the traditional &#8220;hits-oriented&#8221; strategy, or a more &#8220;tail-oriented&#8221; strategy that caters to the numerous niche markets.</p>
<p>If we draw an analogy with customer value, then best-customers are &#8220;hits.&#8221; Obviously, our client would like to &#8220;fire&#8221; his unprofitable customers while retaining his best-performing customers, and even attract more customers like them. But what about his little customers — the 95% of customers in the typical range? If his retention and growth strategy focuses primarily on attracting and retaining big customers, he is following a hits-oriented strategy. If his campaign also includes reaching out to little customers, then he is following something analogous to a tail strategy.</p>
<p>Not all business works like a music or book seller; the appropriate strategy will vary. Still, we can think of a few reasons why keeping little customers happy is a good idea.</p>
<p>For one thing, big customers are not only rare, but they are the ones that your competitors covet the most. Little customers, meanwhile, can still add up to a respectable chunk of change (close to 40% of net profit in our example above). A solid cushion of smaller customers may soften the blow to your profit margin, should a few of your bigger customers defect.</p>
<div style="text-align: center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/logos1.png" border="0" alt="logos.png" width="369" height="336" /></div>
<p>Consider computer sales. Microsoft and Dell serve both the corporate and consumer markets. To judge from their past marketing practices, they consider business customers to be the more valuable segment (see <a href="http://www.win-vector.com/blog/2009/07/microsoft-store-again/">here</a> for a rant somewhat related to this topic). But business IT sales have declined in the current moribund economic climate; analysts attribute the growth in computer sales for the last quarter of 2009 <a href="http://www.cultofmac.com/apple-saw-24-growth-in-q4-2009-as-computer-market-bounces-back/26184">primarily to consumer spending</a>. Dell&#8217;s market growth for that last quarter was much lower than that of HP, Acer, and Apple, which are more consumer-oriented companies. It&#8217;s also worth noting that Microsoft saw a 14% <a href="http://www.neowin.net/news/main/09/10/23/windows-and-xbox-help-microsoft-earnings-beat-predictions">decline in revenue</a> for the quarter ending September 30, 2009, compared to the year-ago quarter (and their earnings were in large part due to sales of the Xbox, a consumer product), while at the same time, consumer-oriented Apple saw a <a href="http://www.cultofmac.com/apple-saw-24-growth-in-q4-2009-as-computer-market-bounces-back/26184">24% increase in revenue</a> from its year-ago quarter.</p>
<p>Your pool of little customers is also a pool of potential future best-customers. And <a href="http://insight.kellogg.northwestern.edu/index.php/Kellogg/article/predicting_customer_lifetimevalue">you can&#8217;t always guess which ones</a>. So a wise strategy might be to allocate part of your retention and growth campaign to providing loyalty incentives to smaller customers, and educating them about how your higher-end services or products might benefit them. Those little customers who have the means or opportunity to move on to the next level might very well appreciate your efforts, and stay with you, rather than defecting to a competitor.</p>
<p><strong>Optimizing Sales vs. Optimizing Customers</strong></p>
<p>One last thought about retail hits and high-value customers. McPhee&#8217;s Theory of Exposure, which is cited by Anita Elberse in her Harvard Business Review article <a href="http://hbr.org/2008/07/should-you-invest-in-the-long-tail/ar/1">&#8220;Should You Invest in the Long Tail?&#8221;</a>, states that the popularity of music, film, TV or books is largely driven by &#8220;marginal audience participants&#8221; — the casual, or light, consumer. Casual consumers gravitate to already popular products because they have limited exposure to alternatives, and hence limited knowledge of them. Consumers of more obscure products, on the other hand, tend to be heavy (and knowledgable) consumers: voracious readers, dedicated music or film buffs, or enthusiasts of specific genres, like science-fiction or horror.</p>
<div style="text-align: center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2010/01/albums.png" border="0" alt="albums.png" width="300" /></div>
<p>McPhee&#8217;s research was done in 1963, using subjects who had a fairly small range of choices, compared to internet scale. Elberse found, however, that the phenomena McPhee described still held for the internet merchants that she studied. She uses this observation (along with McPhee&#8217;s companion theory of <a href="http://en.wikipedia.org/wiki/Double_jeopardy_(marketing)">Double Jeopardy</a>) to argue that retailers should not substantially alter their traditional hits-based strategies. There is an alternative interpretation:</p>
<p><em>If your business follows McPhee&#8217;s theory, then hit products disproportionately attract low-value (low-volume) customers, and vice-versa. </em></p>
<p>So an overly hits-oriented strategy will skew you towards a base of low-value customers. Indeed, <a href="http://sethgodin.typepad.com/seths_blog/2009/12/its-not-the-rats-you-need-to-worry-about.html">Seth Godin argues</a> that iTunes and Amazon, who are in a better position to implement a more tail-oriented strategy, are thriving at the expense of physical stores exactly because they have been able to steal the quality (high-volume) customers away.</p>
<p>The moral is that both sales and customer value live in a lognormal world, where blockbuster products are marketed to a large cloud of low revenue customers, and high revenue best-customers are supported by large catalogues of low volume products. Fail to serve one side of this relationship, and you risk losing the other side.</p>


<p>Related posts:<ol><li><a href='http://www.win-vector.com/blog/2009/12/statistics-to-english-translation-part-2b-calculating-significance/' rel='bookmark' title='Permanent Link: Statistics to English Translation, Part 2b: Calculating Significance'>Statistics to English Translation, Part 2b: Calculating Significance</a></li>
<li><a href='http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/' rel='bookmark' title='Permanent Link: A Demonstration of Data Mining'>A Demonstration of Data Mining</a></li>
<li><a href='http://www.win-vector.com/blog/2009/08/good-graphs-graphical-perception-and-data-visualization/' rel='bookmark' title='Permanent Link: Good Graphs: Graphical Perception and Data Visualization'>Good Graphs: Graphical Perception and Data Visualization</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2010/02/living-in-a-lognormal-world/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Statistics to English Translation, Part 2b: Calculating Significance</title>
		<link>http://www.win-vector.com/blog/2009/12/statistics-to-english-translation-part-2b-calculating-significance/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=statistics-to-english-translation-part-2b-calculating-significance</link>
		<comments>http://www.win-vector.com/blog/2009/12/statistics-to-english-translation-part-2b-calculating-significance/#comments</comments>
		<pubDate>Mon, 14 Dec 2009 07:02:40 +0000</pubDate>
		<dc:creator>Nina Zumel</dc:creator>
				<category><![CDATA[Applications]]></category>
		<category><![CDATA[Expository Writing]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Statistics To English Translation]]></category>
		<category><![CDATA[ANOVA]]></category>
		<category><![CDATA[F-test]]></category>
		<category><![CDATA[significance]]></category>
		<category><![CDATA[t-test]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1281</guid>
		<description><![CDATA[In the previous installment of the Statistics to English Translation, we discussed the technical meaning of the term &#8221;significant&#8221;. In this installment, we look at how significance is calculated. This article will be a little more technically detailed than the last one, but our primary goal is still to help you decipher statements about significance [...]


Related posts:<ol><li><a href='http://www.win-vector.com/blog/2009/12/statistics-to-english-translation-part-2a-%e2%80%99significant%e2%80%99-doesn%e2%80%99t-always-mean-%e2%80%99important%e2%80%99/' rel='bookmark' title='Permanent Link: Statistics to English Translation, Part 2a: ’Significant’ Doesn’t Always Mean ’Important’'>Statistics to English Translation, Part 2a: ’Significant’ Doesn’t Always Mean ’Important’</a></li>
<li><a href='http://www.win-vector.com/blog/2009/11/i-dont-think-that-means-what-you-think-it-means-statistics-to-english-translation-part-1-accuracy-measures/' rel='bookmark' title='Permanent Link: &#8220;I don&#8217;t think that means what you think it means;&#8221; Statistics to English Translation, Part 1: Accuracy Measures'>&#8220;I don&#8217;t think that means what you think it means;&#8221; Statistics to English Translation, Part 1: Accuracy Measures</a></li>
<li><a href='http://www.win-vector.com/blog/2010/02/living-in-a-lognormal-world/' rel='bookmark' title='Permanent Link: Living in A Lognormal World'>Living in A Lognormal World</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>In the <a href="http://www.win-vector.com/blog/2009/12/statistics-to-english-translation-part-2a-’significant’-doesn’t-always-mean-’important’/">previous installment</a> of the <a href="http://www.win-vector.com/blog/category/statistics-to-english-translation/">Statistics to English Translation</a>, we discussed the technical meaning of the term &#8221;significant&#8221;. In this installment, we look at how significance is calculated. This article will be a little more technically detailed than the last one, but our primary goal is still to help you decipher statements about significance in research papers: statements like &#8220;<!-- MATH  $(F(2, 864) = 6.6, p = 0.0014)$  --><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg4.png" border="0" alt="$ (F(2, 864) = 6.6, p = 0.0014)$" width="238" height="37" align="middle" /> &#8221;.</p>
<p>As in the <a href="http://www.win-vector.com/blog/2009/12/statistics-to-english-translation-part-2a-’significant’-doesn’t-always-mean-’important’/">last article</a>, we will concentrate on situations where we want to test the difference of means. You should read that previous article first, so you are familiar with the terminology that we use in this one.</p>
<p>A pdf version of this current article can be found <a href="http://win-vector.com/dfiles/ste2b_calculatesig.pdf">here</a>.<br />
<span id="more-1281"></span></p>
<h1><a name="SECTION00010000000000000000" id="SECTION00010000000000000000">How is Significance Determined?</a></h1>
<p>Generally speaking, we calculate significance by computing a <em>test statistic</em> from the data. If we assume a specific null hypothesis, then we know that this test statistic will be distributed in a certain way. We can then compute how likely it is to observe our value of the test statistic, if we assume that the null hypothesis is true.</p>
<p>We&#8217;ll explain the use of a test statistic with our Sneetch example from the last installment.</p>
<h1><a name="SECTION00020000000000000000" id="SECTION00020000000000000000">The t-test for Difference of Means</a></h1>
<p>Suppose that the test scores for both Star-Bellies and Plain-Bellies are normally distributed, with the means and standard deviations as given in the table below.</p>
<div align="center">
<table cellpadding="3" border="1">
<tr>
<td align="center">&nbsp;</td>
<td align="center"><img width="16" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg5.png" alt="$ n$"> (number of subjects)</td>
<td align="center"><img width="21" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg6.png" alt="$ m$"> (mean score)</td>
<td align="center"><img width="14" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg7.png" alt="$ s$"> (standard error)</td>
</tr>
<tr>
<td align="center">Star-Bellies</td>
<td align="center">50</td>
<td align="center">78</td>
<td align="center">7</td>
</tr>
<tr>
<td align="center">Plain-Bellies</td>
<td align="center">40</td>
<td align="center">74</td>
<td align="center">8</td>
</tr>
</table>
</div>
<p>Remember from the previous installment that we can estimate the true population means <img width="24" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg8.png" alt="$ \mu_1$"> and <img width="24" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg9.png" alt="$ \mu_2$"> as normally distributed around the empirical population means <img width="29" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg10.png" alt="$ m_1$"> and <img width="29" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg11.png" alt="$ m_2$"> respectively, with variances<br />
<img width="52" height="38" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg12.png" alt="$ \sigma^2/{n_1}$"> and<br />
<img width="52" height="38" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg13.png" alt="$ \sigma^2/{n_2}$"> . This is shown in Figure <a href="#fig:twomeans">1</a>. Informally speaking, there is no significant difference in the two populations if the shaded overlap area in Figure <a href="#fig:twomeans">1</a> is large.</p>
<div align="center"><a name="fig:twomeans" id="fig:twomeans"></a><a name="36"></a></p>
<table>
<caption align="bottom"><strong>Figure 1:</strong> The estimates of the means for two populations</caption>
<tr>
<td>
<div align="center"><img width="282" height="204" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/./overlap.png" alt="Image overlap"></div>
</td>
</tr>
</table>
</div>
<p>Calculating this area is somewhat involved. Instead, we calculate the <em>t-statistic</em>:</p>
<div align="center">
<table cellpadding="0" width="100%" align="center">
<tr valign="middle">
<td nowrap align="center"><img width="126" height="62" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg14.png" alt="$\displaystyle t = \frac{(m_2 - m_1)}{s_D}$"></td>
<td nowrap width="10" align="right">(1)</td>
</tr>
</table>
</div>
<p><br clear="all"><br />
where <img width="26" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg15.png" alt="$ s_D$"> is called the <em>pooled variance</em> of the two populations.</p>
<div align="center">
<table cellpadding="0" width="100%" align="center">
<tr valign="middle">
<td nowrap align="center"><img width="325" height="64" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg16.png" alt="$\displaystyle {s_D}^2 = \frac{n_1\cdot {s_1}^2 + n_2\cdot {s_2}^2}{n_1 + n_2 - 2} \cdot (1/n_1 + 1/n_2)$"></td>
<td nowrap width="10" align="right">(2)</td>
</tr>
</table>
</div>
<p><br clear="all"></p>
<p>For our Sneetch example, <img width="75" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg17.png" alt="$ s_D = 1.6$"> , and <img width="79" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg18.png" alt="$ t=2.499$"> , or the negative of that, depending on which group is Group 1. There are<br />
<img width="142" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg19.png" alt="$ 50 + 40 - 2 = 88$"> degrees of freedom.</p>
<p>If the null hypothesis is true, and the two populations are identical, then <img width="12" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg1.png" alt="$ t$"> is distributed according to <em>Student&#8217;s distribution with<br />
<img width="105" height="34" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg20.png" alt="$ N_1 + N_2 - 2$"> degrees of freedom</em>. Student&#8217;s distribution is sort of a &#8220;stretched out&#8221; bell curve; as the degrees of freedom increase (<br />
<img width="122" height="34" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg21.png" alt="$ N_1 + N_2 \rightarrow \infty$"> ), Student&#8217;s distribution approaches the standard normal distribution, <img width="63" height="37" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg22.png" alt="$ N(0, 1)$"> <a name="tex2html2" href="#foot209" id="tex2html2"><sup>1</sup></a>.</p>
<p>In other words, if the null hypothesis is true, <img width="12" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg1.png" alt="$ t$"> should be near zero. The probability of seeing a <img width="12" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg1.png" alt="$ t$"> of a certain magnitude or greater under the null hypothesis is given by the area under the tails of Student&#8217;s distribution:</p>
<div align="center"><a name="57"></a></p>
<table>
<caption align="bottom"><strong>Figure 2:</strong> The area under the tails for a given <img width="12" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg1.png" alt="$ t$"></caption>
<tr>
<td>
<div align="center"><img width="514" height="365" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/./twotailedtest.jpg" alt="Image twotailedtest"></div>
</td>
</tr>
</table>
</div>
<p>This area is <img width="14" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg27.png" alt="$ p$"> . For the Sneetch example, <img width="82" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg28.png" alt="$ p = 0.014$"> .</p>
<p>The further out on the tails <img width="12" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg1.png" alt="$ t$"> is, the stronger the evidence that you should reject the null hypothesis. If you know for some reason that the mean of one population will be greater than or equal to the other, than you can use the <em>one-tailed test</em>:</p>
<div align="center"><a name="64"></a></p>
<table>
<caption align="bottom"><strong>Figure 3:</strong> The one-tailed test for a given <img width="12" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg1.png" alt="$ t$"></caption>
<tr>
<td>
<div align="center"><img width="514" height="365" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/./onetailedtest.jpg" alt="Image onetailedtest"></div>
</td>
</tr>
</table>
</div>
<p>This test halves the p-value as compared to the two-tailed test, making a given <img width="12" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg1.png" alt="$ t$"> value twice as significant. When in doubt about which to use, the two-tailed test is more conservative against false positives<a name="tex2html5" href="#foot210" id="tex2html5"><sup>2</sup></a>.</p>
<p>In discussions of t-tests, you will often see statements of the form:</p>
<blockquote><p>The t-test meets the hypothesis that two means are equal if</p></blockquote>
<div align="center">
<table cellpadding="0" width="100%" align="center">
<tr valign="middle">
<td nowrap align="center"><img width="88" height="37" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg31.png" alt="$\displaystyle \vert t\vert &gt; t_{\alpha/2, \nu}$"></td>
<td nowrap width="10" align="right">&nbsp;&nbsp;&nbsp;</td>
</tr>
</table>
</div>
<p><br clear="all"></p>
<blockquote><p>for a two-tailed test, or</p></blockquote>
<div align="center">
<table cellpadding="0" width="100%" align="center">
<tr valign="middle">
<td nowrap align="center"><img width="64" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg32.png" alt="$\displaystyle t &gt; t_{\alpha, \nu}$"></td>
<td nowrap width="10" align="right">&nbsp;&nbsp;&nbsp;</td>
</tr>
</table>
</div>
<p><br clear="all"></p>
<blockquote><p>for a (right-sided) one-tailed test.</p></blockquote>
<p>The quantities on the right hand side of the two equations above are called the <em>critical values</em> for a given significance level <img width="17" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg33.png" alt="$ \alpha$"> (usually,<br />
<img width="75" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg34.png" alt="$ \alpha = 0.05$"> ) and <img width="15" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg35.png" alt="$ \nu$"> degrees of freedom. The critical values are the values for which the area of the right hand tail is equal to <img width="17" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg33.png" alt="$ \alpha$"> .</p>
<div align="center"><a name="211"></a></p>
<table>
<caption align="bottom"><strong>Figure 4:</strong> Critical value for a one-tailed test. Reject the null hypothesis if<br />
<img width="66" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg2.png" alt="$ t &gt; t_{crit}$"></caption>
<tr>
<td>
<div align="center"><img width="385" height="252" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/./onetailedcritval.png" alt="Image onetailedcritval"></div>
</td>
</tr>
</table>
</div>
<p>For a two-tailed test, you must halve the area under a single tail.</p>
<div align="center"><a name="212"></a></p>
<table>
<caption align="bottom"><strong>Figure 5:</strong> Critical value for a two-tailed test. Reject the null hypothesis if<br />
<img width="77" height="37" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg3.png" alt="$ \vert t\vert &gt; t_{crit}$"></caption>
<tr>
<td>
<div align="center"><img width="384" height="248" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/./twotailedcritval.png" alt="Image twotailedcritval"></div>
</td>
</tr>
</table>
</div>
<p>This convention dates back to the time when computational resources were scarce, and researchers had to use pre-computed tables of critical values, rather than calculating <img width="14" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg27.png" alt="$ p$"> directly. Today, general statistical packages such as R or Matlab can compute the CDFs of any number of standard distributions; once you can compute the CDF, directly computing <img width="14" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg27.png" alt="$ p$"> (the area under the tails) is straightforward. Despite this, many tutorials of the t-test (and of the F-test, and other significance tests) still adhere to the convention of comparing test statistics to critical values. This tends to needlessly ritualize the whole process, and make it seem more complicated and mysterious than it actually is, at least in my opinion.</p>
<p>David Freedman was very much against the continued practice of using critical values, rather than reporting the actual p-value. The last chapter of Freedman, Pisani and Purves [<a href="#Freedman07">FPP07</a>] is worth reading for its discussion of this, and other potential pitfalls of significance tests.</p>
<p>Some standard packages for evaluating t-tests, F-tests, or the ANOVA also present analysis results in terms of critical values. Most of them do usually print the actual p value as well, along with the value of the test statistic and the degrees of freedom. Most researchers rightfully report the test statistics along with the actual significance levels: &#8220;we conclude that there is a significant difference in mathematical performance (t(88) = 2.499, p = 0.014)&#8230; .&#8221; Here, 88 gives the degrees of freedom, <img width="45" height="37" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg36.png" alt="$ t(88)$"> is the value of the t-statistic, and <img width="14" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg27.png" alt="$ p$"> is of course the p-value.</p>
<p>Similar comments apply to the F-test, discussed in more detail below.</p>
<h2><a name="SECTION00021000000000000000" id="SECTION00021000000000000000">Assumptions</a></h2>
<p>Strictly speaking, the t-test is only valid for normally distributed data where both populations have equal variance. However, the test is fairly robust to non-normal data [<a href="#Box53">Box53</a>]. You can verify that the sample variances are &#8220;equal enough&#8221; &#8211; that is, they could plausibly both be sampled observations from populations with the same variance, by using the <em>F-test</em>. The F-statistic</p>
<div align="center"><img width="102" height="40" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg37.png" alt="$\displaystyle F = {s_1}^2/{s_2}^2 $"></div>
<p>is distributed according to the <em>F distribution with<br />
<img width="131" height="37" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg38.png" alt="$ (n_1 - 1,n_2 - 1)$"> degrees of freedom</em></p>
<div align="center"><a name="104"></a></p>
<table>
<caption align="bottom"><strong>Figure 6:</strong> The F distribution</caption>
<tr>
<td>
<div align="center"><img width="514" height="365" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/./Ftest.jpg" alt="Image Ftest"></div>
</td>
</tr>
</table>
</div>
<p>In practice, the larger variance is usually put in the numerator, so <img width="54" height="34" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg39.png" alt="$ F &gt; 1$"> . The test should still be two-tailed, so you should double the area under the right-hand tail<a name="tex2html9" href="#foot107" id="tex2html9"><sup>3</sup></a>. In this situation, you want to check if you ƒshould accept the null hypothesis (that<br />
<img width="54" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg44.png" alt="$ F \approx 1$"> ) at a given significance level. If so, then you can go ahead and apply the t-test.</p>
<p>There is a variation of the t-tests for distributions of unequal variance, called Welch&#8217;s t-test [<a href="#WikiWelch">Wikc</a>]. In this case, you are only checking if the means are equal, not that the distributions are the same.</p>
<h1><a name="SECTION00030000000000000000" id="SECTION00030000000000000000">The F-test for Analysis of Variance (ANOVA)</a></h1>
<p>ANOVA is an extension of the difference of means test above to the casae of more than two populations. The null hypothesis in this case is that all the sample means are equal &#8211; or more strictly, that all the treatment groups are drawn from the same population.</p>
<p>The simplest version of the ANOVA is the <em>one-way ANOVA</em>, where there are <img width="15" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg45.png" alt="$ k$"> <em>treatment groups</em> (populations) with <img width="21" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg46.png" alt="$ n_i$"> subjects (or repetitions, or replications) each, for a total of <img width="22" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg47.png" alt="$ N$"> subjects. Each population corresponds to a different single factor (a treatment or a condition: for example, a type of medicine, or a Star-Bellied Sneetch vs. a Plain-Bellied Sneetch vs. a Grinch). Two- or three- way ANOVAs correspond to varying two or three different factors combinatorially. For example, we could do a two-way ANOVA of Sneetch math performance by considering both the belly type and the gender of the Sneetchs.</p>
<div align="center"><a name="115"></a></p>
<table>
<caption align="bottom"><strong>Figure 7:</strong> Table for a Two-way ANOVA of Sneetch math performance</caption>
<tr>
<td>
<div align="center"><img width="203" height="243" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/./twowayANOVA.png" alt="Image twowayANOVA"></div>
</td>
</tr>
</table>
</div>
<p>We will only discuss one-way ANOVA in this article, since that covers all the relevant ideas about calculating significance.</p>
<p>For a one-way ANOVA, we have the population means <img width="27" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg48.png" alt="$ m_i$"> and variances <img width="27" height="38" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg49.png" alt="$ {s_i}^2$"> . We can also calculate the overall mean <img width="29" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg50.png" alt="$ m_0$"> , over the entire aggregate population.</p>
<p>The <em>between-groups mean sum of squares</em>, which is an estimate of the <em>between-groups variance</em>, is given by</p>
<div align="center">
<table cellpadding="0" width="100%" align="center">
<tr valign="middle">
<td nowrap align="center"><img width="260" height="58" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg51.png" alt="$\displaystyle {s_B}^2 = \frac{1}{k-1} \sum_i {n_i \cdot (m_i - m_0)^2}$"></td>
<td nowrap width="10" align="right">(3)</td>
</tr>
</table>
</div>
<p><br clear="all"></p>
<p><img width="33" height="38" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg52.png" alt="$ {s_B}^2$"> is sometimes designated <img width="48" height="34" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg53.png" alt="$ MS_B$"> It is a measure of how the population means vary with respect to the grand mean.</p>
<p>The <em>within-group mean sum of squares</em> is an estimate of the <em>within-group variance</em>:</p>
<div align="center"><a name="eqn:varw" id="eqn:varw"></a></p>
<table cellpadding="0" width="100%" align="center">
<tr valign="middle">
<td nowrap align="center"><img width="256" height="77" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg54.png" alt="$\displaystyle {s_W}^2 = \frac{1}{N-k} \sum_i^k \sum_j^{n_i} {x_{ij} - m_i}^2$"></td>
<td nowrap width="10" align="right">(4)</td>
</tr>
</table>
</div>
<p><br clear="all"></p>
<p><img width="37" height="38" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg55.png" alt="$ {s_W}^2$"> is sometimes designated <img width="52" height="34" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg56.png" alt="$ MS_W$"> . It is a measure of the &#8220;average population variance&#8221;.</p>
<div align="center"><a name="142"></a></p>
<table>
<caption align="bottom"><strong>Figure 8:</strong> Within-group and between-group variance</caption>
<tr>
<td>
<div align="center"><img width="322" height="214" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/./sigmas.png" alt="Image sigmas"></div>
</td>
</tr>
</table>
</div>
<p>If the null hypothesis is true, then</p>
</p>
<div align="center"><img width="114" height="40" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg57.png" alt="$\displaystyle F = {s_B}^2/{s_W}^2 $"></div>
<p>is distributed according to the F distribution wiht<br />
<img width="116" height="37" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg58.png" alt="$ (k-1, n-k)$"> degrees of freedom.</p>
<div align="center"><a name="150"></a></p>
<table>
<caption align="bottom"><strong>Figure 9:</strong> p-value for the one-tailed F-test</caption>
<tr>
<td>
<div align="center"><img width="514" height="365" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/./Ftest.jpg" alt="Image Ftest"></div>
</td>
</tr>
</table>
</div>
<p>That is, under the null hypothesis, the within-group and between-group variances should be about equal:<br />
<img width="54" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg44.png" alt="$ F \approx 1$"> . If <img width="54" height="34" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg59.png" alt="$ F &lt; 1$"> , then some of the treatment groups overlap other groups substantially, so practically speaking, one might as well accept the null hypothesis. Hence, a one-sided F test is good enough. As with the t-test, research papers usually give the value of the F statistic, the degrees of freedom, and the p-value: &#8220;<br />
<img width="238" height="37" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg4.png" alt="$ (F(2, 864) = 6.6, p = 0.0014)$"> &#8221;. In this example, the test statistic value is 6.6, and it was evaluated against the F distribution with (2, 864) degrees of freedom, which means that<br />
<img width="122" height="34" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg60.png" alt="$ k = 3, n = 866$"> . The p-value is 0.0014.</p>
<h2><a name="SECTION00031000000000000000" id="SECTION00031000000000000000">Assumptions</a></h2>
<p>Like the t-test, ANOVA assumes that the data is normally distributed with equal variances. According to Box [<a href="#Box53">Box53</a>], ANOVA is fairly robust to unequal variances when the population sizes are about the same, but you might want to check anyway. If all the populations are the same size (all the <img width="21" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg46.png" alt="$ n_i$"> are the same), the easiest way to check for equality of variances is an F-test of the statistic<br />
<img width="140" height="38" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg61.png" alt="$ F = {s_{max}}^2/{s_{min}}^2$"> with <img width="49" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg62.png" alt="$ n-1$"> degrees of freedom[<a href="#Sachs84">Sac84</a>]. In other cases, you can use Bartlett&#8217;s Test [<a href="#WikiBartlett">Wika</a>] or Levene&#8217;s Test [<a href="#WikiLevene">Wikb</a>]. Bartlett&#8217;s test uses a test statistic that is distributed as the <img width="24" height="38" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg63.png" alt="$ \chi^2$"> distribution, and Levene&#8217;s test uses one that is distributed as the F distribution. Levene&#8217;s test does not assume normally distributed data.</p>
<p>If the data are not normally distributed, or have unequal variance, often they can be transformed to a form that is closer to obeying the assumptions of ANOVA. The following table of transformations is based on [<a href="#Sachs84">Sac84</a>, p. 517], and other sources [<a href="#ndsu">Hor</a>].</p>
<div align="center"><a name="177"></a></p>
<table>
<caption align="bottom"><strong>Figure 10:</strong> Table of Transformations</caption>
<tr>
<td><img width="500" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg64.png" alt="\begin{figure}\begin{center} \begin{tabular}{\vert p{2.5in}\vert p{3.5in}\vert} ... ...} \ $\sigma \approx k\mu$\ &amp; \ \hline \end{tabular} \end{center}\end{figure}"></td>
</tr>
</table>
</div>
<p>Jim Deacon from the University of Edinburgh lists some suggestions as well [<a href="#deacon07">Dea</a>]. He also reminds us that running ANOVA on the transformed data will identify significant differences in the <em>transformed</em> data. This is <em>not</em> the same as saying there are significant differences in the original data!</p>
<h1><a name="SECTION00040000000000000000" id="SECTION00040000000000000000">Once the Null Hypothesis is Rejected</a></h1>
<p>If you are able to reject the ANOVA null hypothesis, you will usually want to know which population means are significantly different from the rest. Often, in fact, you are primarily interested in which population had the highest mean. For example, if you are comparing the efficacy of a new medicine A against existing medicines B and C, you are probably not too concerned about whether B and C perform significantly differently from each other, only about whether A is significantly better than both.</p>
<p>If all you care about is whether the highest mean is significantly higher than the others, you can simply test where the statistic</p>
</p>
<div align="center"><img width="211" height="56" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg65.png" alt="$\displaystyle (m_1 - m_2)/({s_W}^2 \frac{n_1 + n_2}{n_1\cdot n_2}) $"></div>
<p>falls on the Student-t distribution with <img width="50" height="34" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg66.png" alt="$ n-k$"> degrees of freedom. Here, <img width="37" height="38" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg55.png" alt="$ {s_W}^2$"> is the within-group variance, as calculated in Equation <a href="#eqn:varw">4</a>, <img width="29" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg10.png" alt="$ m_1$"> and <img width="29" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg11.png" alt="$ m_2$"> are the highest and second highest population means, <img width="16" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg5.png" alt="$ n$"> is the total number of samples (<br />
<img width="81" height="37" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg67.png" alt="$ n = \sum{n_i}$"> ), and <img width="15" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg45.png" alt="$ k$"> is the number of treatment groups.</p>
<p>This test is usually written</p>
</p>
<div align="center"><img width="409" height="67" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg68.png" alt="$\displaystyle m_1 - m_2 &gt; t_{(n-k, \alpha/2)} \cdot \sqrt{{s_W}^2 \cdot \frac{n_1 + n_2}{n_1\cdot n_2}} = LSD_{(1,2)} $"></div>
<p>where<br />
<img width="75" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg69.png" alt="$ t_{(n-k, \alpha/2)}$"> is the (two-sided) critical value for significance level <img width="17" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg33.png" alt="$ \alpha$"> and <img width="50" height="34" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg66.png" alt="$ n-k$"> is the number of degrees of freedom to use. This quantity is called the <em>least significant difference (LSD)</em> between the highest and second highest means, and the test is usually called the <em>LSD test</em>.</p>
<p>If you want to test all the population differences <img width="73" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg70.png" alt="$ m_i - m_j$"> for significance, (or test the highest value against all of the others explicitly) then you need to take some care with the LSD test. Remember that a significance level of <img width="17" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg33.png" alt="$ \alpha$"> means that with probability <img width="17" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg33.png" alt="$ \alpha$"> you will make a false positive error. To test all possible population differences is <img width="22" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg71.png" alt="$ K$"> = (<img width="15" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg45.png" alt="$ k$"> choose <img width="14" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg72.png" alt="$ 2$"> ) comparisons, or <img width="90" height="34" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg73.png" alt="$ K = k-1$"> comparisons, if you sort all the means in descending order and compare adjacent ones. Testing the highest mean against all the lower values is also <img width="90" height="34" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg73.png" alt="$ K = k-1$"> comparisons. This means you have a<br />
<img width="48" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg74.png" alt="$ K \cdot \alpha$"> probability of making a false positive error. So if you want the overall significance level to be <img width="17" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg33.png" alt="$ \alpha$"> , each individual comparison should use a stricter significance threshold<br />
<img width="78" height="37" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg75.png" alt="$ p \leq \alpha/K$"> .</p>
<p>A preferred way to compare multiple means for significance (once the ANOVA null hypothesis has been rejected) is to use a <em>multiple range test</em> [<a href="#deacon07">Dea</a>] or <em>Tukey&#8217;s method</em> [<a href="#nistTukey">oST06</a>], rather than the LSD test. Tukey&#8217;s method tests all pairwise comparison simultaneously, and the multiple range test starts with the broadest range (the highest and the lowest means), and works its way in until significance is lost.</p>
<h1><a name="SECTION00050000000000000000" id="SECTION00050000000000000000">Conclusion</a></h1>
<p>We&#8217;ve skimmed over many complications in this discussion. Hopefully, though, what we have gone over is enough to demystify much of the statistical discussion in research papers. Perhaps, it will demystify the output of standard ANOVA and t-test packages for you, as well.</p>
<p>Chong-ho Yu&#8217;s site [<a href="#yu09">hY</a>] gives a brief discussion of some of the issues that I&#8217;ve skimmed over. It also lists a few common non-parametric tests. These are tests that do not make assumptions about how the data is distributed, and so they may be more appropriate for data that is very non-normal, or for discrete data. They tend to have less power than parametric tests (that is, they have a lower true positive rate); so if the data is at all normal-like, parametric tests are preferred.</p>
<p>Significance tests are used in other applications beyond testing the difference in means or variances. They are used for testing whether events follow an expected distribution, for testing if there is a correlation between two variables, and for evaluating the coefficients of a regression analysis. We hope to cover some of these applications in future installments of this series.</p>
<h2><a name="SECTION00060000000000000000" id="SECTION00060000000000000000">Bibliography</a></h2>
<dl compact>
<dt><a name="Box53" id="Box53">Box53</a></dt>
<dd>G.E.P. Box, <i>Non-normality and tests on variances</i>, Biometrika <b>40</b> (1953), no.&nbsp;3/4, 318-335.</dd>
<dt><a name="deacon07" id="deacon07">Dea</a></dt>
<dd>Jim Deacon, <i>A multiple range test for comparing means in an analysis of variance</i>, <a href="http://www.biology.ed.ac.uk/research/groups/jdeacon/statistics/tress7.html">http://www.biology.ed.ac.uk/research/groups/jdeacon/statistics/tress7.html</a>.</dd>
<dt><a name="Freedman07" id="Freedman07">FPP07</a></dt>
<dd>David Freedman, Robert Pisani, and Roger Purves, <i>Statistics</i>, 4th ed., W. W. Norton &amp; Company, New York, 2007.</dd>
<dt><a name="ndsu" id="ndsu">Hor</a></dt>
<dd>Rich Horsley, <i>Transformations</i>, <tt><a name="tex2html14" href="http://www.ndsu.nodak.edu/ndsu/horsley/Transfrm.pdf" id="tex2html14">http://www.ndsu.nodak.edu/ndsu/horsley/Transfrm.pdf</a></tt>, Class notes, Plant Sciences 724, North Dakota State University.</dd>
<dt><a name="yu09" id="yu09">hY</a></dt>
<dd>Chong ho&nbsp;Yu, <i>Parametric tests</i>, <a href="http://www.creative-wisdom.com/teaching/WBI/parametric_test.shtml">http://www.creative-wisdom.com/teaching/WBI/parametric_test.shtml</a>.</dd>
<dt><a name="nistTukey" id="nistTukey">oST06</a></dt>
<dd>National&nbsp;Institute of&nbsp;Standards and Technology, <i>Tukey&#8217;s method</i>, NIST/SEMATECH e-Handbook of Statistical Methods, 2006, <a href="http://itl.nist.gov/div898/handbook/prc/section4/prc471.htm">http://itl.nist.gov/div898/handbook/prc/section4/prc471.htm.</dd>
<dt><a name="Sachs84" id="Sachs84">Sac84</a></dt>
<dd>Lothar Sachs, <i>Applied statistics: A handbook of techniques</i>, 2nd ed., Springer-Verlag, New York, 1984.</dd>
<dt><a name="WikiBartlett" id="WikiBartlett">Wika</a></dt>
<dd>Wikipedia, <i>Bartlett&#8217;s test</i>, <tt><a name="tex2html15" href="http://en.wikipedia.org/wiki/Bartlett's_test" id="tex2html15">http://en.wikipedia.org/wiki/Bartlett's_test</a></tt>.</dd>
<dt><a name="WikiLevene" id="WikiLevene">Wikb</a></dt>
<dd>&#8212;&#8211;, <i>Levene&#8217;s test</i>, <tt><a name="tex2html16" href="http://en.wikipedia.org/wiki/Levene's_test" id="tex2html16">http://en.wikipedia.org/wiki/Levene's_test</a></tt>.</dd>
<dt><a name="WikiWelch" id="WikiWelch">Wikc</a></dt>
<dd>&#8212;&#8211;, <i>Welch&#8217;s t test</i>, <tt><a name="tex2html17" href="http://en.wikipedia.org/wiki/Welch's_t_test" id="tex2html17">http://en.wikipedia.org/wiki/Welch's_t_test</a></tt>.</dd>
</dl>
<p></p>
<hr />
<h4>Footnotes</h4>
<dl>
<dt><a name="foot209" id="foot209">&#8230;</a><a href="#tex2html2"><sup>1</sup></a></dt>
<dd>Remember from the last installment that when you are estimating the mean of a distribution with unknown mean <img width="16" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg23.png" alt="$ \mu$"> and unknown variance <img width="24" height="19" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg24.png" alt="$ \sigma^2$"> , the 95% confidence interval around your estimate is<br />
<img width="115" height="39" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg25.png" alt="$ m \pm 2\cdot \sigma/\sqrt{n}$"> . Intuitively speaking, Student&#8217;s distribution is what you get if you calculate confidence intervals using the estimated variance <img width="14" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg7.png" alt="$ s$"> instead of the true but unknown variance <img width="16" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg26.png" alt="$ \sigma$"> . The distribution is stretched out compared to the normal distribution to reflect this increased uncertainty.</dd>
<dt><a name="foot210" id="foot210">&#8230; positives</a><a href="#tex2html5"><sup>2</sup></a></dt>
<dd>In his textbook <em>Statistics</em>, Freedman tells an anecdote about a study that was published in the <em>Journal of the AMA</em>, claiming to demonstrate that cholesterol causes heart attacks. The treatment group that took a cholesterol reducing drug had &#8220;significantly fewer&#8221; heart attacks than the control group (<br />
<img width="82" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg29.png" alt="$ p \approx 0.035$"> ). A closer reading revealed that the researchers used a one-tailed test, which is equivalent to <em>assuming</em> that the treatment group was going to have fewer heart attacks. What if the drug had <em>increased</em> the risk of heart attack? The proper two-tailed significance of their results would have been<br />
<img width="73" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg30.png" alt="$ p \approx 0.07$"> , which is higher than <em>JAMA</em>&#8216;s strict significance threshold of 0.05. [<a href="#Freedman07">FPP07</a>, p. 550]</dd>
<dt><a name="foot107" id="foot107">&#8230; tail</a><a href="#tex2html9"><sup>3</sup></a></dt>
<dd>The area to the right of <img width="19" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg40.png" alt="$ F$"> with <img width="45" height="37" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg41.png" alt="$ (a,b)$"> degrees of freedom is equal to the area to the left of <img width="38" height="37" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg42.png" alt="$ 1/F$"> , with <img width="45" height="37" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg43.png" alt="$ (b,a)$"> degrees of freedom.</dd>
</dl>
<p></p>
<hr />


<p>Related posts:<ol><li><a href='http://www.win-vector.com/blog/2009/12/statistics-to-english-translation-part-2a-%e2%80%99significant%e2%80%99-doesn%e2%80%99t-always-mean-%e2%80%99important%e2%80%99/' rel='bookmark' title='Permanent Link: Statistics to English Translation, Part 2a: ’Significant’ Doesn’t Always Mean ’Important’'>Statistics to English Translation, Part 2a: ’Significant’ Doesn’t Always Mean ’Important’</a></li>
<li><a href='http://www.win-vector.com/blog/2009/11/i-dont-think-that-means-what-you-think-it-means-statistics-to-english-translation-part-1-accuracy-measures/' rel='bookmark' title='Permanent Link: &#8220;I don&#8217;t think that means what you think it means;&#8221; Statistics to English Translation, Part 1: Accuracy Measures'>&#8220;I don&#8217;t think that means what you think it means;&#8221; Statistics to English Translation, Part 1: Accuracy Measures</a></li>
<li><a href='http://www.win-vector.com/blog/2010/02/living-in-a-lognormal-world/' rel='bookmark' title='Permanent Link: Living in A Lognormal World'>Living in A Lognormal World</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2009/12/statistics-to-english-translation-part-2b-calculating-significance/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Statistics to English Translation, Part 2a: ’Significant’ Doesn’t Always Mean ’Important’</title>
		<link>http://www.win-vector.com/blog/2009/12/statistics-to-english-translation-part-2a-%e2%80%99significant%e2%80%99-doesn%e2%80%99t-always-mean-%e2%80%99important%e2%80%99/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=statistics-to-english-translation-part-2a-%25e2%2580%2599significant%25e2%2580%2599-doesn%25e2%2580%2599t-always-mean-%25e2%2580%2599important%25e2%2580%2599</link>
		<comments>http://www.win-vector.com/blog/2009/12/statistics-to-english-translation-part-2a-%e2%80%99significant%e2%80%99-doesn%e2%80%99t-always-mean-%e2%80%99important%e2%80%99/#comments</comments>
		<pubDate>Fri, 04 Dec 2009 20:39:20 +0000</pubDate>
		<dc:creator>Nina Zumel</dc:creator>
				<category><![CDATA[Applications]]></category>
		<category><![CDATA[Expository Writing]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Statistics To English Translation]]></category>
		<category><![CDATA[effect size]]></category>
		<category><![CDATA[hypothesis testing]]></category>
		<category><![CDATA[significance]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1186</guid>
		<description><![CDATA[In this installment of our ongoing Statistics to English Translation series1, we will look at the technical meaning of the term &#8221;significant&#8221;. As you might expect, what it means in statistics is not exactly what it means in everyday language. As always, a pdf version of this article is available as well. Does too much [...]


Related posts:<ol><li><a href='http://www.win-vector.com/blog/2009/12/statistics-to-english-translation-part-2b-calculating-significance/' rel='bookmark' title='Permanent Link: Statistics to English Translation, Part 2b: Calculating Significance'>Statistics to English Translation, Part 2b: Calculating Significance</a></li>
<li><a href='http://www.win-vector.com/blog/2009/11/i-dont-think-that-means-what-you-think-it-means-statistics-to-english-translation-part-1-accuracy-measures/' rel='bookmark' title='Permanent Link: &#8220;I don&#8217;t think that means what you think it means;&#8221; Statistics to English Translation, Part 1: Accuracy Measures'>&#8220;I don&#8217;t think that means what you think it means;&#8221; Statistics to English Translation, Part 1: Accuracy Measures</a></li>
<li><a href='http://www.win-vector.com/blog/2010/03/r-annoyances/' rel='bookmark' title='Permanent Link: R annoyances'>R annoyances</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>In this installment of our ongoing Statistics to English Translation series<a name="tex2html1" href="#foot133" id="tex2html1"><sup>1</sup></a>, we will look at the technical meaning of the term &#8221;significant&#8221;. As you might expect, what it means in statistics is not exactly what it means in everyday language.</p>
<p>As always, a <a href="http://www.win-vector.com/dfiles/ste2a_significance.pdf">pdf version of this article</a> is available as well.<span id="more-1186"></span></p>
<blockquote><p>Does too much salt cause high blood pressure, or doesn&#8217;t it? That debate has raged for decades, with a slew of studies finding &#8220;yes&#8221; and a slew of others finding &#8220;no.&#8221; Two new studies out today in the journal <em>Hypertension</em> tip the scales in favor of reducing sodium &#8211; particularly for those 1 in 4 Americans who have high blood pressure. One study found that reducing salt intake from 9,700 milligrams a day to 6,500 milligrams decreased blood pressure significantly in blacks, Asians, and whites who had untreated mild hypertension. Another study found that switching to a lower-salt diet helped lower blood pressure in folks with treatment-resistant hypertension.<br />
- &#8220;10 salt shockers that could make hypertension worse,&#8221; <em>U.S. News &amp; World Report</em> [<a href="#Kotz09">Kot09</a>]</p></blockquote>
<p>&#8220;Great!&#8221; you think. &#8220;Who needs to spend money on high-blood pressure meds? I can just cut down my salt!&#8221; Well, maybe so, maybe not. To come to that conclusion, you need more information than you were given in that paragraph. What was the &#8220;significant&#8221; decrease in blood pressure? What was the &#8220;before&#8221; and the &#8220;after&#8221;? Does &#8220;significant&#8221; mean important, or useful? And why has there been so much controversy over this?</p>
<p>Let&#8217;s discuss the important points with an example.</p>
<div align="center"><img width="211" height="236" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/./sneetches.jpg" alt="Image sneetches"></div>
<p>Suppose that we wanted to test for a difference in intelligence between two groups, say Star-Bellied Sneetches and Plain-Bellied Sneetches<a name="tex2html3" href="#foot134" id="tex2html3"><sup>2</sup></a>. We take a group of 50 Star-Bellies and a group of 40 Plain-Bellies, and give them both a series of tests designed to measure their mathematical, linguistic, and problem-solving abilities. After evaluating the data, we conclude that there is &#8220;a significant difference in mathematical performance (t(88) = 2.499, p = 0.014) between the two groups&#8221;. The mean mathematics score of the Star-Bellies is 78, with a standard deviation of 7, and the mean mathematics score of the Plain-Bellies is 74, with a standard deviation of 8, for a difference of 4 points<a name="tex2html4" href="#foot135" id="tex2html4"><sup>3</sup></a>.</p>
<p>Should we interpret this result to mean that Star-Bellied Sneetches are better than Plain-Bellied ones at math? It depends.</p>
<h1><a name="SECTION00010000000000000000" id="SECTION00010000000000000000">How Hypothesis Tests Work</a></h1>
<p>The Sneetch example above and the blood-pressure study cited earlier are both examples of <em>hypothesis tests</em>. In hypothesis testing, researchers set their proposed hypothesis (that there is an effect or a relationship) against the <em>null hypothesis</em> that there is no effect or relationship. In this article, we consider proposed relationships of the form</p>
<blockquote><p>The mean value of X measured for group A is different from the mean value of X measured for group B.</p></blockquote>
<p>In this case, the null hypothesis is</p>
<blockquote><p>The mean value of X is the same for groups A and B, and any difference observed in the data is only by observational chance.</p></blockquote>
<p>In fact, we are actually testing the stricter null hypothesis:</p>
<blockquote><p>The distribution of X is the same for groups A and B, and any difference observed is only by observational chance.</p></blockquote>
<p>A and B are sometimes called <em>treatment groups</em>; this terminology comes from the original applications of hypothesis testing procedures, in agriculture and medicine. In the blood pressure study above, the treatment is daily salt intake. One group ingests about 9,700 milligrams of sodium a day, the other group about 6,500 milligrams a day. The question of interest is: does the difference in sodium intake make a difference in the average blood pressure of the two groups? The null hypothesis is &#8220;No.&#8221;</p>
<h2><a name="SECTION00011000000000000000" id="SECTION00011000000000000000">Significance</a></h2>
<p>We call an observed difference <em>significant</em> &#8211; meaning that a difference as large as we observed is probably not by chance &#8211; if the the value <img width="40" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg3.png" alt="$ 1-p$"> is &#8220;high enough.&#8221; In the Sneetch example, <img width="70" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg4.png" alt="$ p = 0.014$"> is the <em>significance level</em> of the result. To interpret the p-value, suppose the null hypothesis is true: there is truly no difference between Star-Bellied math scores and Plain-Bellied math scores. If this is so, then there is only a 0.014 (1.4%) chance that the difference in the average scores of the two groups will be 4 points or larger. In other words, if the null hypothesis is true, and we administer this same test to different groups of 50 Star-Bellies and 40 Plain-Bellies a hundred times, then the difference in scores will be 4 points or more only about once or twice.</p>
<p>We interpret the fact that we have seen a difference that should be rare to be evidence that the null hypothesis <em>isn&#8217;t</em> true. So we <em>reject the null hypothesis</em> and say that there is a &#8220;significant difference&#8221; in the performance of the two groups. Alternatively, we could say that Star-Bellied Sneetches performed &#8220;significantly better&#8221; than Plain-Bellied Sneetches on the math test.</p>
<h2><a name="SECTION00012000000000000000" id="SECTION00012000000000000000">Effect Size</a></h2>
<p>Four points (or about a 5% difference) is the <em>effect size</em> of the comparison. The effect size represents what might be called the &#8220;practical significance&#8221; of the result. In general, the larger the effect size, the better. In this example, Star-Bellies might truly outperform Plain-Bellies by about four points on average, but if we were to examine the relationship between math scores and real-life math performance (say, how well college-attending Sneetches do in their math and science courses), we might discover that it takes a test score difference of ten points or more to reliably predict which Sneetches will do better. In that case, a four point average difference would not be a practical difference.</p>
<h1><a name="SECTION00020000000000000000" id="SECTION00020000000000000000">Evaluating a Result</a></h1>
<p>When evaluating a result, you should look both for its significance and its effect size. In practice, researchers usually consider a finding to be significant if <!-- MATH<br />
 $p \leq 0.05$<br />
 --><br />
<img width="62" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg5.png" alt="$ p \leq 0.05$"> . This is actually a pretty large <img width="12" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg6.png" alt="$ p$"> ; it means even if the null hypothesis is true, you would still observe a difference as large as the one that you observed about five times out of every one hundred trials. In fact, Sachs noted that <!-- MATH<br />
 $p < 0.0027$<br />
 --><br />
<img width="77" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg7.png" alt="$ p &lt; 0.0027$"> used to be the commonly used threshold for significance ([<a href="#Sachs84">Sac84</a>, p. 114]).</p>
<p>Sometimes results are reported using an asterisk convention: (*) means <!-- MATH<br />
 $p \leq 0.05$<br />
 --><br />
<img width="62" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg5.png" alt="$ p \leq 0.05$"> , (**) means <!-- MATH<br />
 $p \leq<br />
0.01$<br />
 --><br />
<img width="62" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg8.png" alt="$ p \leq 0.01$"> , and (***) means <!-- MATH<br />
 $p \leq 0.001$<br />
 --><br />
<img width="70" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg9.png" alt="$ p \leq 0.001$"> . Hopefully, the actual significance level is reported (it isn&#8217;t always), as well as the actual effect size (it isn&#8217;t always).</p>
<div align="center"><img width="240" height="180" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/./cup_of_coffee.jpg" alt="Image cup_of_coffee"></div>
<p>The effect size in medical studies is often reported in the popular press with statements like &#8220;those who abstained from coffee had triple the risk of contracting colon cancer compared to those who drank three or more cups a day.&#8221; Does that mean that all confirmed Lapsang Souchong drinkers and the uncaffeinated should run out and learn to embrace Starbucks? Well, no. First of all, ask yourself: what is the baseline risk of colon cancer? If abstaining from coffee triples the risk from 0.01% to 0.03%, well, it probably isn&#8217;t worth worrying about. On the other hand, if the risk triples from 5% to 15%, perhaps that is a reason to take up espressos. You should also see who were the subjects of the study, and how similar they are to you. Suppose the study was done on Caucasian males in the U.S., ages 55-65, with no family history of colon cancer. If you are a young white American male, it&#8217;s possible that this study says something about your future health. If you are female or non-Caucasian or not living in the U.S., the finding may or may not be relevant to you. It depends on the mechanism that drives the relationship, and whether or not it applies to you as well as to the subjects of the study.</p>
<h2><a name="SECTION00021000000000000000" id="SECTION00021000000000000000">&#8220;Significant&#8221; is not the same as &#8220;Important&#8221;</a></h2>
<blockquote><p>With a large sample, even a small difference can be &#8220;statistically significant&#8221;&#8230; . This doesn&#8217;t necessarily make it important. Conversely, an important difference may not be statistically significant if the sample size is too small.<br />
- Freedman, Pisani and Purves, <em>Statistics</em> [<a href="#Freedman07">FPP07</a>, p. 550]</p></blockquote>
<p>The ability of a study to detect a significant difference depends almost entirely on its size. When a researcher designs a study, she has to decide how much risk of error &#8211; and what type of error &#8211; she is willing to tolerate.</p>
<blockquote><p>How big a risk [of inventing a difference] between two indistinguishable treatments are we willing to put up with? This risk is known as the significance level <img width="14" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg10.png" alt="$ \alpha$"> . [<a href="#Sachs84">Sac84</a>, p. 214]</p></blockquote>
<p><img width="14" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg10.png" alt="$ \alpha$"> is the probability of rejecting a null hypothesis that should be accepted. This is a Type I error (a false positive). <img width="14" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg10.png" alt="$ \alpha$"> enters the design of the study as the threshold for p-values that the researcher will accept as significant.</p>
<blockquote><p>How big a risk do we allow of missing a substantial difference between two treatments? &#8230; This risk is called <img width="14" height="29" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg11.png" alt="$ \beta$"> . [<a href="#Sachs84">Sac84</a>, p. 214]</p></blockquote>
<p><img width="14" height="29" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg11.png" alt="$ \beta$"> is the probability of accepting a null hypothesis that should have been rejected. This is a Type II error (a false negative). The quantity <img width="41" height="29" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg12.png" alt="$ 1-\beta$"> is known as the <em>power</em> of the test: the probability that the test will correctly reject the null hypothesis when the alternative hypothesis is true.</p>
<blockquote><p>How small a difference should still be recognized as significant? This difference is called <img width="12" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg13.png" alt="$ \delta$"> . [<a href="#Sachs84">Sac84</a>, p. 214]</p></blockquote>
<p><img width="12" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg13.png" alt="$ \delta$"> is the minimum effect size that we are willing to consider &#8220;practically significant.&#8221;</p>
<p>It is important to consider <em>all three</em> of <img width="14" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg10.png" alt="$ \alpha$"> , <img width="14" height="29" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg11.png" alt="$ \beta$"> , and <img width="12" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg13.png" alt="$ \delta$"> when determining an appropriate sample size for a trial. The power of a test and the significance of a result both increase as the sample size <img width="14" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg2.png" alt="$ n$"> increases. So if <img width="12" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg13.png" alt="$ \delta$"> is not specified, <b>any difference can appear significant, with a large enough <img width="14" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg2.png" alt="$ n$"></b> , even if the difference is really by chance.</p>
<h3><a name="SECTION00021100000000000000" id="SECTION00021100000000000000">The Central Limit Theorem</a></h3>
<p>To see why the above statement is true, we need a few more facts about estimating the mean. Suppose we have a random variable <img width="19" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg14.png" alt="$ X$"> that is normally (or nearly normally) distributed, with a true mean <img width="14" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg1.png" alt="$ \mu $"> and (unknown) variance <img width="21" height="16" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg15.png" alt="$ \sigma^2$"> . You want to estimate <img width="14" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg1.png" alt="$ \mu $"> by drawing <img width="14" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg2.png" alt="$ n$"> samples; the sample mean <img width="13" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg16.png" alt="$ \bar{x}$"> gives you an estimate of <img width="14" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg1.png" alt="$ \mu $"> . According to the <em>Central Limit Theorem</em>, if you were to repeat this experiment over and over again, you would see that the estimated <img width="13" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg16.png" alt="$ \bar{x}$"> has a normal distribution, with mean <img width="14" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg1.png" alt="$ \mu $"> and variance <!-- MATH<br />
 $\sigma^2/n$<br />
 --><br />
<img width="38" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg17.png" alt="$ \sigma^2/n$"> . So <img width="13" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg16.png" alt="$ \bar{x}$"> is a good estimate of <img width="14" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg1.png" alt="$ \mu $"> , one that improves with a larger sample size <img width="14" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg2.png" alt="$ n$"> .</p>
<p>Another fact about normal distributions is that a little over 95% of the probability mass is within <img width="24" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg18.png" alt="$ \pm 2$"> standard deviations of the mean. So, for a single experiment, we can reason that the true mean <img width="14" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg1.png" alt="$ \mu $"> is in the interval <!-- MATH<br />
 $\bar{x} \pm 2 \sigma/\sqrt{n}$<br />
 --><br />
<img width="81" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg19.png" alt="$ \bar{x} \pm 2 \sigma/\sqrt{n}$"> with 95% probability<a name="tex2html5" href="#foot136" id="tex2html5"><sup>4</sup></a>.</p>
<div align="center"><a name="86"></a></p>
<table>
<caption align="bottom"><strong>Figure 1:</strong> Confidence bounds on the estimate of <img width="14" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg1.png" alt="$ \mu $"> for different values of <img width="14" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg2.png" alt="$ n$"></caption>
<tr>
<td>
<div align="center"><img width="370" height="183" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/./fig1.png" alt="Image fig1"></div>
</td>
</tr>
</table>
</div>
<p>So, as <img width="14" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg2.png" alt="$ n$"> gets larger, we zoom in on <img width="14" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg1.png" alt="$ \mu $"> <a name="tex2html7" href="#foot89" id="tex2html7"><sup>5</sup></a>.</p>
<p>Now, back to the problem of checking for the difference of means. We&#8217;ll take <img width="14" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg2.png" alt="$ n$"> samples from population <img width="16" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg22.png" alt="$ A$"> and <img width="14" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg2.png" alt="$ n$"> from population <img width="17" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg23.png" alt="$ B$"> . Let&#8217;s assume for now that the variances are equal.</p>
<div align="center"><a name="93"></a></p>
<table>
<caption align="bottom"><strong>Figure 2:</strong> Confidence bounds overlap; means may not be truly different</caption>
<tr>
<td>
<div align="center"><img width="273" height="211" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/./fig2.png" alt="Image fig2"></div>
</td>
</tr>
</table>
</div>
<p>With 95% probability, <!-- MATH<br />
 $\mu_A \in \bar{x}_A \pm 2\sigma/\sqrt{n}$<br />
 --><br />
<img width="131" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg24.png" alt="$ \mu_A \in \bar{x}_A \pm 2\sigma/\sqrt{n}$"> , and <!-- MATH<br />
 $\mu_B \in \bar{x}_B \pm 2\sigma/\sqrt{n}$<br />
 --><br />
<img width="132" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg25.png" alt="$ \mu_B \in \bar{x}_B \pm 2\sigma/\sqrt{n}$"> . If <!-- MATH<br />
 $|\bar{x}_A -<br />
\bar{x}_B|$<br />
 --><br />
<img width="72" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg26.png" alt="$ \vert\bar{x}_A - \bar{x}_B\vert$"> is small compared to <!-- MATH<br />
 $4 \sigma/\sqrt{n}$<br />
 --><br />
<img width="53" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg27.png" alt="$ 4 \sigma/\sqrt{n}$"> , then the two confidence intervals overlap substantially, and we cannot reject the null hypothesis that <!-- MATH<br />
 $\mu_A = \mu_B$<br />
 --><br />
<img width="66" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg28.png" alt="$ \mu_A = \mu_B$"> .</p>
<p>If, on the other hand, <!-- MATH<br />
 $|\bar{x}_A - \bar{x}_B|$<br />
 --><br />
<img width="72" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg26.png" alt="$ \vert\bar{x}_A - \bar{x}_B\vert$"> is wide compared to <!-- MATH<br />
 $4 \sigma/\sqrt{n}$<br />
 --><br />
<img width="53" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg27.png" alt="$ 4 \sigma/\sqrt{n}$"> :</p>
<div align="center"><a name="109"></a></p>
<table>
<caption align="bottom"><strong>Figure 3:</strong> Confidence bounds don&#8217;t overlap; means are significantly different</caption>
<tr>
<td>
<div align="center"><img width="331" height="193" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/./fig3.png" alt="Image fig3"></div>
</td>
</tr>
</table>
</div>
<p>then the confidence intervals are well separated, and we can reject the null hypothesis.</p>
<p>So <img width="12" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg13.png" alt="$ \delta$"> , the minimum significant distance &#8211; the &#8220;resolution&#8221; of the experiment &#8211; is about the distance when the two confidence intervals touch: <!-- MATH<br />
 $4 \sigma/\sqrt{n}$<br />
 --><br />
<img width="53" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg27.png" alt="$ 4 \sigma/\sqrt{n}$"> , if our desired significance level is 0.05.</p>
<div align="center"><a name="116"></a></p>
<table>
<caption align="bottom"><strong>Figure 4:</strong> Minimum significant distance for a given sample size <img width="14" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg2.png" alt="$ n$"></caption>
<tr>
<td>
<div align="center"><img width="304" height="229" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/./fig4.png" alt="Image fig4"></div>
</td>
</tr>
</table>
</div>
<p>If <img width="12" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg13.png" alt="$ \delta$"> is too large, the experiment may be unable to detect important differences because the confidence intervals overlap too soon. This means that the sample size was too small (the test didn&#8217;t have enough power), and the experiment should be repeated with a larger test population.</p>
<p>If <img width="12" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg13.png" alt="$ \delta$"> is too small, then the experiment will potentially detect statistically significant differences that are, for all practical intents and purposes, meaningless. To go back to the Sneetch example, if the math exam has one hundred questions, then an effect size of two points would correspond to one group answering two additional questions correctly, on average. Practically speaking, that&#8217;s probably not a very big difference. But if we made the experiment big enough, about 250 Sneetches in each group, it would be a <em>statistically</em> significant difference, to the 0.05 level. In theory, we could even make a difference of less than one point statistically significant! That is why knowing the effect size of a significant result is important.</p>
<h2><a name="SECTION00022000000000000000" id="SECTION00022000000000000000">&#8220;Significant&#8221; is not the same as &#8220;True&#8221;</a></h2>
<p>The power and significance level of a test play similar roles to the sensitivity and specificity of a diagnostic test. You&#8217;ll remember from Part 1 of this series<a name="tex2html11" href="#foot137" id="tex2html11"><sup>6</sup></a>that sensitivity and specificity are properties of the test, <em>not</em> how the test performs in a given population. To know the practical accuracy of a screening test, you must know the underlying prevalence of the condition that it is screening for. If it is crucial that the screening not miss any positive cases, then the test will be designed to be highly sensitive, possibly at the cost of specificity. In that case, the test will tend to have a high false positive rate if the condition is relatively rare. And yet, this same screening test will have a lower overall false positive rate when used in a population where the condition is more prevalent.</p>
<p>The same is true for hypothesis tests. The probability that a statistically significant result is actually <em>true</em> depends on the underlying probability that results &#8220;of that type&#8221; tend to be true in the domain of study. It also depends on whether the researcher was trying to minimize the chance of a false positive error, or a false negative error.</p>
<p>You should also be careful interpreting the results of exploratory work, where the researchers have run a series of several different studies, but only highlight the &#8220;significant&#8221; ones. Running twenty experiments and having one of them return a significant result to the <img width="62" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg29.png" alt="$ p=0.05$"> level is actually not significant at all.</p>
<p>John Ioannides discusses these points (and a few others) in his 2005 essay &#8220;Why Most Published Research Findings are False&#8221;[<a href="#Ion05">Ioa05</a>]. The essay made a few waves at the time of its publication, and it is still available online. We recommend that you read it, along with the 2007 followup article by Moonesinghe, et.al [<a href="#Moon07">MKJ07</a>]. Now that you&#8217;ve read the first two installments of the Statistics to English translation, both essays should be a breeze!</p>
<h1><a name="SECTION00030000000000000000" id="SECTION00030000000000000000">Some Points to Remember</a></h1>
<ul>
<li>&#8220;Significant&#8221; is a statistical statement that an observed relationship is unlikely to be by chance. It is not an necessarily a statement about the magnitude or the importance (or the truth!) of the relationship.</li>
<li>Knowing the effect size of a significant result will help you decide if the relationship is &#8220;practically significant.&#8221;</li>
<li>With a large enough sample size, any difference in means can appear significant, even when it is by chance.</li>
</ul>
<p>You now have a general idea what a &#8220;statistically significant result&#8221; is. The next installment will go into a little more technical detail of how significance is calculated. You should read that installment if you want to decipher statements in research papers like &#8220;<!-- MATH<br />
 $(F(2, 864) = 6.6, p = 0.0014)$<br />
 --><br />
<img width="202" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg30.png" alt="$ (F(2, 864) = 6.6, p = 0.0014)$"> &#8221; &#8212; or if you are simply curious.</p>
<h2><a name="SECTION00040000000000000000" id="SECTION00040000000000000000">Bibliography</a></h2>
<dl compact>
<dt><a name="Freedman07" id="Freedman07">FPP07</a></dt>
<dd>David Freedman, Robert Pisani, and Roger Purves, <i>Statistics</i>, 4th ed., W. W. Norton &amp; Company, New York, 2007.</dd>
<dt><a name="Ion05" id="Ion05">Ioa05</a></dt>
<dd>John P.&nbsp;A. Ioannidis, <i>Why most published research findings are false</i>, PLoS Med <b>2</b> (2005), no.&nbsp;8, e124, Available as <a href="http://www.plosmedicine.org/article/info:doi/10.1371/journal.pmed.0020124">http://www.plosmedicine.org/article/info:doi/10.1371/journal.pmed.0020124</a>.</dd>
<dt><a name="Kotz09" id="Kotz09">Kot09</a></dt>
<dd>Deborah Kotz, <i>10 salt shockers that could make hypertension worse</i>, U.S. News &amp; World Report (2009), Online as <a href="http://health.usnews.com/articles/health/heart/2009/07/20/10-salt-shockers-that-could-make-hypertension-worse.html"> http://health.usnews.com/articles/health/heart/2009/07/20/10-salt-shockers-that-could-make-hypertension-worse.html</a>.</dd>
<dt><a name="Moon07" id="Moon07">MKJ07</a></dt>
<dd>Ramal Moonesinghe, Muin&nbsp;J Khoury, and A.&nbsp;Cecile J.&nbsp;W Janssens, <i>Most published research findings are false &#8212; but a little replication goes a long way</i>, PLoS Med <b>4</b> (2007), no.&nbsp;2, e28, Available as <a href="http://www.plosmedicine.org/article/info%3Adoi%2F10.1371%2Fjournal.pmed.0040028">http://www.plosmedicine.org/article/info%3Adoi%2F10.1371%2Fjournal.pmed.0040028</a>.</dd>
<dt><a name="Sachs84" id="Sachs84">Sac84</a></dt>
<dd>Lothar Sachs, <i>Applied statistics: A handbook of techniques</i>, 2nd ed., Springer-Verlag, New York, 1984.</dd>
<dt><a name="Spiegel08" id="Spiegel08">SS99</a></dt>
<dd>Murray&nbsp;R. Spiegel and Larry&nbsp;J. Stephens, <i>Schaum&#8217;s outline of statistics</i>, 4th ed., McGraw-Hill, New York, 1999.</dd>
</dl>
<p></p>
<hr />
<h4>Footnotes</h4>
<dl>
<dt><a name="foot133" id="foot133">&#8230; series</a><a href="#tex2html1"><sup>1</sup></a></dt>
<dd><tt><a name="tex2html2" href="http://www.win-vector.com/blog/category/statistics-to-english-translation/" id="tex2html2">http://www.win-vector.com/blog/category/statistics-to-english-translation/</a></tt></dd>
<dt><a name="foot134" id="foot134">&#8230; Sneetches</a><a href="#tex2html3"><sup>2</sup></a></dt>
<dd>&#8220;The Sneetchs,&#8221; from <em>The Sneetches and Other Stories</em> by Dr. Seuss.<br />
<a href="http://www.youtube.com/watch?v=Ln3V0HgW4eM">http://www.youtube.com/watch?v=Ln3V0HgW4eM</a><br />
 and <a href="http://www.youtube.com/watch?v=s0LgMpfLD1Y">http://www.youtube.com/watch?v=s0LgMpfLD1Y</a>
</dd>
<dt><a name="foot135" id="foot135">&#8230; points</a><a href="#tex2html4"><sup>3</sup></a></dt>
<dd>This example is based on Exercise 10.17 in [<a href="#Spiegel08">SS99</a>]; the original exercise did not, unfortunately, involve Sneetches.</dd>
<dt><a name="foot136" id="foot136">&#8230; probability</a><a href="#tex2html5"><sup>4</sup></a></dt>
<dd>The correct way to state this is that for a given (unknown) <img width="14" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg1.png" alt="$ \mu $"> , the estimate <img width="13" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg16.png" alt="$ \bar{x}$"> falls in the interval <!-- MATH<br />
 $\mu<br />
\pm 2 \sigma/\sqrt{n}$<br />
 --><br />
<img width="82" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg20.png" alt="$ \mu \pm 2 \sigma/\sqrt{n}$"> just over 95% of the time. This gets awkward to reason about. Luckily, symmetry arguments let us center the appropriate confidence interval around <img width="13" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg16.png" alt="$ \bar{x}$"> instead.</dd>
<dt><a name="foot89" id="foot89">&#8230;</a><a href="#tex2html7"><sup>5</sup></a></dt>
<dd>Of course, we don&#8217;t actually know <img width="14" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg21.png" alt="$ \sigma$"> , so we don&#8217;t know exactly how fast we zoom in. That doesn&#8217;t affect our argument, though, since only <img width="14" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg2.png" alt="$ n$"> changes</dd>
<dt><a name="foot137" id="foot137">&#8230; series</a><a href="#tex2html11"><sup>6</sup></a></dt>
<dd><a href="http://www.win-vector.com/blog/2009/11/i-dont-think-that-means-what-you-think-it-means-statistics-to-english-translation-part-1-accuracy-measures/">http://www.win-vector.com/blog/2009/11/i-dont-think-that-means-what-you-think-it-means-statistics-to-english-translation-part-1-accuracy-measures/</a></dd>
</dl>
<p></p>
<hr />
<address>Nina Zumel 2009-12-04</address>


<p>Related posts:<ol><li><a href='http://www.win-vector.com/blog/2009/12/statistics-to-english-translation-part-2b-calculating-significance/' rel='bookmark' title='Permanent Link: Statistics to English Translation, Part 2b: Calculating Significance'>Statistics to English Translation, Part 2b: Calculating Significance</a></li>
<li><a href='http://www.win-vector.com/blog/2009/11/i-dont-think-that-means-what-you-think-it-means-statistics-to-english-translation-part-1-accuracy-measures/' rel='bookmark' title='Permanent Link: &#8220;I don&#8217;t think that means what you think it means;&#8221; Statistics to English Translation, Part 1: Accuracy Measures'>&#8220;I don&#8217;t think that means what you think it means;&#8221; Statistics to English Translation, Part 1: Accuracy Measures</a></li>
<li><a href='http://www.win-vector.com/blog/2010/03/r-annoyances/' rel='bookmark' title='Permanent Link: R annoyances'>R annoyances</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2009/12/statistics-to-english-translation-part-2a-%e2%80%99significant%e2%80%99-doesn%e2%80%99t-always-mean-%e2%80%99important%e2%80%99/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>The Local to Global Principle</title>
		<link>http://www.win-vector.com/blog/2009/11/the-local-to-global-principle/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=the-local-to-global-principle</link>
		<comments>http://www.win-vector.com/blog/2009/11/the-local-to-global-principle/#comments</comments>
		<pubDate>Wed, 11 Nov 2009 16:37:53 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Exciting Techniques]]></category>
		<category><![CDATA[Expository Writing]]></category>
		<category><![CDATA[Mathematics]]></category>
		<category><![CDATA[Dynamic Programming]]></category>
		<category><![CDATA[Local to Global]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[PageRank]]></category>
		<category><![CDATA[Problem Solving]]></category>
		<category><![CDATA[Speech Recognition]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1123</guid>
		<description><![CDATA[We describe the &#8220;the local to global principle.&#8221; It is a principle used to break algorithmic problem solving into two distinct phases (local criticism followed by global solution) and is an aid both in the design and in the application of algorithms. Instead of giving a formal definition of the principle we quickly define it [...]


Related posts:<ol><li><a href='http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/' rel='bookmark' title='Permanent Link: Automatic Differentiation with Scala'>Automatic Differentiation with Scala</a></li>
<li><a href='http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/' rel='bookmark' title='Permanent Link: A Demonstration of Data Mining'>A Demonstration of Data Mining</a></li>
<li><a href='http://www.win-vector.com/blog/2009/07/should-your-mom-use-google-search/' rel='bookmark' title='Permanent Link: Should your mom use Google search?'>Should your mom use Google search?</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>We describe the &#8220;the local to global principle.&#8221; It is a principle used to break algorithmic problem solving into two distinct phases (local criticism followed by global solution) and is an aid both in the design and in the application of algorithms. Instead of giving a formal definition of the principle we quickly define it and discuss a few examples and methods.  We have produced both a stand-alone <a href="http://www.win-vector.com/dfiles/LocalToGlobal.pdf">PDF</a> (more legible) and a HTML/blog form (more skimable).<br />
<span id="more-1123"></span></p>
<h1 align="center">The Local to Global Principle</h1>
<p align="center"><strong>John Mount<a name="tex2html3" href="#foot21" id="tex2html3"><sup>1</sup></a></strong></p>
<p></p>
<p align="center"><b>Date:</b> November 11, 2009</p>
<hr />
<h3>Abstract:</h3>
<div>We describe the &#8220;the local to global principle.&#8221; It is a principle used to break algorithmic problem solving into two distinct phases (local criticism followed by global solution) and is an aid both in the design and in the application of algorithms. Instead of giving a formal definition of the principle we quickly define it and discuss a few examples and methods.</div>
<p></p>
<h2><a name="SECTION00010000000000000000" id="SECTION00010000000000000000">Contents</a></h2>
<p><!--Table of Contents--></p>
<ul>
<li><a name="tex2html32" href="#SECTION00020000000000000000" id="tex2html32">Introduction</a></li>
<li><a name="tex2html33" href="#SECTION00030000000000000000" id="tex2html33">The Examples</a>
<ul>
<li><a name="tex2html34" href="#SECTION00031000000000000000" id="tex2html34">Web Page Link Analysis</a></li>
<li><a name="tex2html35" href="#SECTION00032000000000000000" id="tex2html35">Natural Language Processing</a></li>
<li><a name="tex2html36" href="#SECTION00033000000000000000" id="tex2html36">Machine Learning</a></li>
</ul>
<p></li>
<li><a name="tex2html37" href="#SECTION00040000000000000000" id="tex2html37">Some Methods</a>
<ul>
<li><a name="tex2html38" href="#SECTION00041000000000000000" id="tex2html38">Local Methods</a></li>
<li><a name="tex2html39" href="#SECTION00042000000000000000" id="tex2html39">Globalization Methods</a></li>
</ul>
<p></li>
<li><a name="tex2html40" href="#SECTION00050000000000000000" id="tex2html40">Conclusion</a></li>
<li><a name="tex2html41" href="#SECTION00060000000000000000" id="tex2html41">Bibliography</a></li>
<li><a name="tex2html42" href="#SECTION00070000000000000000" id="tex2html42">Acknowledgement</a></li>
</ul>
<p><!--End of Table of Contents--></p>
<h1><a name="SECTION00020000000000000000" id="SECTION00020000000000000000">Introduction</a></h1>
<p><font>A common vain hope of computer scientists and algorithm designers is that a domain expert has already &#8220;boiled down&#8221; a problem to a precise, but unsolved, algorithmic core. On this point the mathematician Gian-Carlo Rota wrote:</font></p>
<blockquote><p><font>One of the rarest mathematical talents is the talent for applied mathematics, for picking out of a maze of experimental data the two or three parameters that are relevant, and to discard all other data. This talent is rare. It is taught only at the shop level.[<a href="#IndiscreteThoughts">Rot97</a>, ``A Mathematician's Gossip'']</font></p></blockquote>
<p><font>We describe a useful tool for designing algorithmic applications and solutions which we call &#8220;the local to global principle.&#8221; The local to global principle is the method of deriving applications and solutions by specifying &#8220;local&#8221; (and deliberately myopic) heuristics, critiques and methods followed by using a powerful general method to &#8220;globalize&#8221; this specification into a complete solution.</font></p>
<p><font>There are many important problem solving prescriptions and methods of thought already systematically described and taught:</font></p>
<ul>
<li>Bacon&#8217;s &#8220;New Organon&#8221; and Mill&#8217;s principles of inductive logic.[<a href="#Mill">Mil02</a>]</li>
<li>Feynman&#8217;s genius method.[<a href="#IndiscreteThoughts">Rot97</a>, ``Ten Lessons I Wish I Had Been Taught'']</li>
<li>Reductionism (top down and bottom up).</li>
<li>Divide and conquer.[<a href="#IntroductionToAlgorithms">CLRS09</a>]</li>
<li>Forward deduction, backwards induction.</li>
<li>Root Cause Analysis.</li>
<li>Polya&#8217;s heuristic and conjecture and prove patterns [<a href="#citeulike:679515">Pol71</a>,<a href="#Polya1">Pol54a</a>,<a href="#Polya2">Pol54b</a>]</li>
<li>Doron Zeilberger&#8217;s &#8220;Method of Undetermined Generalization and Specialization.&#8221; [<a href="#Zeilberger:1995p277">Zei95</a>]</li>
<li>Zbigniew Michalewicz and David B. Fogel&#8217;s presentation of evolutionary algorithms.[<a href="#HTSMH">MF00</a>]</li>
</ul>
<p><font>The local to global principle is more of an organizational pattern than &#8220;computer aided technique&#8221; as no one specific species of software or family of notation is required.</font></p>
<p><font>The local to global principle can be identified in a number of previous important applications, but it is not currently an identified principle.<a name="tex2html4" href="#foot244" id="tex2html4"><sup>2</sup></a> The principle is very general, so any succinct description of it is going to be painfully vague. Instead, we explain the principle by discussing some example applications and methods.  For each of our example applications we deliberately use a different globalization technique. The effective algorithmist or practitioner must in fact come to each problem already familiar with a reasonably large set of already known local and global techniques, so we conclude with some appropriate fields of study and preparation.</font></p>
<p><font>The local to global principle is divided into two parts: local encoding of the problem followed by a globalization step that uses the encoding. The guiding feature of local encodings is that they are usually easy to compute from the data at hand. Any extension that looks like enumeration, search or optimization is best left to the global step. The local step is essentially the translation of your problem into an abstract language that is ready for the globalization step. In contrast globalization methods are often &#8220;off the shelf&#8221; in that once you abstract and encode the particulars of your problem you can look for pre-existing useful methods or software to finish your solution. The idea of globalization is to find a best overall or global compromise between competing local criteria. The local step does not so much have to avoid conflicts but instead &#8220;price them.&#8221; There is also an important trade-off that sophisticated local techniques allow the use of simpler globalization methods and more powerful globalization methods allow the use of simpler local techniques.</font></p>
<h1><a name="SECTION00030000000000000000" id="SECTION00030000000000000000">The Examples</a></h1>
<p><font>To demonstrate the breadth of the local to global principle we choose a diverse collection of example applications: web page link analysis, natural language processing and machine learning. For each example application we will set up the problem, introduce a reasonable set of local criteria and pick an appropriate globalization technique. We will favor finishing each example without describing the globalization technique in detail, as this would distract from our point and is best left to the given references. These examples are previously solved problems, our contribution is demonstrating the shared underlying principle.</font></p>
<h2><a name="SECTION00031000000000000000" id="SECTION00031000000000000000">Web Page Link Analysis</a></h2>
<p><font>For our first example application we demonstrate web page link analysis in the form of the famous PageRank score.[<a href="#Page:1998p2689">PBMW98</a>]</font></p>
<p><font>One of the many good ideas leading up to the early Google search engine was the design of a non-text based measure of importance or interestingness of web pages. A search engine that could fold &#8220;interestingness&#8221; or popularity into its notion of relevance could better sort important pages into the search user&#8217;s view. When the web got so large that there were many pages that were exact matches to any common user query popularity became a critical consideration. A link based notion of popularity exploits what is important about the web (the link structure, for example see [<a href="#Kleinberg:1997p32">Kle97</a>]) and avoids having to depend on a lot of natural language understanding technology. This technique also uses authority outside of the given page, so has some hope at being resistant (though not immune) to web-spam.</font></p>
<p><font>Taken all at once, the task of designing a score of page importance is a daunting task. However, by working in stages (as the local to global principle prescribes) we can quickly derive interesting scores including the famous PageRank score. We start with the idea that popularity (or the amount of web traffic a page receives) is (loosely) correlated with importance. So for our first approximation step we decide to try to estimate popularity (or web traffic) and use this estimate as our importance score. Accurately estimating web traffic is itself a hard problem and a big industry (just a few of the major companies involved in this are: Google/Urchin, Quantcast, Nielsen, comScore, Alexa, Hitwise and LookSmart). For our second approximation step we are going to try and estimate popularity from the link structure<a name="tex2html6" href="#foot43" id="tex2html6"><sup>4</sup></a> of the web (using no other measurements or historic data) and use this as our score. This link based estimate is unlikely to completely reproduce real web surfing patterns, but it is very interesting in its own right and has been proven in the market to be a useful score.</font></p>
<p><font>Now the problem is to try to estimate the popularity of a web page from the link structure of the web. We claim: we can generate a useful (but not necessarily accurate) estimate of web traffic from the web&#8217;s link structure alone. Consider Figure&nbsp;<a href="#fig:Links1">1</a> where we have a universe of three web pages A,B and C that link to each other in the pattern illustrated by what is called a graph<a name="tex2html7" href="#foot45" id="tex2html7"><sup>5</sup></a></font></p>
<div align="center"><a name="fig:Links1" id="fig:Links1"></a><a name="50"></a></p>
<table>
<caption align="bottom"><strong>Figure 1:</strong> A set of Mutually Linked Web Pages</caption>
<tr>
<td>
<div align="center"><img width="300" height="436" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/Links1.png" alt="Image Links1"></div>
</td>
</tr>
</table>
</div>
<p><font>In Figure&nbsp;<a href="#fig:Links1">1</a> we can consider each link to another page as evidence the other page is interesting or popular. One idea is to simulate a very simple web surfer who clicks on the links on a page uniformly at random. This is called &#8220;the random surfer model&#8221; and even a model this simple allows us to read some useful information from the link structure of the web. For instance, we could ask what fraction of their time the random surfer spends on each web page, with an eye to the idea that the pages the random surfer visits more often are the more important ones. Let <img width="35" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg2.png" alt="$ p(A)$"> denote the proportion of time the random web surfer spends on page A (and define <img width="36" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg3.png" alt="$ p(B)$"> and <img width="36" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg4.png" alt="$ p(C)$"> similarly). While we do not know any of <!-- MATH<br />
 $p(A), p(B)$<br />
 --><br />
<img width="76" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg5.png" alt="$ p(A), p(B)$"> or <img width="36" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg4.png" alt="$ p(C)$"> we can derive some relationships between them by inspecting the link graph:</font></p>
<p></p>
<div align="center"><!-- MATH<br />
 \begin{eqnarray*}<br />
p(A) &#038; = &#038; \frac{1}{2} P(B) + P(C) \\<br />
p(B) &#038; = &#038; \frac{1}{2} P(A) \\<br />
p(C) &#038; = &#038; \frac{1}{2} P(A) + \frac{1}{2} P(B) .<br />
\end{eqnarray*}<br />
 --></p>
<table cellpadding="0" align="center" width="100%">
<tr valign="middle">
<td nowrap align="right"><img width="35" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg6.png" alt="$\displaystyle p(A)$"></td>
<td width="10" align="center" nowrap><img width="16" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg7.png" alt="$\displaystyle =$"></td>
<td align="left" nowrap><img width="109" height="49" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg8.png" alt="$\displaystyle \frac{1}{2} P(B) + P(C)$"></td>
<td width="10" align="right">&nbsp;</td>
</tr>
<tr valign="middle">
<td nowrap align="right"><img width="36" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg9.png" alt="$\displaystyle p(B)$"></td>
<td width="10" align="center" nowrap><img width="16" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg7.png" alt="$\displaystyle =$"></td>
<td align="left" nowrap><img width="52" height="49" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg10.png" alt="$\displaystyle \frac{1}{2} P(A)$"></td>
<td width="10" align="right">&nbsp;</td>
</tr>
<tr valign="middle">
<td nowrap align="right"><img width="36" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg11.png" alt="$\displaystyle p(C)$"></td>
<td width="10" align="center" nowrap><img width="16" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg7.png" alt="$\displaystyle =$"></td>
<td align="left" nowrap><img width="125" height="49" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg12.png" alt="$\displaystyle \frac{1}{2} P(A) + \frac{1}{2} P(B) .$"></td>
<td width="10" align="right">&nbsp;</td>
</tr>
</table>
</div>
<p><br clear="all"></p>
<p><font>The first equation is just reading from the graph that: all visits on page-A must come from pages B and C, half of the visitors on page-B continue on to A and all of the visitors on page-C continue on to A. The second and third equations are the appropriate summaries of how traffic is routed to pages B and C. We can insist that <!-- MATH<br />
 $P(A) + P(B)<br />
+ P(C) = 1$<br />
 --><br />
<img width="183" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg13.png" alt="$ P(A) + P(B) + P(C) = 1$"> as we want these numbers to represent the fraction of time the random web surfer spends on each page. A more sophisticated model would add more features<a name="tex2html9" href="#foot245" id="tex2html9"><sup>6</sup></a> to get a more useful result.</font></p>
<p><font>It turns out we have already encoded enough local rules to completely determine <!-- MATH<br />
 $P(A), P(B)$<br />
 --><br />
<img width="85" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg14.png" alt="$ P(A), P(B)$"> and <img width="40" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg15.png" alt="$ P(C)$"> . In this example application an algorithmist already familiar with linear algebra&nbsp;[<a href="#Strang">Str76</a>] would recognize these local conditions as &#8220;a system of linear equations.&#8221; Solving even web-scale systems of linear systems is considered easy with modern techniques and modern computers. For our small example example the solution is: <!-- MATH<br />
 $p(A) = \frac{4}{9}$<br />
 --><br />
<img width="67" height="34" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg16.png" alt="$ p(A) = \frac{4}{9}$"> , <!-- MATH<br />
 $p(B) = \frac{2}{9}$<br />
 --><br />
<img width="68" height="34" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg17.png" alt="$ p(B) = \frac{2}{9}$"> , and <!-- MATH<br />
 $p(C) = \frac{3}{9}$<br />
 --><br />
<img width="67" height="34" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg18.png" alt="$ p(C) = \frac{3}{9}$"> . The role of the local steps was to reduce a new problem (estimating the importance or popularity of web page from the link structure) to something with its <em>already known</em> known techniques (like solving a linear system as illustrated in Figure&nbsp;<a href="#fig:LinAlg">2</a>).</font></p>
<div align="center"><a name="fig:LinAlg" id="fig:LinAlg"></a><a name="79"></a></p>
<table>
<caption align="bottom"><strong>Figure 2:</strong> Linear Algebra Solution: As Taught in School</caption>
<tr>
<td>
<div align="center"><img width="400" height="365" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LinAlg.jpg" alt="Image LinAlg"></div>
</td>
</tr>
</table>
</div>
<p><font>So page-A is the most important page by the PageRank measure.</font></p>
<p><font>In this example application the local step was setting up the system of linear equalities (which are easy to derive from the web link graph) and the global step was solving the entire system for the final scores (which were not obvious). You spend most of your time encoding the problem and then use a known technique (in this case solving a linear system) to finish the solution.</font></p>
<h2><a name="SECTION00032000000000000000" id="SECTION00032000000000000000">Natural Language Processing</a></h2>
<p><font>Our next example application is natural language processing&nbsp;[<a href="#CharniakBook">Cha96</a>,<a href="#Charniak:1997p1484">Cha97</a>]. Speech recognition (the alignment or transcription of recognized intelligible segments of sound to written text) is an important problem in natural language processing. An example problem is the need to find the most likely text matching a sequence of sounds such as is shown in Figure&nbsp;<a href="#fig:SoundSeq1">3</a>.</font></p>
<div align="center"><a name="fig:SoundSeq1" id="fig:SoundSeq1"></a><a name="89"></a></p>
<table>
<caption align="bottom"><strong>Figure 3:</strong> A Sequence of Sounds</caption>
<tr>
<td>
<div align="center"><img width="500" height="69" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/SoundSeq1.png" alt="Image SoundSeq1"></div>
</td>
</tr>
</table>
</div>
<p><font>Consider Figure&nbsp;<a href="#fig:SoundSeq3">4</a> (which shows a bad transcription) and Figure&nbsp;<a href="#fig:SoundSeq2">5</a> (which shows a good transcription).</font></p>
<div align="center"><a name="fig:SoundSeq3" id="fig:SoundSeq3"></a><a name="98"></a></p>
<table>
<caption align="bottom"><strong>Figure 4:</strong> A Bad Transcription</caption>
<tr>
<td>
<div align="center"><img width="500" height="142" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/SoundSeq3.png" alt="Image SoundSeq3"></div>
</td>
</tr>
</table>
</div>
<div align="center"><a name="fig:SoundSeq2" id="fig:SoundSeq2"></a><a name="105"></a></p>
<table>
<caption align="bottom"><strong>Figure 5:</strong> A Good Transcription</caption>
<tr>
<td>
<div align="center"><img width="500" height="142" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/SoundSeq2.png" alt="Image SoundSeq2"></div>
</td>
</tr>
</table>
</div>
<p><font>Our claim: we can (given access to training data, and this is the age of data&nbsp;[<a href="#Halevy:2009p2327">HNP09</a>]) solve this problem with a local step that is a set of simple criticisms of proposed transcriptions. A good starting point is a database of previous sounds to text transcriptions. This database allows the construction of a series of tables that give the historic frequency (or probability) of all of the following:</font></p>
<ul>
<li>Prior probability of each sound</li>
<li>Probability of each sound given the immediately previous sound</li>
<li>Prior probability of each word</li>
<li>Probability of each word given the immediately previous word</li>
<li>Which combinations of word fragments are legitimate words</li>
<li>Probability of each sound being assigned to each word fragment (syllables, phonemes and so on).</li>
</ul>
<p><font>These tables encode a &#8220;speech model&#8221; (the rules involving sounds only), a language model (the rules involving text or words only) and the linkage between the two models. These models are deliberately simple in that they capture only local interactions (like probability of a word given the word before it) but no long range interactions (like subject predicate agreement).</font></p>
<p><font>Each box, nested box and arrow on our diagram represents one possible local critique. For each item in our diagram (again, the boxes and arrows) we can use our tables to assign a goodness or plausibility score. For instance bad word to word transitions (like &#8220;won&#8221; <!-- MATH<br />
 $\rightarrow$<br />
 --><br />
<img width="19" height="13" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg19.png" alt="$ \rightarrow$"> &#8220;won&#8221;) will be rare in our historic tables so, just looking up probabilities from the tables (or, better, using the logarithms of probabilities) gives as a &#8220;plausibility score&#8221; that prefers known patterns of language. Then a score for the overall transcription can be derived by multiplying all of the local scores together. These local scores (though simple) already have encoded enough evidence to prefer the good transcription to the bad transcription <em>without</em> requiring any deep knowledge of speech, text or the meaning of the text. This is because the bad transcription has a series of obvious flaws such as: unlikely sound to word fragment assignments and unlikely word to word transitions.</font></p>
<div align="center"><a name="fig:SoundSeqPartial" id="fig:SoundSeqPartial"></a><a name="116"></a></p>
<table>
<caption align="bottom"><strong>Figure 6:</strong> Naively Extending a Partial Transcription</caption>
<tr>
<td>
<div align="center"><img width="500" height="142" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/SoundSeqPartial.png" alt="Image SoundSeqPartial"></div>
</td>
</tr>
</table>
</div>
<p><font>For example consider Figure&nbsp;<a href="#fig:SoundSeqPartial">6</a> where a naive solver is in the process of considering selecting the word &#8220;one&#8221; as the third word to fill in. The <em>only</em> local critiques they need to consider are:</font></p>
<ul>
<li>how likely the word &#8220;one&#8221; is in general (call this <img width="49" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg20.png" alt="$ P[one]$"> )</li>
<li>how likely the word &#8220;one&#8221; is to follow the word &#8220;nine&#8221; (call this <!-- MATH<br />
 $P[one | nine]$<br />
 --><br />
<img width="86" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg21.png" alt="$ P[one \vert nine]$"> )</li>
<li>how likely the letter sequence &#8220;o&#8221; is given the sound &#8220;w&#8221; (call this <!-- MATH<br />
 $P[o | \text{w\textschwa}]$<br />
 --><br />
<img width="55" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg24.png" alt="$P[o \vert \text{w\textschwa}]$"> )</li>
<li>how likely the letter sequence &#8220;ne&#8221; is given the sound &#8220;n&#8221; (call this <!-- MATH<br />
 $P[ne | \text{n}]$<br />
 --><br />
<img width="41" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg25.png" alt="$ P[ne \vert$">&nbsp; &nbsp;n<img width="7" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg23.png" alt="$ ]$"> ).</li>
</ul>
<p><font>So the local plausibility of the fill-in word &#8220;one&#8221; is: <!-- MATH<br />
 $P[one]<br />
\times P[one | nine] \times P[o | \text{w\textschwa}] \times P[ne |<br />
\text{o}]$<br />
 --><br />
<img width="292" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg28.png" alt="$P[one] \times P[one \vert nine] \times P[o \vert \text{w\textschwa}] \times P[ne \vert \text{o}]$"> . We will call this the critique of &#8220;one&#8221; in position 3 and write as <!-- MATH<br />
 $C_3(w_2,one)$<br />
 --><br />
<img width="84" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg29.png" alt="$ C_3(w_2,one)$"> where <img width="22" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg30.png" alt="$ w_2$"> is the word known to be in position 2. Similarly we can generate all of the possible critiques <img width="53" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg31.png" alt="$ C_1(w_1)$"> , <!-- MATH<br />
 $C_2(w_1,w_2)$<br />
 --><br />
<img width="78" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg32.png" alt="$ C_2(w_1,w_2)$"> , <!-- MATH<br />
 $C_3(w_2,w_3)$<br />
 --><br />
<img width="78" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg33.png" alt="$ C_3(w_2,w_3)$"> , <!-- MATH<br />
 $C_4(w_3,w_4)$<br />
 --><br />
<img width="78" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg34.png" alt="$ C_4(w_3,w_4)$"> and the overall criticize of a sequence <!-- MATH<br />
 $w_1 \; w_2 \; w_3 \; w_4$<br />
 --><br />
<img width="77" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg35.png" alt="$ w_1 \; w_2 \; w_3 \; w_4$"> : <!-- MATH<br />
 $C_1(w_1)<br />
\times C_2(w_1,w_2) \times C_3(w_2,w_3) \times C_4(w_3,w_4)$<br />
 --><br />
<img width="336" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg36.png" alt="$ C_1(w_1) \times C_2(w_1,w_2) \times C_3(w_2,w_3) \times C_4(w_3,w_4)$"> from our pre-computed tables of probabilities. Notice for all of these critiques only the immediately previous word and the nearby sounds were used to determine the plausibility of the word we are attempting to fit in. Instead of using these critiques to directly fill in a possible solution (or using search) we will package up these critiques (in the form of the <img width="32" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg37.png" alt="$ C_i()$"> ) and pass them on to a powerful separate globalization step called Dynamic Programming&nbsp;[<a href="#DynamicProgramming">Bel57</a>].</font></p>
<p><font>The globalization or finding of a best overall transcription is not trivial even though our score is simple. This is because the overall <em>best</em> sequence could depend on clever non-local fill-ins (like deliberately picking a less likely first word to allow a later favored transition to a fantastically good third word). Dynamic Programing does not fill in the transcription from left to right, but instead uses a table of scores derived from the left to right arrows and the <img width="32" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg37.png" alt="$ C_i()$"> . In our example Dynamic Programming consists of building a table of information as shown in Figure&nbsp;<a href="#fig:DynBackFill">7</a>. Let <img width="9" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg38.png" alt="$ i$"> represent the word position we are working looking at (so <img width="9" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg38.png" alt="$ i$"> ranges from 1 to 4) and let <img width="15" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg39.png" alt="$ w$"> be a variable that ranges over every word in the dictionary. Our table is indexed by <img width="9" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg38.png" alt="$ i$"> and <img width="15" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg39.png" alt="$ w$"> and when filled in <img width="51" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg40.png" alt="$ T(i,w)$"> stores what the highest &#8220;plausibility score&#8221; of a partial sequence of words where words 1 through <img width="9" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg38.png" alt="$ i$"> have been filled in and the <img width="9" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg38.png" alt="$ i$"> -th word is <img width="15" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg39.png" alt="$ w$"> .</font></p>
<div align="center"><a name="fig:DynBackFill" id="fig:DynBackFill"></a><a name="134"></a></p>
<table>
<caption align="bottom"><strong>Figure 7:</strong> Dynamic Programming: Back Chaining in <img width="27" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg1.png" alt="$ T()$"> for a Solution</caption>
<tr>
<td>
<div align="center"><img width="300" height="298" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/DynTableBackFill.png" alt="Image DynTableBackFill"></div>
</td>
</tr>
</table>
</div>
<p><font>If we already had this magic table <img width="27" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg1.png" alt="$ T()$"> we could find a best possible sequence by &#8220;back chaining.&#8221; We start by finding a fourth word (<img width="22" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg41.png" alt="$ w_4$"> ) such that <img width="61" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg42.png" alt="$ T(4,w_4)$"> is maximal (in this case &#8220;one&#8221;). We then find a best third word (<img width="22" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg43.png" alt="$ w_3$"> ) by enumerating all words and picking <img width="22" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg43.png" alt="$ w_3$"> such that <!-- MATH<br />
 $T(3,w_3) \times C_4(w_3,w_4) = T(4,w_4)$<br />
 --><br />
<img width="234" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg44.png" alt="$ T(3,w_3) \times C_4(w_3,w_4) = T(4,w_4)$"> . We continue back until we had found words <img width="22" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg30.png" alt="$ w_2$"> and <img width="22" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg45.png" alt="$ w_1$"> to get a complete best sequence. Notice that we work from right to left (backwards) and except for the starting step we pick each word to match the calculation we are trying to un-roll, not to be the maximal entry in the column. For instance we pick <!-- MATH<br />
 $w_1 = dial$<br />
 --><br />
<img width="70" height="29" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg46.png" alt="$ w_1 = dial$"> even though it does not have a the highest score, but because <!-- MATH<br />
 $T(1,dial) C_2(dial,nine)<br />
C_3(nine,one) C_4(one,one) = T(4,one)$<br />
 --><br />
<img width="433" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg47.png" alt="$ T(1,dial) C_2(dial,nine) C_3(nine,one) C_4(one,one) = T(4,one)$"> is the maximal complete chain.</font></p>
<p><font>Of course, we don&#8217;t start with the table <img width="27" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg1.png" alt="$ T()$"> already filled in- so we need a procedure to build it. This procedure is the heart of the Dynamic Programming method (for more examples see: &#8220;Introduction to Algorithms&#8221;&nbsp;[<a href="#IntroductionToAlgorithms">CLRS09</a>]). Notice that <img width="54" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg48.png" alt="$ T(1,w)$"> can be filled in for all <img width="15" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg39.png" alt="$ w$"> just by plugging in words and computing the critiques <img width="46" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg49.png" alt="$ C_1(w)$"> (i.e. <!-- MATH<br />
 $T(1,w) = C_1(w)$<br />
 --><br />
<img width="118" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg50.png" alt="$ T(1,w) = C_1(w)$"> ). Once all the <img width="54" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg48.png" alt="$ T(1,w)$"> are filled in we can fill in the the <img width="54" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg51.png" alt="$ T(2,w)$"> with the general (and slightly trickier) formula:</font></p>
<p><!-- MATH<br />
 \begin{displaymath}<br />
T(i+1,w) = \max_{v} T(i,v) C_{i+1}(v,w)<br />
\end{displaymath}<br />
 --></p>
<div align="center"><img width="249" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg52.png" alt="$\displaystyle T(i+1,w) = \max_{v} T(i,v) C_{i+1}(v,w) $"></div>
<p><font>as we illustrate for <img width="74" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg53.png" alt="$ T(2,nine)$"> in Figure&nbsp;<a href="#fig:DynTable">8</a>.</font></p>
<div align="center"><a name="fig:DynTable" id="fig:DynTable"></a><a name="145"></a></p>
<table>
<caption align="bottom"><strong>Figure 8:</strong> Dynamic Programming: Building the Table <img width="27" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg1.png" alt="$ T()$"></caption>
<tr>
<td>
<div align="center"><img width="400" height="261" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/DynTableCalculate.png" alt="Image DynTableCalculate"></div>
</td>
</tr>
</table>
</div>
<p><font>The magic of the Dynamic Programing technique is: by being careful to not store too much in the table <img width="51" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg40.png" alt="$ T(i,w)$"> we avoid an explosion in record keeping that would render the method inefficient. Dynamic Programming exploits the small dependence structure encoded in <img width="32" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg37.png" alt="$ C_i()$"> (each box in our diagram depending on only a few arrows) and as we have shown can find &#8220;clever&#8221; solutions (such as taking a sub-optimal first word to get better transitions into preferred later words). For those who want more detail on solving this problem we recommend [<a href="#CharniakBook">Cha96</a>] (as our goal is not to fully explain Dynamic Programming, but to demonstrate how it could be applied to the transcription problem as a pre-packaged globalizer).</font></p>
<p><font>In this example the local step was the graph link based critiques and the globalization step was Dynamic Programming. The separation of concerns from the local scoring to the globalizing step is a strength of the local to global principle.</font></p>
<h2><a name="SECTION00033000000000000000" id="SECTION00033000000000000000">Machine Learning</a></h2>
<p><font>Our final example application is machine learning. Machine learning is loosely defined as computer programs that adapt or learn from data. Thomas Mitchell helps distinguish this activity as a specialty of artificial intelligence that concentrates on &#8220;well-posed learning problems.&#8221;&nbsp;[<a href="#MitchellML">Mit97</a>] Trevor Hastie, Robert Tibshirani, Jerome Friedman emphasize the relation to statistics (versus more traditional symbolic AI)&nbsp;[<a href="#TibHat">TH09</a>]. A simple demonstration can be found in [<a href="#MLArt">Mou09b</a>].</font></p>
<p><font>Machine learning is perhaps the strongest example of the local to global principle and is inspired by the work of Kristin P. Bennett and Emilio Parrado-Hernandez&nbsp;[<a href="#Bennett:2006p400">BPH06</a>]. In hindsight many machine learning algorithms (each of which has had a turn at being &#8220;the most exciting breakthrough ever&#8221; for a while) can be seen as the pairing of a performance criterion (which we call a local criterion as it applies to one specific set of parameter values at a time) and an optimization method (what we have been calling the globalization step). The work of Bennett and Parrado-Hernandez calls this distinction out and shows how it is not productive to present machine learning systems as unique named monolithic units, but instead to consider how to break them into an objective function and an optimizer. This allows both choice of better optimizers (such as replacing the inferior method of gradient descent method wherever it occurs) and for explicit control of important concepts such as hypothesis regularization and control of over-fitting (which some algorithms claim to achieve by deliberately using early exit from a an inferior optimizer).</font></p>
<p><font>At a &#8220;30,000 feet level&#8221; we can build a table of common machine learning techniques and name what is commonly used to implement their local and global steps. When a machine learning algorithm is defined by what conditions are meant to be true at the optimum we are no longer bound by details of the original implementation and can examine fix and improve the components.<a name="tex2html17" href="#foot154" id="tex2html17"><sup>7</sup></a> Table&nbsp;<a href="#fig:MachineLearning">1</a> is a crude summary of a wide selection for machine learning algorithms that may be more likely to offend everybody than just offend somebody. But this is also the point: it is the algorithmist&#8217;s job to think fluidly (beyond given names and provenances) and to invent scaffolding to convert partial analogies into practical correspondences.</font></p>
<p></p>
<div align="center"><a name="190"></a></p>
<table>
<caption><strong>Table 1:</strong> Various Machine Learning Techniques</caption>
<tr>
<td>
<div align="center">
<table cellpadding="3" border="1" align="center">
<tr>
<td align="left" valign="top" width="180"><font size="-1">Machine Learning Method</font></td>
<td align="left" valign="top" width="144"><font size="-1">Local Criterion</font></td>
<td align="left" valign="top" width="144"><font size="-1">Globalization Method</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">Linear Regression [<a href="#Breiman:1997p1133">BF97</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">square error</font></td>
<td align="left" valign="top" width="144"><font size="-1">Linear Algebra</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">Linear Discriminant Analysis [<a href="#Fisher:1936p2576">Fis36</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">square error</font></td>
<td align="left" valign="top" width="144"><font size="-1">Linear Algebra</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">Logistic Regression [<a href="#Komarek:2008p1742">Kom08</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">logit penalty</font></td>
<td align="left" valign="top" width="144"><font size="-1">Newton&#8217;s Method</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">Perceptron [<a href="#Beigel:1991p1027">BRS91</a>] [<a href="#Blum:2002p1867">BD02</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">error rate</font></td>
<td align="left" valign="top" width="144"><font size="-1">error based update</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">Naive Bayes [<a href="#Maron:2000p2553">MK00</a>] [<a href="#Maron:1961p2566">Mar61</a>] [<a href="#Lewis:1998p105">Lew98</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">frequency tables</font></td>
<td align="left" valign="top" width="144"><font size="-1">arithmetic</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">Nearest Neighbor [<a href="#Ailon:2006p872">AC06</a>] [<a href="#Indyk:1999p166">IM99</a>] [<a href="#Andoni:2006p52">AI06</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">Kernel Methods</font></td>
<td align="left" valign="top" width="144"><font size="-1">enumeration,<br />
projection</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">Decision Trees [<a href="#bfso:1984">BFSO84</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">information theory</font></td>
<td align="left" valign="top" width="144"><font size="-1">partitioning</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">clustering [<a href="#Cilibrasi:2005p8">CV05</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">square error</font></td>
<td align="left" valign="top" width="144"><font size="-1">partitioning</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">MaxEnt [<a href="#Grunwald:2000p108">Gru00</a>] [<a href="#Grunwald:2004p739">GD04</a>] [<a href="#Skilling:1988p780">Ski88</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">entropy penalty</font></td>
<td align="left" valign="top" width="144"><font size="-1">Newton&#8217;s Method</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">Neural Net with Back Propagation [<a href="#NNCPE">Hus99</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">sigmoid penalty function</font></td>
<td align="left" valign="top" width="144"><font size="-1">Automatic Differentiation,<br />
steepest descent</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">Winnow [<a href="#Kivinen:1995p1836">KWA95</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">error rate</font></td>
<td align="left" valign="top" width="144"><font size="-1">multiplicative error based update</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">Boosting [<a href="#Freund:1999p1015">FS99</a>] [<a href="#Breiman:2000p1134">Bre00</a>] [<a href="#Collins:2002p1008">CSS02</a>] [<a href="#Trevisan:2008p2166">TTV08</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">weighted errors,<br />
data re-weighting</font></td>
<td align="left" valign="top" width="144"><font size="-1">Conjugate Gradient</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">HMM [<a href="#Kristjansson:2004p545">KCVM04</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">probability penalty</font></td>
<td align="left" valign="top" width="144"><font size="-1">Gibbs Sampler</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">Latent Dirichlet Allocation [<a href="#Blei:2003p1063">BNJ03</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">KL divergence</font></td>
<td align="left" valign="top" width="144"><font size="-1">Variational Methods</font></td>
</tr>
<tr>
<td align="left" valign="top" width="180"><font size="-1">Support Vector Machine [<a href="#Joachims:1998p406">Joa98</a>] [<a href="#SVMBook">STC00</a>]</font></td>
<td align="left" valign="top" width="144"><font size="-1">L1 Margin,<br />
Kernel Methods</font></td>
<td align="left" valign="top" width="144"><font size="-1">Quadratic Optimization</font></td>
</tr>
</table>
</div>
<p><a name="fig:MachineLearning" id="fig:MachineLearning"></a></td>
</tr>
</table>
</div>
<p></p>
<p><font>This table is a necessarily crude summary. For example: notice that several known techniques can not even be distinguished from each other by the local and global columns of the table.</font></p>
<p><font>There are a few points we would like to make. Back propagation was considered unique to Neural Nets for quite a while because it was so entwined with the technique it was not recognized as the simple application of Automatic Differentiation&nbsp;[<a href="#Rall:1996p2473">RC96</a>] that it is. Support Vector Machines (SVM) are remarkable for their uniform very good choice of component methods (maximum L1 margin objective regularization, Kernel Methods&nbsp;[<a href="#KernBook">STC04</a>] and sophisticated optimization methods&nbsp;[<a href="#Joachims:2006p403">Joa06</a>]). Many of the machine learning methods that SVM outperforms become again competitive when they adopt some of SVM&#8217;s technologies (especially using kernel methods to produce synthetic features).</font></p>
<p><font>Beyond these points we invoke a &#8220;globalizers are pre-packaged&#8221; principle and leave the discussion of machine learning and optimization to our reference: [<a href="#Bennett:2006p400">BPH06</a>]. In this example the local step is a per-example score or penalty and the globalization step is optimization.</font></p>
<h1><a name="SECTION00040000000000000000" id="SECTION00040000000000000000">Some Methods</a></h1>
<p><font>The application of the local to global principle is similar to the Feynman &#8220;genius method.&#8221; Feynman&#8217;s method is to always have in mind a list of problems and a list of solution methods. The genius step is: anytime you see a new problem or a new solution method to immediately try it against every item from the complementary list.&nbsp;[<a href="#IndiscreteThoughts">Rot97</a>, ``Ten Lessons I Wish I Had Been Taught''] This deliberate retention and activity greatly increases your problem solving ability. The power of the local to global principle is itself proportional to the number of local methods times the number of globalization strategies. Of course, to even start: the practitioner must already have available a number of candidate local and globalization methods. We list some methods and some guidance on variation and invention.</font></p>
<h2><a name="SECTION00041000000000000000" id="SECTION00041000000000000000">Local Methods</a></h2>
<p><font><img width="100" height="100" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/nails.jpg" alt="Image nails"> Good sources of ideas and analogies for local methods include:</font></p>
<ul>
<li>Introduce a Graph Structure
<p>A graph structure is a network of nodes connected by edges. Use of graphs was demonstrated both in the natural language processing and web page link analysis examples. We can dress up how we solved these problems and say we used a &#8220;Hidden Markov Model&#8221;, but the real power was we encoded our problem in a simple graph. Some problems (especially those from logic or those involving time) are essentially solved once they are translated out of their original form and into graph notation (for an example see: [<a href="#Mount:2000p360">Mou00</a>]).</p>
</li>
<li>Appeal to Physical Conservation Laws
<p>A good example physical law is Kirchhoff&#8217;s law or conservation of flow. All of the web page link analysis&#8217;s equations were derived by saying that the attention of at node is essentially the sum of attentions from other nodes (more sophisticated versions of the analysis actually do create and destroy flow, but they do it in a principled way).</p>
</li>
<li>Encode the Problem into an Objective Function
<p>This method is essentially your declaration that you intend to use an optimizer for the globalization step. In operations research this specific technique has long been the practice (with no disrespect: a very productive part of operations research has been translating different problems into linear programs so the simplex method can be applied, for an example see [<a href="#TradeArt">Mou09a</a>]).</p>
</li>
<li>Gradient Like Computations
<p>Includes Gradients, Secants, Lagrangians and other ideas from calculus. Gradients can drive optimizer based globalizers and techniques like Lagrangians are often powerful enough use mere inspection as the globalization step.</p>
</li>
<li>Violation Driven Updates
<p>This method is particularly effective when your problem is not amenable to continuous optimization. A good example is the Lin-Kernighan heuristic for solving the traveling salesman problem.[<a href="#Lin:1973p2739">LK73</a>] This heuristic looks at subsets of the problem and suggests improving &#8220;surgeries&#8221; (until no more such improvements are possible).</p>
</li>
<li>Introduction of Symbols
<p>Often, as with the web page link analysis example, you can not specify specific values for the unknowns, but you can specify relationships. You often can then solve for the symbols or introduce additional conditions and use an optimizer to complete the solution (see for example the maximum entropy method as described in [<a href="#Skilling:1988p780">Ski88</a>]).</p>
</li>
<li>Over Specification
<p>If we anticipate using a global step like search, enumeration, summation or integration then over specification is a good local idea.</p>
<p>For example: consider computing the probability that a fair count flipped 10 times comes up with heads exactly 3 times. The easiest way to perform this calculation is to specify exactly which 3 coins come up heads (the local over-specification) and then sum over all choices of 3 out of 10 coins (the global step). In mathematical notation this is:</p>
<p><!-- MATH<br />
 \begin{displaymath}<br />
P[\text{exactly 3 heads out of 10 flips}] = \binom{10}{3} 2^{-10} \approx 0.117<br />
\end{displaymath}<br />
 --></p>
<div align="center"><img width="20" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg54.png" alt="$\displaystyle P[$">exactly 3 heads out of 10 flips<img width="157" height="54" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg55.png" alt="$\displaystyle ] = \binom{10}{3} 2^{-10} \approx 0.117 $"></div>
<p>or just under 12%.</li>
<li>Under Specification
<p>One of the core principles of Dynamic Programming is to forget as much as possible about partial solutions, keeping only partial solution cost and just enough information to extend partial solutions. If you anticipate using something like Dynamic Programming as your globalization step then your goal should be to under specify.</p>
</li>
<li>Tables
<p>A key step of the natural language processing example was the use of tables of past experience to determine which sounds likely corresponded to which words, which words likely followed each other (and so on). Encoding domain knowledge or expertise as probability tables is a very effective problem solving strategy (especially if the globalization strategy is going to be search or Dynamic Programming). In natural language processing examples tables and statistics are <em>much</em> easier to manage than comprehensive rules or grammars.</p>
</li>
<li>Set up as Ranking or Machine Learning Problem
<p>This tactic is especially appropriate if your solution success metric is counts, frequencies or probabilities (instead of having to always be correct or always be optimal).</p>
</li>
</ul>
<h2><a name="SECTION00042000000000000000" id="SECTION00042000000000000000">Globalization Methods</a></h2>
<p><font><img width="100" height="100" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/hammer.jpg" alt="Image hammer"> The universe of possible globalization methods is very diverse (in particular globalization is not always optimization).</font></p>
<ul>
<li>Search / Enumeration
<p>Search can be slow, but it is always an option to consider. If your problem translates naturally into a graph structure or your solutions are naturally seen as being composed of small pieces search should be considered. One of the big advantages using the local phase to formally encode your problem&#8217;s structure and putting search off to the global phase is: you can use advanced search techniques. Once you are freed from your specific problem details it becomes much easier to consider search techniques like branch and bound, A*, game theoretic search and general speed-up techniques like hashing and caching.</p>
</li>
<li>Dynamic Programming
<p>If your problem has a bit more structure (in that partial solutions summarize and compose easily) then you can likely replace search with Dynamic Programming. The advantage is that Dynamic Programming typically offers an incredible speed up when compared to search.</p>
</li>
<li>Optimization
<p>If your problem is continuous (involves numbers instead of discrete or categorical decisions), can be encoded as a reasonable objective function (linear, positive definite quadratic) and has reasonable constraints (linear or convex) then you can immediately apply an optimizer as your globalization step. Typical optimization methods include: conjugate gradient, Newton methods, quasi Newton methods, linear programming and quadratic programming.</p>
</li>
<li>Combinatorial Optimization
<p>If your problem includes a &#8220;discrete variables&#8221; (that is variables that take on one of fixed set of values instead of values from a numeric range) then you may not be able to apply standard optimization techniques. At this point you may want to use more expensive combinatorial optimization techniques like integer linear programing or constraint satisfaction.</p>
</li>
<li>Fixed Point Methods / Iteration
<p>Fixed point methods are based on the idea: &#8220;incrementally improve until there is no incremental improvement possible.&#8221; If the problem is continuous this is similar to steepest descent. If the problem is discrete then this is similar the Lin-Kernighan heuristic.</p>
</li>
<li>Linear Algebra
<p>The web page link analysis and optimization examples were essentially solved once we reduced them to linear algebra. If you can write your problem as a linear relationship between unknowns or as the fixed-point of a linear operator (i.e. an <img width="12" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg56.png" alt="$ x$"> such that <img width="54" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/LTGimg57.png" alt="$ A x = x$"> ) then you can immediately use linear algebra to solve the problem at very large scale (e.g. web scale).</p>
</li>
<li>Sampling / Problem Kernels
<p>A very successful line of attack on large problems is to reduce to a smaller problem containing most of the essential difficulty. David Karger has produced a number of effective algorithms for graph cuts and flows using a theory of sampling&nbsp;[<a href="#Karger:1998p556">Kar98</a>]. Rod Downey and M. Fellows have demonstrated an effective theory of &#8220;problem kernels&#8221; that finds solution by focusing on smaller sub-problems (on which we can afford to use more expensive procedures).[<a href="#DF98">DF98</a>]</p>
</li>
<li>Amortized Analysis / Economic Mechanism Methods
<p>Daniel Sleator and Robert Tarjan&#8217;s ideas of amortized analysis&nbsp;[<a href="#Sleator:1985p168">ST85</a>] allow approximation schemes similar to problem kernels. The method is to approximately optimize by pairing a bunch of unavoidable large penalties (conditions we can&#8217;t meet) with some accounting credits (say bonuses from other conditions we are meeting very well). We then isolate these paired items and optimize the rest of the problem exactly. The technique often works by showing the approximation can not be too bad because, due to the pairing of large penalties to good credits, there can not be too many large penalties. An informal example is: if it is impossible to pick someplace where all of an office will eat for lunch, perhaps you can solve the problem by paying one person to accept a restaurant they do not like (if the removal of their objection opens up a venue that is acceptable to everybody else).</p>
</li>
<li>Relaxation / Homotopic methods
<p>These methods involve changing hard constraints to soft penalties (so allowing the constraints to be violated, but at a slowly increasing cost). After such a relaxation the homotopic (or continuous deformation) method is to increase the cost of violation and re-solve to try and get a trajectory of better and better nearly acceptable solutions that point to a possible overall solution.</p>
</li>
</ul>
<h1><a name="SECTION00050000000000000000" id="SECTION00050000000000000000">Conclusion</a></h1>
<p><font>The purpose of this article has been to make more visible an idea we call the local to global principle. This principle is an organizing tool useful both in designing and analyzing a wide variety of applications. Essentially the whole point of this writeup is to set up enough framework to quickly write down a table of advice such as Table&nbsp;<a href="#fig:ProblemTable">2</a> (and for such a table to mean something).</font></p>
<p></p>
<div align="center"><a name="227"></a></p>
<table>
<caption><strong>Table 2:</strong> Various Applications, Local Steps and Global Steps</caption>
<tr>
<td>
<div align="center">
<table cellpadding="3" border="1" align="center">
<tr>
<td align="left"><font size="-1">Example</font></td>
<td align="left"><font size="-1">Local Step</font></td>
<td align="left"><font size="-1">Global Step</font></td>
</tr>
<tr>
<td align="left"><font size="-1">speech transcription</font></td>
<td align="left"><font size="-1">tables</font></td>
<td align="left"><font size="-1">Dynamic Programming</font></td>
</tr>
<tr>
<td align="left"><font size="-1">PageRank</font></td>
<td align="left"><font size="-1">graph structure, linear equations</font></td>
<td align="left"><font size="-1">Linear Algebra</font></td>
</tr>
<tr>
<td align="left"><font size="-1">machine learning</font></td>
<td align="left"><font size="-1">objective function</font></td>
<td align="left"><font size="-1">optimization</font></td>
</tr>
</table>
</div>
<p><a name="fig:ProblemTable" id="fig:ProblemTable"></a></td>
</tr>
</table>
</div>
<p></p>
<p><font>The principle is not universal; not everything can be fit into such a table. For example the local to global decoupling is <em>not</em> a feature of the famous EM algorithm&nbsp;[<a href="#Dempster:1977p761">DLR77</a>], which depends on mixing predictions and corrections.</font></p>
<p><font>To conclude: the recipe is as follows. If you come to a problem with a large shopping bag of possible ways to build local criteria and powerful globalization procedures then you stand a very good chance of solving the problem quickly. Also, if you keep the local to global principle in mind you are more likely to identify and retain potential local tricks and globalizers when you see them and thus have a larger more nimble set of tools available to solve problems when the time comes.</font></p>
<h2><a name="SECTION00060000000000000000" id="SECTION00060000000000000000">Bibliography</a></h2>
<dl compact>
<dt><a name="Ailon:2006p872" id="Ailon:2006p872">AC06</a></dt>
<dd>Nir Ailon and Bernard Chazelle, <i>Approximate nearest neighbors and the fast johnson-lindenstrauss transform</i>, STOC (2006).</dd>
<dt><a name="Andoni:2006p52" id="Andoni:2006p52">AI06</a></dt>
<dd>Alexandr Andoni and Piotr Indyk, <i>Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions</i>.</dd>
<dt><a name="Blum:2002p1867" id="Blum:2002p1867">BD02</a></dt>
<dd>Avrim Blum and John Dunagan, <i>Smoothed analysis of the perceptron algorithm for linear programming</i>, SODA (2002), 11.</dd>
<dt><a name="DynamicProgramming" id="DynamicProgramming">Bel57</a></dt>
<dd>Richard Bellman, <i>Dynamic programming</i>, Princeton University Press, 1957.</dd>
<dt><a name="Breiman:1997p1133" id="Breiman:1997p1133">BF97</a></dt>
<dd>Leo Breiman and Jerome&nbsp;H Friedman, <i>Predicting multivariate responses in multiple linear regression</i>, Journal of the Royal Statistical Society, Series B (Methodological) <b>59</b> (1997), no.&nbsp;1, 3-54.</dd>
<dt><a name="bfso:1984" id="bfso:1984">BFSO84</a></dt>
<dd>Leo Breiman, Jerome Friedman, Charles&nbsp;J. Stone, and R.&nbsp;A. Olshen, <i>Classification and regression trees</i>, Chapman &amp; Hall/CRC, January 1984.</dd>
<dt><a name="Blei:2003p1063" id="Blei:2003p1063">BNJ03</a></dt>
<dd>David&nbsp;M Blei, Andrew&nbsp;Y Ng, and Michael&nbsp;I Jordan, <i>Latent dirichlet allocation</i>, Journal of Machine Learning Research <b>3</b> (2003), 993-1022.</dd>
<dt><a name="Bennett:2006p400" id="Bennett:2006p400">BPH06</a></dt>
<dd>Kristin&nbsp;P. Bennett and Emilio Parrado-Hernandez, <i>The interplay of optimization and machine learning research</i>, Journal of Machine Learning Research <b>7</b> (2006), 1265-1281.</dd>
<dt><a name="Breiman:2000p1134" id="Breiman:2000p1134">Bre00</a></dt>
<dd>Leo Breiman, <i>Special invited paper. additive logistic regression: A statistical view of boosting: Discussion</i>, Ann. Statist. <b>28</b> (2000), no.&nbsp;2, 374-377.</dd>
<dt><a name="Beigel:1991p1027" id="Beigel:1991p1027">BRS91</a></dt>
<dd>R&nbsp;Beigel, N&nbsp;Reingold, and D&nbsp;Spielman, <i>The perceptron strikes back</i>, Structure in Complexity Theory Conference <b>6</b> (1991), 286-291.</dd>
<dt><a name="CharniakBook" id="CharniakBook">Cha96</a></dt>
<dd>Eugene Charniak, <i>Statistical language learning</i>, MIT Press, 1996.</dd>
<dt><a name="Charniak:1997p1484" id="Charniak:1997p1484">Cha97</a></dt>
<dd>to3em, <i>Statistial techniques for natural language parsing</i>, AI Magazine <b>18</b> (1997), no.&nbsp;4, 33-44.</dd>
<dt><a name="IntroductionToAlgorithms" id="IntroductionToAlgorithms">CLRS09</a></dt>
<dd>Thomas&nbsp;H. Cormen, Charles&nbsp;E. Leiserson, Ronald&nbsp;L. Rivest, and Clifford Stein, <i>Introduction to algorithms</i>, MIT Press, 2009.</dd>
<dt><a name="Collins:2002p1008" id="Collins:2002p1008">CSS02</a></dt>
<dd>Michael Collins, Robert&nbsp;E Schapire, and Yoram Singer, <i>Logistic regression, adaboost and bregman distances</i>, Machine Learning <b>48</b> (2002), no.&nbsp;1/2/3, 30.</dd>
<dt><a name="Cilibrasi:2005p8" id="Cilibrasi:2005p8">CV05</a></dt>
<dd>Rudi Cilibrasi and Paul&nbsp;M.B Vitanyi, <i>Clustering by compression</i>, IEEE Transactions on Information Theory <b>51</b> (2005), no.&nbsp;4, 1523-1545.</dd>
<dt><a name="DF98" id="DF98">DF98</a></dt>
<dd>Rod&nbsp;G. Downey and M.&nbsp;R. Fellows, <i>Parameterized complexity</i>, Monographs in Computer Science, Springer, November 1998.</dd>
<dt><a name="Dempster:1977p761" id="Dempster:1977p761">DLR77</a></dt>
<dd>A&nbsp;P Dempster, N&nbsp;M Laird, and D&nbsp;B Rubin, <i>Maximum likelihood from incomplete data via the em algorithm</i>, Journal of the Royal Statistical Society, Series B (Methodological) <b>39</b> (1977), no.&nbsp;1, 1-38.</dd>
<dt><a name="Fisher:1936p2576" id="Fisher:1936p2576">Fis36</a></dt>
<dd>Ronald&nbsp;A Fisher, <i>The use of multiple measurements in taxonomic problems</i>, Annals of Eugenics <b>7</b> (1936), 179-188.</dd>
<dt><a name="Freund:1999p1015" id="Freund:1999p1015">FS99</a></dt>
<dd>Yoav Freund and Robert&nbsp;E Schapire, <i>A short introduction to boosting</i>, Journal of Japanese Society for Artificial Intelligence <b>14</b> (1999), no.&nbsp;5, 771-780.</dd>
<dt><a name="Grunwald:2004p739" id="Grunwald:2004p739">GD04</a></dt>
<dd>Peter&nbsp;D Grunwald and A&nbsp;Philip Dawid, <i>Game theory, maximum entropy, minimum discrepancy and robust bayesian decision theory</i>, Ann. Statist. <b>32</b> (2004), no.&nbsp;4, 1367-1433.</dd>
<dt><a name="Grunwald:2000p108" id="Grunwald:2000p108">Gru00</a></dt>
<dd>PD&nbsp;Grunwald, <i>Maximum entropy and the glasses you are looking through</i>, Conference on Uncertainty in Artificial Intelligence (2000), 238-246.</dd>
<dt><a name="Halevy:2009p2327" id="Halevy:2009p2327">HNP09</a></dt>
<dd>Alon Halevy, Peter Norvig, and Fernando Pereira, <i>The unreasonable effectiveness of data</i>, IEEE Intellegent Systems (2009).</dd>
<dt><a name="NNCPE" id="NNCPE">Hus99</a></dt>
<dd>Dirk Husmeier, <i>Neural networks for conditional probability estimation</i>, Springer, 1999.</dd>
<dt><a name="Indyk:1999p166" id="Indyk:1999p166">IM99</a></dt>
<dd>Piotr Indyk and Rajeev Motwani, <i>Approximate nearest neighbors: Towards removing the curse of dimensionality</i>.</dd>
<dt><a name="Joachims:1998p406" id="Joachims:1998p406">Joa98</a></dt>
<dd>Thorsten Joachims, <i>Making large-scale svm learning practical</i>, Advances in Kernel Methods &#8211; Support Vector Learning (1998).</dd>
<dt><a name="Joachims:2006p403" id="Joachims:2006p403">Joa06</a></dt>
<dd>to3em, <i>Training linear svms in linear time</i>, KDD (2006).</dd>
<dt><a name="Karger:1998p556" id="Karger:1998p556">Kar98</a></dt>
<dd>David&nbsp;R Karger, <i>Randomization in graph optimization problems: A survey</i>, Optima: Mathematical Programming Society Newsletter <b>58</b> (1998).</dd>
<dt><a name="Kristjansson:2004p545" id="Kristjansson:2004p545">KCVM04</a></dt>
<dd>Trausti Kristjansson, Aron Culotta, Paul Viola, and Andrew&nbsp;Kachites McCallum, <i>Interactive information extraction with constrained conditional random fields</i>, AAAI (2004).</dd>
<dt><a name="Kleinberg:1997p32" id="Kleinberg:1997p32">Kle97</a></dt>
<dd>Jon&nbsp;M Kleinberg, <i>Authoritative souces in a hyperlinked environment</i>, ACM SIAM Symposium on Discrete Algorithms (1997).</dd>
<dt><a name="Komarek:2008p1742" id="Komarek:2008p1742">Kom08</a></dt>
<dd>Paul Komarek, <i>Logistic regression for data mining and high-dimensional classification</i>, CMU CS Thesis (2008), 138.</dd>
<dt><a name="Kivinen:1995p1836" id="Kivinen:1995p1836">KWA95</a></dt>
<dd>J&nbsp;Kivinen, Manfred&nbsp;K Warmuth, and P&nbsp;Auer, <i>The perceptron algorithm v.s. winnowo: Linear v.s. logarithmic mistake bounds when few input variables are relevant</i>, COLT (1995), 289-296.</dd>
<dt><a name="Lewis:1998p105" id="Lewis:1998p105">Lew98</a></dt>
<dd>David&nbsp;D Lewis, <i>Naive (bayes) at forty: The independence assumption in information retrieval</i>, find journal (1998).</dd>
<dt><a name="Lin:1973p2739" id="Lin:1973p2739">LK73</a></dt>
<dd>S&nbsp;Lin and BW&nbsp;Kernighan, <i>An effective heuristic algorithm for the traveling-salesman problem</i>, Operations Research (1973), 498-516.</dd>
<dt><a name="Maron:1961p2566" id="Maron:1961p2566">Mar61</a></dt>
<dd>M&nbsp;E Maron, <i>Automatic indexing: An experimental inquiry</i>, RAND Technical Report (1961), 404-417.</dd>
<dt><a name="HTSMH" id="HTSMH">MF00</a></dt>
<dd>Zbigniew Michalewicz and David&nbsp;B. Fogel, <i>How to solve it: Modern heuristics</i>, Springer, 2000.</dd>
<dt><a name="Mill" id="Mill">Mil02</a></dt>
<dd>John&nbsp;Stuart Mill, <i>A system of logic</i>, University Press of the Pacific, 2002.</dd>
<dt><a name="MitchellML" id="MitchellML">Mit97</a></dt>
<dd>Thomas Mitchell, <i>Machine learning</i>, McGraw-Hill, 1997.</dd>
<dt><a name="Maron:2000p2553" id="Maron:2000p2553">MK00</a></dt>
<dd>M&nbsp;E Maron and J&nbsp;L Kuhns, <i>On relevance, probabilistic indexing and information retrieval</i>, 1960 (2000), 1-29.</dd>
<dt><a name="Mount:2000p360" id="Mount:2000p360">Mou00</a></dt>
<dd>John&nbsp;A Mount, <i>Automatic detection of potential deadlock</i>, Dr. Dobbs Journal (2000).</dd>
<dt><a name="TradeArt" id="TradeArt">Mou09a</a></dt>
<dd>John Mount, <i>Automatic generation and testing of un-rolls for profitable technical trades</i>, <a href="http://www.win-vector.com/blog/2007/10/paper-on-stock-trading/">http://www.win-vector.com/blog/2007/10/paper-on-stock-trading/</a>, 2009.</dd>
<dt><a name="MLArt" id="MLArt">Mou09b</a></dt>
<dd>to3em, <i>A demonstration of data mining</i>, <a href="http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/">http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/</a>, 2009.</dd>
<dt><a name="Page:1998p2689" id="Page:1998p2689">PBMW98</a></dt>
<dd>Lawrence Page, Sergey Brin, Rajeev Motwani, and Tery Winograd, <i>The pagerank citation ranking: Bringing order to the web</i>, <a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.31.1768">http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.31.1768</a> (1998).</dd>
<dt><a name="Polya1" id="Polya1">Pol54a</a></dt>
<dd>G.&nbsp;Polya, <i>Induction and analogy in mathematics</i>, Princeton University Press, 1954.</dd>
<dt><a name="Polya2" id="Polya2">Pol54b</a></dt>
<dd>to3em, <i>Patterns of plausible inference</i>, Princeton University Press, 1954.</dd>
<dt><a name="citeulike:679515" id="citeulike:679515">Pol71</a></dt>
<dd>to3em, <i>How to solve it</i>, Princeton University Press, November 1971.</dd>
<dt><a name="Rall:1996p2473" id="Rall:1996p2473">RC96</a></dt>
<dd>Louis&nbsp;B Rall and George&nbsp;F Corliss, <i>An introduction to automatic differentiation</i>, SIAM: Computational Differentiation: Techniques, Applications and Tools (1996), 1-18.</dd>
<dt><a name="IndiscreteThoughts" id="IndiscreteThoughts">Rot97</a></dt>
<dd>Gian-Carlo Rota, <i>Indiscrete thoughts</i>, Birkhauser, 1997.</dd>
<dt><a name="Skilling:1988p780" id="Skilling:1988p780">Ski88</a></dt>
<dd>John Skilling, <i>The axioms of maximum entropy</i>, Maximum Entropy and Bayesian Methods in Science and Engineering <b>1</b> (1988), no.&nbsp;173-187.</dd>
<dt><a name="Sleator:1985p168" id="Sleator:1985p168">ST85</a></dt>
<dd>Daniel&nbsp;Dominic Sleator and Robert&nbsp;Endre Tarjan, <i>Amortized efficiency of list update and paging rules</i>, Communications of the ACM <b>28</b> (1985), no.&nbsp;2.</dd>
<dt><a name="SVMBook" id="SVMBook">STC00</a></dt>
<dd>Jown Shawe-Taylor and Nello Cristianini, <i>Support vector machines</i>, Cambridge University Press, 2000.</dd>
<dt><a name="KernBook" id="KernBook">STC04</a></dt>
<dd>to3em, <i>Kernel methods for pattern analysis</i>, Cambridge University Press, 2004.</dd>
<dt><a name="Strang" id="Strang">Str76</a></dt>
<dd>Gilbert Strang, <i>Linear algebra and its applications</i>, Academic Press, Inc., 1976.</dd>
<dt><a name="TibHat" id="TibHat">TH09</a></dt>
<dd>Jerome&nbsp;Friedman Trevor&nbsp;Hastie, Robert&nbsp;Tibshirani, <i>The elements of statistical learning: Data mining, inference and prediction</i>, Springer, 2009.</dd>
<dt><a name="Trevisan:2008p2166" id="Trevisan:2008p2166">TTV08</a></dt>
<dd>Luca Trevisan, Madhur Tulsiani, and Salil Vadhan, <i>Regularity, boosting, and efficiently simulating every high-entropy distribution</i>, Electronic Colloquium on Computational Complexity (2008), 18.</dd>
<dt><a name="Zeilberger:1995p277" id="Zeilberger:1995p277">Zei95</a></dt>
<dd>Doron Zeilberger, <i>The method of undetermined generalization and specialization illustrated with fred galvin&#8217;s amazing proof of the dinitz conjecture</i>, <a href="http://arxiv.org/abs/math/9506215">http://arxiv.org/abs/math/9506215</a>, 1995.</dd>
</dl>
<h1><a name="SECTION00070000000000000000" id="SECTION00070000000000000000">Acknowledgement</a></h1>
<p><font><font>A thank you to readers who supplied help and comments on earlier drafts.</font></font></p>
<p></p>
<hr />
<h4>Footnotes</h4>
<dl>
<dt><a name="foot21" id="foot21">&#8230; Mount</a><a href="#tex2html3"><sup>1</sup></a></dt>
<dd>email: <tt><a name="tex2html1" href="mailto:jmount@win-vector.com" id="tex2html1">mailto:jmount@win-vector.com</a></tt> web: <tt><a name="tex2html2" href="http://www.win-vector.com/" id="tex2html2">http://www.win-vector.com/</a></tt></dd>
<dt><a name="foot244" id="foot244">&#8230; principle.</a><a href="#tex2html4"><sup>2</sup></a></dt>
<dd>The pre-existing practice that comes cloesest to the local o global principle is found in operations research where encoding a problem to be solved by an optimizer is a central technique. We claim the natural statement of the local to global principle is more general than <font><em>always</em> encoding constraints for a particular optimizer (in particular globalization is not always optimization).</font></dd>
<dt><font><a name="foot43" id="foot43">&#8230; structure</a><a href="#tex2html6"><sup>4</sup></a></font></dt>
<dd><font>By &#8220;link structure&#8221; we mean which web pages link to which other web pages.</font></dd>
<dt><font><a name="foot45" id="foot45">&#8230; graph</a><a href="#tex2html7"><sup>5</sup></a></font></dt>
<dd><font>Remember, a graph is diagram consisting of nodes and edges (here depicted as arrows).</font></dd>
<dt><font><a name="foot245" id="foot245">&#8230; features</a><a href="#tex2html9"><sup>6</sup></a></font></dt>
<dd><font>For example the model could account for:</font></p>
<ul>
<li>surfers entering and leaving the model</li>
<li>link odds that vary where they are on a page</li>
<li>surfers staying on a page proportional to how much text is on the page</li>
<li>matching known traffic and click behavior where we have such data.</li>
</ul>
<p><font>For simplicity we will just stick with the example given example.</font></dd>
<dt><font><a name="foot154" id="foot154">&#8230; components.</a><a href="#tex2html17"><sup>7</sup></a></font></dt>
<dd><font>When a system is named and defined as an exact set of procedures the system can, by definition, not be improved. This is because with any change in procedure we have a new system that no longer matches the original definition and therefore requires a new name.</font></dd>
</dl>
<p><font><br /></font></p>
<hr />
<address><font>John Mount 2009-11-11</font></address>


<p>Related posts:<ol><li><a href='http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/' rel='bookmark' title='Permanent Link: Automatic Differentiation with Scala'>Automatic Differentiation with Scala</a></li>
<li><a href='http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/' rel='bookmark' title='Permanent Link: A Demonstration of Data Mining'>A Demonstration of Data Mining</a></li>
<li><a href='http://www.win-vector.com/blog/2009/07/should-your-mom-use-google-search/' rel='bookmark' title='Permanent Link: Should your mom use Google search?'>Should your mom use Google search?</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2009/11/the-local-to-global-principle/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>&#8220;I don&#8217;t think that means what you think it means;&#8221; Statistics to English Translation, Part 1: Accuracy Measures</title>
		<link>http://www.win-vector.com/blog/2009/11/i-dont-think-that-means-what-you-think-it-means-statistics-to-english-translation-part-1-accuracy-measures/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=i-dont-think-that-means-what-you-think-it-means-statistics-to-english-translation-part-1-accuracy-measures</link>
		<comments>http://www.win-vector.com/blog/2009/11/i-dont-think-that-means-what-you-think-it-means-statistics-to-english-translation-part-1-accuracy-measures/#comments</comments>
		<pubDate>Tue, 03 Nov 2009 16:14:00 +0000</pubDate>
		<dc:creator>Nina Zumel</dc:creator>
				<category><![CDATA[Applications]]></category>
		<category><![CDATA[Expository Writing]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Statistics To English Translation]]></category>
		<category><![CDATA[Accuracy Measures]]></category>
		<category><![CDATA[Classifiers]]></category>
		<category><![CDATA[Diagnostic Tests]]></category>
		<category><![CDATA[Precision and Recall]]></category>
		<category><![CDATA[ROC Curves]]></category>
		<category><![CDATA[Sensitivity and Specificity]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1050</guid>
		<description><![CDATA[Scientists, engineers, and statisticians share similar concerns about evaluating the accuracy of their results, but they don&#8217;t always talk about it in the same language. This can lead to misunderstandings when reading across disciplines, and the problem is exacerbated when technical work is communicated to and by the popular media. The &#8220;Statistics to English Translation&#8221; [...]


Related posts:<ol><li><a href='http://www.win-vector.com/blog/2009/12/statistics-to-english-translation-part-2a-%e2%80%99significant%e2%80%99-doesn%e2%80%99t-always-mean-%e2%80%99important%e2%80%99/' rel='bookmark' title='Permanent Link: Statistics to English Translation, Part 2a: ’Significant’ Doesn’t Always Mean ’Important’'>Statistics to English Translation, Part 2a: ’Significant’ Doesn’t Always Mean ’Important’</a></li>
<li><a href='http://www.win-vector.com/blog/2009/12/statistics-to-english-translation-part-2b-calculating-significance/' rel='bookmark' title='Permanent Link: Statistics to English Translation, Part 2b: Calculating Significance'>Statistics to English Translation, Part 2b: Calculating Significance</a></li>
<li><a href='http://www.win-vector.com/blog/2010/02/living-in-a-lognormal-world/' rel='bookmark' title='Permanent Link: Living in A Lognormal World'>Living in A Lognormal World</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>Scientists, engineers, and statisticians share similar concerns about evaluating the accuracy of their results, but they don&#8217;t always talk about it in the same language. This can lead to misunderstandings when reading across disciplines, and the problem is exacerbated when technical work is communicated to and by the popular media.</p>
<p>The &#8220;Statistics to English Translation&#8221; series is a new set of articles that we will be posting from time to time, as an attempt to bridge the language gaps. Our goal is to increase statistical literacy: we hope that you will find it easier to read and understand the statistical results in research papers, even if you can&#8217;t replicate the analyses. We also hope that you will be able to read popular media accounts of statistical and scientific results more critically, and to recognize common misunderstandings when they occur.</p>
<p>The first installment discusses some different accuracy measures that are commonly used in various research communities, and how they are related to each other. There is also a more legible PDF version of the article <a href="http://win-vector.com/dfiles/StatisticsToEnglishPart1_Accuracy.pdf">here</a>.</p>
<p><span id="more-1050"></span></p>
<h2><a name="SECTION00020000000000000000" id="SECTION00020000000000000000">The Basics</a></h2>
<p>In informal language and in popular press articles, &#8220;accuracy&#8221; is often discussed as if it were a one-dimensional property of a diagnostic test or a classifier.</p>
<div align="center"><img width="613" height="454" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/MyMoney.png" alt="Image MyMoney"/></div>
<p>In general though, a single number is not enough. A test or classifier should detect what&#8217;s interesting, and ignore what&#8217;s not. How well it accomplishes these two tasks is related to the two kinds of mistakes that a test or classifier can make: false negatives, and false positives.</p>
<p>For a classification task, <em>positive</em> means that an instance is labeled as belonging to the class of interest: we may want to automatically gather all news articles about Microsoft out of a news feed, or identify fraudulent credit card transactions. For a screening test, positive means that the test detects whatever it was designed to look for: an HIV test detects the presence of human immunodeficiency virus, for example, while an allergy test detects the presence of an allergic reaction. A <em>negative</em> is obviously the opposite of a positive.</p>
<p>A <em>false positive</em> is concluding that something is positive when it is not. False positives are sometimes called <em>Type I errors</em>. A <em>false negative</em> is concluding that something is negative when it is not. False negatives are sometimes called <em>Type II errors</em>. The terms &#8220;Type I error&#8221; and &#8220;Type II error&#8221; are not terribly mnemonic, but they are commonly used, and therefore worth knowing.</p>
<p>For binary classification or binary test procedures, the <em>False Positive Rate</em>, <img width="45" height="15" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img1.png" alt="$ FPR$"/> , is the fraction of negative instances that are erroneously misclassified as positive.</p>
<div align="center"><a name="eqn:fp" id="eqn:fp"></a><!-- MATH<br />
 \begin{equation}<br />
FPR = \frac{\mbox{\#false positives}} {\mbox{all negative instances}}<br />
                     = \frac{\mbox{\#false positives}} {\mbox{\#false positives} + \mbox{\#true negatives}}<br />
\end{equation}<br />
 --></p>
<table cellpadding="0" width="100%" align="center">
<tr valign="middle">
<td nowrap align="center"><img width="522" height="56" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img2.png" alt="$\displaystyle FPR = \frac{\mbox{\char93 false positives}} {\mbox{all negative i... ...se positives}} {\mbox{\char93 false positives} + \mbox{\char93 true negatives}}$"/></td>
<td nowrap width="10" align="right">(1)</td>
</tr>
</table>
</div>
<p><br clear="all"/></p>
<p>Likewise, the <em>False Negative Rate</em>, <img width="47" height="15" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img3.png" alt="$ FNR$"/> , is the fraction of positive instances that are erroneously misclassified as negative.</p>
<div align="center"><a name="eqn:fn" id="eqn:fn"></a><!-- MATH<br />
 \begin{equation}<br />
FNR = \frac {\mbox{\#false negatives}} {\mbox{all positive instances}}<br />
                     = \frac{\mbox{\#false negatives}} {\mbox{\#false negatives} + \mbox{\#true positives}}<br />
\end{equation}<br />
 --></p>
<table cellpadding="0" width="100%" align="center">
<tr valign="middle">
<td nowrap align="center"><img width="520" height="56" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img4.png" alt="$\displaystyle FNR = \frac {\mbox{\char93 false negatives}} {\mbox{all positive ... ...se negatives}} {\mbox{\char93 false negatives} + \mbox{\char93 true positives}}$"/></td>
<td nowrap width="10" align="right">(2)</td>
</tr>
</table>
</div>
<p><br clear="all"/></p>
<p>The <em>True Positive Rate</em>, <img width="44" height="15" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img5.png" alt="$ TPR$"/> , is the fraction of positive instances that are correctly identified as such. It follows from the Definition <a href="#eqn:fn">2</a> above that <!-- MATH<br />
 $TPR = 1 - FNR$<br />
 --><br />
<img width="140" height="32" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img6.png" alt="$ TPR = 1 - FNR$"/> .</p>
<p>The <em>True Negative Rate</em>, <img width="46" height="15" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img7.png" alt="$ TNR$"/> , is the fraction of negative instances that are correctly identified as such. It follows from the Definition <a href="#eqn:fp">1</a> above that <!-- MATH<br />
 $TNR = 1 - FPR$<br />
 --><br />
<img width="140" height="32" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img8.png" alt="$ TNR = 1 - FPR$"/> .</p>
<h2><a name="SECTION00030000000000000000" id="SECTION00030000000000000000">Sensitivity and Specificity</a></h2>
<div align="center"><img width="180" height="180" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/screening.jpg" alt="Image screening"/></div>
<p>The terms sensitivity and specificity generally refer to diagnostic or screening procedures, such as an HIV or allergy tests. The <em>sensitivity</em> of a test is its true positive rate; the <em>specificity</em> is its true negative rate, although it can be more intuitive to think of specificity as the complement of the false positive rate: <em>Specificity</em> = <!-- MATH<br />
 $TNR = (1 - FPR)$<br />
 --><br />
<img width="153" height="36" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img9.png" alt="$ TNR = (1 - FPR)$"/> .</p>
<p>The Wikipedia entry on Sensitivity and Specificity [<a href="#wikiSS">Wik</a>] uses a nice example to illustrate the difference: think of a drug-sniffing dog as a screening test for illicit drugs. If the dog&#8217;s nose is highly <em>sensitive</em> to the smell of drugs, then it will detect all the hidden packets of drugs; if it is less sensitive, then it will fail to detect some of the packets. At the same time, the dog should react <em>specifically</em> to drugs, and not, say, jambalaya or doggie biscuits. If the dog is highly specific in its reactions, it will only react to drugs; if it is less specific, then it will react to the occasional care package of yummy home cooking from Mom.</p>
<p>Screening tests may trade off specificity for sensitivity (and vice-versa). To go back to our drug-sniffing example, we might treat every suitcase and bag that comes through the airport as if it contained drugs; this procedure is perfectly sensitive (it will detect every packet of drugs, for sure), but not specific at all. Or, we might assume that no one is carrying drugs. This is perfectly specific (we will never make a false accusation), but not sensitive at all.</p>
<p>A more realistic example, inspired by a discussion of mandatory AIDS testing by Joshua Rosenau [<a href="#Rosenau">Ros06</a>], is the use of the ELISA screening test to detect HIV-infected blood donations. The ELISA test is designed to be very sensitive: it detects 99.7% of the cases of HIV-infection, which gives a false negative rate of <!-- MATH<br />
 $3 \times 10^{-3}$<br />
 --><br />
<img width="70" height="39" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img10.png" alt="$ 3 \times 10^{-3}$"/> . On the other hand, it is not very specific: it has a 1.9% false positive rate<a name="tex2html2" href="#foot208" id="tex2html2"><sup>1</sup></a>. If you assume that the incidence of HIV-positive individuals in the general population is about 448 out of every 100,000 people [<a href="#CDC">Hig08</a>], then a positive test result is correct only about 19% of the time: one case of true infection out of every five positives. This error rate may be appropriate for screening blood donations, since it is better to discard four perfectly good pints of blood, &#8220;just in case&#8221;, than to allow a pint of HIV-infected blood into the blood bank. But it is <em>not</em> appropriate to assume that all five of those poor blood donors are HIV-positive, without followup tests to increase the specificity of the screening procedure.</p>
<h3><a name="SECTION00031000000000000000" id="SECTION00031000000000000000">Sensitivity, Specificity, and Prevalence</a></h3>
<p>The example above brings up an important point. Sensitivity and specificity are properties <em>of the test itself</em>, not <em>how the test performs in a given population</em>. <b>The absolute accuracy</b> (as the term is commonly understood) <b>of a screening test will change, depending on the prevalence of the condition that the test is screening for.</b></p>
<p>Let&#8217;s imagine the ELISA test described above as an HIV-screening daemon, who uses two coins to generate uncertainty. When the daemon is shown a pint of infected blood, she flips an unfair quarter. If the quarter comes up heads (which it does 3 times out of every 1000 flips), then she lies and says the blood is uninfected, otherwise she tells the truth. When the daemon is shown a pint of uninfected blood, she flips a silver dollar that comes up heads about 2 times out of every 100 flips. If the silver dollar comes up heads, she lies and says the blood is infected, or else she tells the truth. The quarter and the silver dollar encode ELISA&#8217;s sensitivity and specificity, respectively.</p>
<div align="center"><a name="76"></a></p>
<table>
<caption align="bottom"><strong>Figure 1:</strong> The ELISA daemon screening an uninfected pint of blood</caption>
<tr>
<td>
<div align="center"><img width="518" height="333" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ELISAflip.png" alt="Image ELISAflip"/></div>
</td>
</tr>
</table>
</div>
<p>Suppose ELISA looks at the blood of 1000 people a day, drawn from the general population. We can expect that about 5 of them are truly infected. That means that ELISA flips her silver dollar 995 times; it will come up heads about 20 times. That&#8217;s about twenty false positives a day. She&#8217;ll flip the quarter about 5 times, and with high probability, won&#8217;t ever see a head. That&#8217;s near zero false negatives a day. In total, ELISA will read positive for about 25 pints of blood every day, and she will be wrong for 80% of those cases.</p>
<p>But suppose ELISA looks at the blood of 1000 people from a high-risk population, where one out of four people are infected. Then ELISA flips her silver dollar about 750 times, and it will come up heads about 15 times: 15 false positives. She&#8217;ll also flip the quarter 250 times; the coin just might come up heads one time. Let&#8217;s say it does. Then ELISA will read positive for 249+15 = 264 pints of blood, and she&#8217;ll be wrong for only about 6% of those cases &#8211; plus that case of infection that she missed.</p>
<p><b>Same test, same sensitivity and specificity, but different overall accuracy.</b> The percentage of positives that are actually true positives in a given population is called the <em>positive predictive value</em> (<img width="45" height="15" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img11.png" alt="$ PPV$"/> ) of the test within that population; it is the probability <em>for that population</em> that a positive test result correctly predicts a positive instance.</p>
<div align="center"><!-- MATH<br />
 \begin{equation}<br />
PPV = \frac{TPR \times P(+)}{TPR \times P(+) + FPR \times P(-)}<br />
\end{equation}<br />
 --></p>
<table cellpadding="0" width="100%" align="center">
<tr valign="middle">
<td nowrap align="center"><img width="298" height="58" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img12.png" alt="$\displaystyle PPV = \frac{TPR \times P(+)}{TPR \times P(+) + FPR \times P(-)}$"/></td>
<td nowrap width="10" align="right">(3)</td>
</tr>
</table>
</div>
<p><br clear="all"/><br />
where <img width="45" height="36" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img13.png" alt="$ P(+)$"/> is the probability of a positive instance, or in other words the prevalence of the condition in the population. <img width="45" height="36" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img14.png" alt="$ P(-)$"/> is the probability of a negative instance, and of course <!-- MATH<br />
 $P(+) + P(-) =<br />
1$<br />
 --><br />
<img width="139" height="36" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img15.png" alt="$ P(+) + P(-) = 1$"/> .<a name="tex2html4" href="#foot86" id="tex2html4"><sup>2</sup></a></p>
<h3><a name="SECTION00032000000000000000" id="SECTION00032000000000000000">Likelihood Ratios</a></h3>
<p>Likelihood ratios are another measure of diagnostic test accuracy. The <em>positive likelihood ratio</em> is the true positive rate over the false positive rate: <!-- MATH<br />
 $LR_P = TPR/FPR$<br />
 --><br />
<img width="152" height="36" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img16.png" alt="$ LR_P = TPR/FPR$"/> . For our example ELISA test, the positive likelihood ratio is 0.997/0.019 = 52.47. The <em>negative likelihood ratio</em> is the false negative rate over the true negative rate, <!-- MATH<br />
 $LR_N =FNR/TNR$<br />
 --><br />
<img width="159" height="36" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img17.png" alt="$ LR_N =FNR/TNR$"/> . For our ELISA example, the negative likelihood ratio is 0.003/.981 = 0.003058.</p>
<p>Likelihood ratios are a property of the screening test, independent of the prevalence of the condition in the population. If you know the odds of infection for the population of interest, <!-- MATH<br />
 $odds_{pop} =<br />
P(+)/P(-)$<br />
 --><br />
<img width="173" height="36" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img18.png" alt="$ odds_{pop} = P(+)/P(-)$"/> , then you can calculate the posterior odds of infection for someone who has tested positive:</p>
<p><!-- MATH<br />
 \begin{displaymath}<br />
odds_{post}  =  LR_P \times odds_{pop}<br />
\end{displaymath}<br />
 --></p>
<div align="center"><img width="201" height="34" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img19.png" alt="$\displaystyle odds_{post} = LR_P \times odds_{pop} $"/></div>
<p>and the posterior odds of infection for someone who has tested negative:</p>
<p><!-- MATH<br />
 \begin{displaymath}<br />
odds_{post} = LR_N \times odds_{pop}<br />
\end{displaymath}<br />
 --></p>
<div align="center"><img width="202" height="34" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img20.png" alt="$\displaystyle odds_{post} = LR_N \times odds_{pop} $"/></div>
<p>It&#8217;s been argued that likelihood ratios make it easier for non-statistically-minded practitioners to interpret the results of a test than sensitivity and specificity do [<a href="#JAMA94">JGS94</a>]. It&#8217;s also been argued the other way [<a href="#AIM05">PSBtR05</a>]. Which framework makes more sense depends on if you prefer thinking in odds or probabilities. In either case you should be leery of &#8220;guidelines&#8221; of the sort: &#8220;<img width="81" height="32" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img21.png" alt="$ LR_P &gt; 10$"/> indicates large and often conclusive increase in the likelihood of the disease.&#8221; There is certainly a large increase in the posterior likelihood of infection if <img width="81" height="32" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img21.png" alt="$ LR_P &gt; 10$"/> , but as the ELISA coin-flipping example should have made clear, this posterior likelihood can still be quite small, if the disease is sufficiently rare.</p>
<p>I occasionally see something called the <em>diagnostic odds ratio</em>. It was developed as &#8220;a single indicator of test performance,&#8221; and I&#8217;ve seen it described as &#8220;the odds of the true positive rate divided by the odds of the false positive rate&#8221; [<a href="#diagodds">HC07</a>]. I could give you the actual formula here, but frankly &#8211; it makes no sense. The whole point of having two measures for accuracy is that one is not enough, and if you absolutely must have one number, you are better off using something like the <img width="23" height="32" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img22.png" alt="$ F_1$"/> measure that we describe in the next section.</p>
<h2><a name="SECTION00040000000000000000" id="SECTION00040000000000000000">Precision and Recall</a></h2>
<div align="center"><img width="644" height="420" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/istock_library.jpg" alt="Image istock_library"/></div>
<p>Precision and recall are similar (but not identical) to sensitivity and specificity. The measures are popular in the information retrieval and machine learning communities.</p>
<p><em>Recall</em> is the same as sensitivity, or the true positive rate: the number of true examples correctly classified as such. <em>Precision</em> is defined as the fraction of instances classified as positive that really are positive:</p>
<p><!-- MATH<br />
 \begin{displaymath}<br />
\mbox{precision} = \frac{\mbox{\# true positives}}{\mbox{\# true positives + \# false positives}}<br />
\end{displaymath}<br />
 --></p>
<div align="center">&nbsp; &nbsp;precision<img width="299" height="56" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img23.png" alt="$\displaystyle = \frac{\mbox{\char93 true positives}}{\mbox{\char93 true positives + \char93 false positives}} $"/></div>
<p>This is <em>not</em> the same as specificity; it is the same as the positive predictive value, and is a joint property of the classifier and the population that it was evaluated on.</p>
<p>Information retrieval research concerns itself with efficient discovery of relevant documents from document collections, and that domain motivates the definitions of precision and recall. A library patron queries the library catalog for books on a given topic; the catalog&#8217;s search engine should return all of the books relevant to her query, and only those books. Recall is a measure of how well the search delivers &#8220;all of the relevant books&#8221;, and precision is a measure of how well it delivers &#8220;only the relevant books&#8221;. If recall is poor, then the patron will miss finding many relevant books; if precision is poor, then she will be inundated with a bunch of book suggestions that have nothing to do with her search.</p>
<p>As with diagnostic procedures, classifiers may trade precision for recall, and vice-versa. Suppose our library patron is looking for novels about vampires. She could request all novels with the word &#8220;vampire&#8221; in the title. This search would have almost perfect precision, since presumably a novel with the word &#8220;vampire&#8221; in the title is going to be about vampires. It would not have perfect recall, since many novels about vampires &#8211; like <em>Dracula</em>, or the books from the <em>Twilight</em> series &#8211; don&#8217;t announce themselves quite that blatantly. Now suppose she is only interested in novels from Anne Rice&#8217;s <em>Vampire Chronicles</em> series. She could request all novels authored by Anne Rice. This search would have perfect recall, but not perfect precision, since Ms. Rice did in fact write several novels that are not about vampires.</p>
<p>These examples show that neither high precision nor high recall guarantee a useful classifier. It is the tension between achieving high precision and high recall that leads to good classifiers.</p>
<p>As we discussed above, the primary difference between precision and specificity is that precision is a property of <em>the algorithm and the population</em>. One could argue that precision is a more appropriate measure than specificity for many classification and machine learning tasks, especially those related to text or natural language. The fundamental assumption, after all, is that such algorithms are trained on data that is representative of the population that the classifier will be deployed in.</p>
<h3><a name="SECTION00041000000000000000" id="SECTION00041000000000000000">If you insist: Single Score Measures</a></h3>
<p>There is another measure called <img width="23" height="32" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img22.png" alt="$ F_1$"/> , the harmonic mean of precision and recall:</p>
<p><!-- MATH<br />
 \begin{displaymath}<br />
F_1 = \frac{2}{(1/\mbox{precision} + 1/\mbox{recall})} =<br />
2 \times \frac{\mbox{precision} \times \mbox{recall}}{\mbox{precision + recall}}.<br />
\end{displaymath}<br />
 --></p>
<div align="center"><img width="422" height="56" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img24.png" alt="$\displaystyle F_1 = \frac{2}{(1/\mbox{precision} + 1/\mbox{recall})} = 2 \times \frac{\mbox{precision} \times \mbox{recall}}{\mbox{precision + recall}}. $"/></div>
<p><img width="23" height="32" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img22.png" alt="$ F_1$"/> is near one when both precision and recall are high, and near zero when they are both low. It is a convenient single score to characterize overall accuracy, especially for comparing the performance of different classifiers.</p>
<p>Using <img width="23" height="32" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img22.png" alt="$ F_1$"/> to compare classifiers assumes that precision and recall are equally important for the application. If one criterion is more important than the other, then one can also use the weighted geometric mean:</p>
<p><!-- MATH<br />
 \begin{displaymath}<br />
F_\alpha = (1 + \alpha)(\mbox{precision} \times \mbox{recall})/(\alpha \mbox{precision +<br />
  recall}).<br />
\end{displaymath}<br />
 --></p>
<div align="center"><img width="110" height="36" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img25.png" alt="$\displaystyle F_\alpha = (1 + \alpha)($"/>precision<img width="18" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img26.png" alt="$\displaystyle \times$"/>&nbsp; &nbsp;recall<img width="38" height="36" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img27.png" alt="$\displaystyle )/(\alpha$"/>&nbsp; &nbsp;precision + recall<img width="16" height="36" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img28.png" alt="$\displaystyle ). $"/></div>
<p><img width="15" height="18" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img29.png" alt="$ \alpha$"/> describes how much more important recall is than precision: use <img width="23" height="32" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img30.png" alt="$ F_2$"/> if recall is twice as important as precision, <img width="33" height="32" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img31.png" alt="$ F_{0.5}$"/> if precision is twice as important as recall.</p>
<p>It is still better to have separate target goals for precision and recall that a candidate classifier must meet. Still, <img width="23" height="32" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img22.png" alt="$ F_1$"/> and <img width="25" height="32" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img32.png" alt="$ F_\alpha$"/> are found in the literature, so they are presented here.</p>
<h2><a name="SECTION00050000000000000000" id="SECTION00050000000000000000">ROC Curves</a></h2>
<p>Not all diagnostic tests or classifiers return a simple &#8220;yes-or-no&#8221; answer. In fact, most probably don&#8217;t. Generally, a classification or diagnostic procedure will return a score along a continuum; ideally, the positive instances score towards one end of the scale, and the negative examples towards the other end. It is up to the scientist or the analyst to set a threshold on that score that separates what is considered a positive result from what is considered a negative result. The Receiver Operating Characteristic Curve, or <em>ROC Curve</em>, is a tool that helps set the best threshold.</p>
<div align="center"><a name="fig:densityplots" id="fig:densityplots"></a><a name="129"></a></p>
<table>
<caption align="bottom"><strong>Figure 2:</strong> Plot of score distributions for positive and negative instances (Class 10 is positive)</caption>
<tr>
<td>
<div align="center"><img width="525" height="525" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ScoreDensityplots.png" alt="Image ScoreDensityplots"/></div>
</td>
</tr>
</table>
</div>
<p>Suppose we are trying to classify a set of instances into one of two classes, positive and negative<a name="tex2html6" href="#foot133" id="tex2html6"><sup>3</sup></a>. We&#8217;ve gathered a test set of representative samples, and we&#8217;ve developed a scoring procedure to try to separate them. Positives tend to score on the high end of the scale, negatives toward the low end. We want to pick a threshold value.</p>
<p>Figure <a href="#fig:densityplots">2</a> shows what happens when score the test set. We can see that the scores of the positive instances (class 10) are in a cluster centered just above 7, and the scores of the negatives (class 0) are in a cluster centered near 5. Still, there is an interval where the two clusters overlap substantially. If we pick a threshold to the right of that interval (say, <img width="49" height="15" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img33.png" alt="$ T = 7$"/> ), almost everything that scores greater than <img width="17" height="15" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img34.png" alt="$ T$"/> will be truly positive (high precision/specificity), but we miss a lot of positives, too (low recall/sensitivity). If we pick a threshold to the left of that interval (say <img width="49" height="15" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img35.png" alt="$ T = 5$"/> ), we will catch almost all the positives (high recall/sensitivity), but we will also pick up a lot of negatives (low specificity/precision). So we want the threshold to be somewhere in the overlap interval, but where?</p>
<div align="center"><a name="fig:ROC" id="fig:ROC"></a><a name="214"></a></p>
<table>
<caption align="bottom"><strong>Figure 3:</strong> ROC Curve corresponding to Figure <a href="#fig:densityplots">2</a>. Selected thresholds are marked on the curve.</caption>
<tr>
<td>
<div align="center"><img width="525" height="525" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ROC.png" alt="Image ROC"/></div>
</td>
</tr>
</table>
</div>
<p>ROC curves plot the false positive rate on the x-axis and the true positive rate on the y-axis, as we vary the threshold. The point <img width="43" height="36" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img36.png" alt="$ (0,0)$"/> corresponds to rejecting everything; the point <img width="43" height="36" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img37.png" alt="$ (1,1)$"/> corresponds to accepting everything. The ideal point is <img width="43" height="36" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img38.png" alt="$ (0,1)$"/> : accept all positive instances and reject all negative instances. The line <img width="46" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img39.png" alt="$ x=y$"/> corresponds to random guessing: that is, a procedure that assigns each instance a score uniformly drawn from (in this example) the interval <img width="39" height="36" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img40.png" alt="$ [1,8]$"/> without even checking if the instance is positive or negative.</p>
<p>The ROC curve represents the tradeoff between true positives and false positives that we make as we increase the threshold from accepting everything to rejecting everything. Figure <a href="#fig:ROC">3</a> gives the ROC curve for our example, with a few example thresholds marked on the curve.</p>
<p>The area between the ROC curve and the <img width="46" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img39.png" alt="$ x=y$"/> line can be considered a measure of accuracy; the smaller that area, the more the scoring procedure is like random guessing. The larger the area, the better separated the two classes are. We can use the curve to help us decide how to set a threshold that will give us the most acceptable tradeoff between true positives and false positives. For this example, we would probably want to select a threshold somewhere between <img width="13" height="18" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img41.png" alt="$ 6$"/> and <img width="26" height="18" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img42.png" alt="$ 6.5$"/> .</p>
<h2><a name="SECTION00060000000000000000" id="SECTION00060000000000000000">In Conclusion</a></h2>
<p>Some points to remember:</p>
<ul>
<li>Classifier and diagnostic test performance are not one-dimensional.</li>
<li>Different fields use different (but related) measures of accuracy.</li>
<li>Classifier and diagnostic test performance depend on the relative cost of Type I and Type II errors, as well as on the proportion of positive and negative instances in the population of interest.</li>
</ul>
<h2><a name="SECTION00070000000000000000" id="SECTION00070000000000000000">Bibliography</a></h2>
<dl compact>
<dt><a name="diagodds" id="diagodds">HC07</a></dt>
<dd>Childrens&nbsp;Mercy Hospitals and Clinics, <i>Stats: Meta-analysis for a diagnostic test</i>, <tt><a name="tex2html8" href="http://www.childrens-mercy.org/stats/model/diagnostic.asp" id="tex2html8">http://www.childrens-mercy.org/stats/model/diagnostic.asp</a></tt>, 2007.</dd>
<dt><a name="CDC" id="CDC">Hig08</a></dt>
<dd>Liz Highleyman, <i>CDC updates estimates of HIV prevalence in the United States</i>, <tt><a name="tex2html9" href="http://www.hivandhepatitis.com/recent/2008/100708_a.html" id="tex2html9">http://www.hivandhepatitis.com/recent/2008/100708_a.html</a></tt>, 2008.</dd>
<dt><a name="JAMA94" id="JAMA94">JGS94</a></dt>
<dd>R.&nbsp;Jaeschke, GH&nbsp;Guyatt, and DL&nbsp;Sackett, <i>Users&#8217; guides to the medical literature. III. How to use an article about a diagnostic test. B. What are the results and will they help me in caring for my patients? The evidence-based medicine working group</i>, JAMA <b>271</b> (1994), no.&nbsp;6, 703-707.</dd>
<dt><a name="who" id="who">Org04</a></dt>
<dd>World&nbsp;Health Organization, <i>HIV assays: Operational characteristics (phase 1); Report 15 antigen/antibody ELISAs</i>, <tt><a name="tex2html10" href="http://whqlibdoc.who.int/publications/2004/9241592370.pdf" id="tex2html10">http://whqlibdoc.who.int/publications/2004/9241592370.pdf</a></tt>, 2004.</dd>
<dt><a name="AIM05" id="AIM05">PSBtR05</a></dt>
<dd>MA&nbsp;Puhan, J.&nbsp;Steurer, LM&nbsp;Bachmann, and G.&nbsp;ter Riet, <i>&#8220;A randomized trial of ways to describe test accuracy: the effect on physicians&#8217; post-test probability estimates&#8221;</i>, Annals of Internal Medicine <b>143</b> (2005), no.&nbsp;3, 184-&acirc;&Auml;&igrave;189.</dd>
<dt><a name="Rosenau" id="Rosenau">Ros06</a></dt>
<dd>Joshua Rosenau, <i>AIDS testing</i>, <tt><a name="tex2html11" href="http://scienceblogs.com/tfk/2006/08/aids_testing.php" id="tex2html11">http://scienceblogs.com/tfk/2006/08/aids_testing.php</a></tt>, 2006.</dd>
<dt><a name="wikiSS" id="wikiSS">Wik</a></dt>
<dd>Wikipedia, <i>Sensitivity and specificity</i>, <tt><a name="tex2html12" href="http://en.wikipedia.org/wiki/Sensitivity_and_specificity)" id="tex2html12">http://en.wikipedia.org/wiki/Sensitivity_and_specificity)</a></tt>.</dd>
</dl>
<h2><a name="SECTION00080000000000000000" id="SECTION00080000000000000000">Appendix: Glossary of Accuracy Terms</a></h2>
<h3><a name="SECTION00081000000000000000" id="SECTION00081000000000000000">Basic Terms</a></h3>
<dl>
<dt><strong>Accuracy</strong></dt>
<dd>
<p><!-- MATH<br />
 \begin{displaymath}<br />
\frac{\mbox{\#true positives} + \mbox{\#true negatives}}{\mbox{all instances}}<br />
\end{displaymath}<br />
 --></p>
<div align="center"><img width="267" height="56" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img43.png" alt="$\displaystyle \frac{\mbox{\char93 true positives} + \mbox{\char93 true negatives}}{\mbox{all instances}} $"/></div>
</dd>
<dt><strong>Type I error</strong></dt>
<dd>False Positive: to conclude something is a positive instance when it is not.</dd>
<dt><strong>Type II error</strong></dt>
<dd>False Negative: to conclude something is a negative instance when it is not.</dd>
<dt><strong>False Positive Rate (<img width="45" height="15" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img1.png" alt="$ FPR$"/> )</strong></dt>
<dd>The fraction of negative instances that are erroneously misclassified as positive.</p>
<div align="center"><!-- MATH<br />
 \begin{equation*}<br />
FPR = \frac{\mbox{\#false positives}} {\mbox{all negative instances}}<br />
                     = \frac{\mbox{\#false positives}} {\mbox{\#false positives} + \mbox{\#true negatives}}<br />
\end{equation*}<br />
 --></p>
<table cellpadding="0" width="100%" align="center">
<tr valign="middle">
<td nowrap align="center"><img width="522" height="56" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img2.png" alt="$\displaystyle FPR = \frac{\mbox{\char93 false positives}} {\mbox{all negative i... ...se positives}} {\mbox{\char93 false positives} + \mbox{\char93 true negatives}}$"/></td>
<td nowrap width="10" align="right">&nbsp;&nbsp;&nbsp;</td>
</tr>
</table>
</div>
<p><br clear="all"/></dd>
<dt><strong>False Negative Rate (<img width="47" height="15" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img3.png" alt="$ FNR$"/> )</strong></dt>
<dd>The fraction of positive instances that are erroneously misclassified as negative.</p>
<div align="center"><!-- MATH<br />
 \begin{equation*}<br />
FNR = \frac {\mbox{\#false negatives}} {\mbox{all positive instances}}<br />
                     = \frac{\mbox{\#false negatives}} {\mbox{\#false negatives} + \mbox{\#true positives}}<br />
\end{equation*}<br />
 --></p>
<table cellpadding="0" width="100%" align="center">
<tr valign="middle">
<td nowrap align="center"><img width="520" height="56" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img4.png" alt="$\displaystyle FNR = \frac {\mbox{\char93 false negatives}} {\mbox{all positive ... ...se negatives}} {\mbox{\char93 false negatives} + \mbox{\char93 true positives}}$"/></td>
<td nowrap width="10" align="right">&nbsp;&nbsp;&nbsp;</td>
</tr>
</table>
</div>
<p><br clear="all"/></dd>
<dt><strong>True Positive Rate (<img width="44" height="15" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img5.png" alt="$ TPR$"/> )</strong></dt>
<dd>The fraction of positive instances that are correctly identified as such. <!-- MATH<br />
 $TPR = 1 - FNR$<br />
 --><br />
<img width="140" height="32" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img6.png" alt="$ TPR = 1 - FNR$"/> .</dd>
<dt><strong>True Negative Rate (<img width="46" height="15" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img7.png" alt="$ TNR$"/> )</strong></dt>
<dd>The fraction of negative instances that are correctly identified as such. <!-- MATH<br />
 $TNR = 1 - FPR$<br />
 --><br />
<img width="140" height="32" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img8.png" alt="$ TNR = 1 - FPR$"/> .</dd>
<dt><strong>Prevalence <img width="45" height="36" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img13.png" alt="$ P(+)$"/></strong></dt>
<dd>The proportion of positive instances in the population, or the probability that someone drawn from the population at random in a positive instance. <img width="45" height="36" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img14.png" alt="$ P(-)$"/> is the probability of drawing a negative instance from the population at random. The <em>odds of a positive</em> is the ratio of <img width="45" height="36" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img13.png" alt="$ P(+)$"/> to <img width="45" height="36" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img14.png" alt="$ P(-)$"/> .</p>
<p><!-- MATH<br />
 \begin{displaymath}<br />
odds_{pop} = P(+)/P(-)<br />
\end{displaymath}<br />
 --></p>
<div align="center"><img width="173" height="36" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img44.png" alt="$\displaystyle odds_{pop} = P(+)/P(-) $"/></div>
</dd>
</dl>
<h3><a name="SECTION00082000000000000000" id="SECTION00082000000000000000">Accuracy Terms</a></h3>
<p>Terms to describe the accuracy of diagnostic tests are conventionally given in terms of sensitivity and specificity. They have been rephrased here in terms of true positive rate, false positive rate, etc., for clarity.</p>
<dl>
<dt><strong><img width="23" height="32" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img22.png" alt="$ F_1$"/></strong></dt>
<dd>Single score measure of accuracy. The harmonic mean of precision and recall.</p>
<p><!-- MATH<br />
 \begin{displaymath}<br />
F_1 = \frac{2}{(1/\mbox{precision} + 1/\mbox{recall})} =<br />
2 \times \frac{\mbox{precision} \times \mbox{recall}}{\mbox{precision + recall}}.<br />
\end{displaymath}<br />
 --></p>
<div align="center"><img width="422" height="56" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img24.png" alt="$\displaystyle F_1 = \frac{2}{(1/\mbox{precision} + 1/\mbox{recall})} = 2 \times \frac{\mbox{precision} \times \mbox{recall}}{\mbox{precision + recall}}. $"/></div>
<p><img width="23" height="32" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img22.png" alt="$ F_1$"/> is near 1 for high accuracy, near 0 for low accuracy. Also see <em>Precision, Recall</em>.</dd>
<dt><strong><img width="23" height="32" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img30.png" alt="$ F_2$"/></strong></dt>
<dd>Single score measure of accuracy when recall is twice as important as precision. Also see <em><img width="23" height="32" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img22.png" alt="$ F_1$"/> , Precision, Recall</em>.</dd>
<dt><strong><img width="36" height="32" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img45.png" alt="$ F_0.5$"/></strong></dt>
<dd>Single score measure of accuracy when precision is twice as important as recall. Also see <em><img width="23" height="32" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img22.png" alt="$ F_1$"/> , Precision, Recall</em>.</dd>
<dt><strong>Likelihood-ratio (negative)</strong></dt>
<dd><!-- MATH<br />
 $LR_N = FNR/TNR$<br />
 --><br />
<img width="159" height="36" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img17.png" alt="$ LR_N =FNR/TNR$"/> . In diagnostic screening, used for calculating the posterior odds of a true positive for a subject who has tested positive:</p>
<p><!-- MATH<br />
 \begin{displaymath}<br />
odds_{post}  =  LR_N \times odds_{pop}<br />
\end{displaymath}<br />
 --></p>
<div align="center"><img width="202" height="34" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img20.png" alt="$\displaystyle odds_{post} = LR_N \times odds_{pop} $"/></div>
</dd>
<dt><strong>Likelihood-ratio (positive)</strong></dt>
<dd><!-- MATH<br />
 $LR_P = TPR/FPR$<br />
 --><br />
<img width="152" height="36" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img16.png" alt="$ LR_P = TPR/FPR$"/> . In diagnostic screening, used for calculating the posterior odds of a positive for a subject who has tested negative:</p>
<p><!-- MATH<br />
 \begin{displaymath}<br />
odds_{post} = LR_P \times odds_{pop}<br />
\end{displaymath}<br />
 --></p>
<div align="center"><img width="201" height="34" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img19.png" alt="$\displaystyle odds_{post} = LR_P \times odds_{pop} $"/></div>
</dd>
<dt><strong>Positive Predictive Value</strong></dt>
<dd>Probability (with respect to a specific assumed prevalence rate) that a positive result from a diagnostic or screening procedure is a true positive. Same as precision.</p>
<p><!-- MATH<br />
 \begin{displaymath}<br />
PPV = \frac{TPR \times P(+)}{TPR \times P(+) + FPR \times P(-)}<br />
\end{displaymath}<br />
 --></p>
<div align="center"><img width="298" height="58" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img12.png" alt="$\displaystyle PPV = \frac{TPR \times P(+)}{TPR \times P(+) + FPR \times P(-)}$"/></div>
<p>Also see <em>Precision</em></dd>
<dt><strong>Precision</strong></dt>
<dd>In information retrieval, the fraction of returned documents that are actually relevant to the query. In classification, the fraction of all instances classified as class A that are truly in class A. The same as Positive Predictive Value.</p>
<p><!-- MATH<br />
 \begin{displaymath}<br />
\mbox{precision} = \frac{\mbox{\# true positives}}{\mbox{\# true positives + \# false positives}}<br />
\end{displaymath}<br />
 --></p>
<div align="center">&nbsp; &nbsp;precision<img width="299" height="56" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img23.png" alt="$\displaystyle = \frac{\mbox{\char93 true positives}}{\mbox{\char93 true positives + \char93 false positives}} $"/></div>
<p>Also see <em>Positive Predictive Value</em></dd>
<dt><strong>Recall</strong></dt>
<dd>In information retrieval, the fraction of relevant documents that are returned by the query. In classification, the fraction of all true instances of class A that are classified into class A. The same as sensitivity.</p>
<p><!-- MATH<br />
 \begin{displaymath}<br />
\mbox{recall} = \frac{\mbox{\# true positives}}{\mbox{\# true positives + \# false negatives}} = TPR<br />
\end{displaymath}<br />
 --></p>
<div align="center">&nbsp; &nbsp;recall<img width="366" height="56" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img46.png" alt="$\displaystyle = \frac{\mbox{\char93 true positives}}{\mbox{\char93 true positives + \char93 false negatives}} = TPR $"/></div>
<p>Also see <em>Sensitivity</em></dd>
<dt><strong>ROC Curve</strong></dt>
<dd>Plot of true positive rate versus false positive rate for a diagnostic test or binary classifier, as the decision threshold is varied.</dd>
<dt><strong>Sensitivity</strong></dt>
<dd>The true positive rate <img width="44" height="15" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img5.png" alt="$ TPR$"/> of a diagnostic or screening procedure. Also see <em>Recall</em>.</dd>
<dt><strong>Specificity</strong></dt>
<dd>The true negative rate <!-- MATH<br />
 $TNR = 1 - FPR$<br />
 --><br />
<img width="140" height="32" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img8.png" alt="$ TNR = 1 - FPR$"/> (or the complement of the false positive rate) of a diagnostic or screening procedure.</dd>
</dl>
<p></p>
<hr />
<h4>Footnotes</h4>
<dl>
<dt><a name="foot208" id="foot208">&#8230; rate</a><a href="#tex2html2"><sup>1</sup></a></dt>
<dd>The ELISA sensitivity and specificity numbers are from WHO&#8217;s report on the operational characteristics of HIV Assays [<a href="#who">Org04</a>, p. 18], using the lower bounds of the confidence interval. They are slightly different from the numbers Rosenau uses</dd>
<dt><a name="foot86" id="foot86">&#8230;.</a><a href="#tex2html4"><sup>2</sup></a></dt>
<dd>The definition of <img width="45" height="15" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/ste1img11.png" alt="$ PPV$"/> is conventionally given in terms of sensitivity and specificity (similarly for the likelihood ratios discussed in the following section). The definitions in this article are given in terms of false positive rate, etc., since that is clearer for people reading outside their discipline.</dd>
<dt><a name="foot133" id="foot133">&#8230; negative</a><a href="#tex2html6"><sup>3</sup></a></dt>
<dd>We are using a classifier in our example, but a diagnostic test would work the same way.</dd>
</dl>
<p></p>
<hr />


<p>Related posts:<ol><li><a href='http://www.win-vector.com/blog/2009/12/statistics-to-english-translation-part-2a-%e2%80%99significant%e2%80%99-doesn%e2%80%99t-always-mean-%e2%80%99important%e2%80%99/' rel='bookmark' title='Permanent Link: Statistics to English Translation, Part 2a: ’Significant’ Doesn’t Always Mean ’Important’'>Statistics to English Translation, Part 2a: ’Significant’ Doesn’t Always Mean ’Important’</a></li>
<li><a href='http://www.win-vector.com/blog/2009/12/statistics-to-english-translation-part-2b-calculating-significance/' rel='bookmark' title='Permanent Link: Statistics to English Translation, Part 2b: Calculating Significance'>Statistics to English Translation, Part 2b: Calculating Significance</a></li>
<li><a href='http://www.win-vector.com/blog/2010/02/living-in-a-lognormal-world/' rel='bookmark' title='Permanent Link: Living in A Lognormal World'>Living in A Lognormal World</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2009/11/i-dont-think-that-means-what-you-think-it-means-statistics-to-english-translation-part-1-accuracy-measures/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Google AdSense Channels IDs and the Cramer Rao Inequality</title>
		<link>http://www.win-vector.com/blog/2009/10/google-adsense-channels-ids-and-the-cramer-rao-inequality/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=google-adsense-channels-ids-and-the-cramer-rao-inequality</link>
		<comments>http://www.win-vector.com/blog/2009/10/google-adsense-channels-ids-and-the-cramer-rao-inequality/#comments</comments>
		<pubDate>Mon, 19 Oct 2009 22:07:07 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Administrativia]]></category>
		<category><![CDATA[Expository Writing]]></category>
		<category><![CDATA[Mathematics]]></category>
		<category><![CDATA[AdSense]]></category>
		<category><![CDATA[AdSense Channel]]></category>
		<category><![CDATA[Channel ID]]></category>
		<category><![CDATA[Cramer-Rao]]></category>
		<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=898</guid>
		<description><![CDATA[&#8220;Comparing Apples and Oranges: Two Examples of the Limits of Statistical Inference, With an Application to Google Advertising Markets&#8221; is our analysis of Google AdSense Channel IDs and our use of the Cramer Rao bound to show that these IDs fundamentally limit what participants in the Google online advertising market can measure (and therefore in [...]


Related posts:<ol><li><a href='http://www.win-vector.com/blog/2008/06/yaygda-yet-another-yahoo-google-deal-article/' rel='bookmark' title='Permanent Link: YAYGDA (Yet Another Yahoo Google Deal Article)'>YAYGDA (Yet Another Yahoo Google Deal Article)</a></li>
<li><a href='http://www.win-vector.com/blog/2007/06/new-paper/' rel='bookmark' title='Permanent Link: New Paper'>New Paper</a></li>
<li><a href='http://www.win-vector.com/blog/2009/07/should-your-mom-use-google-search/' rel='bookmark' title='Permanent Link: Should your mom use Google search?'>Should your mom use Google search?</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.win-vector.com/SelectedPapers/files/ComparingApplesAndOrangesProblemsWithAdsense.pdf">&#8220;Comparing Apples and Oranges: Two Examples of the Limits of Statistical Inference, With an Application to Google Advertising Markets&#8221;</a> is our analysis of Google AdSense Channel IDs and our use of the Cramer Rao bound to show that these IDs fundamentally limit what participants in the Google online advertising market can measure (and therefore in turn limit what these players can do).<br />
<span id="more-898"></span><br />
We also include a entry level exposition and examples of what the Cramer Rao Inequality is and how it works.</p>
<p>This is a repost of an older paper- but a few people have pointed out they were put off by the incredibly uninformative title of the original post &#8220;<a href="http://www.win-vector.com/blog/2007/06/new-paper/">New Paper</a>.&#8221;</p>


<p>Related posts:<ol><li><a href='http://www.win-vector.com/blog/2008/06/yaygda-yet-another-yahoo-google-deal-article/' rel='bookmark' title='Permanent Link: YAYGDA (Yet Another Yahoo Google Deal Article)'>YAYGDA (Yet Another Yahoo Google Deal Article)</a></li>
<li><a href='http://www.win-vector.com/blog/2007/06/new-paper/' rel='bookmark' title='Permanent Link: New Paper'>New Paper</a></li>
<li><a href='http://www.win-vector.com/blog/2009/07/should-your-mom-use-google-search/' rel='bookmark' title='Permanent Link: Should your mom use Google search?'>Should your mom use Google search?</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2009/10/google-adsense-channels-ids-and-the-cramer-rao-inequality/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>What is the gambler&#8217;s equivalent of Amdahl&#8217;s Law?</title>
		<link>http://www.win-vector.com/blog/2009/10/what-is-the-gamblers-equivalent-of-amdahls-law/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=what-is-the-gamblers-equivalent-of-amdahls-law</link>
		<comments>http://www.win-vector.com/blog/2009/10/what-is-the-gamblers-equivalent-of-amdahls-law/#comments</comments>
		<pubDate>Thu, 08 Oct 2009 20:38:21 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Expository Writing]]></category>
		<category><![CDATA[Quantitative Finance]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Amdahl's Law]]></category>
		<category><![CDATA[Kelly Criterion]]></category>
		<category><![CDATA[Kraft Inequality]]></category>
		<category><![CDATA[Mathematical Bedside Reading]]></category>
		<category><![CDATA[Statistical Detective]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=878</guid>
		<description><![CDATA[While executing some statistical detective work for a client we had a major &#8220;aha!&#8221; moment and realized something like &#8220;Amdahl&#8217;s Law&#8221; rephrased in terms of probability would solve everything. We finished our work using direct methods and moved on. But it is an interesting question: what is the probabilist&#8217;s (or gambler&#8217;s) equivalent of Amdahl&#8217;s Law? [...]


Related posts:<ol><li><a href='http://www.win-vector.com/blog/2008/09/a-quick-appreciation-of-the-sharpe-ratio/' rel='bookmark' title='Permanent Link: A Quick Appreciation of the Sharpe Ratio'>A Quick Appreciation of the Sharpe Ratio</a></li>
<li><a href='http://www.win-vector.com/blog/2008/05/betting-best-of-series/' rel='bookmark' title='Permanent Link: Betting Best-Of Series'>Betting Best-Of Series</a></li>
<li><a href='http://www.win-vector.com/blog/2008/06/how-market-designs-set-prices/' rel='bookmark' title='Permanent Link: How Market Designs Set Prices'>How Market Designs Set Prices</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>While executing some statistical detective work for a client we had a major &#8220;aha!&#8221; moment and realized  something like &#8220;Amdahl&#8217;s Law&#8221; rephrased in terms of probability would solve everything.  We finished our work using direct methods and moved on.  But it is an interesting question: what is the probabilist&#8217;s (or gambler&#8217;s) equivalent of Amdahl&#8217;s Law?<span id="more-878"></span></p>
<p>Amdahl&#8217;s Law is famous idea due to computer architect Gene Amdahl.  It is a simple technique that computer scientists use to re-direct their work back to important parts of problems.  Suppose you have a complicated system you wish to speed up.  Suppose this system is spending a p-fraction of its time in an important sub-process and that you have an idea that would speed up the sub-process by a factor of k.  Should you invest the effort?  </p>
<p>Amdahl&#8217;s Law says (by simple arithmetic): the speed-up (the ratio of the old run-time over the new run-time) the entire system would achieve if you implemented your improvement is not the factor of k you would hope for, but instead:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/10/eq11.png" alt="eq1.png" border="0" width="141" height="56" /><br />
</center></p>
<p>For example if p = 1/3 then you can only speed up the over all system by at most a factor of 33%, even your idea is so astoundingly good that you have k=1000.</p>
<p>Amdahl&#8217;s Law reminds us that speeding up a component you do not lose much time to is not an important accomplishment.  In fact Amdahl&#8217;s Law directly prescribes looking at your most expensive components as being the largest opportunities for improvement.  Appealing to Amdahl&#8217;s Law is an important nerd-tool to end &#8220;color of the bike shed&#8221; arguments (and concentrate only on the design of systems that actually have an impact on outcomes).</p>
<p>It is clear there are similar principles for managing expenses, revenue, effort and so on (such as the Pareto Principle).</p>
<p>But what is the equivalent statement in the harder and more complicated world of probabilities and gambling systems?  There are a lot of candidate statements and theorems (such as &#8220;look for horses not for zebras&#8221;, the Kraft Inequality, Kullback Leibler Distance, Cross Entropy and the Asymptotic Equipartition Principle) but I think the most powerful and direct analogue is: the Kelly Betting System.  The Kelly Betting System is a remarkable system that, like Amdahl&#8217;s Law, tells us exactly what to look at (and surprisingly some things to ignore).</p>
<p>Kelly&#8217;s original paper: &#8220;A New Interpretation of Information Rate&#8221; J. L. Jr Kelly, AT&#038;T Technical Journal (1956) phrases the problem as betting at a horse race.  The technique applies more generally (other forms of gambling, portfolio management, even explaining the preferences of lab-mice) but the clearest example remains a horse race.</p>
<p>We follow the excellent discussion of the problem from Cover and Thomas &#8220;Information Theory&#8221; Wiley (1991).    Consider a simplified horse race where there is only one payoff offered: picking the winning horse.  Suppose the (unknown) true probability of the i-th horse winning is p_i.  Further suppose the track publishes a set of payoffs for each horse such that if you bet a dollar on the i-th horse and it wins: you are given o_i dollars back.   </p>
<p>Now a gambler that has no estimate of the p_i might put all of their money on &#8220;the highest paying horse.&#8221;   That is picking the i such that o_i is maximal (&#8220;going for big score&#8221;).   A somewhat more informed gambler might put all of their money on the &#8220;horse with the best expected return&#8221; that is a horse i that maximizes p_i * o_i.  But this betting strategy &#8220;invites ruin&#8221;:  you have probability of 1 &#8211; p_i of losing all of your money.  Kelly starts with the controversial idea of trying to maximize expected log-return (instead of maximizing expected return).  Maximizing expected log-return avoids ruin, maximizes the exponential rate your wealth grows  and maximizes the median wealth over all outcomes (see: &#8220;The Kelly System Maximizes Median Fortune&#8221; S N Ethier, Journal of Applied Probability (2004) vol. 41 (4) pp. 1230-1236).  Even the observation that you don&#8217;t always want to put all of your money in a &#8220;favorable bet&#8221; (that is one with expectation p_i * o_i >1) is an important one.</p>
<p>To get the next part of Kelly&#8217;s system consider the sum of reciprocals of track offered payoffs:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/10/sum1.png" alt="sum1.png" border="0" width="82" height="68" /><br />
</center></p>
<p>At any real track this sum will be greater than 1 (i.e. the o_i will be small, making the sum large).   The larger the sum the more clearly unfair the track&#8217;s published payoff schedule is.  Let us assume we were at a fantastically generous track where this sum is exactly 1 (admittedly unrealistic, and both the paper and the book work beyond this limitation).  In this case we can write r_i = 1/o_i and we know r_i > 0 and the r_i sum to 1.  That is we can interpret the r_i as the track&#8217;s estimate of the probability of the i-th horse winning.   If o_i = 100 (the track is paying off 100:1) we then can infer they think the i-th horse has no more than a 1 in 100 chance of winning (else they could not afford to offer the bet).  Kelly&#8217;s system gives (and proves correct) the following remarkable advice: if the sum given above is 1 (i.e. the track is paying off at least a fair rate) then you can safely bet all of your money and you should bet a p_i fraction of your money on the i-th horse.  </p>
<p>That is: if you decide the track is paying off so much that it is worth your while to gamble then you should then completely ignore the track&#8217;s payoff schedule in making your bet.   You might use the track&#8217;s published payoffs as some of your evidence when trying to estimate the p_i (the probability of each horse winning), but once you have estimated these probabilities you then ignore the track&#8217;s payoff rates in designing your bets.  In fact your expected rate of winning is exactly proportional to how much closer to the true probabilities your estimate is than the track&#8217;s estimate is (Cover/Thomas example 6.1.1, so if unless you know something the track does not know you should not bet).  Also you should bet even on unlikely and underpaying horses to help cover the possibilities (this is because you are making a series of bets, not just a single bet- so each bet&#8217;s value is computed under the assumption that your other bets have failed).  This (provably correct) advice is contrary to many obvious and traditional betting systems.</p>
<p>The Kelly System is simultaneously very precise and broadly applicable.  For example: it has be extended to many other games and the stock market (see: &#8220;The Kelly Criterion and the Stock Market&#8221; Louis M Rotando, Edward O Thorp, The American Mathematical Monthly (1992) vol. 99 (10) pp. 922-931).  The Kelly System gives actionable advice (exact amounts to bet or exact amounts of effort to invest) and is very specific in saying what to look at.  </p>
<p>Just as Amdahl&#8217;s law shows us component speedup is a distraction the Kelly System shows us that published rates of return are siren songs.  Thus the Kelly System is the gambler&#8217;s equivalent of Amdahl&#8217;s Law.</p>


<p>Related posts:<ol><li><a href='http://www.win-vector.com/blog/2008/09/a-quick-appreciation-of-the-sharpe-ratio/' rel='bookmark' title='Permanent Link: A Quick Appreciation of the Sharpe Ratio'>A Quick Appreciation of the Sharpe Ratio</a></li>
<li><a href='http://www.win-vector.com/blog/2008/05/betting-best-of-series/' rel='bookmark' title='Permanent Link: Betting Best-Of Series'>Betting Best-Of Series</a></li>
<li><a href='http://www.win-vector.com/blog/2008/06/how-market-designs-set-prices/' rel='bookmark' title='Permanent Link: How Market Designs Set Prices'>How Market Designs Set Prices</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2009/10/what-is-the-gamblers-equivalent-of-amdahls-law/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Good Graphs: Graphical Perception and Data Visualization</title>
		<link>http://www.win-vector.com/blog/2009/08/good-graphs-graphical-perception-and-data-visualization/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=good-graphs-graphical-perception-and-data-visualization</link>
		<comments>http://www.win-vector.com/blog/2009/08/good-graphs-graphical-perception-and-data-visualization/#comments</comments>
		<pubDate>Fri, 28 Aug 2009 15:40:41 +0000</pubDate>
		<dc:creator>Nina Zumel</dc:creator>
				<category><![CDATA[Exciting Techniques]]></category>
		<category><![CDATA[Expository Writing]]></category>
		<category><![CDATA[Mathematics]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Cleveland]]></category>
		<category><![CDATA[data exploration]]></category>
		<category><![CDATA[graphical perception]]></category>
		<category><![CDATA[Lattice]]></category>
		<category><![CDATA[Mathematical Bedside Reading]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[visualization]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=296</guid>
		<description><![CDATA[What makes a good graph? When faced with a slew of numeric data, graphical visualization can be a more efficient way of getting a feel for the data than going through the rows of a spreadsheet. But do we know if we are getting an accurate or useful picture? How do we pick an effective [...]


Related posts:<ol><li><a href='http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/' rel='bookmark' title='Permanent Link: A Demonstration of Data Mining'>A Demonstration of Data Mining</a></li>
<li><a href='http://www.win-vector.com/blog/2010/02/living-in-a-lognormal-world/' rel='bookmark' title='Permanent Link: Living in A Lognormal World'>Living in A Lognormal World</a></li>
<li><a href='http://www.win-vector.com/blog/2009/04/the-data-enrichment-method/' rel='bookmark' title='Permanent Link: The Data Enrichment Method'>The Data Enrichment Method</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>What makes a good graph? When faced with a slew of numeric data, graphical visualization can be a more efficient way of getting a feel for the data than going through the rows of a spreadsheet. But do we know if we are getting an accurate or useful picture? How do we pick an effective visualization that neither obscures important details, or drowns us in confusing clutter? In 1968, William Cleveland published a text called <a href="http://www.stat.purdue.edu/~wsc/elements.html"><em>The Elements of Graphing Data,</em></a> inspired by Strunk and White&#8217;s classic writing handbook <a href="http://www.amazon.com/Elements-Style-50th-Anniversary/dp/0205632645"><em>The Elements of Style</em></a> . <em>The Elements of Graphing Data</em> puts forward Cleveland&#8217;s philosophy about how to produce good, clear graphs — not only for presenting one&#8217;s experimental results to peers, but also for the purposes of data analysis and exploration. Cleveland&#8217;s approach is based on a theory of graphical perception: how well the human perceptual system accomplishes certain tasks involved in reading a graph. For a given data analysis task, the goal is to align the information being presented with the perceptual tasks the viewer accomplishes the best. <span id="more-296"></span></p>
<blockquote><p>When a graph is made, quantitative and categorical information is encoded by a display method. Then the information is visually decoded. This visual perception is a vital link. No matter how clever the choice of the information, and no matter how technologically impressive the encoding, a visualization fails if the decoding fails. Some display methods lead to efficient, accurate decoding, and others lead to inefficient, inaccurate decoding. It is only through scientific study of visual perception that informed judgments can be made about display methods. The display methods of <em>Elements</em> rest on a foundation of scientific enquiry.</p></blockquote>
<p>— from the preface of <em>The Elements of Graphing Data</em></p>
<p>A revised edition of <em>The Elements of Graphing Data</em> was published in 1994, along with a companion volume, <a href="http://www.stat.purdue.edu/~wsc/visualizing.html"><em>Visualizing Data,</em></a> which is oriented towards the implementation and technical details of different graphing techniques. I highly recommend <em>The Elements of Graphing Data</em> as a guidebook for creating graphs, as well as for its excellent survey of several useful techniques. Cleveland, along with other colleagues at Bell Labs, developed the <a href="http://stat.bell-labs.com/project/trellis/s.html">Trellis display system,</a> a framework for the visualization of multivariable databases, using the ideas developed in his texts. Trellis, in turn, influenced Deepayan Sarkar&#8217;s Lattice graphics system for R. Lattice implements many of Cleveland&#8217;s ideas, and I also recommend Sarkar&#8217;s <a href="http://lmdvr.r-forge.r-project.org/figures/figures.html">Lattice manual</a> if you do data visualization in R.</p>
<p>It&#8217;s important to note here that Cleveland writes for researchers and decision-makers who use graphs to analyze data, or to convey scientific results to colleagues in an (ideally) objective manner. This distinguishes him from Darrell Huff, whose 1954 <a href="http://www.amazon.com/How-Lie-Statistics-Darrell-Huff/dp/0393310728"><em>How to Lie with Statistics</em></a> considered the use of graphs (and statistics in general) as rhetorical devices for convincing others of one&#8217;s point of view. Hence, some of Cleveland&#8217;s recommendations and guidelines actually contradict Huff&#8217;s. <a id="refHuff" href="#Huff"><sup>1</sup></a></p>
<p>Edward Tufte also explored the idea that the choice of graphical display should be influenced by the viewer&#8217;s cognitive processes, in his 1990 book <a href="http://www.edwardtufte.com/tufte/books_ei"><em>Envisioning Information</em></a>. Tufte tends to be more broadly concerned with the gestalt of a graph, beyond its use as an analysis tool; he is also more concerned than Cleveland is with aesthetic considerations.</p>
<p>Cleveland&#8217;s philosophy might be summarized as: <em>minimize the mental gymnastics that the viewer must go through to understand the graph</em>. This leads to some obvious advice: avoid clutter and occlusion, make graphing symbols or color-coding unambiguous, use scale-lines on all four sides of the graph, and so on. It also leads to advice that perhaps should be as obvious, but isn&#8217;t: <em>make the aspect of the data that you want to analyze as clear as possible</em>. But what does this mean in practice?</p>
<p><strong>Make important differences large enough to perceive</strong></p>
<p>Weber&#8217;s Law is a well known observation from the psychophysics literature, which states that the &#8220;just noticeable&#8221; change in a stimulus is a constant ratio of the original stimulus. Put another way, people are only capable of detecting a change in a stimulus that is greater than a certain percentage <em>k</em> of the original stimulus. Here, &#8220;stimulus&#8221; can refer to any perceivable physical quantity: weight, intensity, length, orientation. The percentage <em>k</em> will vary with stimulus, and with observer.</p>
<table border="0" align="center">
<tbody>
<tr>
<td>
<div style="text-align:center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/weberslaw.jpg" border="0" alt="weberslaw.jpg" width="488" height="233" /></div>
</td>
</tr>
</tbody>
<caption>Figure 1: From Cleveland, <em>The Elements of Graphing Data</em></caption>
<tbody>
<tr>
<td></td>
</tr>
</tbody>
</table>
<p>Figure 1 shows the application of Weber&#8217;s law to lengths. The bars A and B are of different lengths, but the difference is such a small fraction of the &#8220;base&#8221; length (say, A&#8217;s length, to be specific) that is difficult to tell whether or not they are different, or which is longer. On the right, the bars have been embedded in frames of identical length, and now it is easy to see that B is longer. Why? Because the difference in lengths of the <em>white</em> intervals is a much larger percentage of the white &#8220;base&#8221; length (say the white A interval). It is easy to see that the white B interval is shorter than the white A interval, and therefore, the black B interval is longer than the black A interval.</p>
<p>The moral is that you always want the viewer to be estimating changes or differences with respect to a short base length. You can do this with reference grids, as demonstrated below.</p>
<table border="0" align="center">
<caption>From Cleveland, <em>The Elements of Graphing Data</em></caption>
<tbody>
<tr>
<td><!-- original 319 by 601 --><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/noreferencegrids.jpg" border="0" alt="noreferencegrids.jpg" width="200" height="400" align="left" /></td>
<td><!-- original 319 by 601 --><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/referencegrids1.jpg" border="0" alt="referencegrids.jpg" width="200" height="400" align="right" /></td>
</tr>
<tr>
<td align="center">Figure 2</td>
<td align="center">Figure 3</td>
</tr>
</tbody>
</table>
<p>Figure 2 shows eight curves. Which one dips to the lowest minimum? Are the high curves approaching the same value, and which one is rising the fastest? Are the low curves dipping to the same minimum? Are they going to the same steady state? Figure 3 shows the same curves, graphed with identical reference grids. The grids shorten the base lengths that are being compared, and it is now much easier to compare highs, lows, and steady state behavior.</p>
<p>But wouldn&#8217;t it be better to compare the graphs by superposing them? For two or three curves, perhaps. But in this case, eight curves can clutter the graph, and use up the symbol or color space, making it difficult to distinguish the different datasets &#8212; increasing the mental gymnastics.</p>
<p>Reference grids are useful even for a single curve, especially one with slowly varying segments, such as these graphs have. The reference grid makes it easier to answer questions like: does the process return to the initial state, or to a different steady state? Has the process reached steady state, or is it still growing?</p>
<p><strong>Make important shape changes large enough to perceive: Banking to 45 degrees.</strong></p>
<p>The aspect ratio of a graph is important when trying to understand shape. Rate of change information is encoded in the slope of the curve, which the viewer estimates by changes in the orientation of the local tangents at each point of the graph. Weber&#8217;s Law tells us that very small changes in this orientation will be difficult to detect. For a given (physical) curve, the local orientation changes will be dependent on the aspect ratio of its graphical presentation, as shown (to an exaggerated degree) in Figure 4. Here, the same curve (two line segments) is plotted at three different aspect ratios, one that centers the graph at 45 degrees, one that forces the curve to be nearly vertical, and another that forces it to be nearly horizontal. In the last two cases, the change in orientation of the two line segments is so small as to be nearly undetectable.</p>
<table border="0" align="center">
<caption>Figure 4: From Cleveland</caption>
<tbody>
<tr>
<td><!-- original 670 by 630 --></p>
<div style="text-align:center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/angles.jpg" border="0" alt="angles.jpg" width="446" height="420" align="left" /></div>
</td>
</tr>
</tbody>
</table>
<p>For two line segments with positive, unequal slopes, a simple geometric argument shows that their absolute difference in orientation is maximized by the aspect ratio that sets their average orientation to 45 degrees (the first graph in Figure 4). Empirical studies by Cleveland and others have indeed verified that a viewer&#8217;s ability to judge the relative slopes of line segments on a graph is maximized when the absolute values of the orientations of the segments are centered on 45 degrees.</p>
<p>This result leads to a technique called <em>Banking to 45</em>, whereby the aspect ratio of the graph is chosen so that the average slope of the entire graph is 45 degrees. The details are discussed in Cleveland, and many of the plots in R&#8217;s Lattice package also have an option to bank the graph to 45 degrees.</p>
<p>This deliberate exaggeration of slope is something that Darrell Huff deplores. In <em>How to Lie with Statistics</em>, Huff refers to these graphs as &#8220;gee-whiz&#8221; graphs — and in the context of his discussion of statistics as rhetoric, they are:</p>
<table border="0" align="center">
<caption>Figure 5: From Huff, <em>How to Lie With Statistics</em></caption>
<tbody>
<tr>
<td><!-- original 461 by 351 --></p>
<div style="text-align:center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/geewhiz.jpg" border="0" alt="geewhiz.jpg" width="461" height="351" /></div>
</td>
</tr>
</tbody>
</table>
<p>To insist that a graph should always include a zero line and that units be in proportion may be good advice from a rhetorical perspective; but it is poor advice if the purpose of the graph is data analysis. As Figure 6 below demonstrates, we can lose resolution if we always insist on including the zero. Does the trend line in the left graph increase linearly, superlinearly, or sublinearly? The convexity of the curve is more apparent when it is banked to 45, as on the right. Assuming that the scientist reads the axis and is cognizant of the actual magnitude changes involved, the graph on the right conveys more information.</p>
<table border="0" align="center">
<caption>Figure 6: From Cleveland</caption>
<tbody>
<tr>
<td><img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/bank451.jpg" border="0" alt="bank45.jpg" width="500"  /></td>
</tr>
</tbody>
</table>
<p><strong>Make sure all the data is equally well resolved.</strong></p>
<p>It is quite common for positive data —  word frequencies, populations, price distributions, just to name a few examples — to be skewed: most of the data is bunched towards low values, the rest of it is spread out on a very long tail. This long tail squashes the majority of the data into a tiny interval of a very narrow dynamic range, as in Figure 7, making it difficult to evaluate the data.</p>
<table border="0" align="center">
<tbody>
<tr>
<td>
<table border="0">
<tbody>
<tr>
<td><!-- original size: 990 by 860 --><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/skewed1.gif" border="0" alt="skewed.gif" width="250" /></td>
</tr>
</tbody>
<caption>Figure 7: Long-tailed distribution of purchase sizes</caption>
</table>
</td>
<td>
<table border="0">
<tbody>
<tr>
<td><!-- original = 499 by 675 --><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/logskewed1.gif" border="0" alt="logskewed.gif" width="250" /></td>
</tr>
</tbody>
<caption>Figure 8: Distribution of log(purchase size)</caption>
</table>
</td>
</tr>
</tbody>
</table>
<p>Imagine that Figure 7 represents the distribution of average purchase size across an online merchant&#8217;s customers: average purchase size is plotted on the x-axis, and the y-axis represents the fraction of the total customer population whose average purchase size is a given value (the area under the graph integrates to one). According to this graph, most customers make fairly small purchases on average, but there is a long tail of big spenders trailing out into the range of several thousand dollars. Obviously, one would like a little more resolution on the big spike of customers near zero. One could simply &#8220;zoom in&#8221; on this range, by chopping off some long chunk of the tail, but you may potentially lose sight of some global patterns in the data by doing so.</p>
<p>Graphing the distribution of log(purchase size) enables you to increase the resolution near zero, while preserving the global view. Figure 8 shows the distribution of log(purchase size), revealing two spending populations: a population of high spenders who tend to make purchases in the $3000 range (in log space), and another population whose purchases are centered (in log space) around $60. The existence of these two distinct populations is not apparent in the original graph.</p>
<p>Notice that Figure 8 has two x-axis scales: the top axis is marked in log units, while the bottom axis is marked in absolute dollars, spaced on a log scale. This accords with the principle of minimizing mental gymnastics, since the viewer of the graph will typically be concerned about prices in dollars, not log dollars. In fact, it would have been better yet to have plotted the distribution of log<sub>2</sub> or log<sub>10</sub> of the data; the former would allow us to see at a glance the doubling of price ranges, the latter to see price changes in factors of ten.</p>
<table border="0" align="center">
<caption>Figure 9: The 14 most abundant elements in meteorites. From Cleveland</caption>
<tbody>
<tr>
<td><!-- original = 543 by 522 --><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/metals.jpg" border="0" alt="metals.jpg" width="250" /></td>
<td><!-- original = 550 by 600 --><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/logmetals.jpg" border="0" alt="logmetals.jpg" width="250" /></td>
</tr>
</tbody>
</table>
<p>Figure 9 shows another example: the fourteen most abundant elements in meteorites, specifically the average percent of each of the elements. If we graph the percentages directly, as on the left, we cannot easily distinguish the differences in the elements from aluminum on down. Graphing log<sub>2</sub> of the percentages, as on the right, improves the resolution. Again, we have two x-axes on the graph of the log data.</p>
<p><strong>If you want to analyze the difference between two processes, then graph the difference, not the processes (or graph both).</strong></p>
<p>Suppose that we are comparing the two processes f1 and f2 that are shown in Figure 10. As x increases, the two processes appear to be approaching each other  — that is, the difference between the two seems to be decreasing. In reality, the difference between the two is constant: f2 = f1+1.</p>
<table border="0" align="center">
<tbody>
<tr>
<td>
<table border="0">
<tbody>
<tr>
<td><!-- original size: 990 by 860 --><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/difference1.gif" border="0" alt="difference.gif" width="250" /></td>
</tr>
</tbody>
<caption>Figure 10: The illusion of convergence</caption>
</table>
</td>
<td>
<table border="0">
<tbody>
<tr>
<td><!-- original = 499 by 675 --><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/imports.jpg" border="0" alt="imports.jpg" width="250" /></td>
</tr>
</tbody>
<caption>Figure 11: British Imports and Exports. From Cleveland</caption>
</table>
</td>
</tr>
</tbody>
</table>
<p>It turns out that people are good at perceiving the perpendicular difference between two curves, but not the differences in height, which is what we are actually interested in here. When we try to infer the differences from the process graph, we may not only miss key information, we may actually draw incorrect conclusions.</p>
<p>A less toy example is given in Figure 11. Here the imports to and exports from England are graphed over the first 80 years of the 18th century. In the difference graph on the bottom, we can see a local peak in (imports-exports) just after 1760; this is not obvious from simply comparing the two processes (top graph).</p>
<p><strong>If you are interested in rate of change, then graph rate of change.</strong></p>
<p>In Figure 12, we see the population figures for a given community from 1990 to 2009. Obviously, the population is steadily increasing, but how quickly? Is the rate of population growth increasing over time, or is it decreasing? If we are interested in these questions, then simply graphing the population over time is not enough. We need to look at the rate of change directly.</p>
<table border="0" align="center">
<tbody>
<tr>
<td>
<table border="0">
<caption>Figure 12</caption>
<tbody>
<tr>
<td><!-- original 998 by 860 --><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/rateofchange1.gif" border="0" alt="rateofchange.gif" width="250" /></td>
</tr>
</tbody>
</table>
</td>
<td>
<table border="0">
<caption>Figure 13</caption>
<tbody>
<tr>
<td><!-- original 720 by 720 --><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/lograteofchange2.gif" border="0" alt="lograteofchange.gif" width="250" /></td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
<p>The classic way to do this is by graphing the logarithm of the data. In Figure 13, we have graphed log<sub>2</sub> of the population over time, with the log scale printed on the right hand y-axis, and the actual population numbers printed at a log scale on the left hand axis. Now we can see that the population increased at a constant rate from 1990 to 2000, quadrupling approximately every four years, and then slowed down (to a lower constant rate) after 2000.</p>
<p><strong>Graphs as a research tool</strong></p>
<p>Throughout this discussion, we have considered graphs as a tool for data exploration and initial understanding. It is an iterative process &#8212; as questions arise, the data will be reprocessed and re-plotted to highlight the new issues to be examined. A good research graph must display this information directly, with a minimum of mental gymnastics, but &#8212; as with any research tool &#8212; there can be a learning curve. For example, densityplots (such as those shown in Figures 7 and 8) are in my opinion more useful than histograms for understanding how numerical data is distributed &#8212; and I am constantly surprised at the amount of explanation that they require when I show them to people who are unfamiliar with them. A number of very useful graphs that are discussed in Cleveland&#8217;s texts meet with the same reaction from people who encounter that style of graph for the first time. This is a disadvantage, relative to using a more fashionable graph, when attempting to communicate results. But the insight into the data that these graphs provide often make it worth spending the time to educate clients or peers on how to read the graph.</p>
<p>Even so, a good graph still may not be a quick read. As Cleveland writes:</p>
<blockquote><p>While there is a place for rapidly-understood graphs, it is too limiting to make speed a requirement in science and technology, where the use of graphs ranges from detailed in-depth data analysis to quick presentation.<br />
&#8230;</p>
<p>The important criterion for a graph is not simply how fast we can see a result; rather it is whether through the use of the graph we can see something that would have been harder to see otherwise or that could not have been seen at all.</p></blockquote>
<p>- <em>The Elements of Graphing Data</em>, Chapter 2</p>
<hr /><a id="Huff" href="#refHuff">[Back]</a><sup>1</sup><em>How to Lie with Statistics</em> is an entertaining (if a little dated) discussion of how to read statistical and quantitative claims critically, and is definitely worth a read.</p>


<p>Related posts:<ol><li><a href='http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/' rel='bookmark' title='Permanent Link: A Demonstration of Data Mining'>A Demonstration of Data Mining</a></li>
<li><a href='http://www.win-vector.com/blog/2010/02/living-in-a-lognormal-world/' rel='bookmark' title='Permanent Link: Living in A Lognormal World'>Living in A Lognormal World</a></li>
<li><a href='http://www.win-vector.com/blog/2009/04/the-data-enrichment-method/' rel='bookmark' title='Permanent Link: The Data Enrichment Method'>The Data Enrichment Method</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2009/08/good-graphs-graphical-perception-and-data-visualization/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>A Demonstration of Data Mining</title>
		<link>http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=a-demonstration-of-data-mining</link>
		<comments>http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/#comments</comments>
		<pubDate>Thu, 20 Aug 2009 01:16:27 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Applications]]></category>
		<category><![CDATA[Expository Writing]]></category>
		<category><![CDATA[Mathematics]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[ML]]></category>
		<category><![CDATA[Regression]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=252</guid>
		<description><![CDATA[REPOST (now in HTML in addition to the original PDF). This paper demonstrates and explains some of the basic techniques used in data mining. It also serves as an example of some of the kinds of analyses and projects Win Vector LLC engages in. August 19, 2009 John Mount1 A Demonstration of Data Mining 1&#160;&#160;Introduction [...]


Related posts:<ol><li><a href='http://www.win-vector.com/blog/2009/04/the-data-enrichment-method/' rel='bookmark' title='Permanent Link: The Data Enrichment Method'>The Data Enrichment Method</a></li>
<li><a href='http://www.win-vector.com/blog/2009/01/exciting-technique-1-the-r-language/' rel='bookmark' title='Permanent Link: Exciting Technique #1: The &#8220;R&#8221; language.'>Exciting Technique #1: The &#8220;R&#8221; language.</a></li>
<li><a href='http://www.win-vector.com/blog/2009/08/good-graphs-graphical-perception-and-data-visualization/' rel='bookmark' title='Permanent Link: Good Graphs: Graphical Perception and Data Visualization'>Good Graphs: Graphical Perception and Data Visualization</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>REPOST (now in HTML in addition to the original  <a href="http://www.win-vector.com/dfiles/ADemonstrationOfDataMining.pdf"> PDF</a>).</p>
<p>This paper  demonstrates and explains some of the basic techniques used in data mining.  It also serves as an example of some of the kinds of analyses and projects Win Vector LLC engages in.<span id="more-252"></span>
<div class="p"><!----></div>
<h3 align="center">August 19, 2009 </h3>
<h3 align="center">John Mount<a href="#tthFtNtAAB" name="tthFrefAAB"><sup>1</sup></a> </h3>
<h1 align="center">A Demonstration of Data Mining </h1>
<div class="p"><!----></div>
<h2><a name="tth_sEc1"><br />
1</a>&nbsp;&nbsp;Introduction</h2>
<div class="p"><!----></div>
<p> A major industry in our time is the collection of large data sets in preparation for the magic of data mining [<a href="#NYTStat" name="CITENYTStat">Loh09</a>,<a href="#Halevy:2009p2327" name="CITEHalevy:2009p2327">HNP09</a>].  There is extreme excitement about both the possible applications (identifying important customers, identifying medical risks, targeting advertising, designing auctions and so on) and the various methods for data mining and machine learning.  To some extent these methods are classic statistics presented in a new bottle.  Unfortunately, the concerns, background and language of the modern data-mining practitioner are different than that of the classic statistician- so some demonstration and translation is required.  In this writeup we will show how much of the magic of current data mining and machine learning can be explained in terms of statistical regression techniques and show how the statistician&#8217;s view is useful in choosing techniques.</p>
<div class="p"><!----></div>
<p> Too often data mining is used as a black-box. It is quite possible to clearly use statistics to understand the meaning and mechanisms of data mining.</p>
<div class="p"><!----></div>
<h2><a name="tth_sEc2"><br />
2</a>&nbsp;&nbsp;The Example Problem</h2>
<div class="p"><!----></div>
<p> Throughout this writeup we will work on a single idealized example problem.  For our problem we will assume we are working with a company that sells items and that this company has recorded its past sales visits.  We assume they recorded how well the prospect matched the product offering (we will call this &#8220;match factor&#8221;), how much of a discount was offered to the prospect (we will call this &#8220;discount factor&#8221;) and if the prospect became a customer or not (this is our determination of positive or negative outcome).  The goal is to use this past record as &#8220;training data&#8221; and build a model to predict the odds of making a new sale as a function of the match factor and the discount factor.  In a perfect world the historic data would look a lot like Figure&nbsp;<a href="#fig:IdealFitting">1</a>.  In Figure&nbsp;<a href="#fig:IdealFitting">1</a> each icon represents a past sales-visit, the red diamonds are non-sales and the green disks are successful sales.  Each icon is positioned horizontally to correspond to the discount factor used and vertically to correspond to the degree of product match estimated during the prospective customer visit.  This data is literally too good to be true in at least three qualities: the past data covers a large range of possibilities, every possible combination has already been tried in an orderly fashion and the good and bad events &#8220;are linearly separable.&#8221;  The job of the modeler would then be to draw the separating line (shown in Figure&nbsp;<a href="#fig:IdealFitting">1</a>) and label every situation above and to the right of the separating line as good (or positive) and every situation below and to the left as bad (or negative).</p>
<div class="p"><!----></div>
<div class="p"><!----></div>
<p><a name="tth_fIg1"><br />
</a><br />
<center><img width="400" src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/IdealFitting.png" alt="IdealFitting.png" /></p>
<p></center><center>Figure 1: Ideal Fitting Situation</center><br />
<a name="fig:IdealFitting"><br />
</a></p>
<div class="p"><!----></div>
<p> In reality past data is subject to what prospects were available (so you are unlikely to have good range and an orderly layout of past sales calls) and also heavily affected by past policy.  An example policy might be that potential customers with good product match factor may never have been offered a significant discount in the past; so we would have no data from that situation.  Finally each outcome is a unique event that depends on a lot more than the two quantities we are recording- so it is too much to hope that the good prospects are simply separable from the bad ones.</p>
<div class="p"><!----></div>
<p> Figure&nbsp;<a href="#fig:IdealFitting">1</a> is a mere cartoon or caricature of the modeling process, but it represents the initial intuition behind data mining.  Again: the flaws in Figure&nbsp;<a href="#fig:IdealFitting">1</a> represent the implicit hopes of the data miner.  The data miner wishes that the past experiments are laid out in an orderly manner, data covers most of the combinations of possibilities and there is a perfect and simple concept ready to be learned.</p>
<div class="p"><!----></div>
<p> Frankly, an experienced data miner would feel incredibly fortunate if the past data looked anything like what is shown in Figure&nbsp;<a href="#fig:EmpiricalData">2</a>.</p>
<div class="p"><!----></div>
<div class="p"><!----></div>
<p><a name="tth_fIg2"><br />
</a><br />
<center><img width="400" src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/empirical1.png" alt="empirical1.png" /></p>
<p></center><center>Figure 2: Empirical Data</center><br />
<a name="fig:EmpiricalData"><br />
</a></p>
<div class="p"><!----></div>
<p> The green disks (representing good past prospects) and the red diamonds (representing bad past prospects) are intermingled (which is bad).  There is some evidence that past policy was to lower the discount offered as the match factor increased (as seen in the diagonal spread of the green disks).  Finally we see the red diamonds are also distributed differently than the green disks. This is both good and bad.  The good is that the center of mass of the red diamonds differs from the center of mass of the green disks.  The bad is that the density of red diamonds does not fall any faster as it passes into the green disks than it falls in any other direction.  This indicates there is something important and different (and not measured in our two variables) about at least some of the bad prospects.  It is the data miner&#8217;s job be aware and to press on.</p>
<div class="p"><!----></div>
<h3><a name="tth_sEc2.1"><br />
2.1</a>&nbsp;&nbsp;The Trendy Now</h3>
<div class="p"><!----></div>
<p> In truth data miners often rush where classical statisticians fear to tread.  Right now the temptation is to immediately select from any number of &#8220;red hot&#8221; techniques, methods or software packages.  My short list of super-star method buzzwords includes:</p>
<div class="p"><!----></div>
<ul>
<li> Boosting[<a href="#Schapire:2001p1019" name="CITESchapire:2001p1019">Sch01</a>,<a href="#Breiman:2000p1134" name="CITEBreiman:2000p1134">Bre00</a>,<a href="#Freund:2003p1009" name="CITEFreund:2003p1009">FISS03</a>]
<div class="p"><!----></div>
</li>
<li> Latent Dirichlet Allocation[<a href="#Blei:2003p1063" name="CITEBlei:2003p1063">BNJ03</a>]
<div class="p"><!----></div>
</li>
<li> Linear Regression[<a href="#statistics" name="CITEstatistics">FPP07</a>,<a href="#Agresti" name="CITEAgresti">Agr02</a>]
<div class="p"><!----></div>
</li>
<li> Linear Discriminant Analysis[<a href="#Fisher:1936p2576" name="CITEFisher:1936p2576">Fis36</a>]
<div class="p"><!----></div>
</li>
<li> Logistic Regression[<a href="#Agresti" name="CITEAgresti">Agr02</a>,<a href="#Klein:2003p261" name="CITEKlein:2003p261">KM03</a>]
<div class="p"><!----></div>
</li>
<li> Kernel Methods[<a href="#kernel1" name="CITEkernel1">CST00</a>,<a href="#kernel2" name="CITEkernel2">STC04</a>]
<div class="p"><!----></div>
</li>
<li> Maximum Entropy[<a href="#Klein:2003p261" name="CITEKlein:2003p261">KM03</a>,<a href="#Grunwald:2005p108" name="CITEGrunwald:2005p108">Gru05</a>,<a href="#Stern:1989p1480" name="CITEStern:1989p1480">SC89</a>,<a href="#Dudik:2006p954" name="CITEDudik:2006p954">DS06</a>]
<div class="p"><!----></div>
</li>
<li> Naive Bayes[<a href="#Lewis:1998p105" name="CITELewis:1998p105">Lew98</a>]
<div class="p"><!----></div>
</li>
<li> Perceptrons[<a href="#Beigel:2008p1027" name="CITEBeigel:2008p1027">BRS08</a>,<a href="#Dasgupta:2005p2013" name="CITEDasgupta:2005p2013">DKM05</a>]
<div class="p"><!----></div>
</li>
<li> Quantile Regression[<a href="#quantile" name="CITEquantile">Koe05</a>]
<div class="p"><!----></div>
</li>
<li> Ridge Regression[<a href="#Breiman:1997p1133" name="CITEBreiman:1997p1133">BF97</a>]
<div class="p"><!----></div>
</li>
<li> Support Vector Machines[<a href="#kernel1" name="CITEkernel1">CST00</a>]
<div class="p"><!----></div>
</li>
</ul>
<div class="p"><!----></div>
<p> Based on some of the above referenced writing and analysis I would first pick &#8220;logistic regression&#8221; as I am confident that, when used properly, it is just about as powerful as any of the modern data mining techniques (despite its somewhat less than trendy status).  Using logistic regression I immediately get just about as close to a separating line as this data set will support: Figure&nbsp;<a href="#fig:LinearSepartor">3</a>.</p>
<div class="p"><!----></div>
<div class="p"><!----></div>
<p><a name="tth_fIg3"><br />
</a><br />
<center><img width="400" src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/lin1.png" alt="lin1.png" /></p>
<p></center><center>Figure 3: Linear Separator</center><br />
<a name="fig:LinearSepartor"><br />
</a></p>
<div class="p"><!----></div>
<p> The separating line actually encodes a simple rule of the form: &#8220;if 2.2*DiscountFactor + 3.1*MatchFactor &#8805; 1 then we have a good chance of a sale.&#8221;  This is classic black-box data mining magic.  The purpose of this writeup is to look deeper how to actually derive and understand something like this.</p>
<div class="p"><!----></div>
<h2><a name="tth_sEc3"><br />
3</a>&nbsp;&nbsp;Explanation</h2>
<div class="p"><!----></div>
<p> What is really going on?  Why is our magic formula at all sensible advice, why did this work at all and what motivates the analysis?  It turns out regression (be it linear regression or logistic regression) works in this case because it somewhat imitates the methodology of linear discriminant analysis (described in: [<a href="#Fisher:1936p2576" name="CITEFisher:1936p2576">Fis36</a>]).  In fact in many cases it would be a better idea to perform a linear discriminant analysis or perform an analysis of variance than to immediately appeal to a complicated method.  I will first step through the process of linear discriminant analysis and then relate it to our logistic regression.  Stepping through understandable stages lets us see where we were lucky in modeling and what limits and opportunities for improvement we have.</p>
<div class="p"><!----></div>
<div class="p"><!----></div>
<p><a name="tth_fIg4"><br />
</a><br />
<center></p>
<table>
<tr>
<td><img width="250" src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/posDat.png" alt="posDat.png" /></td>
<td><img width="250" src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/negDat.png" alt="negDat.png" />
</td>
</tr>
</table>
<p></center><center>Figure 4: Separate Plots</center><br />
<a name="fig:SeparatePlots"><br />
</a></p>
<div class="p"><!----></div>
<p> Our data initially looks very messy (the good and bad group are fairly mixed together).  But if we examine out data in separate groups we can see we are actually incredibly lucky in that the data is easy to describe.  As we can see in Figure&nbsp;<a href="#fig:SeparatePlots">4</a>: the data, when separated by outcome (plotting only all of the good green disks or only all of the bad red diamonds), is grouped in simple blobs without bends, intrusions or other odd (and more work to model) configurations.</p>
<div class="p"><!----></div>
<p> We can plot the idealizations of these data distributions (or densities) as &#8220;contour maps&#8221; (as if we are looking down on the elevations of a mountain on a map) which gives us Figure&nbsp;<a href="#fig:SeparateDistributions">5</a>.</p>
<div class="p"><!----></div>
<div class="p"><!----></div>
<p><a name="tth_fIg5"><br />
</a><br />
<center></p>
<table>
<tr>
<td><img width="250" src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/posDist.png" alt="posDist.png" /></td>
<td> <img width="250" src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/negDist.png" alt="negDist.png" />
</td>
</tr>
</table>
<p></center><center>Figure 5: Separate Distributions</center><br />
<a name="fig:SeparateDistributions"><br />
</a></p>
<div class="p"><!----></div>
<h3><a name="tth_sEc3.1"><br />
3.1</a>&nbsp;&nbsp;Full Bayes Model</h3>
<div class="p"><!----></div>
<p> From Figure&nbsp;<a href="#fig:SeparateDistributions">5</a> we can see while our data is not separable there are significant differences between the groups.  The difference in the groups is more obvious if we plot the difference of the densities on the same graph as in Figure&nbsp;<a href="#fig:DifferenceInDensity">6</a>.  Here we are visualizing the distribution of positive examples as a connected pair of peaks (colored green) and the distribution of negative examples a deep valley (colored red) located just below and to the left of the peaks.</p>
<div class="p"><!----></div>
<div class="p"><!----></div>
<p><a name="tth_fIg6"><br />
</a><br />
<center><img width="400" src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/diff1.png" alt="diff1.png" /></p>
<p></center><center>Figure 6: Difference in Density</center><br />
<a name="fig:DifferenceInDensity"><br />
</a></p>
<div class="p"><!----></div>
<p> This difference graph is demonstrating how both of the densities or distributions (positive and negative) reach into different regions of the plane.  The white areas are where the difference in densities is very small which includes the areas in the corners (where there is little of either distribution) and the area between the blobs (where there is a lot of mass from both distributions competing).  This view is a bit closer to what a statistician wants to see- how the distributions of successes and failures different (this is a step to take before even guessing at or looking for causes and explanations).</p>
<div class="p"><!----></div>
<p> Figure&nbsp;<a href="#fig:DifferenceInDensity">6</a> is already an actionable model- we can predict the odds a new prospect will buy or not at a given discount by looking where they fall on Figure&nbsp;<a href="#fig:DifferenceInDensity">6</a> and checking if they fall in a region on strong red or strong green color.  We can also recommend a discount for a given potential customer by drawing a line at the height determined by their degree of match and tracing from left to right until we first hit a strong green region.  We could hand out a simplified Figure&nbsp;<a href="#fig:FullBayesModel">7</a> as a sales rulebook.</p>
<div class="p"><!----></div>
<div class="p"><!----></div>
<p><a name="tth_fIg7"><br />
</a><br />
<center><img width="400" src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/bayesModel1.png" alt="bayesModel1.png" /></p>
<p></center><center>Figure 7: Full Bayes Model</center><br />
<a name="fig:FullBayesModel"><br />
</a></p>
<div class="p"><!----></div>
<p> This model is a full Bayes model (but not a Naive Bayes model, which is oddly more famous and which we will cover later).  The steps we took were: first we summarized or idealized our known data into two Gaussian blobs (as depicted in Figure&nbsp;<a href="#fig:SeparateDistributions">5</a>).  Once we had estimated the centers, widths and orientations of these blobs we could then: for any new point say how likely the point is under the modeled distribution of sales and how likely the point is under the modeled distribution of non-sales.  Mathematically we claim we can estimate P(x,y &#124;sale)<a href="#tthFtNtAAC" name="tthFrefAAC"><sup>2</sup></a> and P(x,y &#124; non-sale) (where x is our discount factor and y is our matching factor).<a href="#tthFtNtAAD" name="tthFrefAAD"><sup>3</sup></a> Neither of these are what we are actually interested in (we want: P(sale &#124; x,y)<a href="#tthFtNtAAE" name="tthFrefAAE"><sup>4</sup></a>).  We can, however, use these values to calculate what we want to know.  Bayes&#8217; law is a law of probability that says if we know P(sale &#124; x,y), P(non-sale &#124; x,y), P(sale) and P(non-sale)<a href="#tthFtNtAAF" name="tthFrefAAF"><sup>5</sup></a> then:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/eqn1.png"/><br />
</center></p>
<p>Figure&nbsp;<a href="#fig:FullBayesModel">7</a> depicts a central hourglass shaped region (colored green) that represents the region of x, y values where P(sale &#124;x,y) is estimated to be at least 0.5 and the remaining (darker red region) are the situations predicted to be less favorable.  Here we are using priors of P(sale) = P(non-sale) = 0.5, for different priors and thresholds we would get different graphs.</p>
<div class="p"><!----></div>
<p> Even at this early stage in the analysis we have already accidentally introduced what we call &#8220;an inductive bias.&#8221;  By modeling both distributions as Gaussians we have guaranteed that our acceptance region will be an hourglass figure (as we saw in Figure&nbsp;<a href="#fig:FullBayesModel">7</a>).  One undesirable consequence of the modeling technique is the prediction sales become unlikely when both match factor and discount factor are very large.  This is somewhat a consequence of our modeling technique (though the fact that the negative data does not fall quickly as it passes into the green region also added to this).  This un-realistic (or &#8220;not physically plausible&#8221;) prediction is called an artifact (of the technique and of the data) and it is the statistician&#8217;s job to see this, confirm they don&#8217;t want it and eliminate it (by deliberately introducing a &#8220;useful modeling bias&#8221;).</p>
<div class="p"><!----></div>
<h3><a name="tth_sEc3.2"><br />
3.2</a>&nbsp;&nbsp;Linear Discriminant</h3>
<div class="p"><!----></div>
<p> To get around the bad predictions of our model in the upper-right quadrant we &#8220;apply domain knowledge&#8221; and introduce a useful modeling bias as follows.  Let us insist that our model be monotone: that if moving some direction is good than moving further in the same direction is better.  In fact let&#8217;s insist that our model be a half-plane (instead of two parabolas).  We want a nice straight separating cut, which brings us to linear discriminant analysis.  We have enough information to apply Fisher linear discriminant technique and find a separator that maximizes the variance of data across categories while minimizing the variance of data within one category and within the other category.  This is called the linear discriminant and it is shown in Figure&nbsp;<a href="#fig:LinearDiscriminant">8</a>.</p>
<div class="p"><!----></div>
<div class="p"><!----></div>
<p><a name="tth_fIg8"><br />
</a><br />
<center><img width="400" src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/lda1.png" alt="lda1.png" /></p>
<p></center><center>Figure 8: Linear Discriminant</center><br />
<a name="fig:LinearDiscriminant"><br />
</a></p>
<div class="p"><!----></div>
<p> The blue line is the linear discriminant (similar to the logistic regression line depicted earlier on the data-slide).  Everything above or to the right of the blue line is considered good and everything below or to the left of the blue line is considered bad.  Notice that this advice while not quite as accurate as the Bayes Model near the boundary between the two distributions is much more sensible about the upper right corner of the graph.</p>
<div class="p"><!----></div>
<p> To evaluate a separator we collapse all variation parallel to the separating cut (as shown in Figure&nbsp;<a href="#fig:collapse">9</a>).  We then see that each distribution becomes a small interval or streak.  A separator is good if these resulting streaks are both short (the collapse packs the blobs) and the two centers of the streaks are far apart (and on opposite size of the separator).  In Figure&nbsp;<a href="#fig:collapse">9</a> the streaks are fairly short and despite some overlap we do have some usable separation between the two centers.</p>
<div class="p"><!----></div>
<div class="p"><!----></div>
<p><a name="tth_fIg9"><br />
</a><br />
<center><img width="400" src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/collapse2.png" alt="collapse2.png" /></p>
<p></center><center>Figure 9: Evaluating Quality of Separating Cut</center><br />
<a name="fig:collapse"><br />
</a></p>
<div class="p"><!----></div>
<p> To make the above precise we switch to mathematical notation.  For the i-th positive training example form the vector v<sub>+,i</sub> and the matrix S<sub>+,i</sub> where</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/eqn2.png"/><br />
</center></p>
<div class="p"><!----></div>
<p> where x<sub>i</sub> and y<sub>i</sub> are the known x and y coordinates for this particular past experience.  Define v<sub>&#8722;,i</sub>, S<sub>&#8722;,i</sub> similarly for all negative examples.  In this notation we have for a direction &#947;: the distance along the &#947; direction between the center of positive examples and center of negative examples is: &#947;<sup>T</sup> ( &#8721;<sub>i</sub> v<sub>+,i</sub> / n<sub>+</sub> &#8722; &#8721;<sub>i</sub> v<sub>&#8722;,i</sub> / n<sub>&#8722;</sub>) (where n<sub>+</sub> is the number of positive examples and n<sub>&#8722;</sub> is the number of negative examples).  We would like this quantity to be large.  The degree of spread or variance of the positive examples along the &#947; direction is &#947;<sup>T</sup> (&#8721;<sub>i</sub> S<sub>+,i</sub> / n<sub>+</sub>) &#947;.  The degree of spread or variance of the negative examples along the &#947; direction is &#947;<sup>T</sup> (&#8721;<sub>i</sub> S<sub>&#8722;,i</sub> / n<sub>&#8722;</sub>) &#947;.  We would like the last two quantities to be small.  The linear discriminant is picked to maximize:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/eqn3.png"/><br />
</center></p>
<p>It is a fairly standard observation (involving the Rayleigh quotient) that this form is maximized when:<br />
<center><br />
<a name="eq:lda"><br />
</a><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/eqn4.png"/><br />
</center></p>
<div class="p"><!----></div>
<p> As we have said, the linear discriminant is very similar to what is returned by a regression or logistic regression.  In fact in our diagrams the regression lines are almost identical to the linear discriminant.  A large part of why regression can be usefully applied in classification comes from its close relationship to the linear discriminant.</p>
<div class="p"><!----></div>
<h3><a name="tth_sEc3.3"><br />
3.3</a>&nbsp;&nbsp;Linear Regression</h3>
<div class="p"><!----></div>
<p> Linear regression is designed to model continuous functions subject to independent normal errors in observation.  Linear regression is incredibly powerful at characterizing and elimination correlations between the input variables of a model.  While function fitting is different than classification (our example problem) linear regression is so useful whenever there is any suspected correlation (which is almost always the case) that it is an appropriate tool.  In our example in the positive examples (those that led to sales) there is clearly a historical dependence between the degree of estimated match and amount of discount offered.  Likely this dependence is from past prospects being subject to a (rational) policy of &#8220;the worse the match the higher the offered discount&#8221; (instead of being arranged in a perfect grid-like experiment as in our first diagram: Figure&nbsp;<a href="#fig:IdealFitting">1</a>).  If this dependence is not dealt with we would under-estimate the value of discount because we would think that discounted customers are not signing up at a higher rate (when these prospects are in fact clearly motivated by discount, once you control for the fact that many of the deeply discounted prospects had a much worse degree of match than average).</p>
<div class="p"><!----></div>
<p> For analysis of categorical data linear regression is closely linked to ANOVA (analysis of variance).[<a href="#Agresti" name="CITEAgresti">Agr02</a>] Recall that variance was a major consideration with the linear discriminant analysis, so we should by now be on familiar ground.</p>
<div class="p"><!----></div>
<p>In our notation the standard least-squares regression solution is:<br />
<center><br />
<a name="eq:leastsquares"><br />
</a><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/eqn5.png"/><br />
</center></p>
<p>where y<sub>+,i</sub> = 1 for all i and y<sub>&#8722;,i</sub> = &#8722;1 for all i.</p>
<div class="p"><!----></div>
<p> If we have the same number of positive and negative examples (i.e.  n<sub>+</sub> = n<sub>&#8722;</sub>) then Equation&nbsp;<a href="#eq:lda">1</a> and Equation&nbsp;<a href="#eq:leastsquares">2</a> are identical and we have &#946; = &#947;.  So in this special case the linear discriminant equals the least square linear regression solution.  We can even ask how the solutions change if the relative proportions of positive and negative training data changes.  The linear discriminant is carefully designed not to move, but the regression solution will tilt to be an angle that is more compatible with the larger of the example classes and shift to cut less into that class.  The linear regression solution can be fixed (by re-weighting the data) to also be insensitive to the relative proportions of positive and negative examples but does not behave that way &#8220;fresh out of the box.&#8221;</p>
<div class="p"><!----></div>
<h3><a name="tth_sEc3.4"><br />
3.4</a>&nbsp;&nbsp;Logistic Regression</h3>
<div class="p"><!----></div>
<p> While linear regression is designed to pick a function that minimizes the sum of square errors logistic regression is designed to pick a separator that maximizes something called <em>the plausibility of the data</em>.  In our case since the data is so well behaved the logistic regression line is essentially the same as the linear regression line.  It is in fact an important property of logistic regression that there is always a re-weighting (or choice of re-emphasis) of the data that causes some linear regression to pick the same separator as the logistic regression.  Because linear and logistic regression are only identical in specific circumstances it is the job of the statistician to know which of the two is more appropriate for a given data set and given intended use of the resulting model.</p>
<div class="p"><!----></div>
<h2><a name="tth_sEc4"><br />
4</a>&nbsp;&nbsp;Other Methods and Techniques</h2>
<div class="p"><!----></div>
<h3><a name="tth_sEc4.1"><br />
4.1</a>&nbsp;&nbsp;Kernelized Regression</h3>
<div class="p"><!----></div>
<p> One way to greatly expand the power of modeling methods is a trick called kernel methods.  Roughly kernel methods are those methods that increase the power of machine learning by moving from a simple problem space (like ours in variables x and y) to a richer problem space that may be easier to work in.  A lot of ink is spilled about how efficient the kernel methods are (they work in time proportional to the size of the simple space, not the complex one) but this is not their essential feature.  The essential feature is the expanded explanation power and this is so important that even the trivial kernel methods (such as directly adjoining additional combinations of variables) pick up most of the power of the method.  Kernel methods are also overly associated with Support Vector Machines- but are just as useful when added to Naive Bayes, linear regression or logistic regression.</p>
<div class="p"><!----></div>
<p> For instance: Figure&nbsp;<a href="#fig:KernelizedRegression">10</a> shows a bow-tie like acceptance region found by using linear regression over the variables x, y, x<sup>2</sup>, y<sup>2</sup> and x y (instead of just x and y).  Note how this result is similar to the full Bayes model (but comes from a different feature set and fitting technique).</p>
<div class="p"><!----></div>
<div class="p"><!----></div>
<p><a name="tth_fIg10"><br />
</a><br />
<center><img width="400" src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/kRegression.png" alt="kRegression.png" /></p>
<p></center><center>Figure 10: Kernelized Regression</center><br />
<a name="fig:KernelizedRegression"><br />
</a></p>
<div class="p"><!----></div>
<h3><a name="tth_sEc4.2"><br />
4.2</a>&nbsp;&nbsp;Naive Bayes Model</h3>
<div class="p"><!----></div>
<p> We briefly return to the Bayes model to discuss a more common alternative called &#8220;Naive Bayes.&#8221;  A Naive Bayes model is like a full Bayes model except an additional modeling simplification is introduced in assuming that P(x,y&#124;sale) = P(x&#124;sale)P(y&#124;sale) and P(x,y&#124;non-sale) = P(x&#124;non-sale)P(y&#124;non-sale).  That is we are assuming that the distributions of the x and y measurements are essentially independent (once we know which outcome happened).  This assumption is the opposite of what we do with regression in that we ignore dependencies in the data (instead of modeling and eliminating the dependencies).  However, Naive Bayes methods are quite powerful and very appropriate in sparse-data situations (such as text classification).  The &#8220;naive&#8221; assumption that the input variables are independent greatly reduces the amount of data that needs to be tracked (it is much less work to track values of variables instead of simultaneous values of pairs of variables).  The curved separator from this Naive Bayes model is illustrated in Figure&nbsp;<a href="#fig:NaiveBayesModel">11</a>.</p>
<div class="p"><!----></div>
<div class="p"><!----></div>
<p><a name="tth_fIg11"><br />
</a><br />
<center><img width="400" src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/naiveBayesModel1.png" alt="naiveBayesModel1.png" /></p>
<p></center><center>Figure 11: Naive Bayes Model</center><br />
<a name="fig:NaiveBayesModel"><br />
</a></p>
<div class="p"><!----></div>
<p> The Naive Bayes version of the advice or policy chart is always going to be an axis-aligned parabola as in Figure&nbsp;<a href="#fig:NaiveBayesDecision">12</a>.  Notice how both the linear discriminant and the Naive Bayes model make mistakes (places some colors on the wrong side of the curve)- but they are simple, reliable models that have the desirable property of having connected prediction regions.</p>
<div class="p"><!----></div>
<div class="p"><!----></div>
<p><a name="tth_fIg12"><br />
</a><br />
<center><img width="400" src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/naiveBayesModel2.png" alt="naiveBayesModel2.png" /></p>
<p></center><center>Figure 12: Naive Bayes Decision</center><br />
<a name="fig:NaiveBayesDecision"><br />
</a></p>
<div class="p"><!----></div>
<h3><a name="tth_sEc4.3"><br />
4.3</a>&nbsp;&nbsp;More Exotic Methods</h3>
<div class="p"><!----></div>
<p> Many of the hot buzzword machine learning and data mining methods we listed earlier are essentially different techniques of fitting a linear separator over data.  These methods seem very different but they all form a family once you realize many of the details of the methods are determined by:</p>
<div class="p"><!----></div>
<ul>
<li> Choice of Loss Function
<div class="p"><!----></div>
<p> This is what notion of &#8220;goodness of fit&#8221; is being used.  It can be normalized mean-variance (linear discriminants), un-normalized variance (linear regression), plausibility (logistic regression), L1 distance (support vector machines, quantile regression), entropy (maximum entropy), probability mass and so on.</p>
<div class="p"><!----></div>
</li>
<li> Choice of Optimization Technique
<div class="p"><!----></div>
<p> For a given loss function we can optimize in many ways (though most authors make the mistake of binding their current favorite optimization method deep into their specification of technique): EM, steepest descent, conjugate gradient, quasi-Newton, linear programming and quadratic programming to name a few.</p>
<div class="p"><!----></div>
</li>
<li> Choice of Regularization Method
<div class="p"><!----></div>
<p> Regularization is the idea of forcing the model to not pick extreme values of parameters to over-fit irrelevant artifacts in training data.  Methods include MDL, controlling energy/entropy, Lagrange smoothing, shrinkage, bagging and early termination of optimization.  Non-explicit treatment of regularization is one reason many methods completely specify their optimization procedure (to get some accidental regularization).</p>
<div class="p"><!----></div>
</li>
<li> Choice of Features/Kernelization
<div class="p"><!----></div>
<p> The richness of the feature set the method is applied to is the single largest determinant of model quality.</p>
<div class="p"><!----></div>
</li>
<li> Pre-transformation Tricks
<div class="p"><!----></div>
<p> Some statistical methods are improved by pre-transforming the outcome data to look more normal or be more homoscedastic.<a href="#tthFtNtAAG" name="tthFrefAAG"><sup>6</sup></a></p>
<div class="p"><!----></div>
</li>
</ul>
<div class="p"><!----></div>
<p> If you think along a few axes like these (instead of evaluating them by their name and lineage) you tend to see different data mining methods more as embodying different trade-offs than as being unique incompatible disciplines.</p>
<div class="p"><!----></div>
<div class="p"><!----></div>
<h2><a name="tth_sEc5"><br />
5</a>&nbsp;&nbsp;Conclusion</h2>
<div class="p"><!----></div>
<p> Our goal for this writeup was to fully demonstrate a data mining method and then survey some important data mining and machine learning techniques.  Many of the important considerations are &#8220;too obvious&#8221; to be discussed by statisticians and &#8220;too statistical&#8221; to be comfortably expressed in terms popular with data miners.  The theory and considerations from statistics when combined with the experience and optimism of data-mining/machine-learning truly make possible achieving the important goal of &#8220;learning from data.&#8221;</p>
<div class="p"><!----></div>
<p>This expository writeup is also meant to serve as an example of the<br />
types of research, analysis, software and training supplied by<br />
Win-Vector LLC <a href="http://www.win-vector.com"><tt>http://www.win-vector.com</tt></a> .  Win-Vector LLC<br />
prides itself in depth of research and specializes in identifying,<br />
documenting and implementing the &#8220;simplest technique that can<br />
possibly work&#8221; (which is often the most understandable, maintainable,<br />
robust and reliable).  Win-Vector LLC specializes in research but<br />
has significant experience in delivering full solutions (including<br />
software solutions and integration with existing databases).</p>
<div class="p"><!----></div>
<p><font size="-1"></p>
<h2>References</h2>
<dl compact="compact">
<dt><a href="#CITEAgresti" name="Agresti">[Agr02]</a></dt>
<dd>
Alan Agresti, <em>Categorical data analysis (wiley series in probability and<br />
  statistics)</em>, Wiley-Interscience, July 2002.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITEBreiman:1997p1133" name="Breiman:1997p1133">[BF97]</a></dt>
<dd>
Leo Breiman and Jerome&nbsp;H Friedman, <em>Predicting multivariate responses in<br />
  multiple linear regression</em>, Journal of the Royal Statistical Society, Series<br />
  B (Methodological) <b>59</b> (1997), no.&nbsp;1, 3-54.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITEBlei:2003p1063" name="Blei:2003p1063">[BNJ03]</a></dt>
<dd>
David&nbsp;M Blei, Andrew&nbsp;Y Ng, and Michael&nbsp;I Jordan, <em>Latent dirichlet<br />
  allocation</em>, Journal of Machine Learning Research <b>3</b> (2003),<br />
  993-1022.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITEBreiman:2000p1134" name="Breiman:2000p1134">[Bre00]</a></dt>
<dd>
Leo Breiman, <em>Special invited paper. additive logistic regression: A<br />
  statistical view of boosting: Discussion</em>, Ann. Statist. <b>28</b> (2000),<br />
  no.&nbsp;2, 374-377.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITEBeigel:2008p1027" name="Beigel:2008p1027">[BRS08]</a></dt>
<dd>
Richard Beigel, Nick Reingold, and Daniel&nbsp;A Spielman, <em>The perceptron<br />
  strikes back</em>, 6.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITEkernel1" name="kernel1">[CST00]</a></dt>
<dd>
Nello Cristianini and John Shawe-Taylor, <em>An introduction to support<br />
  vector machines and other kernel-based learning methods</em>, 1 ed., Cambridge<br />
  University Press, March 2000.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITEDasgupta:2005p2013" name="Dasgupta:2005p2013">[DKM05]</a></dt>
<dd>
Sanjoy Dasgupta, Adam&nbsp;Tauman Kalai, and Claire Monteleoni, <em>Analysis of<br />
  perceptron-based active learning</em>, CSAIL Tech. Report (2005), 16.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITEDudik:2006p954" name="Dudik:2006p954">[DS06]</a></dt>
<dd>
Miroslav Dudik and Robert&nbsp;E Schapire, <em>Maximum entropy distribution<br />
  estimation with generalized regularization</em>, COLT (2006), 15.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITEFisher:1936p2576" name="Fisher:1936p2576">[Fis36]</a></dt>
<dd>
Ronald&nbsp;A Fisher, <em>The use of multiple measurements in taxonomic problems</em>,<br />
  Annals of Eugenics <b>7</b> (1936), 179-188.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITEFreund:2003p1009" name="Freund:2003p1009">[FISS03]</a></dt>
<dd>
Yoav Freund, Raj Iyer, Robert&nbsp;E Schapire, and Yoram Singer, <em>An efficient<br />
  boosting algorithm for combining preferences</em>, Journal of Machine Learning<br />
  Research <b>4</b> (2003), 933-969.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITEstatistics" name="statistics">[FPP07]</a></dt>
<dd>
David Freedman, Robert Pisani, and Roger Purves, <em>Statistics 4th edition</em>,<br />
  W. W. Norton and Company, 2007.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITEGrunwald:2005p108" name="Grunwald:2005p108">[Gru05]</a></dt>
<dd>
Peter&nbsp;D Grunwald, <em>Maximum entropy and the glasses you are looking<br />
  through</em>.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITEHalevy:2009p2327" name="Halevy:2009p2327">[HNP09]</a></dt>
<dd>
Alon Halevy, Peter Norvig, and Fernando Pereira, <em>The unreasonable<br />
  effectiveness of data</em>, IEEE Intellegent Systems (2009).</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITEKlein:2003p261" name="Klein:2003p261">[KM03]</a></dt>
<dd>
Dan Klein and Christopher&nbsp;D Manning, <em>Maxent models, conditional<br />
  estimation, and optimization</em>.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITEquantile" name="quantile">[Koe05]</a></dt>
<dd>
Roger Koenker, <em>Quantile regression</em>, Cambridge University Press, May<br />
  2005.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITELewis:1998p105" name="Lewis:1998p105">[Lew98]</a></dt>
<dd>
David&nbsp;D Lewis, <em>Naive (bayes) at forty: The independence assumption in<br />
  information retrieval</em>.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITENYTStat" name="NYTStat">[Loh09]</a></dt>
<dd>
Steve Lohr, <em>For today’s graduate, just one word: Statistics</em>,<br />
  <a href="http://www.nytimes.com/2009/08/06/technology/06stats.html"><tt>http://www.nytimes.com/2009/08/06/technology/06stats.html</tt></a>, August 2009.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITER:Sarkar:2008" name="R:Sarkar:2008">[Sar08]</a></dt>
<dd>
Deepayan Sarkar, <em>Lattice: Multivariate data visualization with R</em>,<br />
  Springer, New York, 2008, ISBN 978-0-387-75968-5.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITEStern:1989p1480" name="Stern:1989p1480">[SC89]</a></dt>
<dd>
Hal Stern and Thomas&nbsp;M Cover, <em>Maximum entropy and the lottery</em>, Journal<br />
  of the American Statistical Association <b>84</b> (1989), no.&nbsp;408,<br />
  980-985.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITESchapire:2001p1019" name="Schapire:2001p1019">[Sch01]</a></dt>
<dd>
Robert&nbsp;E Schapire, <em>The boosting approach to machine learning an<br />
  overview</em>, 23.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITEkernel2" name="kernel2">[STC04]</a></dt>
<dd>
John Shawe-Taylor and Nello Cristianini, <em>Kernel methods for pattern<br />
  analysis</em>, Cambridge University Press, June 2004.</dd>
</dl>
<p></font></p>
<div class="p"><!----></div>
<p><center><b>APPENDIX</b><br />
</center></p>
<div class="p"><!----></div>
<h2><a name="tth_sEcA"><br />
A</a>&nbsp;&nbsp;Graphs</h2>
<div class="p"><!----></div>
<p>The majority of the graphs in this writeup were produced using &#8220;R&#8221;<br />
<a href="http://www.r-project.org/"><tt>http://www.r-project.org/</tt></a> and Deepayan Sarkar&#8217;s Lattice<br />
package[<a href="#R:Sarkar:2008" name="CITER:Sarkar:2008">Sar08</a>].</p>
<div class="p"><!----></div>
<hr />
<h3>Footnotes:</h3>
<div class="p"><!----></div>
<p><a name="tthFtNtAAB"></a><a href="#tthFrefAAB"><sup>1</sup></a><br />
<a href="mailto:jmount@win-vector.com"><tt>mailto:jmount@win-vector.com</tt></a><br />
<a href="http://www.win-vector.com/"><tt>http://www.win-vector.com/</tt></a><br />
<a href="http://www.win-vector.com/blog/"><tt>http://www.win-vector.com/blog/</tt></a></p>
<div class="p"><!----></div>
<p><a name="tthFtNtAAC"></a><a href="#tthFrefAAC"><sup>2</sup></a>Read P(A &#124; B) as: &#8220;the probability of A will<br />
  happen given we know B is true.&#8221;</p>
<div class="p"><!----></div>
<p><a name="tthFtNtAAD"></a><a href="#tthFrefAAD"><sup>3</sup></a>Technically we are working with densities, not<br />
  probabilities, but we will use probability notation for its<br />
  intuition.</p>
<div class="p"><!----></div>
<p><a name="tthFtNtAAE"></a><a href="#tthFrefAAE"><sup>4</sup></a>P(sale &#124; x,y) is the probability of<br />
making a sale as a function of what we know about the prospective<br />
customer and our offer.  Whereas P(x,y&#124;sale) was just how likely it is<br />
to see a prospect with the given x and y values, conditioned on knowing we made<br />
a sale to this prospect.</p>
<div class="p"><!----></div>
<p><a name="tthFtNtAAF"></a><a href="#tthFrefAAF"><sup>5</sup></a> P(sale) and<br />
  P(non-sale) are just the &#8220;prior odds&#8221; of sales or what<br />
  our estimate of our chances of success are before we look at any<br />
  facts about a particular customer.  We can use our historical<br />
  overall success and failure rates as estimates of these quantities.</p>
<div class="p"><!----></div>
<p><a name="tthFtNtAAG"></a><a href="#tthFrefAAG"><sup>6</sup></a>A situation is homoscedastic if the errors are independent of where we are in the parameter space (our x,y or match factor and discount factor).  This property is very important for meaningful fitting/modeling and interpreting significance of fits.</p>
<hr /><small>File translated from<br />
T<sub><font size="-1">E</font></sub>X<br />
by <a href="http://hutchinson.belmont.ma.us/tth/"><br />
T<sub><font size="-1">T</font></sub>H</a>,<br />
version 3.85.<br />On 29 Aug 2009, 11:43.</small></p>


<p>Related posts:<ol><li><a href='http://www.win-vector.com/blog/2009/04/the-data-enrichment-method/' rel='bookmark' title='Permanent Link: The Data Enrichment Method'>The Data Enrichment Method</a></li>
<li><a href='http://www.win-vector.com/blog/2009/01/exciting-technique-1-the-r-language/' rel='bookmark' title='Permanent Link: Exciting Technique #1: The &#8220;R&#8221; language.'>Exciting Technique #1: The &#8220;R&#8221; language.</a></li>
<li><a href='http://www.win-vector.com/blog/2009/08/good-graphs-graphical-perception-and-data-visualization/' rel='bookmark' title='Permanent Link: Good Graphs: Graphical Perception and Data Visualization'>Good Graphs: Graphical Perception and Data Visualization</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>On The Hysteria Over &#8220;The Cloud&#8221;</title>
		<link>http://www.win-vector.com/blog/2009/08/on-the-hysteria-over-the-cloud/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=on-the-hysteria-over-the-cloud</link>
		<comments>http://www.win-vector.com/blog/2009/08/on-the-hysteria-over-the-cloud/#comments</comments>
		<pubDate>Thu, 13 Aug 2009 23:12:53 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Expository Writing]]></category>
		<category><![CDATA[Opinion]]></category>
		<category><![CDATA[Cloud Computing]]></category>
		<category><![CDATA[Data Centers]]></category>
		<category><![CDATA[Mainframes]]></category>
		<category><![CDATA[PC Revolution]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=237</guid>
		<description><![CDATA[On The Hysteria Over &#8220;The Cloud&#8221; The frenzy of anticipation and opinion about &#8220;The Cloud&#8221; is so intense and so pointless it becomes &#8220;parody proof.&#8221; It is as Jerry Holkins and Mike Krahulik wrote (regarding a different situation): It&#8217;s like trying to make fun of a clown. What, are you going to make fun of [...]


Related posts:<ol><li><a href='http://www.win-vector.com/blog/2010/02/postels-law-not-sure-who-to-be-angry-with/' rel='bookmark' title='Permanent Link: Postel&#8217;s Law: Not Sure Who To Be Angry With'>Postel&#8217;s Law: Not Sure Who To Be Angry With</a></li>
<li><a href='http://www.win-vector.com/blog/2008/04/sorting-in-anger/' rel='bookmark' title='Permanent Link: Sorting Used in Anger'>Sorting Used in Anger</a></li>
<li><a href='http://www.win-vector.com/blog/2008/04/i-know-i-am-the-one-being-a-jerk/' rel='bookmark' title='Permanent Link: I know, I am the one being a jerk'>I know, I am the one being a jerk</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>On The Hysteria Over &#8220;The Cloud&#8221;<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/180px-Lenticular_Cloud_in_Wyoming_0034b.jpg" alt="180px-Lenticular_Cloud_in_Wyoming_0034b.jpg" border="0" width="180" height="120" /><br />
</center></p>
<p />
The frenzy of anticipation and opinion about &#8220;The Cloud&#8221; is so intense and so pointless it becomes &#8220;parody proof.&#8221;<br />
<span id="more-237"></span>It is as Jerry Holkins and Mike Krahulik wrote (regarding a different situation):</p>
<blockquote><p>
It&#8217;s like trying to make fun of a clown.  What, are you going to make fun of his tiny car?  His floppy shoes? It just doesn&#8217;t work.
</p></blockquote>
<p />
I would like to point out that (by computer science standards) the cloud is not new and has for some time been considered inevitable.</p>
<p />
But what is &#8220;The Cloud?&#8221; What the cloud is depends a bit on what conversation you are being drawn into.  If the conversation is about computing then the cloud is remote computers, software and services like Wikipedia, GMail, SalesForce.com, Google Docs, Amazon EC2/S3 and Google App Engine.  If the conversation is about human interaction then the cloud is ecosystems like Facebook, Twitter and RSS.  Each of these are facets of important longer term trends, but for individual companies and technologies the pendulum is about as fast on the down-swing as it was on the up-swing.  At this time we can safely declare a number of recent important players dead: Friendster, AltaVista, WSDL, Usenet, IRC and Web2.0.  </p>
<p />
<p>It is true that the network itself is more useful than the computer, but this idea is not new to our third millennium.  The current people getting rich promoting this idea did not invent this idea, they grew up in its shadow.  The early big thinkers on computers had big plans.  Plans much larger than Tetris, payroll processing, COBOL and punched cards.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/Hollerith_punched_card.jpg" alt="Hollerith_punched_card.jpg" border="0" width="434" height="246" /><br />
</center></p>
<p />
<p>Take the article <cite>&#8220;As We May Think&#8221; (by Vannevar Bush, The Atlantic Monthly (1945))</cite>.  In it Vannevar Bush writes:
<p />
<blockquote><p>
Consider a future device for individual use, which is a sort of mechanized private file and library.  It needs a name, and, to coin one at random, &#8220;memex&#8221; will do. A memex is a device in which an individual stores all his books, records, and communications, and which is mechanized so that it may be consulted with exceeding speed and flexibility. It is an enlarged intimate supplement to his memory.
</p></blockquote>
<p /> At first this sounds like nothing more than <cite>&#8220;Danny Dunn and the Homework Machine&#8221; (by Jay Williams (1964), Scholastic Press)</cite><br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/DannyDunnHomeworkMachine.jpg" alt="DannyDunnHomeworkMachine.jpg" border="0" width="240" height="240" />.<br />
</center></p>
<p /> But in his essay Vannevar Bush uses the phrase &#8220;it can presumably be operated from a distance&#8221; and ends his essay with a long section of how many professions would benefit from a Memex (we show here only one):
<p />
<blockquote><p>
Wholly new forms of encyclopedias will appear, ready made with a mesh of associative trails running through them, ready to be dropped into the memex and there amplified. The lawyer has at his touch the associated opinions and decisions of his whole experience, and of the experience of friends and authorities.
</p></blockquote>
<p /> Obviously we are reading this with a modern eye, but here we have the antecedents of hypertext and the Wikipedia.
<p />
<p>We can trace this thread further forward to <cite>&#8220;Augmenting Human Intellect: A Conceptual Framework&#8221; (by Douglas C Engelbart (1962))</cite> and the famous <a href="http://sloan.stanford.edu/MouseSite/1968Demo.html">1968 demo</a>.</p>
<p>And we can further trace the ideas passing through: <cite> &#8220;Literary Machines: The report on, and of, Project Xanadu concerning word processing, electronic publishing, hypertext, thinkertoys, tomorrow&#8217;s intellectual revolution, and certain other topics including knowledge, education and freedom&#8221; (by Ted Nelson (1981), Mindful Press, Sausalito, California.) </cite>  </p>
<p />
<p>These works were all about knowledge engineering, information storage, networking and communication.  There was an extreme urgency in these works.  Both Engelbart and Nelson felt we had a limited window to gain the ability to organize the world&#8217;s information before some catastrophic error or misunderstanding eliminated us all.  This feeling of urgency and doom came from another exciting application of real time networked computers: <a href="http://en.wikipedia.org/wiki/Semi_Automatic_Ground_Environment">SAGE</a>.  SAGE was the &#8220;Semi Automatic Ground Environment&#8221; first made operational in 1959.  It involved networked computers, light pen based operator terminals and was the system that the United States had ready to fight World War III.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/File-SAGE_control_room.png" alt="File-SAGE_control_room.png" border="0" width="180" height="232" /><br />
</center></p>
<p>This was the era of near infinite budgets, block sized computer complexes, massive mainframes and IT priesthoods that ran the whole show.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/180px-Sage_typical_building.jpg" alt="180px-Sage_typical_building.jpg" border="0" width="180" height="136" /><br />
</center></p>
<p>The inevitable march was on. Some large fraction of the GDP would be forever dedicated to building and maintaining monument sized networked computing facilities.  Your degree of relevance and power in society would be directly determined by how close you could get to these facilities.  Then something happened and distracted everyone.  The distraction was so immediate and so complete that by the time the inevitable march restarted (block sized Google data centers and a <a href="http://green.yahoo.com/blog/ecogeek/1125/yahoo-data-center-will-be-powered-by-niagara-falls.html"> proposed Yahoo data center to be built attached to Niagara falls</a>) everyone thought it was a new thing.</p>
<p>What happened was the 1958 demonstrations of successful integrated circuits.  This and the transistor started an era of micro-miniaturization that took the world by storm.  By 1971 Intel had released a single chip CPU (the 4004) as a commercial product.  This chip implemented the core of a computer in a fingertip size package that contained 2300 transistors.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/180px-Intel_4004.jpg" alt="180px-Intel_4004.jpg" border="0" width="180" height="173" /><br />
</center></p>
<p>From here on everything was desktop calculators, pocket calculators and digital watches.  And then the personal computer and the personal computer revolution hit.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/180px-Popular_Electronics_Cover_Jan_1975.jpg" alt="180px-Popular_Electronics_Cover_Jan_1975.jpg" border="0" width="180" height="240" /><br />
</center></p>
<p>IBM kicked the PC revolution into high gear when they pushed into the market in 1981.  The personal computer was a supreme distraction that pulled attention away from the monolithic computers for fifteen years.  And for a long while networking and shared information were both nearly forgotten. Computers were for spreadsheets, desktop publishing and other non-networked tasks.
<p />
<p>However, out of public view the monolithic network continued to develop.  The Internet was started as ARPAnet and grew connecting universities and defense contractors from 1969 through now.  The messaging formats (it is inappropriate to use the more common term &#8220;technology&#8221; to describe HTTP and HTML) we call &#8220;The World Wide Web&#8221; were invented (without much fanfare) in 1989.  Netscape was founded in 1994 and made the World Wide Web and Internet available to the PC.  And then the Internet hit like a Tsunami.  Electronic commerce and speculation funded the the initial burst.  Then on-line advertising took over and we are back to building new encyclopedias, tracking everyone and once again building city block sized computers (now called data centers).</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/google_data_center_lenoir.jpg" alt="google_data_center_lenoir.jpg" border="0" width="512" height="355" /><br />
</center></p>
<p>Once again we are being told our data is too important to be locked in our desk (or PC) and everything is migrating back to the mainframe (now called &#8220;the cloud&#8221;).
<p />
<p>Will the cycle reverse?  If applications are moving into the cloud now will they ever move back out?
<p />
<p>Moore&#8217;s law has a way of shrinking things (a current smart phone outperforms many early mainframes, super computers and data centers).  Will individual PCs once again be more important than the network?  Some of the more useful parts of the Internet (like the Wikipedia) are small enough to put on current PCs.  The data centers and networks will not go away any time soon, but excitement and attention could move on to something else.  Devices that you could carry everywhere and that have intermittent or expensive connections to the Internet might have an advantage in being able to cache some of the Internet.  And excitement follows what is new, so a stable pervasive cloud would likely be taken for granted (like roads, power, telephone and other utilities).
<p />
<p>Another thing that could migrate applications back out of the cloud (assuming they migrate in) is if access to the user becomes too important to delegate to the cloud.  eCommerce applications take user access when they can get it, but many other applications may depend more on immediate access to the user than on grabbing fresh data from the network.  For example a pacemaker is likely to run most of its application from an embedded computer- this computer might talk to the cloud when it can, but the application will be designed to stand alone as long as possible.
<p />
<p>In the end evangelizing the coming triumph of factory scale computing and networking is pointless.  It is already here and has no great need for cheerleaders.</p>


<p>Related posts:<ol><li><a href='http://www.win-vector.com/blog/2010/02/postels-law-not-sure-who-to-be-angry-with/' rel='bookmark' title='Permanent Link: Postel&#8217;s Law: Not Sure Who To Be Angry With'>Postel&#8217;s Law: Not Sure Who To Be Angry With</a></li>
<li><a href='http://www.win-vector.com/blog/2008/04/sorting-in-anger/' rel='bookmark' title='Permanent Link: Sorting Used in Anger'>Sorting Used in Anger</a></li>
<li><a href='http://www.win-vector.com/blog/2008/04/i-know-i-am-the-one-being-a-jerk/' rel='bookmark' title='Permanent Link: I know, I am the one being a jerk'>I know, I am the one being a jerk</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2009/08/on-the-hysteria-over-the-cloud/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
	</channel>
</rss>
