<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Win-Vector Blog &#187; significance</title>
	<atom:link href="http://www.win-vector.com/blog/tag/significance/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.win-vector.com/blog</link>
	<description>The Applied Theorist&#039;s Point of View</description>
	<lastBuildDate>Thu, 29 Jul 2010 17:09:49 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
		<item>
		<title>Statistics to English Translation, Part 2b: Calculating Significance</title>
		<link>http://www.win-vector.com/blog/2009/12/statistics-to-english-translation-part-2b-calculating-significance/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=statistics-to-english-translation-part-2b-calculating-significance</link>
		<comments>http://www.win-vector.com/blog/2009/12/statistics-to-english-translation-part-2b-calculating-significance/#comments</comments>
		<pubDate>Mon, 14 Dec 2009 07:02:40 +0000</pubDate>
		<dc:creator>Nina Zumel</dc:creator>
				<category><![CDATA[Applications]]></category>
		<category><![CDATA[Expository Writing]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Statistics To English Translation]]></category>
		<category><![CDATA[ANOVA]]></category>
		<category><![CDATA[F-test]]></category>
		<category><![CDATA[significance]]></category>
		<category><![CDATA[t-test]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1281</guid>
		<description><![CDATA[In the previous installment of the Statistics to English Translation, we discussed the technical meaning of the term &#8221;significant&#8221;. In this installment, we look at how significance is calculated. This article will be a little more technically detailed than the last one, but our primary goal is still to help you decipher statements about significance [...]


Related posts:<ol><li><a href='http://www.win-vector.com/blog/2009/12/statistics-to-english-translation-part-2a-%e2%80%99significant%e2%80%99-doesn%e2%80%99t-always-mean-%e2%80%99important%e2%80%99/' rel='bookmark' title='Permanent Link: Statistics to English Translation, Part 2a: ’Significant’ Doesn’t Always Mean ’Important’'>Statistics to English Translation, Part 2a: ’Significant’ Doesn’t Always Mean ’Important’</a></li>
<li><a href='http://www.win-vector.com/blog/2009/11/i-dont-think-that-means-what-you-think-it-means-statistics-to-english-translation-part-1-accuracy-measures/' rel='bookmark' title='Permanent Link: &#8220;I don&#8217;t think that means what you think it means;&#8221; Statistics to English Translation, Part 1: Accuracy Measures'>&#8220;I don&#8217;t think that means what you think it means;&#8221; Statistics to English Translation, Part 1: Accuracy Measures</a></li>
<li><a href='http://www.win-vector.com/blog/2010/02/living-in-a-lognormal-world/' rel='bookmark' title='Permanent Link: Living in A Lognormal World'>Living in A Lognormal World</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>In the <a href="http://www.win-vector.com/blog/2009/12/statistics-to-english-translation-part-2a-’significant’-doesn’t-always-mean-’important’/">previous installment</a> of the <a href="http://www.win-vector.com/blog/category/statistics-to-english-translation/">Statistics to English Translation</a>, we discussed the technical meaning of the term &#8221;significant&#8221;. In this installment, we look at how significance is calculated. This article will be a little more technically detailed than the last one, but our primary goal is still to help you decipher statements about significance in research papers: statements like &#8220;<!-- MATH  $(F(2, 864) = 6.6, p = 0.0014)$  --><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg4.png" border="0" alt="$ (F(2, 864) = 6.6, p = 0.0014)$" width="238" height="37" align="middle" /> &#8221;.</p>
<p>As in the <a href="http://www.win-vector.com/blog/2009/12/statistics-to-english-translation-part-2a-’significant’-doesn’t-always-mean-’important’/">last article</a>, we will concentrate on situations where we want to test the difference of means. You should read that previous article first, so you are familiar with the terminology that we use in this one.</p>
<p>A pdf version of this current article can be found <a href="http://win-vector.com/dfiles/ste2b_calculatesig.pdf">here</a>.<br />
<span id="more-1281"></span></p>
<h1><a name="SECTION00010000000000000000" id="SECTION00010000000000000000">How is Significance Determined?</a></h1>
<p>Generally speaking, we calculate significance by computing a <em>test statistic</em> from the data. If we assume a specific null hypothesis, then we know that this test statistic will be distributed in a certain way. We can then compute how likely it is to observe our value of the test statistic, if we assume that the null hypothesis is true.</p>
<p>We&#8217;ll explain the use of a test statistic with our Sneetch example from the last installment.</p>
<h1><a name="SECTION00020000000000000000" id="SECTION00020000000000000000">The t-test for Difference of Means</a></h1>
<p>Suppose that the test scores for both Star-Bellies and Plain-Bellies are normally distributed, with the means and standard deviations as given in the table below.</p>
<div align="center">
<table cellpadding="3" border="1">
<tr>
<td align="center">&nbsp;</td>
<td align="center"><img width="16" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg5.png" alt="$ n$"> (number of subjects)</td>
<td align="center"><img width="21" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg6.png" alt="$ m$"> (mean score)</td>
<td align="center"><img width="14" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg7.png" alt="$ s$"> (standard error)</td>
</tr>
<tr>
<td align="center">Star-Bellies</td>
<td align="center">50</td>
<td align="center">78</td>
<td align="center">7</td>
</tr>
<tr>
<td align="center">Plain-Bellies</td>
<td align="center">40</td>
<td align="center">74</td>
<td align="center">8</td>
</tr>
</table>
</div>
<p>Remember from the previous installment that we can estimate the true population means <img width="24" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg8.png" alt="$ \mu_1$"> and <img width="24" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg9.png" alt="$ \mu_2$"> as normally distributed around the empirical population means <img width="29" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg10.png" alt="$ m_1$"> and <img width="29" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg11.png" alt="$ m_2$"> respectively, with variances<br />
<img width="52" height="38" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg12.png" alt="$ \sigma^2/{n_1}$"> and<br />
<img width="52" height="38" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg13.png" alt="$ \sigma^2/{n_2}$"> . This is shown in Figure <a href="#fig:twomeans">1</a>. Informally speaking, there is no significant difference in the two populations if the shaded overlap area in Figure <a href="#fig:twomeans">1</a> is large.</p>
<div align="center"><a name="fig:twomeans" id="fig:twomeans"></a><a name="36"></a></p>
<table>
<caption align="bottom"><strong>Figure 1:</strong> The estimates of the means for two populations</caption>
<tr>
<td>
<div align="center"><img width="282" height="204" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/./overlap.png" alt="Image overlap"></div>
</td>
</tr>
</table>
</div>
<p>Calculating this area is somewhat involved. Instead, we calculate the <em>t-statistic</em>:</p>
<div align="center">
<table cellpadding="0" width="100%" align="center">
<tr valign="middle">
<td nowrap align="center"><img width="126" height="62" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg14.png" alt="$\displaystyle t = \frac{(m_2 - m_1)}{s_D}$"></td>
<td nowrap width="10" align="right">(1)</td>
</tr>
</table>
</div>
<p><br clear="all"><br />
where <img width="26" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg15.png" alt="$ s_D$"> is called the <em>pooled variance</em> of the two populations.</p>
<div align="center">
<table cellpadding="0" width="100%" align="center">
<tr valign="middle">
<td nowrap align="center"><img width="325" height="64" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg16.png" alt="$\displaystyle {s_D}^2 = \frac{n_1\cdot {s_1}^2 + n_2\cdot {s_2}^2}{n_1 + n_2 - 2} \cdot (1/n_1 + 1/n_2)$"></td>
<td nowrap width="10" align="right">(2)</td>
</tr>
</table>
</div>
<p><br clear="all"></p>
<p>For our Sneetch example, <img width="75" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg17.png" alt="$ s_D = 1.6$"> , and <img width="79" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg18.png" alt="$ t=2.499$"> , or the negative of that, depending on which group is Group 1. There are<br />
<img width="142" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg19.png" alt="$ 50 + 40 - 2 = 88$"> degrees of freedom.</p>
<p>If the null hypothesis is true, and the two populations are identical, then <img width="12" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg1.png" alt="$ t$"> is distributed according to <em>Student&#8217;s distribution with<br />
<img width="105" height="34" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg20.png" alt="$ N_1 + N_2 - 2$"> degrees of freedom</em>. Student&#8217;s distribution is sort of a &#8220;stretched out&#8221; bell curve; as the degrees of freedom increase (<br />
<img width="122" height="34" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg21.png" alt="$ N_1 + N_2 \rightarrow \infty$"> ), Student&#8217;s distribution approaches the standard normal distribution, <img width="63" height="37" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg22.png" alt="$ N(0, 1)$"> <a name="tex2html2" href="#foot209" id="tex2html2"><sup>1</sup></a>.</p>
<p>In other words, if the null hypothesis is true, <img width="12" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg1.png" alt="$ t$"> should be near zero. The probability of seeing a <img width="12" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg1.png" alt="$ t$"> of a certain magnitude or greater under the null hypothesis is given by the area under the tails of Student&#8217;s distribution:</p>
<div align="center"><a name="57"></a></p>
<table>
<caption align="bottom"><strong>Figure 2:</strong> The area under the tails for a given <img width="12" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg1.png" alt="$ t$"></caption>
<tr>
<td>
<div align="center"><img width="514" height="365" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/./twotailedtest.jpg" alt="Image twotailedtest"></div>
</td>
</tr>
</table>
</div>
<p>This area is <img width="14" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg27.png" alt="$ p$"> . For the Sneetch example, <img width="82" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg28.png" alt="$ p = 0.014$"> .</p>
<p>The further out on the tails <img width="12" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg1.png" alt="$ t$"> is, the stronger the evidence that you should reject the null hypothesis. If you know for some reason that the mean of one population will be greater than or equal to the other, than you can use the <em>one-tailed test</em>:</p>
<div align="center"><a name="64"></a></p>
<table>
<caption align="bottom"><strong>Figure 3:</strong> The one-tailed test for a given <img width="12" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg1.png" alt="$ t$"></caption>
<tr>
<td>
<div align="center"><img width="514" height="365" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/./onetailedtest.jpg" alt="Image onetailedtest"></div>
</td>
</tr>
</table>
</div>
<p>This test halves the p-value as compared to the two-tailed test, making a given <img width="12" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg1.png" alt="$ t$"> value twice as significant. When in doubt about which to use, the two-tailed test is more conservative against false positives<a name="tex2html5" href="#foot210" id="tex2html5"><sup>2</sup></a>.</p>
<p>In discussions of t-tests, you will often see statements of the form:</p>
<blockquote><p>The t-test meets the hypothesis that two means are equal if</p></blockquote>
<div align="center">
<table cellpadding="0" width="100%" align="center">
<tr valign="middle">
<td nowrap align="center"><img width="88" height="37" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg31.png" alt="$\displaystyle \vert t\vert &gt; t_{\alpha/2, \nu}$"></td>
<td nowrap width="10" align="right">&nbsp;&nbsp;&nbsp;</td>
</tr>
</table>
</div>
<p><br clear="all"></p>
<blockquote><p>for a two-tailed test, or</p></blockquote>
<div align="center">
<table cellpadding="0" width="100%" align="center">
<tr valign="middle">
<td nowrap align="center"><img width="64" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg32.png" alt="$\displaystyle t &gt; t_{\alpha, \nu}$"></td>
<td nowrap width="10" align="right">&nbsp;&nbsp;&nbsp;</td>
</tr>
</table>
</div>
<p><br clear="all"></p>
<blockquote><p>for a (right-sided) one-tailed test.</p></blockquote>
<p>The quantities on the right hand side of the two equations above are called the <em>critical values</em> for a given significance level <img width="17" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg33.png" alt="$ \alpha$"> (usually,<br />
<img width="75" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg34.png" alt="$ \alpha = 0.05$"> ) and <img width="15" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg35.png" alt="$ \nu$"> degrees of freedom. The critical values are the values for which the area of the right hand tail is equal to <img width="17" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg33.png" alt="$ \alpha$"> .</p>
<div align="center"><a name="211"></a></p>
<table>
<caption align="bottom"><strong>Figure 4:</strong> Critical value for a one-tailed test. Reject the null hypothesis if<br />
<img width="66" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg2.png" alt="$ t &gt; t_{crit}$"></caption>
<tr>
<td>
<div align="center"><img width="385" height="252" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/./onetailedcritval.png" alt="Image onetailedcritval"></div>
</td>
</tr>
</table>
</div>
<p>For a two-tailed test, you must halve the area under a single tail.</p>
<div align="center"><a name="212"></a></p>
<table>
<caption align="bottom"><strong>Figure 5:</strong> Critical value for a two-tailed test. Reject the null hypothesis if<br />
<img width="77" height="37" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg3.png" alt="$ \vert t\vert &gt; t_{crit}$"></caption>
<tr>
<td>
<div align="center"><img width="384" height="248" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/./twotailedcritval.png" alt="Image twotailedcritval"></div>
</td>
</tr>
</table>
</div>
<p>This convention dates back to the time when computational resources were scarce, and researchers had to use pre-computed tables of critical values, rather than calculating <img width="14" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg27.png" alt="$ p$"> directly. Today, general statistical packages such as R or Matlab can compute the CDFs of any number of standard distributions; once you can compute the CDF, directly computing <img width="14" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg27.png" alt="$ p$"> (the area under the tails) is straightforward. Despite this, many tutorials of the t-test (and of the F-test, and other significance tests) still adhere to the convention of comparing test statistics to critical values. This tends to needlessly ritualize the whole process, and make it seem more complicated and mysterious than it actually is, at least in my opinion.</p>
<p>David Freedman was very much against the continued practice of using critical values, rather than reporting the actual p-value. The last chapter of Freedman, Pisani and Purves [<a href="#Freedman07">FPP07</a>] is worth reading for its discussion of this, and other potential pitfalls of significance tests.</p>
<p>Some standard packages for evaluating t-tests, F-tests, or the ANOVA also present analysis results in terms of critical values. Most of them do usually print the actual p value as well, along with the value of the test statistic and the degrees of freedom. Most researchers rightfully report the test statistics along with the actual significance levels: &#8220;we conclude that there is a significant difference in mathematical performance (t(88) = 2.499, p = 0.014)&#8230; .&#8221; Here, 88 gives the degrees of freedom, <img width="45" height="37" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg36.png" alt="$ t(88)$"> is the value of the t-statistic, and <img width="14" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg27.png" alt="$ p$"> is of course the p-value.</p>
<p>Similar comments apply to the F-test, discussed in more detail below.</p>
<h2><a name="SECTION00021000000000000000" id="SECTION00021000000000000000">Assumptions</a></h2>
<p>Strictly speaking, the t-test is only valid for normally distributed data where both populations have equal variance. However, the test is fairly robust to non-normal data [<a href="#Box53">Box53</a>]. You can verify that the sample variances are &#8220;equal enough&#8221; &#8211; that is, they could plausibly both be sampled observations from populations with the same variance, by using the <em>F-test</em>. The F-statistic</p>
<div align="center"><img width="102" height="40" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg37.png" alt="$\displaystyle F = {s_1}^2/{s_2}^2 $"></div>
<p>is distributed according to the <em>F distribution with<br />
<img width="131" height="37" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg38.png" alt="$ (n_1 - 1,n_2 - 1)$"> degrees of freedom</em></p>
<div align="center"><a name="104"></a></p>
<table>
<caption align="bottom"><strong>Figure 6:</strong> The F distribution</caption>
<tr>
<td>
<div align="center"><img width="514" height="365" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/./Ftest.jpg" alt="Image Ftest"></div>
</td>
</tr>
</table>
</div>
<p>In practice, the larger variance is usually put in the numerator, so <img width="54" height="34" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg39.png" alt="$ F &gt; 1$"> . The test should still be two-tailed, so you should double the area under the right-hand tail<a name="tex2html9" href="#foot107" id="tex2html9"><sup>3</sup></a>. In this situation, you want to check if you ƒshould accept the null hypothesis (that<br />
<img width="54" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg44.png" alt="$ F \approx 1$"> ) at a given significance level. If so, then you can go ahead and apply the t-test.</p>
<p>There is a variation of the t-tests for distributions of unequal variance, called Welch&#8217;s t-test [<a href="#WikiWelch">Wikc</a>]. In this case, you are only checking if the means are equal, not that the distributions are the same.</p>
<h1><a name="SECTION00030000000000000000" id="SECTION00030000000000000000">The F-test for Analysis of Variance (ANOVA)</a></h1>
<p>ANOVA is an extension of the difference of means test above to the casae of more than two populations. The null hypothesis in this case is that all the sample means are equal &#8211; or more strictly, that all the treatment groups are drawn from the same population.</p>
<p>The simplest version of the ANOVA is the <em>one-way ANOVA</em>, where there are <img width="15" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg45.png" alt="$ k$"> <em>treatment groups</em> (populations) with <img width="21" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg46.png" alt="$ n_i$"> subjects (or repetitions, or replications) each, for a total of <img width="22" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg47.png" alt="$ N$"> subjects. Each population corresponds to a different single factor (a treatment or a condition: for example, a type of medicine, or a Star-Bellied Sneetch vs. a Plain-Bellied Sneetch vs. a Grinch). Two- or three- way ANOVAs correspond to varying two or three different factors combinatorially. For example, we could do a two-way ANOVA of Sneetch math performance by considering both the belly type and the gender of the Sneetchs.</p>
<div align="center"><a name="115"></a></p>
<table>
<caption align="bottom"><strong>Figure 7:</strong> Table for a Two-way ANOVA of Sneetch math performance</caption>
<tr>
<td>
<div align="center"><img width="203" height="243" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/./twowayANOVA.png" alt="Image twowayANOVA"></div>
</td>
</tr>
</table>
</div>
<p>We will only discuss one-way ANOVA in this article, since that covers all the relevant ideas about calculating significance.</p>
<p>For a one-way ANOVA, we have the population means <img width="27" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg48.png" alt="$ m_i$"> and variances <img width="27" height="38" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg49.png" alt="$ {s_i}^2$"> . We can also calculate the overall mean <img width="29" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg50.png" alt="$ m_0$"> , over the entire aggregate population.</p>
<p>The <em>between-groups mean sum of squares</em>, which is an estimate of the <em>between-groups variance</em>, is given by</p>
<div align="center">
<table cellpadding="0" width="100%" align="center">
<tr valign="middle">
<td nowrap align="center"><img width="260" height="58" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg51.png" alt="$\displaystyle {s_B}^2 = \frac{1}{k-1} \sum_i {n_i \cdot (m_i - m_0)^2}$"></td>
<td nowrap width="10" align="right">(3)</td>
</tr>
</table>
</div>
<p><br clear="all"></p>
<p><img width="33" height="38" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg52.png" alt="$ {s_B}^2$"> is sometimes designated <img width="48" height="34" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg53.png" alt="$ MS_B$"> It is a measure of how the population means vary with respect to the grand mean.</p>
<p>The <em>within-group mean sum of squares</em> is an estimate of the <em>within-group variance</em>:</p>
<div align="center"><a name="eqn:varw" id="eqn:varw"></a></p>
<table cellpadding="0" width="100%" align="center">
<tr valign="middle">
<td nowrap align="center"><img width="256" height="77" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg54.png" alt="$\displaystyle {s_W}^2 = \frac{1}{N-k} \sum_i^k \sum_j^{n_i} {x_{ij} - m_i}^2$"></td>
<td nowrap width="10" align="right">(4)</td>
</tr>
</table>
</div>
<p><br clear="all"></p>
<p><img width="37" height="38" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg55.png" alt="$ {s_W}^2$"> is sometimes designated <img width="52" height="34" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg56.png" alt="$ MS_W$"> . It is a measure of the &#8220;average population variance&#8221;.</p>
<div align="center"><a name="142"></a></p>
<table>
<caption align="bottom"><strong>Figure 8:</strong> Within-group and between-group variance</caption>
<tr>
<td>
<div align="center"><img width="322" height="214" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/./sigmas.png" alt="Image sigmas"></div>
</td>
</tr>
</table>
</div>
<p>If the null hypothesis is true, then</p>
</p>
<div align="center"><img width="114" height="40" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg57.png" alt="$\displaystyle F = {s_B}^2/{s_W}^2 $"></div>
<p>is distributed according to the F distribution wiht<br />
<img width="116" height="37" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg58.png" alt="$ (k-1, n-k)$"> degrees of freedom.</p>
<div align="center"><a name="150"></a></p>
<table>
<caption align="bottom"><strong>Figure 9:</strong> p-value for the one-tailed F-test</caption>
<tr>
<td>
<div align="center"><img width="514" height="365" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/./Ftest.jpg" alt="Image Ftest"></div>
</td>
</tr>
</table>
</div>
<p>That is, under the null hypothesis, the within-group and between-group variances should be about equal:<br />
<img width="54" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg44.png" alt="$ F \approx 1$"> . If <img width="54" height="34" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg59.png" alt="$ F &lt; 1$"> , then some of the treatment groups overlap other groups substantially, so practically speaking, one might as well accept the null hypothesis. Hence, a one-sided F test is good enough. As with the t-test, research papers usually give the value of the F statistic, the degrees of freedom, and the p-value: &#8220;<br />
<img width="238" height="37" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg4.png" alt="$ (F(2, 864) = 6.6, p = 0.0014)$"> &#8221;. In this example, the test statistic value is 6.6, and it was evaluated against the F distribution with (2, 864) degrees of freedom, which means that<br />
<img width="122" height="34" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg60.png" alt="$ k = 3, n = 866$"> . The p-value is 0.0014.</p>
<h2><a name="SECTION00031000000000000000" id="SECTION00031000000000000000">Assumptions</a></h2>
<p>Like the t-test, ANOVA assumes that the data is normally distributed with equal variances. According to Box [<a href="#Box53">Box53</a>], ANOVA is fairly robust to unequal variances when the population sizes are about the same, but you might want to check anyway. If all the populations are the same size (all the <img width="21" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg46.png" alt="$ n_i$"> are the same), the easiest way to check for equality of variances is an F-test of the statistic<br />
<img width="140" height="38" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg61.png" alt="$ F = {s_{max}}^2/{s_{min}}^2$"> with <img width="49" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg62.png" alt="$ n-1$"> degrees of freedom[<a href="#Sachs84">Sac84</a>]. In other cases, you can use Bartlett&#8217;s Test [<a href="#WikiBartlett">Wika</a>] or Levene&#8217;s Test [<a href="#WikiLevene">Wikb</a>]. Bartlett&#8217;s test uses a test statistic that is distributed as the <img width="24" height="38" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg63.png" alt="$ \chi^2$"> distribution, and Levene&#8217;s test uses one that is distributed as the F distribution. Levene&#8217;s test does not assume normally distributed data.</p>
<p>If the data are not normally distributed, or have unequal variance, often they can be transformed to a form that is closer to obeying the assumptions of ANOVA. The following table of transformations is based on [<a href="#Sachs84">Sac84</a>, p. 517], and other sources [<a href="#ndsu">Hor</a>].</p>
<div align="center"><a name="177"></a></p>
<table>
<caption align="bottom"><strong>Figure 10:</strong> Table of Transformations</caption>
<tr>
<td><img width="500" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg64.png" alt="\begin{figure}\begin{center} \begin{tabular}{\vert p{2.5in}\vert p{3.5in}\vert} ... ...} \ $\sigma \approx k\mu$\ &amp; \ \hline \end{tabular} \end{center}\end{figure}"></td>
</tr>
</table>
</div>
<p>Jim Deacon from the University of Edinburgh lists some suggestions as well [<a href="#deacon07">Dea</a>]. He also reminds us that running ANOVA on the transformed data will identify significant differences in the <em>transformed</em> data. This is <em>not</em> the same as saying there are significant differences in the original data!</p>
<h1><a name="SECTION00040000000000000000" id="SECTION00040000000000000000">Once the Null Hypothesis is Rejected</a></h1>
<p>If you are able to reject the ANOVA null hypothesis, you will usually want to know which population means are significantly different from the rest. Often, in fact, you are primarily interested in which population had the highest mean. For example, if you are comparing the efficacy of a new medicine A against existing medicines B and C, you are probably not too concerned about whether B and C perform significantly differently from each other, only about whether A is significantly better than both.</p>
<p>If all you care about is whether the highest mean is significantly higher than the others, you can simply test where the statistic</p>
</p>
<div align="center"><img width="211" height="56" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg65.png" alt="$\displaystyle (m_1 - m_2)/({s_W}^2 \frac{n_1 + n_2}{n_1\cdot n_2}) $"></div>
<p>falls on the Student-t distribution with <img width="50" height="34" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg66.png" alt="$ n-k$"> degrees of freedom. Here, <img width="37" height="38" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg55.png" alt="$ {s_W}^2$"> is the within-group variance, as calculated in Equation <a href="#eqn:varw">4</a>, <img width="29" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg10.png" alt="$ m_1$"> and <img width="29" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg11.png" alt="$ m_2$"> are the highest and second highest population means, <img width="16" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg5.png" alt="$ n$"> is the total number of samples (<br />
<img width="81" height="37" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg67.png" alt="$ n = \sum{n_i}$"> ), and <img width="15" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg45.png" alt="$ k$"> is the number of treatment groups.</p>
<p>This test is usually written</p>
</p>
<div align="center"><img width="409" height="67" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg68.png" alt="$\displaystyle m_1 - m_2 &gt; t_{(n-k, \alpha/2)} \cdot \sqrt{{s_W}^2 \cdot \frac{n_1 + n_2}{n_1\cdot n_2}} = LSD_{(1,2)} $"></div>
<p>where<br />
<img width="75" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg69.png" alt="$ t_{(n-k, \alpha/2)}$"> is the (two-sided) critical value for significance level <img width="17" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg33.png" alt="$ \alpha$"> and <img width="50" height="34" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg66.png" alt="$ n-k$"> is the number of degrees of freedom to use. This quantity is called the <em>least significant difference (LSD)</em> between the highest and second highest means, and the test is usually called the <em>LSD test</em>.</p>
<p>If you want to test all the population differences <img width="73" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg70.png" alt="$ m_i - m_j$"> for significance, (or test the highest value against all of the others explicitly) then you need to take some care with the LSD test. Remember that a significance level of <img width="17" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg33.png" alt="$ \alpha$"> means that with probability <img width="17" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg33.png" alt="$ \alpha$"> you will make a false positive error. To test all possible population differences is <img width="22" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg71.png" alt="$ K$"> = (<img width="15" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg45.png" alt="$ k$"> choose <img width="14" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg72.png" alt="$ 2$"> ) comparisons, or <img width="90" height="34" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg73.png" alt="$ K = k-1$"> comparisons, if you sort all the means in descending order and compare adjacent ones. Testing the highest mean against all the lower values is also <img width="90" height="34" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg73.png" alt="$ K = k-1$"> comparisons. This means you have a<br />
<img width="48" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg74.png" alt="$ K \cdot \alpha$"> probability of making a false positive error. So if you want the overall significance level to be <img width="17" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg33.png" alt="$ \alpha$"> , each individual comparison should use a stricter significance threshold<br />
<img width="78" height="37" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg75.png" alt="$ p \leq \alpha/K$"> .</p>
<p>A preferred way to compare multiple means for significance (once the ANOVA null hypothesis has been rejected) is to use a <em>multiple range test</em> [<a href="#deacon07">Dea</a>] or <em>Tukey&#8217;s method</em> [<a href="#nistTukey">oST06</a>], rather than the LSD test. Tukey&#8217;s method tests all pairwise comparison simultaneously, and the multiple range test starts with the broadest range (the highest and the lowest means), and works its way in until significance is lost.</p>
<h1><a name="SECTION00050000000000000000" id="SECTION00050000000000000000">Conclusion</a></h1>
<p>We&#8217;ve skimmed over many complications in this discussion. Hopefully, though, what we have gone over is enough to demystify much of the statistical discussion in research papers. Perhaps, it will demystify the output of standard ANOVA and t-test packages for you, as well.</p>
<p>Chong-ho Yu&#8217;s site [<a href="#yu09">hY</a>] gives a brief discussion of some of the issues that I&#8217;ve skimmed over. It also lists a few common non-parametric tests. These are tests that do not make assumptions about how the data is distributed, and so they may be more appropriate for data that is very non-normal, or for discrete data. They tend to have less power than parametric tests (that is, they have a lower true positive rate); so if the data is at all normal-like, parametric tests are preferred.</p>
<p>Significance tests are used in other applications beyond testing the difference in means or variances. They are used for testing whether events follow an expected distribution, for testing if there is a correlation between two variables, and for evaluating the coefficients of a regression analysis. We hope to cover some of these applications in future installments of this series.</p>
<h2><a name="SECTION00060000000000000000" id="SECTION00060000000000000000">Bibliography</a></h2>
<dl compact>
<dt><a name="Box53" id="Box53">Box53</a></dt>
<dd>G.E.P. Box, <i>Non-normality and tests on variances</i>, Biometrika <b>40</b> (1953), no.&nbsp;3/4, 318-335.</dd>
<dt><a name="deacon07" id="deacon07">Dea</a></dt>
<dd>Jim Deacon, <i>A multiple range test for comparing means in an analysis of variance</i>, <a href="http://www.biology.ed.ac.uk/research/groups/jdeacon/statistics/tress7.html">http://www.biology.ed.ac.uk/research/groups/jdeacon/statistics/tress7.html</a>.</dd>
<dt><a name="Freedman07" id="Freedman07">FPP07</a></dt>
<dd>David Freedman, Robert Pisani, and Roger Purves, <i>Statistics</i>, 4th ed., W. W. Norton &amp; Company, New York, 2007.</dd>
<dt><a name="ndsu" id="ndsu">Hor</a></dt>
<dd>Rich Horsley, <i>Transformations</i>, <tt><a name="tex2html14" href="http://www.ndsu.nodak.edu/ndsu/horsley/Transfrm.pdf" id="tex2html14">http://www.ndsu.nodak.edu/ndsu/horsley/Transfrm.pdf</a></tt>, Class notes, Plant Sciences 724, North Dakota State University.</dd>
<dt><a name="yu09" id="yu09">hY</a></dt>
<dd>Chong ho&nbsp;Yu, <i>Parametric tests</i>, <a href="http://www.creative-wisdom.com/teaching/WBI/parametric_test.shtml">http://www.creative-wisdom.com/teaching/WBI/parametric_test.shtml</a>.</dd>
<dt><a name="nistTukey" id="nistTukey">oST06</a></dt>
<dd>National&nbsp;Institute of&nbsp;Standards and Technology, <i>Tukey&#8217;s method</i>, NIST/SEMATECH e-Handbook of Statistical Methods, 2006, <a href="http://itl.nist.gov/div898/handbook/prc/section4/prc471.htm">http://itl.nist.gov/div898/handbook/prc/section4/prc471.htm.</dd>
<dt><a name="Sachs84" id="Sachs84">Sac84</a></dt>
<dd>Lothar Sachs, <i>Applied statistics: A handbook of techniques</i>, 2nd ed., Springer-Verlag, New York, 1984.</dd>
<dt><a name="WikiBartlett" id="WikiBartlett">Wika</a></dt>
<dd>Wikipedia, <i>Bartlett&#8217;s test</i>, <tt><a name="tex2html15" href="http://en.wikipedia.org/wiki/Bartlett's_test" id="tex2html15">http://en.wikipedia.org/wiki/Bartlett's_test</a></tt>.</dd>
<dt><a name="WikiLevene" id="WikiLevene">Wikb</a></dt>
<dd>&#8212;&#8211;, <i>Levene&#8217;s test</i>, <tt><a name="tex2html16" href="http://en.wikipedia.org/wiki/Levene's_test" id="tex2html16">http://en.wikipedia.org/wiki/Levene's_test</a></tt>.</dd>
<dt><a name="WikiWelch" id="WikiWelch">Wikc</a></dt>
<dd>&#8212;&#8211;, <i>Welch&#8217;s t test</i>, <tt><a name="tex2html17" href="http://en.wikipedia.org/wiki/Welch's_t_test" id="tex2html17">http://en.wikipedia.org/wiki/Welch's_t_test</a></tt>.</dd>
</dl>
<p></p>
<hr />
<h4>Footnotes</h4>
<dl>
<dt><a name="foot209" id="foot209">&#8230;</a><a href="#tex2html2"><sup>1</sup></a></dt>
<dd>Remember from the last installment that when you are estimating the mean of a distribution with unknown mean <img width="16" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg23.png" alt="$ \mu$"> and unknown variance <img width="24" height="19" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg24.png" alt="$ \sigma^2$"> , the 95% confidence interval around your estimate is<br />
<img width="115" height="39" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg25.png" alt="$ m \pm 2\cdot \sigma/\sqrt{n}$"> . Intuitively speaking, Student&#8217;s distribution is what you get if you calculate confidence intervals using the estimated variance <img width="14" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg7.png" alt="$ s$"> instead of the true but unknown variance <img width="16" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg26.png" alt="$ \sigma$"> . The distribution is stretched out compared to the normal distribution to reflect this increased uncertainty.</dd>
<dt><a name="foot210" id="foot210">&#8230; positives</a><a href="#tex2html5"><sup>2</sup></a></dt>
<dd>In his textbook <em>Statistics</em>, Freedman tells an anecdote about a study that was published in the <em>Journal of the AMA</em>, claiming to demonstrate that cholesterol causes heart attacks. The treatment group that took a cholesterol reducing drug had &#8220;significantly fewer&#8221; heart attacks than the control group (<br />
<img width="82" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg29.png" alt="$ p \approx 0.035$"> ). A closer reading revealed that the researchers used a one-tailed test, which is equivalent to <em>assuming</em> that the treatment group was going to have fewer heart attacks. What if the drug had <em>increased</em> the risk of heart attack? The proper two-tailed significance of their results would have been<br />
<img width="73" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg30.png" alt="$ p \approx 0.07$"> , which is higher than <em>JAMA</em>&#8216;s strict significance threshold of 0.05. [<a href="#Freedman07">FPP07</a>, p. 550]</dd>
<dt><a name="foot107" id="foot107">&#8230; tail</a><a href="#tex2html9"><sup>3</sup></a></dt>
<dd>The area to the right of <img width="19" height="17" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg40.png" alt="$ F$"> with <img width="45" height="37" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg41.png" alt="$ (a,b)$"> degrees of freedom is equal to the area to the left of <img width="38" height="37" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg42.png" alt="$ 1/F$"> , with <img width="45" height="37" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2bimg43.png" alt="$ (b,a)$"> degrees of freedom.</dd>
</dl>
<p></p>
<hr />


<p>Related posts:<ol><li><a href='http://www.win-vector.com/blog/2009/12/statistics-to-english-translation-part-2a-%e2%80%99significant%e2%80%99-doesn%e2%80%99t-always-mean-%e2%80%99important%e2%80%99/' rel='bookmark' title='Permanent Link: Statistics to English Translation, Part 2a: ’Significant’ Doesn’t Always Mean ’Important’'>Statistics to English Translation, Part 2a: ’Significant’ Doesn’t Always Mean ’Important’</a></li>
<li><a href='http://www.win-vector.com/blog/2009/11/i-dont-think-that-means-what-you-think-it-means-statistics-to-english-translation-part-1-accuracy-measures/' rel='bookmark' title='Permanent Link: &#8220;I don&#8217;t think that means what you think it means;&#8221; Statistics to English Translation, Part 1: Accuracy Measures'>&#8220;I don&#8217;t think that means what you think it means;&#8221; Statistics to English Translation, Part 1: Accuracy Measures</a></li>
<li><a href='http://www.win-vector.com/blog/2010/02/living-in-a-lognormal-world/' rel='bookmark' title='Permanent Link: Living in A Lognormal World'>Living in A Lognormal World</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2009/12/statistics-to-english-translation-part-2b-calculating-significance/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Statistics to English Translation, Part 2a: ’Significant’ Doesn’t Always Mean ’Important’</title>
		<link>http://www.win-vector.com/blog/2009/12/statistics-to-english-translation-part-2a-%e2%80%99significant%e2%80%99-doesn%e2%80%99t-always-mean-%e2%80%99important%e2%80%99/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=statistics-to-english-translation-part-2a-%25e2%2580%2599significant%25e2%2580%2599-doesn%25e2%2580%2599t-always-mean-%25e2%2580%2599important%25e2%2580%2599</link>
		<comments>http://www.win-vector.com/blog/2009/12/statistics-to-english-translation-part-2a-%e2%80%99significant%e2%80%99-doesn%e2%80%99t-always-mean-%e2%80%99important%e2%80%99/#comments</comments>
		<pubDate>Fri, 04 Dec 2009 20:39:20 +0000</pubDate>
		<dc:creator>Nina Zumel</dc:creator>
				<category><![CDATA[Applications]]></category>
		<category><![CDATA[Expository Writing]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Statistics To English Translation]]></category>
		<category><![CDATA[effect size]]></category>
		<category><![CDATA[hypothesis testing]]></category>
		<category><![CDATA[significance]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1186</guid>
		<description><![CDATA[In this installment of our ongoing Statistics to English Translation series1, we will look at the technical meaning of the term &#8221;significant&#8221;. As you might expect, what it means in statistics is not exactly what it means in everyday language. As always, a pdf version of this article is available as well. Does too much [...]


Related posts:<ol><li><a href='http://www.win-vector.com/blog/2009/12/statistics-to-english-translation-part-2b-calculating-significance/' rel='bookmark' title='Permanent Link: Statistics to English Translation, Part 2b: Calculating Significance'>Statistics to English Translation, Part 2b: Calculating Significance</a></li>
<li><a href='http://www.win-vector.com/blog/2009/11/i-dont-think-that-means-what-you-think-it-means-statistics-to-english-translation-part-1-accuracy-measures/' rel='bookmark' title='Permanent Link: &#8220;I don&#8217;t think that means what you think it means;&#8221; Statistics to English Translation, Part 1: Accuracy Measures'>&#8220;I don&#8217;t think that means what you think it means;&#8221; Statistics to English Translation, Part 1: Accuracy Measures</a></li>
<li><a href='http://www.win-vector.com/blog/2010/03/r-annoyances/' rel='bookmark' title='Permanent Link: R annoyances'>R annoyances</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>In this installment of our ongoing Statistics to English Translation series<a name="tex2html1" href="#foot133" id="tex2html1"><sup>1</sup></a>, we will look at the technical meaning of the term &#8221;significant&#8221;. As you might expect, what it means in statistics is not exactly what it means in everyday language.</p>
<p>As always, a <a href="http://www.win-vector.com/dfiles/ste2a_significance.pdf">pdf version of this article</a> is available as well.<span id="more-1186"></span></p>
<blockquote><p>Does too much salt cause high blood pressure, or doesn&#8217;t it? That debate has raged for decades, with a slew of studies finding &#8220;yes&#8221; and a slew of others finding &#8220;no.&#8221; Two new studies out today in the journal <em>Hypertension</em> tip the scales in favor of reducing sodium &#8211; particularly for those 1 in 4 Americans who have high blood pressure. One study found that reducing salt intake from 9,700 milligrams a day to 6,500 milligrams decreased blood pressure significantly in blacks, Asians, and whites who had untreated mild hypertension. Another study found that switching to a lower-salt diet helped lower blood pressure in folks with treatment-resistant hypertension.<br />
- &#8220;10 salt shockers that could make hypertension worse,&#8221; <em>U.S. News &amp; World Report</em> [<a href="#Kotz09">Kot09</a>]</p></blockquote>
<p>&#8220;Great!&#8221; you think. &#8220;Who needs to spend money on high-blood pressure meds? I can just cut down my salt!&#8221; Well, maybe so, maybe not. To come to that conclusion, you need more information than you were given in that paragraph. What was the &#8220;significant&#8221; decrease in blood pressure? What was the &#8220;before&#8221; and the &#8220;after&#8221;? Does &#8220;significant&#8221; mean important, or useful? And why has there been so much controversy over this?</p>
<p>Let&#8217;s discuss the important points with an example.</p>
<div align="center"><img width="211" height="236" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/./sneetches.jpg" alt="Image sneetches"></div>
<p>Suppose that we wanted to test for a difference in intelligence between two groups, say Star-Bellied Sneetches and Plain-Bellied Sneetches<a name="tex2html3" href="#foot134" id="tex2html3"><sup>2</sup></a>. We take a group of 50 Star-Bellies and a group of 40 Plain-Bellies, and give them both a series of tests designed to measure their mathematical, linguistic, and problem-solving abilities. After evaluating the data, we conclude that there is &#8220;a significant difference in mathematical performance (t(88) = 2.499, p = 0.014) between the two groups&#8221;. The mean mathematics score of the Star-Bellies is 78, with a standard deviation of 7, and the mean mathematics score of the Plain-Bellies is 74, with a standard deviation of 8, for a difference of 4 points<a name="tex2html4" href="#foot135" id="tex2html4"><sup>3</sup></a>.</p>
<p>Should we interpret this result to mean that Star-Bellied Sneetches are better than Plain-Bellied ones at math? It depends.</p>
<h1><a name="SECTION00010000000000000000" id="SECTION00010000000000000000">How Hypothesis Tests Work</a></h1>
<p>The Sneetch example above and the blood-pressure study cited earlier are both examples of <em>hypothesis tests</em>. In hypothesis testing, researchers set their proposed hypothesis (that there is an effect or a relationship) against the <em>null hypothesis</em> that there is no effect or relationship. In this article, we consider proposed relationships of the form</p>
<blockquote><p>The mean value of X measured for group A is different from the mean value of X measured for group B.</p></blockquote>
<p>In this case, the null hypothesis is</p>
<blockquote><p>The mean value of X is the same for groups A and B, and any difference observed in the data is only by observational chance.</p></blockquote>
<p>In fact, we are actually testing the stricter null hypothesis:</p>
<blockquote><p>The distribution of X is the same for groups A and B, and any difference observed is only by observational chance.</p></blockquote>
<p>A and B are sometimes called <em>treatment groups</em>; this terminology comes from the original applications of hypothesis testing procedures, in agriculture and medicine. In the blood pressure study above, the treatment is daily salt intake. One group ingests about 9,700 milligrams of sodium a day, the other group about 6,500 milligrams a day. The question of interest is: does the difference in sodium intake make a difference in the average blood pressure of the two groups? The null hypothesis is &#8220;No.&#8221;</p>
<h2><a name="SECTION00011000000000000000" id="SECTION00011000000000000000">Significance</a></h2>
<p>We call an observed difference <em>significant</em> &#8211; meaning that a difference as large as we observed is probably not by chance &#8211; if the the value <img width="40" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg3.png" alt="$ 1-p$"> is &#8220;high enough.&#8221; In the Sneetch example, <img width="70" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg4.png" alt="$ p = 0.014$"> is the <em>significance level</em> of the result. To interpret the p-value, suppose the null hypothesis is true: there is truly no difference between Star-Bellied math scores and Plain-Bellied math scores. If this is so, then there is only a 0.014 (1.4%) chance that the difference in the average scores of the two groups will be 4 points or larger. In other words, if the null hypothesis is true, and we administer this same test to different groups of 50 Star-Bellies and 40 Plain-Bellies a hundred times, then the difference in scores will be 4 points or more only about once or twice.</p>
<p>We interpret the fact that we have seen a difference that should be rare to be evidence that the null hypothesis <em>isn&#8217;t</em> true. So we <em>reject the null hypothesis</em> and say that there is a &#8220;significant difference&#8221; in the performance of the two groups. Alternatively, we could say that Star-Bellied Sneetches performed &#8220;significantly better&#8221; than Plain-Bellied Sneetches on the math test.</p>
<h2><a name="SECTION00012000000000000000" id="SECTION00012000000000000000">Effect Size</a></h2>
<p>Four points (or about a 5% difference) is the <em>effect size</em> of the comparison. The effect size represents what might be called the &#8220;practical significance&#8221; of the result. In general, the larger the effect size, the better. In this example, Star-Bellies might truly outperform Plain-Bellies by about four points on average, but if we were to examine the relationship between math scores and real-life math performance (say, how well college-attending Sneetches do in their math and science courses), we might discover that it takes a test score difference of ten points or more to reliably predict which Sneetches will do better. In that case, a four point average difference would not be a practical difference.</p>
<h1><a name="SECTION00020000000000000000" id="SECTION00020000000000000000">Evaluating a Result</a></h1>
<p>When evaluating a result, you should look both for its significance and its effect size. In practice, researchers usually consider a finding to be significant if <!-- MATH<br />
 $p \leq 0.05$<br />
 --><br />
<img width="62" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg5.png" alt="$ p \leq 0.05$"> . This is actually a pretty large <img width="12" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg6.png" alt="$ p$"> ; it means even if the null hypothesis is true, you would still observe a difference as large as the one that you observed about five times out of every one hundred trials. In fact, Sachs noted that <!-- MATH<br />
 $p < 0.0027$<br />
 --><br />
<img width="77" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg7.png" alt="$ p &lt; 0.0027$"> used to be the commonly used threshold for significance ([<a href="#Sachs84">Sac84</a>, p. 114]).</p>
<p>Sometimes results are reported using an asterisk convention: (*) means <!-- MATH<br />
 $p \leq 0.05$<br />
 --><br />
<img width="62" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg5.png" alt="$ p \leq 0.05$"> , (**) means <!-- MATH<br />
 $p \leq<br />
0.01$<br />
 --><br />
<img width="62" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg8.png" alt="$ p \leq 0.01$"> , and (***) means <!-- MATH<br />
 $p \leq 0.001$<br />
 --><br />
<img width="70" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg9.png" alt="$ p \leq 0.001$"> . Hopefully, the actual significance level is reported (it isn&#8217;t always), as well as the actual effect size (it isn&#8217;t always).</p>
<div align="center"><img width="240" height="180" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/./cup_of_coffee.jpg" alt="Image cup_of_coffee"></div>
<p>The effect size in medical studies is often reported in the popular press with statements like &#8220;those who abstained from coffee had triple the risk of contracting colon cancer compared to those who drank three or more cups a day.&#8221; Does that mean that all confirmed Lapsang Souchong drinkers and the uncaffeinated should run out and learn to embrace Starbucks? Well, no. First of all, ask yourself: what is the baseline risk of colon cancer? If abstaining from coffee triples the risk from 0.01% to 0.03%, well, it probably isn&#8217;t worth worrying about. On the other hand, if the risk triples from 5% to 15%, perhaps that is a reason to take up espressos. You should also see who were the subjects of the study, and how similar they are to you. Suppose the study was done on Caucasian males in the U.S., ages 55-65, with no family history of colon cancer. If you are a young white American male, it&#8217;s possible that this study says something about your future health. If you are female or non-Caucasian or not living in the U.S., the finding may or may not be relevant to you. It depends on the mechanism that drives the relationship, and whether or not it applies to you as well as to the subjects of the study.</p>
<h2><a name="SECTION00021000000000000000" id="SECTION00021000000000000000">&#8220;Significant&#8221; is not the same as &#8220;Important&#8221;</a></h2>
<blockquote><p>With a large sample, even a small difference can be &#8220;statistically significant&#8221;&#8230; . This doesn&#8217;t necessarily make it important. Conversely, an important difference may not be statistically significant if the sample size is too small.<br />
- Freedman, Pisani and Purves, <em>Statistics</em> [<a href="#Freedman07">FPP07</a>, p. 550]</p></blockquote>
<p>The ability of a study to detect a significant difference depends almost entirely on its size. When a researcher designs a study, she has to decide how much risk of error &#8211; and what type of error &#8211; she is willing to tolerate.</p>
<blockquote><p>How big a risk [of inventing a difference] between two indistinguishable treatments are we willing to put up with? This risk is known as the significance level <img width="14" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg10.png" alt="$ \alpha$"> . [<a href="#Sachs84">Sac84</a>, p. 214]</p></blockquote>
<p><img width="14" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg10.png" alt="$ \alpha$"> is the probability of rejecting a null hypothesis that should be accepted. This is a Type I error (a false positive). <img width="14" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg10.png" alt="$ \alpha$"> enters the design of the study as the threshold for p-values that the researcher will accept as significant.</p>
<blockquote><p>How big a risk do we allow of missing a substantial difference between two treatments? &#8230; This risk is called <img width="14" height="29" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg11.png" alt="$ \beta$"> . [<a href="#Sachs84">Sac84</a>, p. 214]</p></blockquote>
<p><img width="14" height="29" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg11.png" alt="$ \beta$"> is the probability of accepting a null hypothesis that should have been rejected. This is a Type II error (a false negative). The quantity <img width="41" height="29" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg12.png" alt="$ 1-\beta$"> is known as the <em>power</em> of the test: the probability that the test will correctly reject the null hypothesis when the alternative hypothesis is true.</p>
<blockquote><p>How small a difference should still be recognized as significant? This difference is called <img width="12" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg13.png" alt="$ \delta$"> . [<a href="#Sachs84">Sac84</a>, p. 214]</p></blockquote>
<p><img width="12" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg13.png" alt="$ \delta$"> is the minimum effect size that we are willing to consider &#8220;practically significant.&#8221;</p>
<p>It is important to consider <em>all three</em> of <img width="14" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg10.png" alt="$ \alpha$"> , <img width="14" height="29" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg11.png" alt="$ \beta$"> , and <img width="12" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg13.png" alt="$ \delta$"> when determining an appropriate sample size for a trial. The power of a test and the significance of a result both increase as the sample size <img width="14" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg2.png" alt="$ n$"> increases. So if <img width="12" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg13.png" alt="$ \delta$"> is not specified, <b>any difference can appear significant, with a large enough <img width="14" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg2.png" alt="$ n$"></b> , even if the difference is really by chance.</p>
<h3><a name="SECTION00021100000000000000" id="SECTION00021100000000000000">The Central Limit Theorem</a></h3>
<p>To see why the above statement is true, we need a few more facts about estimating the mean. Suppose we have a random variable <img width="19" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg14.png" alt="$ X$"> that is normally (or nearly normally) distributed, with a true mean <img width="14" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg1.png" alt="$ \mu $"> and (unknown) variance <img width="21" height="16" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg15.png" alt="$ \sigma^2$"> . You want to estimate <img width="14" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg1.png" alt="$ \mu $"> by drawing <img width="14" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg2.png" alt="$ n$"> samples; the sample mean <img width="13" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg16.png" alt="$ \bar{x}$"> gives you an estimate of <img width="14" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg1.png" alt="$ \mu $"> . According to the <em>Central Limit Theorem</em>, if you were to repeat this experiment over and over again, you would see that the estimated <img width="13" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg16.png" alt="$ \bar{x}$"> has a normal distribution, with mean <img width="14" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg1.png" alt="$ \mu $"> and variance <!-- MATH<br />
 $\sigma^2/n$<br />
 --><br />
<img width="38" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg17.png" alt="$ \sigma^2/n$"> . So <img width="13" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg16.png" alt="$ \bar{x}$"> is a good estimate of <img width="14" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg1.png" alt="$ \mu $"> , one that improves with a larger sample size <img width="14" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg2.png" alt="$ n$"> .</p>
<p>Another fact about normal distributions is that a little over 95% of the probability mass is within <img width="24" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg18.png" alt="$ \pm 2$"> standard deviations of the mean. So, for a single experiment, we can reason that the true mean <img width="14" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg1.png" alt="$ \mu $"> is in the interval <!-- MATH<br />
 $\bar{x} \pm 2 \sigma/\sqrt{n}$<br />
 --><br />
<img width="81" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg19.png" alt="$ \bar{x} \pm 2 \sigma/\sqrt{n}$"> with 95% probability<a name="tex2html5" href="#foot136" id="tex2html5"><sup>4</sup></a>.</p>
<div align="center"><a name="86"></a></p>
<table>
<caption align="bottom"><strong>Figure 1:</strong> Confidence bounds on the estimate of <img width="14" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg1.png" alt="$ \mu $"> for different values of <img width="14" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg2.png" alt="$ n$"></caption>
<tr>
<td>
<div align="center"><img width="370" height="183" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/./fig1.png" alt="Image fig1"></div>
</td>
</tr>
</table>
</div>
<p>So, as <img width="14" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg2.png" alt="$ n$"> gets larger, we zoom in on <img width="14" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg1.png" alt="$ \mu $"> <a name="tex2html7" href="#foot89" id="tex2html7"><sup>5</sup></a>.</p>
<p>Now, back to the problem of checking for the difference of means. We&#8217;ll take <img width="14" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg2.png" alt="$ n$"> samples from population <img width="16" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg22.png" alt="$ A$"> and <img width="14" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg2.png" alt="$ n$"> from population <img width="17" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg23.png" alt="$ B$"> . Let&#8217;s assume for now that the variances are equal.</p>
<div align="center"><a name="93"></a></p>
<table>
<caption align="bottom"><strong>Figure 2:</strong> Confidence bounds overlap; means may not be truly different</caption>
<tr>
<td>
<div align="center"><img width="273" height="211" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/./fig2.png" alt="Image fig2"></div>
</td>
</tr>
</table>
</div>
<p>With 95% probability, <!-- MATH<br />
 $\mu_A \in \bar{x}_A \pm 2\sigma/\sqrt{n}$<br />
 --><br />
<img width="131" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg24.png" alt="$ \mu_A \in \bar{x}_A \pm 2\sigma/\sqrt{n}$"> , and <!-- MATH<br />
 $\mu_B \in \bar{x}_B \pm 2\sigma/\sqrt{n}$<br />
 --><br />
<img width="132" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg25.png" alt="$ \mu_B \in \bar{x}_B \pm 2\sigma/\sqrt{n}$"> . If <!-- MATH<br />
 $|\bar{x}_A -<br />
\bar{x}_B|$<br />
 --><br />
<img width="72" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg26.png" alt="$ \vert\bar{x}_A - \bar{x}_B\vert$"> is small compared to <!-- MATH<br />
 $4 \sigma/\sqrt{n}$<br />
 --><br />
<img width="53" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg27.png" alt="$ 4 \sigma/\sqrt{n}$"> , then the two confidence intervals overlap substantially, and we cannot reject the null hypothesis that <!-- MATH<br />
 $\mu_A = \mu_B$<br />
 --><br />
<img width="66" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg28.png" alt="$ \mu_A = \mu_B$"> .</p>
<p>If, on the other hand, <!-- MATH<br />
 $|\bar{x}_A - \bar{x}_B|$<br />
 --><br />
<img width="72" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg26.png" alt="$ \vert\bar{x}_A - \bar{x}_B\vert$"> is wide compared to <!-- MATH<br />
 $4 \sigma/\sqrt{n}$<br />
 --><br />
<img width="53" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg27.png" alt="$ 4 \sigma/\sqrt{n}$"> :</p>
<div align="center"><a name="109"></a></p>
<table>
<caption align="bottom"><strong>Figure 3:</strong> Confidence bounds don&#8217;t overlap; means are significantly different</caption>
<tr>
<td>
<div align="center"><img width="331" height="193" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/./fig3.png" alt="Image fig3"></div>
</td>
</tr>
</table>
</div>
<p>then the confidence intervals are well separated, and we can reject the null hypothesis.</p>
<p>So <img width="12" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg13.png" alt="$ \delta$"> , the minimum significant distance &#8211; the &#8220;resolution&#8221; of the experiment &#8211; is about the distance when the two confidence intervals touch: <!-- MATH<br />
 $4 \sigma/\sqrt{n}$<br />
 --><br />
<img width="53" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg27.png" alt="$ 4 \sigma/\sqrt{n}$"> , if our desired significance level is 0.05.</p>
<div align="center"><a name="116"></a></p>
<table>
<caption align="bottom"><strong>Figure 4:</strong> Minimum significant distance for a given sample size <img width="14" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg2.png" alt="$ n$"></caption>
<tr>
<td>
<div align="center"><img width="304" height="229" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/./fig4.png" alt="Image fig4"></div>
</td>
</tr>
</table>
</div>
<p>If <img width="12" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg13.png" alt="$ \delta$"> is too large, the experiment may be unable to detect important differences because the confidence intervals overlap too soon. This means that the sample size was too small (the test didn&#8217;t have enough power), and the experiment should be repeated with a larger test population.</p>
<p>If <img width="12" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg13.png" alt="$ \delta$"> is too small, then the experiment will potentially detect statistically significant differences that are, for all practical intents and purposes, meaningless. To go back to the Sneetch example, if the math exam has one hundred questions, then an effect size of two points would correspond to one group answering two additional questions correctly, on average. Practically speaking, that&#8217;s probably not a very big difference. But if we made the experiment big enough, about 250 Sneetches in each group, it would be a <em>statistically</em> significant difference, to the 0.05 level. In theory, we could even make a difference of less than one point statistically significant! That is why knowing the effect size of a significant result is important.</p>
<h2><a name="SECTION00022000000000000000" id="SECTION00022000000000000000">&#8220;Significant&#8221; is not the same as &#8220;True&#8221;</a></h2>
<p>The power and significance level of a test play similar roles to the sensitivity and specificity of a diagnostic test. You&#8217;ll remember from Part 1 of this series<a name="tex2html11" href="#foot137" id="tex2html11"><sup>6</sup></a>that sensitivity and specificity are properties of the test, <em>not</em> how the test performs in a given population. To know the practical accuracy of a screening test, you must know the underlying prevalence of the condition that it is screening for. If it is crucial that the screening not miss any positive cases, then the test will be designed to be highly sensitive, possibly at the cost of specificity. In that case, the test will tend to have a high false positive rate if the condition is relatively rare. And yet, this same screening test will have a lower overall false positive rate when used in a population where the condition is more prevalent.</p>
<p>The same is true for hypothesis tests. The probability that a statistically significant result is actually <em>true</em> depends on the underlying probability that results &#8220;of that type&#8221; tend to be true in the domain of study. It also depends on whether the researcher was trying to minimize the chance of a false positive error, or a false negative error.</p>
<p>You should also be careful interpreting the results of exploratory work, where the researchers have run a series of several different studies, but only highlight the &#8220;significant&#8221; ones. Running twenty experiments and having one of them return a significant result to the <img width="62" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg29.png" alt="$ p=0.05$"> level is actually not significant at all.</p>
<p>John Ioannides discusses these points (and a few others) in his 2005 essay &#8220;Why Most Published Research Findings are False&#8221;[<a href="#Ion05">Ioa05</a>]. The essay made a few waves at the time of its publication, and it is still available online. We recommend that you read it, along with the 2007 followup article by Moonesinghe, et.al [<a href="#Moon07">MKJ07</a>]. Now that you&#8217;ve read the first two installments of the Statistics to English translation, both essays should be a breeze!</p>
<h1><a name="SECTION00030000000000000000" id="SECTION00030000000000000000">Some Points to Remember</a></h1>
<ul>
<li>&#8220;Significant&#8221; is a statistical statement that an observed relationship is unlikely to be by chance. It is not an necessarily a statement about the magnitude or the importance (or the truth!) of the relationship.</li>
<li>Knowing the effect size of a significant result will help you decide if the relationship is &#8220;practically significant.&#8221;</li>
<li>With a large enough sample size, any difference in means can appear significant, even when it is by chance.</li>
</ul>
<p>You now have a general idea what a &#8220;statistically significant result&#8221; is. The next installment will go into a little more technical detail of how significance is calculated. You should read that installment if you want to decipher statements in research papers like &#8220;<!-- MATH<br />
 $(F(2, 864) = 6.6, p = 0.0014)$<br />
 --><br />
<img width="202" height="31" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg30.png" alt="$ (F(2, 864) = 6.6, p = 0.0014)$"> &#8221; &#8212; or if you are simply curious.</p>
<h2><a name="SECTION00040000000000000000" id="SECTION00040000000000000000">Bibliography</a></h2>
<dl compact>
<dt><a name="Freedman07" id="Freedman07">FPP07</a></dt>
<dd>David Freedman, Robert Pisani, and Roger Purves, <i>Statistics</i>, 4th ed., W. W. Norton &amp; Company, New York, 2007.</dd>
<dt><a name="Ion05" id="Ion05">Ioa05</a></dt>
<dd>John P.&nbsp;A. Ioannidis, <i>Why most published research findings are false</i>, PLoS Med <b>2</b> (2005), no.&nbsp;8, e124, Available as <a href="http://www.plosmedicine.org/article/info:doi/10.1371/journal.pmed.0020124">http://www.plosmedicine.org/article/info:doi/10.1371/journal.pmed.0020124</a>.</dd>
<dt><a name="Kotz09" id="Kotz09">Kot09</a></dt>
<dd>Deborah Kotz, <i>10 salt shockers that could make hypertension worse</i>, U.S. News &amp; World Report (2009), Online as <a href="http://health.usnews.com/articles/health/heart/2009/07/20/10-salt-shockers-that-could-make-hypertension-worse.html"> http://health.usnews.com/articles/health/heart/2009/07/20/10-salt-shockers-that-could-make-hypertension-worse.html</a>.</dd>
<dt><a name="Moon07" id="Moon07">MKJ07</a></dt>
<dd>Ramal Moonesinghe, Muin&nbsp;J Khoury, and A.&nbsp;Cecile J.&nbsp;W Janssens, <i>Most published research findings are false &#8212; but a little replication goes a long way</i>, PLoS Med <b>4</b> (2007), no.&nbsp;2, e28, Available as <a href="http://www.plosmedicine.org/article/info%3Adoi%2F10.1371%2Fjournal.pmed.0040028">http://www.plosmedicine.org/article/info%3Adoi%2F10.1371%2Fjournal.pmed.0040028</a>.</dd>
<dt><a name="Sachs84" id="Sachs84">Sac84</a></dt>
<dd>Lothar Sachs, <i>Applied statistics: A handbook of techniques</i>, 2nd ed., Springer-Verlag, New York, 1984.</dd>
<dt><a name="Spiegel08" id="Spiegel08">SS99</a></dt>
<dd>Murray&nbsp;R. Spiegel and Larry&nbsp;J. Stephens, <i>Schaum&#8217;s outline of statistics</i>, 4th ed., McGraw-Hill, New York, 1999.</dd>
</dl>
<p></p>
<hr />
<h4>Footnotes</h4>
<dl>
<dt><a name="foot133" id="foot133">&#8230; series</a><a href="#tex2html1"><sup>1</sup></a></dt>
<dd><tt><a name="tex2html2" href="http://www.win-vector.com/blog/category/statistics-to-english-translation/" id="tex2html2">http://www.win-vector.com/blog/category/statistics-to-english-translation/</a></tt></dd>
<dt><a name="foot134" id="foot134">&#8230; Sneetches</a><a href="#tex2html3"><sup>2</sup></a></dt>
<dd>&#8220;The Sneetchs,&#8221; from <em>The Sneetches and Other Stories</em> by Dr. Seuss.<br />
<a href="http://www.youtube.com/watch?v=Ln3V0HgW4eM">http://www.youtube.com/watch?v=Ln3V0HgW4eM</a><br />
 and <a href="http://www.youtube.com/watch?v=s0LgMpfLD1Y">http://www.youtube.com/watch?v=s0LgMpfLD1Y</a>
</dd>
<dt><a name="foot135" id="foot135">&#8230; points</a><a href="#tex2html4"><sup>3</sup></a></dt>
<dd>This example is based on Exercise 10.17 in [<a href="#Spiegel08">SS99</a>]; the original exercise did not, unfortunately, involve Sneetches.</dd>
<dt><a name="foot136" id="foot136">&#8230; probability</a><a href="#tex2html5"><sup>4</sup></a></dt>
<dd>The correct way to state this is that for a given (unknown) <img width="14" height="28" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg1.png" alt="$ \mu $"> , the estimate <img width="13" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg16.png" alt="$ \bar{x}$"> falls in the interval <!-- MATH<br />
 $\mu<br />
\pm 2 \sigma/\sqrt{n}$<br />
 --><br />
<img width="82" height="33" align="middle" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg20.png" alt="$ \mu \pm 2 \sigma/\sqrt{n}$"> just over 95% of the time. This gets awkward to reason about. Luckily, symmetry arguments let us center the appropriate confidence interval around <img width="13" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg16.png" alt="$ \bar{x}$"> instead.</dd>
<dt><a name="foot89" id="foot89">&#8230;</a><a href="#tex2html7"><sup>5</sup></a></dt>
<dd>Of course, we don&#8217;t actually know <img width="14" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg21.png" alt="$ \sigma$"> , so we don&#8217;t know exactly how fast we zoom in. That doesn&#8217;t affect our argument, though, since only <img width="14" height="14" align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/ste2aimg2.png" alt="$ n$"> changes</dd>
<dt><a name="foot137" id="foot137">&#8230; series</a><a href="#tex2html11"><sup>6</sup></a></dt>
<dd><a href="http://www.win-vector.com/blog/2009/11/i-dont-think-that-means-what-you-think-it-means-statistics-to-english-translation-part-1-accuracy-measures/">http://www.win-vector.com/blog/2009/11/i-dont-think-that-means-what-you-think-it-means-statistics-to-english-translation-part-1-accuracy-measures/</a></dd>
</dl>
<p></p>
<hr />
<address>Nina Zumel 2009-12-04</address>


<p>Related posts:<ol><li><a href='http://www.win-vector.com/blog/2009/12/statistics-to-english-translation-part-2b-calculating-significance/' rel='bookmark' title='Permanent Link: Statistics to English Translation, Part 2b: Calculating Significance'>Statistics to English Translation, Part 2b: Calculating Significance</a></li>
<li><a href='http://www.win-vector.com/blog/2009/11/i-dont-think-that-means-what-you-think-it-means-statistics-to-english-translation-part-1-accuracy-measures/' rel='bookmark' title='Permanent Link: &#8220;I don&#8217;t think that means what you think it means;&#8221; Statistics to English Translation, Part 1: Accuracy Measures'>&#8220;I don&#8217;t think that means what you think it means;&#8221; Statistics to English Translation, Part 1: Accuracy Measures</a></li>
<li><a href='http://www.win-vector.com/blog/2010/03/r-annoyances/' rel='bookmark' title='Permanent Link: R annoyances'>R annoyances</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2009/12/statistics-to-english-translation-part-2a-%e2%80%99significant%e2%80%99-doesn%e2%80%99t-always-mean-%e2%80%99important%e2%80%99/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
	</channel>
</rss>
