<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Win-Vector Blog</title>
	<atom:link href="http://www.win-vector.com/blog/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.win-vector.com/blog</link>
	<description>The Applied Theorist&#039;s Point of View</description>
	<lastBuildDate>Mon, 20 May 2013 15:28:27 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.5.1</generator>
		<item>
		<title>Practical Data Science with R news</title>
		<link>http://www.win-vector.com/blog/2013/05/practical-data-science-with-r-news/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=practical-data-science-with-r-news</link>
		<comments>http://www.win-vector.com/blog/2013/05/practical-data-science-with-r-news/#comments</comments>
		<pubDate>Mon, 20 May 2013 15:27:48 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Administrativia]]></category>
		<category><![CDATA[Practical Data Science]]></category>
		<category><![CDATA[Practical Data Science with R]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=2424</guid>
		<description><![CDATA[We have some great news for &#8220;Practical Data Science with R&#8221;: We are the Manning Deal of the Day May 21, 2013: Half off &#8220;Practical Data Science with R.&#8221; Use code dotd0521au at www.manning.com/zumel/ Our good friends at r-bloggers.com have been really helping promote the book! We have started an announcement page to point direct [...]<div class='yarpp-related-rss'>

Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2013/02/data-science-project-planning/' rel='bookmark' title='Data science project planning'>Data science project planning</a></li>
<li><a href='http://www.win-vector.com/blog/2013/04/data-science-machine-learning-and-statistics-what-is-in-a-name/' rel='bookmark' title='Data Science, Machine Learning, and Statistics: what is in a name?'>Data Science, Machine Learning, and Statistics: what is in a name?</a></li>
<li><a href='http://www.win-vector.com/blog/2012/04/setting-expectations-in-data-science-projects/' rel='bookmark' title='Setting expectations in data science projects'>Setting expectations in data science projects</a></li>
</ol>
<img src='http://yarpp.org/pixels/b8781af2b90c83bd11d1e98c04b31afb'/>
</div>
]]></description>
				<content:encoded><![CDATA[<p>We have some great news for &#8220;Practical Data Science with R&#8221;:</p>
<ul>
<li>We are the Manning Deal of the Day May 21, 2013: Half off &#8220;Practical Data Science with R.&#8221; Use code <strong>dotd0521au</strong> at <a target="_blank" href="http://affiliate.manning.com/idevaffiliate.php?id=1273_360">www.manning.com/zumel/</a></li>
</ul>
<p><span id="more-2424"></span>
<ul>
<li>Our good friends at <a target="_blank" href="http://www.r-bloggers.com">r-bloggers.com</a> have been really helping promote the book!<br />
<a target="_blank" href="http://www.r-bloggers.com/big-news-practical-data-science-with-r-meap-launched/"><img style="display:block; margin-left:auto; margin-right:auto;" src="http://www.win-vector.com/blog/wp-content/uploads/2013/05/rbloggers.png" alt="Rbloggers" title="rbloggers.png" border="0" width="600" height="347" /></a></li>
<li>We have started <a target="_blank" href="http://www.win-vector.com/blog/practical-data-science-with-r/">an announcement page</a> to point direct readers to the book, book forums, data and<br />
the free preview chapter.</li>
<li>We have a new shorter URL <a target="_blank" href="http://practicaldatascience.com">practicaldatascience.com</a> to link to news and updates (haven&#8217;t quite settled on where to point the link to in the end, right now it is to our announcement page).</li>
</ul>
<p>As you can tell, this is going to be a very public book.  We are going share as much as we can and listen as much as we can as we put this great book together.</p>
<div class='yarpp-related-rss'>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2013/02/data-science-project-planning/' rel='bookmark' title='Data science project planning'>Data science project planning</a></li>
<li><a href='http://www.win-vector.com/blog/2013/04/data-science-machine-learning-and-statistics-what-is-in-a-name/' rel='bookmark' title='Data Science, Machine Learning, and Statistics: what is in a name?'>Data Science, Machine Learning, and Statistics: what is in a name?</a></li>
<li><a href='http://www.win-vector.com/blog/2012/04/setting-expectations-in-data-science-projects/' rel='bookmark' title='Setting expectations in data science projects'>Setting expectations in data science projects</a></li>
</ol></p>
<img src='http://yarpp.org/pixels/b8781af2b90c83bd11d1e98c04b31afb'/>
</div>
]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2013/05/practical-data-science-with-r-news/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Big News! &#8220;Practical Data Science with R&#8221; MEAP launched!</title>
		<link>http://www.win-vector.com/blog/2013/05/big-news-practical-data-science-with-r-meap-launched/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=big-news-practical-data-science-with-r-meap-launched</link>
		<comments>http://www.win-vector.com/blog/2013/05/big-news-practical-data-science-with-r-meap-launched/#comments</comments>
		<pubDate>Wed, 15 May 2013 14:27:25 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Administrativia]]></category>
		<category><![CDATA[data science]]></category>
		<category><![CDATA[Exciting Techniques]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[book]]></category>
		<category><![CDATA[Manning Publications]]></category>
		<category><![CDATA[Practical Data Science]]></category>
		<category><![CDATA[Practical Data Science with R]]></category>
		<category><![CDATA[R]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=2415</guid>
		<description><![CDATA[Nina Zumel and I ( John Mount ) have been working very hard on producing an exciting new book called &#8220;Practical Data Science with R.&#8221; The book has now entered Manning Early Access Program (MEAP) which allows you to subscribe to chapters as they become available and give us feedback before the book goes into [...]<div class='yarpp-related-rss'>

Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2012/04/setting-expectations-in-data-science-projects/' rel='bookmark' title='Setting expectations in data science projects'>Setting expectations in data science projects</a></li>
<li><a href='http://www.win-vector.com/blog/2013/04/data-science-machine-learning-and-statistics-what-is-in-a-name/' rel='bookmark' title='Data Science, Machine Learning, and Statistics: what is in a name?'>Data Science, Machine Learning, and Statistics: what is in a name?</a></li>
<li><a href='http://www.win-vector.com/blog/2013/02/data-science-project-planning/' rel='bookmark' title='Data science project planning'>Data science project planning</a></li>
</ol>
<img src='http://yarpp.org/pixels/b8781af2b90c83bd11d1e98c04b31afb'/>
</div>
]]></description>
				<content:encoded><![CDATA[<p>Nina Zumel and I ( John Mount ) have been working very hard on producing an exciting new book called &#8220;<a target="_blank" href="http://affiliate.manning.com/idevaffiliate.php?id=1273_360">Practical Data Science with R</a>.&#8221;  The book has now entered <a target="_blank" href="http://affiliate.manning.com/idevaffiliate.php?id=1273">Manning</a> Early Access Program (MEAP) which allows you to subscribe to chapters as they become available and give us feedback before the book goes into print.  </p>
<p><a target="_blank" href="http://affiliate.manning.com/idevaffiliate.php?id=1273_360"><br />
<img style="display:block; margin-left:auto; margin-right:auto;" src="http://www.win-vector.com/blog/wp-content/uploads/2013/05/Zumel-PDSwithR-3.jpg" alt="Zumel PDSwithR 3" title="Zumel-PDSwithR-3.jpg" border="0"  /></a></p>
<p>Please subscribe to our book, your support now will help us improve it.  Please also forward this offer to your friends and colleagues (and please ask them to also subscribe and forward).<span id="more-2415"></span>
<p/>
<p><strike><a href="http://affiliate.manning.com/idevaffiliate.php?id=1273">Manning</a> is sharing a 50% off promotion code active until May 18, 2013:  <strong>pdswrco</strong> . </strike></p>
<p>Deal of the Day May 21 2013: Half off <a target="_blank" href="http://affiliate.manning.com/idevaffiliate.php?id=1273_360">Practical Data Science with R</a>. Use code <strong>dotd0521au</strong>.</p>
<p>Please subscribe to our <a target="_blank" href="http://affiliate.manning.com/idevaffiliate.php?id=1273_360">MEAP</a>!</p>
<div class='yarpp-related-rss'>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2012/04/setting-expectations-in-data-science-projects/' rel='bookmark' title='Setting expectations in data science projects'>Setting expectations in data science projects</a></li>
<li><a href='http://www.win-vector.com/blog/2013/04/data-science-machine-learning-and-statistics-what-is-in-a-name/' rel='bookmark' title='Data Science, Machine Learning, and Statistics: what is in a name?'>Data Science, Machine Learning, and Statistics: what is in a name?</a></li>
<li><a href='http://www.win-vector.com/blog/2013/02/data-science-project-planning/' rel='bookmark' title='Data science project planning'>Data science project planning</a></li>
</ol></p>
<img src='http://yarpp.org/pixels/b8781af2b90c83bd11d1e98c04b31afb'/>
</div>
]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2013/05/big-news-practical-data-science-with-r-meap-launched/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Bayesian and Frequentist Approaches: Ask the Right Question</title>
		<link>http://www.win-vector.com/blog/2013/05/bayesian-and-frequentist-approaches-ask-the-right-question/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=bayesian-and-frequentist-approaches-ask-the-right-question</link>
		<comments>http://www.win-vector.com/blog/2013/05/bayesian-and-frequentist-approaches-ask-the-right-question/#comments</comments>
		<pubDate>Mon, 06 May 2013 16:04:18 +0000</pubDate>
		<dc:creator>Nina Zumel</dc:creator>
				<category><![CDATA[Expository Writing]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Statistics To English Translation]]></category>
		<category><![CDATA[Tutorials]]></category>
		<category><![CDATA[bayesian]]></category>
		<category><![CDATA[frequentist]]></category>
		<category><![CDATA[parameter estimation]]></category>
		<category><![CDATA[probability]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[significance]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=2404</guid>
		<description><![CDATA[It occurred to us recently that we don&#8217;t have any articles about Bayesian approaches to statistics here. I&#8217;m not going to get into the &#8220;Bayesian versus Frequentist&#8221; war; in my opinion, which style of approach to use is less about philosophy, and more about figuring out the best way to answer a question. Once you [...]<div class='yarpp-related-rss'>

Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2009/12/statistics-to-english-translation-part-2a-significant-doesnt-always-mean-important/' rel='bookmark' title='Statistics to English Translation, Part 2a: ’Significant’ Doesn’t Always Mean ’Important’'>Statistics to English Translation, Part 2a: ’Significant’ Doesn’t Always Mean ’Important’</a></li>
<li><a href='http://www.win-vector.com/blog/2013/04/worry-about-correctness-and-repeatability-not-p-values/' rel='bookmark' title='Worry about correctness and repeatability, not p-values'>Worry about correctness and repeatability, not p-values</a></li>
<li><a href='http://www.win-vector.com/blog/2009/12/statistics-to-english-translation-part-2b-calculating-significance/' rel='bookmark' title='Statistics to English Translation, Part 2b: Calculating Significance'>Statistics to English Translation, Part 2b: Calculating Significance</a></li>
</ol>
<img src='http://yarpp.org/pixels/b8781af2b90c83bd11d1e98c04b31afb'/>
</div>
]]></description>
				<content:encoded><![CDATA[<p>It occurred to us recently that we don&#8217;t have any articles about Bayesian approaches to statistics here. I&#8217;m not going to get into the &#8220;Bayesian versus Frequentist&#8221; war; in my opinion, which style of approach to use is less about philosophy, and more about figuring out the best way to answer a question. Once you have the right question, then the right approach will naturally suggest itself to you. It could be a frequentist approach, it could be a bayesian one, it could be both &#8212; even while solving the same problem.</p>
<p>Let&#8217;s take the example that Bayesians love to hate: significance testing, especially in clinical trial style experiments. Clinical trial experiments are designed to answer questions of the form &#8220;Does treatment X have a discernible effect on condition Y, on average?&#8221; To be specific, let&#8217;s use the question &#8220;Does drugX reduce hypertension, on average?&#8221; Assuming that your experiment does show a positive effect, the statistical significance tests that you run should check for the sorts of problems that John discussed in our previous article, <a href="http://www.win-vector.com/blog/2013/04/worry-about-correctness-and-repeatability-not-p-values/">Worry about correctness and repeatability, not p-values</a>: What are the chances that an ineffective drug could produce the results that I saw? How likely is it that another researcher could replicate my results with the same size trial?</p>
<p>We can argue about whether or not the question we are answering is the <em>correct</em> question &#8212; but given that it <em>is</em> the question, the procedure to answer it and to verify the statistical validity of the results is perfectly appropriate.</p>
<p>So what is the correct question? From your family doctor&#8217;s viewpoint, a clinical trial answers the question &#8220;If I prescribe drugX to all my hypertensive patients, will their blood pressure improve, on average?&#8221; That isn&#8217;t the question (hopefully) that your doctor actually asks, though possibly your insurance company does. Your doctor should be asking &#8220;If I prescribe drugX to <em>this patient</em>, the one sitting in my examination room, will the patient&#8217;s blood pressure improve?&#8221; There is only one patient, so there is no such thing as &#8220;on average.&#8221;</p>
<p>If your doctor has a masters degree in statistics, the question might be phrased as &#8220;If I prescribe drugX to this patient, what is the posterior probability that the patient&#8217;s blood pressure will improve?&#8221; And that&#8217;s a bayesian question.<span id="more-2404"></span>Let&#8217;s run through a small toy example. We will run a 500 patient clinical trial on drugX. All the patients have &#8220;moderately high&#8221; blood pressure, and are of similar age, health and family history, and so on. We will measure whether or not drugX reduces their blood pressure to &#8220;normal&#8221; &#8212; somewhere in the region of 120/80. The control group will be on the sort of diet recommended for hypertensive patients(say a low-sodium, low-cholesterol, high-fiber diet) and will take a placebo. The treatment group will be on the same diet, plus drugX.</p>
<p>Now suppose that the diet alone will normalize blood pressure in about 10% of the population. And also suppose that (unknown to the researchers) there is a hidden factor HF (a genetic factor, perhaps) that moderates whether or not drugX actually works. There are two types of people. 90% of the population are HFA, and drugX has no effect on them. 10% of the population are HFB, and drugX completely normalizes blood pressure for 95% of the HFB population.</p>
<p>So you, the omnipotent readers of this article, now know that drugX is only effective on about 9.5% of the general population, although an overlapping 10% of the general population will show improvement from diet alone. This gives you the luxury of comparing the &#8220;right answer&#8221; with what could happen in an actual experiment. Now let&#8217;s see what might be observed.</p>
<p>Here&#8217;s some R code to simulate the trial.</p>
<pre>#
# 2 populations: HFA, HFB. A not affected by the drug
#

n = 500;
spontaneous = 0.1 # effectiveness of diet alone
effectiveness = c(0, 0.95)
names(effectiveness) = c("HFA", "HFB")

# set the HF for the population
hfcoin = runif(n)
hf = ifelse(hfcoin &lt; 0.9, "HFA", "HFB")

# assign control and treatment groups
group = runif(n)
group = ifelse(group&lt;0.5, "control", "drug")

# assign outcomes
spontcoin = runif(n)
drugcoin = runif(n)
outcome = ( (spontcoin &lt; spontaneous) | 
            (drugcoin &lt; effectiveness[hf]*(group=="drug")) )

expframe=data.frame(group=group,hf=hf, improved=outcome)</pre>
<p>Here are the summaries I got when I ran the code:</p>
<pre>&gt; summary(expframe)
     group       hf       improved      
 control:255   HFA:449   Mode :logical  
 drug   :245   HFB: 51   FALSE:437      
                         TRUE :63       
                         NA's :0    

&gt; with(expframe[group=="control",], table(hf, improved))
     improved
hf    FALSE TRUE
  HFA   209   16
  HFB    26    4

&gt; with(expframe[group=="drug",], table(hf, improved))
     improved
hf    FALSE TRUE
  HFA   201   23
  HFB     1   20

&gt; tab = with(expframe, table(group, improved))
&gt; tab
         improved
group     FALSE TRUE
  control   235   20
  drug      202   43</pre>
<p>The last contingency table, <code>tab</code>, is the only of the above summaries known to the researchers. From it, you can see that the drug group had a 100*43/(202+43) = 17.5% improvement rate, and the control group had a 7.8% improvement rate. So, empirically, drugX more than doubled the probability of improvement (17.5/7.8 = 2.25 &#8212; this is called the <em>risk ratio</em>). If you think in odds like a gambler does (odds of improvement are 20 to 235 for the control group), then we have also more than doubled the odds of improvement ( (43/202)/(20/235) = 2.5 &#8212; this is called the <em>odds ratio</em>). Now we want to test if these results are real (and not a fluke).</p>
<p><strong>Frequentist Approach</strong></p>
<p>One way to check the significance of the results (from a frequentist viewpoint) is check whether the contingency table <code>tab</code> is independent. Under the null hypothesis that improvement is independent of whether or not the patient took the drug, the odds of improvement should be the same for both the control and the drug groups. We can test this using Fisher&#8217;s Exact Test for Count Data (or we can use the chi-squared test, which is an approximation of Fisher&#8217;s exact test). In Fisher&#8217;s test, the null hypothesis is that the odds ratio is 1.</p>
<pre>&gt; fisher.test(tab)

	Fisher's Exact Test for Count Data

data:  tab 
p-value = 0.001158
alternative hypothesis: true odds ratio is not equal to 1 
95 percent confidence interval:
 1.384802 4.635614 
sample estimates:
odds ratio 
  2.496745</pre>
<p>So now we know that our results are significant to the 0.05 level (in fact, to the 0.01 level: if the drug had no effect, we would see a result this good or better no more than 1% of the time). We also know that if our estimate of the odds ratio is correct, then when other researchers repeat our experiment, 95% of the time they will see an odds ratio between about 1.38 to 4.63 &#8212; definitely greater than one. So we can reject the null hypothesis and assume that drugX will increase the improvement rate in the population, relative to diet alone.</p>
<p><strong>Bayesian Approach</strong></p>
<p>But what about the poor patient sitting in the doctor&#8217;s examination room? What are the chances that <em>his</em> blood pressure will improve if he takes drugX? Roughly 17%, which is better than the 10% chance from dietary changes alone, but still isn&#8217;t very high. Let&#8217;s verify this statement using the bayesian approach.</p>
<p>The bayesian approach assumes that the quantity that you are interested in, in this case the rate of improvement <em>p</em>, is distributed according to some distribution Prior(<em>p</em>). Once you have a set of observations, <em>x</em>, you update your estimate of the distribution to</p>
<p>Posterior(<em>p</em> | <em>x</em>) = C * Prior(<em>p</em>) * f(<em>x</em> | <em>p</em>),</p>
<p>where f is the probability of the data conditioned on the parameter, and C is the total probability of the data over all possible settings of the parameter. Usually, calculating C is hard. Fortunately, for some common scenarios, like coin-flipping, calculating the posterior is quite easy.</p>
<p>Estimating the improvement rate <em>p</em> of drugX is a coin-flipping problem, where <em>p</em> is the (unknown) probability of the coin coming up heads. If you model the coin as a binomial distribution, and the distribution of <em>p</em> as a Beta distribution:</p>
<p><img style="display:block; margin-left:auto; margin-right:auto;" src="http://www.win-vector.com/blog/wp-content/uploads/2013/05/NewImage3.png" alt="NewImage" title="NewImage.png" border="0"  /></p>
<p>then the posterior is also a Beta distribution, with α&#8217; = α + nheads and β&#8217; = β + ntails.</p>
<p>The mode of the distribution (which is what is usually used as a point estimate for <em>p</em>) is</p>
<p><img style="display: block; margin-left: auto; margin-right: auto;" alt="NewImage" src="http://www.win-vector.com/blog/wp-content/uploads/2013/05/NewImage.png" border="0" /></p>
<p>The mean of the distribution (which is close to the mode when α and β are large) is</p>
<p><img style="display: block; margin-left: auto; margin-right: auto;" alt="NewImage" src="http://www.win-vector.com/blog/wp-content/uploads/2013/05/NewImage2.png" border="0" />Now back to our problem. Suppose that we already knew (never mind how) that the hypertension improvement rate from diet alone was about 10%. We can set the prior to have a mean value of 0.1 by setting α = 0.1 and β = 0.9. That looks like this:</p>
<p><img style="display: block; margin-left: auto; margin-right: auto;" alt="Prior dist" src="http://www.win-vector.com/blog/wp-content/uploads/2013/05/prior_dist.png" border="0" /></p>
<p>It&#8217;s an nasty prior &#8212; notice it goes to infinity at both 0 and 1 &#8212; but it spreads the probability mass all along the unit interval, which is what we want, since we don&#8217;t want to start with a very strong bias about the improvement rate. Another common prior is the Jeffrey&#8217;s prior: α = β = 0.5. The Jeffrey&#8217;s prior is maximally uninformative (or minimally biased) and has a mean of 0.5.</p>
<p>Now let&#8217;s calculate the posterior, its mean and its mode, in R:</p>
<pre># The mean of the Beta distribution
beta_mean = function(alpha, beta)
  alpha/(alpha+beta)
}

# The mode of the Beta Distribution
beta_mode = function(alpha, beta)
  (alpha+1)/(alpha+beta-2)
}

#  prior, mean 0.1, mode not defined
alpha = 0.1
beta = 0.9

# The values from the contingency table for the experiment
improved.control = tab[1,2]     # 20
notimproved.control = tab[1,1]  # 235
improved.drug = tab[2,2]        # 43
notimproved.drug = tab[2,1]     # 202

# update the distribution for the treatment group
alpha.drug = alpha + improved.drug
beta.drug = beta + notimproved.drug

# calculate the mean and the mode for the treatment group
beta_mean(alpha.drug, beta.drug) # 0.1752033
beta_mode(alpha.drug, beta.drug) # 0.1807377

# update the distribution for the control group
alpha.control = alpha + improved.control
beta.control = beta + notimproved.control

# calculate the mean and the mode for the control group
beta_mean(alpha.control, beta.control) # 0.07851563
beta_mode(alpha.control, beta.control) # 0.08307087

# plot both distributions to compare
# the function dbeta() returns the value of the distribution
# at point x, for a given alpha and beta
x=seq(from=0.0, to=0.3,by=0.005)
frame=melt(data.frame(x=x,
                      control=dbeta(x,alpha.control,beta.control),
                      drug=dbeta(x,alpha.drug,beta.drug)),
           measure.vars=c("control", "drug"),
           variable.name="treatment",
           value.name="y")
ggplot(frame, aes(x=x,y=y,color=treatment)) + geom_line()</pre>
<p><img style="display: block; margin-left: auto; margin-right: auto;" alt="Post compare" src="http://www.win-vector.com/blog/wp-content/uploads/2013/05/post_compare.png" border="0" /></p>
<p>The means and modes of both distributions are about where we estimated them from the naive calculations directly on the contingency table; if we use the mode as our point estimate of the improvement rates for both groups, then the spreads of the distributions give us the uncertainty around that estimate, based on the size of our data sample. The distributions don&#8217;t overlap much (the result we expected, based on our frequentist analysis); the two populations do in fact have different improvement rates. The difference (mostly a philosophical one, perhaps) is that this analysis gives us <em>directly</em> what our family doctor wants: an estimate of the posterior probability of a patient&#8217;s blood pressure improving when prescribed drugX. We can calculate what is called the <em>95% credible interval</em> for each distribution: the interval that with 95% probability contains the true improvement rate:</p>
<pre>credible_interval= function(conf, alpha, beta){
  p = (1-conf)
  lower = p/2
  upper = 1-lower
  c(qbeta(lower, alpha, beta), qbeta(upper, alpha, beta))
}

credible_interval(0.95, alpha.drug, beta.drug) 
# 0.1303853 0.2250201

credible_interval(0.95, alpha.control, beta.control)
# 0.04887709 0.11437549</pre>
<p>Based on this data, if you take drugX for your hypertension, the probability of normalizing your blood pressure is likely somewhere in the range of 13 to 22 percent, compared to 4.8 to 11 percent from diet alone. So you will improve your chance of normalizing your blood pressure &#8212; but it&#8217;s more likely that your blood pressure will remain high.</p>
<p>The credible interval, by the way, is what most people <em>think</em> the confidence interval is. With 95% probability (based on the available evidence), the true improvement rate is in the 95% credible interval. The <em>95% confidence interval</em> is the interval that is produced by a construction procedure such that, if you repeated the experiment again and again, the constructed confidence interval contains the true improvement rate 95% of the time. This still makes it likely that you&#8217;ve bracketed the true improvement rate, and in practice, the confidence interval is probably a good stand-in for the credible interval. It&#8217;s just not really answering the question you actually asked, philosophically speaking.</p>
<p><strong>The Hidden Factor</strong></p>
<p>Suppose the researchers had suspected that the hidden factor HF might be implicated in the drug&#8217;s performance, and had been able to measure it in the experiment.</p>
<pre>tabfull = aggregate(numeric(dim(expframe)[1])+1,
        by=list(expframe$group, expframe$hf, expframe$improved), FUN=sum)

&gt; tabfull
  Group.1 Group.2 Group.3   x
1 control     HFA   FALSE 209
2    drug     HFA   FALSE 201
3 control     HFB   FALSE  26
4    drug     HFB   FALSE   1
5 control     HFA    TRUE  16
6    drug     HFA    TRUE  23
7 control     HFB    TRUE   4
8    drug     HFB    TRUE  20</pre>
<p>In this case, we can also estimate the posterior probabilities of improvement for each group, using the bayesian approach. I&#8217;ll just give you the graph.</p>
<p><img style="display: block; margin-left: auto; margin-right: auto;" alt="Post withhf" src="http://www.win-vector.com/blog/wp-content/uploads/2013/05/post_withhf.png" border="0" /></p>
<p>From this evidence, HFB people taking drugX have better than 75% probability of improving their blood pressure; everyone else has probability less than 25%. Just looking at the modes of the distributions, you might naively think that HFA people also have a higher improvement rate when they are taking drugX, or that HFB people have a higher improvement rate than HFA people even in the control group. But the distributions overlap substantially; there is no real evidence that the three groups on the left of the graph have different improvement rates. In other words, if your family doctor knows that you are type HFB, it would make sense to prescribe drugX for your high blood pressure; if you are type HFA, then it doesn&#8217;t.</p>
<p>This is the kind of reasoning promoted by the <a href="http://en.wikipedia.org/wiki/Personalized_medicine">personalized medicine</a> movement. In fact it is what your family doctor already tries to do, by taking into account your family and previous health history, and so on. So far, your doctor can only do this in a negative way &#8212; if you have a family history of colon cancer, then start your annual colonoscopies sooner, otherwise, don&#8217;t bother &#8212; and as far as I know (though I&#8217;m not a doctor or a medical researcher) most published medical research isn&#8217;t designed to help doctors make &#8220;bayesian type&#8221; assessments in a more positive way.</p>
<p><strong>But Don&#8217;t Throw Out Frequentism</strong></p>
<p>So we&#8217;ve established that determining individual patient outcomes is a bayesian question. You might then wonder why anyone would use the frequentist approach at all. But some problems really are frequentist. A medical practitioner who is in public health rather than in a direct patient care practice is interested in the effects of treatments over entire populations, rather than on individuals. Similarly, an insurance company that is deciding whether or not to approve coverage for drugX is interested in whether the drug helps anyone, at all, or if the drug is no better than diet alone. In those situations, a frequentist analysis of drugX does in fact answer the question that is being asked.</p>
<div class='yarpp-related-rss'>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2009/12/statistics-to-english-translation-part-2a-significant-doesnt-always-mean-important/' rel='bookmark' title='Statistics to English Translation, Part 2a: ’Significant’ Doesn’t Always Mean ’Important’'>Statistics to English Translation, Part 2a: ’Significant’ Doesn’t Always Mean ’Important’</a></li>
<li><a href='http://www.win-vector.com/blog/2013/04/worry-about-correctness-and-repeatability-not-p-values/' rel='bookmark' title='Worry about correctness and repeatability, not p-values'>Worry about correctness and repeatability, not p-values</a></li>
<li><a href='http://www.win-vector.com/blog/2009/12/statistics-to-english-translation-part-2b-calculating-significance/' rel='bookmark' title='Statistics to English Translation, Part 2b: Calculating Significance'>Statistics to English Translation, Part 2b: Calculating Significance</a></li>
</ol></p>
<img src='http://yarpp.org/pixels/b8781af2b90c83bd11d1e98c04b31afb'/>
</div>
]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2013/05/bayesian-and-frequentist-approaches-ask-the-right-question/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
		</item>
		<item>
		<title>A pathological glm() problem that doesn&#8217;t issue a warning</title>
		<link>http://www.win-vector.com/blog/2013/05/a-pathological-glm-problem-that-doesnt-issue-a-warning/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=a-pathological-glm-problem-that-doesnt-issue-a-warning</link>
		<comments>http://www.win-vector.com/blog/2013/05/a-pathological-glm-problem-that-doesnt-issue-a-warning/#comments</comments>
		<pubDate>Wed, 01 May 2013 14:11:18 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Statistics]]></category>
		<category><![CDATA[diverge]]></category>
		<category><![CDATA[generalized linear model]]></category>
		<category><![CDATA[GLM]]></category>
		<category><![CDATA[Logistic Regression]]></category>
		<category><![CDATA[Newton-Raphson]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[Regularization]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=2395</guid>
		<description><![CDATA[I know I have already written a lot about technicalities in logistic regression (see for example: How robust is logistic regression? and Newton-Raphson can compute an average). But I just ran into a simple case where R&#8216;s glm() implementation of logistic regression seems to fail without issuing a warning message. Yes the data is a [...]<div class='yarpp-related-rss'>

Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2012/08/newton-raphson-can-compute-an-average/' rel='bookmark' title='Newton-Raphson can compute an average'>Newton-Raphson can compute an average</a></li>
<li><a href='http://www.win-vector.com/blog/2012/08/how-robust-is-logistic-regression/' rel='bookmark' title='How robust is logistic regression?'>How robust is logistic regression?</a></li>
<li><a href='http://www.win-vector.com/blog/2012/08/what-does-a-generalized-linear-model-do/' rel='bookmark' title='What does a generalized linear model do?'>What does a generalized linear model do?</a></li>
</ol>
<img src='http://yarpp.org/pixels/b8781af2b90c83bd11d1e98c04b31afb'/>
</div>
]]></description>
				<content:encoded><![CDATA[<p>I know I have already written a lot about technicalities in logistic regression (see for example: <a target="_blank" href="http://www.win-vector.com/blog/2012/08/how-robust-is-logistic-regression/">How robust is logistic regression?</a> and <a target="_blank" href="http://www.win-vector.com/blog/2012/08/newton-raphson-can-compute-an-average/">Newton-Raphson can compute an average</a>).  But I just ran into a simple case where <a target="_blank" href="http://cran.r-project.org">R</a>&#8216;s glm() implementation of logistic regression seems to fail without issuing a warning message.  Yes the data is a bit pathological, but one would hope for a diagnostic or warning message from the fitter.<span id="more-2395"></span>Consider the following synthetic data set and glm() logistic regression fit (using &#8220;R version 3.0.0 (2013-04-03) &#8212; &#8220;Masked Marvel&#8221;" on OSX Mountain Lion):</p>
<pre>
> d &lt;- data.frame(x=c(rep(1,200),rep(0,25)),y=c(rep(1,24),rep(0,176),rep(1,25)))
> table(y=d$y,x=d$x)
   x
y     0   1
  0   0 176
  1  25  24
> m &lt;- glm(y~x,data=d,family=binomial(link='logit'))
> print(summary(m))

Call:
glm(formula = y ~ x, family = binomial(link = "logit"), data = d)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.5056  -0.5056  -0.5056  -0.5056   2.0593  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept)    18.57    1304.53   0.014    0.989
x             -20.56    1304.53  -0.016    0.987

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 235.84  on 224  degrees of freedom
Residual deviance: 146.77  on 223  degrees of freedom
AIC: 150.77

Number of Fisher Scoring iterations: 17
</pre>
<p>Notice that no coefficient achieved significance and the error bars are in fact gigantic.  It is almost like when logistic regression tries to run a coefficient to infinity on linearly separable data.   In fact that is what is going on: for x=0 the y data is all in one class (separated or pseudo-separated) and any model of the form dc-term = B and x-term = -B  -1.9924 is a good model for large positive B (always is correct on the x=1 distribution of y&#8217;s, and gets better at reproducing the x=0 distribution of y&#8217;s as B goes to infinity).  An ideal optimizer would run B to +infinity.  Likely the Newton method in glm() failed because the Hessian became numerically ill-conditioned or the gradient became near zero (but in a region where the loss function was flat, so not a good indication of an optimum).  But that is the problem: glm() didn&#8217;t inform us of any issue.  It should have run to infinity (bad), but instead it just stopped without diagnostic signaling (also bad).</p>
<p>The model is in fact good, it is the error bars that are a problem.  It should not be hard to bound coefficients that are running to infinity away from zero.  The standard error estimates (even if right) are in this case not able to show the coefficients are nowhere near zero (the usual use of the standard error estimates from a model summary!).  It has always been a strange feature of logistic regression that it has problems with (and has to be defended from) data that is &#8220;too good.&#8221;  Many other methods (like decision trees) do not have this issue.</p>
<p>The solution is simple: add a regularization term in the optimizer (or add reasonable prior if you are a Bayesian).  Regularizing of course spoils the coefficient error-bar calculations; but you could either work out the math for error bars on a regularized estimate- or empirically estimate error bars by some sort of Bootstrap or empirical re-sampling scheme.</p>
<p>Unfortunately trying to regularize by adding some fuzzy data does not work until we add a fairly significant perturbation to the data:</p>
<pre>
> d$wt &lt;- 1
> fuzz &lt;- data.frame(x=c(1,1,0,0),y=c(1,0,1,0))
> fuzz$wt &lt;- 0.5
> d2 &lt;- rbind(d,fuzz)
> m2 &lt;- glm(y~x,data=d2,family=binomial(link='logit'),weights=wt)
Warning message:
In eval(expr, envir, enclos) : non-integer #successes in a binomial glm!
> summary(m2)

Call:
glm(formula = y ~ x, family = binomial(link = "logit"), data = d2, 
    weights = wt)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.9878  -0.5099  -0.5099  -0.5099   2.0516  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)    3.932      1.428   2.754  0.00589 ** 
x             -5.906      1.444  -4.090 4.31e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 239.37  on 228  degrees of freedom
Residual deviance: 153.95  on 227  degrees of freedom
AIC: 151.75

Number of Fisher Scoring iterations: 6
</pre>
<p>But to even try these fixes you would have to know you have a problem.  Right now the only sign of a problem are the enormous error bars.</p>
<div class='yarpp-related-rss'>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2012/08/newton-raphson-can-compute-an-average/' rel='bookmark' title='Newton-Raphson can compute an average'>Newton-Raphson can compute an average</a></li>
<li><a href='http://www.win-vector.com/blog/2012/08/how-robust-is-logistic-regression/' rel='bookmark' title='How robust is logistic regression?'>How robust is logistic regression?</a></li>
<li><a href='http://www.win-vector.com/blog/2012/08/what-does-a-generalized-linear-model-do/' rel='bookmark' title='What does a generalized linear model do?'>What does a generalized linear model do?</a></li>
</ol></p>
<img src='http://yarpp.org/pixels/b8781af2b90c83bd11d1e98c04b31afb'/>
</div>
]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2013/05/a-pathological-glm-problem-that-doesnt-issue-a-warning/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Prefer = for assignment in R</title>
		<link>http://www.win-vector.com/blog/2013/04/prefer-for-assignment-in-r/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=prefer-for-assignment-in-r</link>
		<comments>http://www.win-vector.com/blog/2013/04/prefer-for-assignment-in-r/#comments</comments>
		<pubDate>Wed, 24 Apr 2013 03:38:10 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Mathematics]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[assignment]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[style]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=2392</guid>
		<description><![CDATA[We share our opinion that = should be preferred to the more standard &#60;- for assignment in R. This is from a draft of the appendix of our upcoming book. This has the risk of becoming an R version of Javascript&#8217;s semicolon controversy, but here you have it. R has five common assignment operators: &#8220;=&#8220;, [...]<div class='yarpp-related-rss'>

Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2012/02/why-i-dont-like-dynamic-typing/' rel='bookmark' title='Why I don&#8217;t like Dynamic Typing'>Why I don&#8217;t like Dynamic Typing</a></li>
<li><a href='http://www.win-vector.com/blog/2012/10/error-handling-in-r/' rel='bookmark' title='Error Handling in R'>Error Handling in R</a></li>
<li><a href='http://www.win-vector.com/blog/2012/06/selection-in-r/' rel='bookmark' title='Selection in R'>Selection in R</a></li>
</ol>
<img src='http://yarpp.org/pixels/b8781af2b90c83bd11d1e98c04b31afb'/>
</div>
]]></description>
				<content:encoded><![CDATA[<p>We share our opinion that <code>=</code> should be preferred to the more standard <code>&lt;-</code> for assignment in <a target="_blank" href="http://cran.r-project.org">R</a>.  This is from a draft of the appendix of our upcoming book.  This has the risk of becoming an R version of <a target="_blank" href="http://christianheilmann.com/2012/04/16/of-parser-fetishists-and-semi-colons/">Javascript&#8217;s semicolon controversy</a>, but here you have it.<span id="more-2392"></span>
<p/>
<p>R has five common assignment operators: &#8220;<code>=</code>&#8220;, &#8220;<code>&lt;-</code>&#8220;, &#8220;<code>-&gt;</code>&#8220;, &#8220;<code>&lt;&lt;-</code>&#8221; and &#8220;<code>-&gt;&gt;</code>&#8220;.  Traditionally in R <code>&lt;-</code> is the preferred assignment operator and <code>=</code> is thought as an amateurish alias for it.</p>
<p>The <code>&lt;-</code> notation is preferred by some for the very good reason that <code>&lt;-</code> always means assignment. Whereas <code>=</code> can mean assignment, function argument binding or case statement depending on context.  However, in our opinion, you are allowed by R to type <code>&lt;-</code> too many places (such as inside expressions) and it usually an easier to find bug when you typed <code>=</code> when you meant <code>&lt;-</code> than the other way around.</p>
<p>We prefer to get into the habit of never typing <code>&lt;-</code>, because accidentally typing <code>&lt;-</code> instead of <code>=</code> in a function call can cause a non-reported error.  Consider the following code fragment demonstrating how we can use <code>=</code> to bind values to function arguments:</p>
<pre>
> divide = function(numerator,denominator) { numerator/denominator }
> divide(1,2)
[1] 0.5
> divide(2,1)
[1] 2
> divide(denominator=2,numerator=1)
[1] 0.5
</pre>
<p>Now consider the following (deliberate) error, by habit we typed <code>&lt;-</code> instead of <code>=</code>:</p>
<pre>
> divide(denominator&lt;-2,numerator&lt;-1)
[1] 2
> denominator
[1] 2
</pre>
<p>We quietly get the wrong answer and contaminate the values of <code>numerator</code> and <code>denominator</code> in the global name space.  This is a simple example of where typing <code>&lt;-</code> where <code>=</code> was intended causes a non-signaling bug.  We don&#8217;t know of any simple example (other than building examples that intend side-effects) where typing <code>=</code> where you meant <code>&lt;-</code> is an error.  So we prefer <code>=</code>.</p>
<p>The <code>-&gt;</code> operator is just a right to left assignment that lets you write things like <code>x -&gt; 5</code>. It is cute, but not game changing.  The <code>&lt;&lt;-</code> and <code>-&gt;&gt;</code> are to be avoided unless you actually need their special abilities. They undo one of the important safety point about functions. When a variable is assigned inside a function this assignment is local to the function. That is nobody outside of the function every sees the effect, the function can safely use variables to store intermediate calculations without clobbering same-named outside variables. The <code>&lt;&lt;-</code> and <code>-&gt;&gt;</code> operators are the operators to reach outside of this protected scope and cause outside side effects. Side effects seem great when you need them, but on the balance they make code maintenance, debugging and documentation much harder.</p>
<div class='yarpp-related-rss'>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2012/02/why-i-dont-like-dynamic-typing/' rel='bookmark' title='Why I don&#8217;t like Dynamic Typing'>Why I don&#8217;t like Dynamic Typing</a></li>
<li><a href='http://www.win-vector.com/blog/2012/10/error-handling-in-r/' rel='bookmark' title='Error Handling in R'>Error Handling in R</a></li>
<li><a href='http://www.win-vector.com/blog/2012/06/selection-in-r/' rel='bookmark' title='Selection in R'>Selection in R</a></li>
</ol></p>
<img src='http://yarpp.org/pixels/b8781af2b90c83bd11d1e98c04b31afb'/>
</div>
]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2013/04/prefer-for-assignment-in-r/feed/</wfw:commentRss>
		<slash:comments>20</slash:comments>
		</item>
		<item>
		<title>Data Science, Machine Learning, and Statistics: what is in a name?</title>
		<link>http://www.win-vector.com/blog/2013/04/data-science-machine-learning-and-statistics-what-is-in-a-name/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=data-science-machine-learning-and-statistics-what-is-in-a-name</link>
		<comments>http://www.win-vector.com/blog/2013/04/data-science-machine-learning-and-statistics-what-is-in-a-name/#comments</comments>
		<pubDate>Fri, 19 Apr 2013 17:48:18 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[data science]]></category>
		<category><![CDATA[Expository Writing]]></category>
		<category><![CDATA[Opinion]]></category>
		<category><![CDATA[Pragmatic Data Science]]></category>
		<category><![CDATA[Pragmatic Machine Learning]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[data science project planning]]></category>
		<category><![CDATA[information science]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Project Management]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=2386</guid>
		<description><![CDATA[A fair complaint when seeing yet another &#8220;data science&#8221; article is to say: &#8220;this is just medical statistics&#8221; or &#8220;this is already part of bioinformatics.&#8221; We certainly label many articles as &#8220;data science&#8221; on this blog. Probably the complaint is slightly cleaner if phrased as &#8220;this is already known statistics.&#8221; But the essence of the [...]<div class='yarpp-related-rss'>

Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2010/10/a-personal-perspective-on-machine-learning/' rel='bookmark' title='A Personal Perspective on Machine Learning'>A Personal Perspective on Machine Learning</a></li>
<li><a href='http://www.win-vector.com/blog/2012/04/setting-expectations-in-data-science-projects/' rel='bookmark' title='Setting expectations in data science projects'>Setting expectations in data science projects</a></li>
<li><a href='http://www.win-vector.com/blog/2012/05/the-differing-perspectives-of-statistics-and-machine-learning/' rel='bookmark' title='The differing perspectives of statistics and machine learning'>The differing perspectives of statistics and machine learning</a></li>
</ol>
<img src='http://yarpp.org/pixels/b8781af2b90c83bd11d1e98c04b31afb'/>
</div>
]]></description>
				<content:encoded><![CDATA[<p>A fair complaint when seeing yet another &#8220;data science&#8221; article is to say: &#8220;this is just medical statistics&#8221; or &#8220;this is already part of bioinformatics.&#8221;  We certainly label many articles as &#8220;data science&#8221; on this blog.  Probably the complaint is slightly cleaner if phrased as &#8220;this is already known statistics.&#8221;  But the essence of the complaint is a feeling of claiming novelty in putting old wine in new bottles.   Rob Tibshirani nailed this type of distinction in is famous <a target="_blank" href="http://www-stat.stanford.edu/~tibs/stat315a/glossary.pdf">machine learning versus statistics glossary</a>. </p>
<p>I&#8217;ve written about <a target="_blank" href="http://www.win-vector.com/blog/2012/05/the-differing-perspectives-of-statistics-and-machine-learning/">statistics</a> v.s. <a target="_blank" href="http://www.win-vector.com/blog/2010/10/a-personal-perspective-on-machine-learning/">machine learning</a> , but I would like to explain why we (the authors of this blog) often use the term data science.  <a target="_blank" href="http://www.win-vector.com/Staff/NinaZumel/NinaZumel.html">Nina Zumel</a> <a target="_blank" href="http://www.win-vector.com/blog/2012/09/on-being-a-data-scientist/">explained being a data scientist</a> very well, I am going to take a swipe at explaining data science.</p>
<p>We (the authors on this blog) label many of our articles as being about data science because we want to emphasize that the various techniques we write about are only meaningful when considered parts of a larger end to end process.  The process we are interested in is the deployment of useful data driven models into production.   The important components are learning the true business needs (often by extensive partnership with customers), enabling the collection of data, managing data, applying modeling techniques and applying statistics criticisms.   The pre-existing term I have found that is closest to describing this whole project system is data science, so that is the term I use.  I tend to use it a lot, because while I love the tools and techniques our true loyalty is to the whole process (and I want to emphasize this to our readers).</p>
<p>The phrase &#8220;data science&#8221; as in use it today is a fairly new term (made popular by William S. Cleveland, DJ Patil, and Jeff Hammerbacher).  I myself worked in a &#8220;computational sciences&#8221; group in the mid 1990&#8242;s (this group emphasized simulation based modeling of small molecules and their biological interactions,  the naming was an attempt to emphasize computation over computers).  So for me &#8220;data science&#8221; seems like a good term when your work is driven by data (versus driven from computer simulations).  For some people data science is considered a new calling and for others it is a faddish misrepresentation of work that has already been done.  I think there are enough substantial differences in approach between traditional statistics, machine learning, data mining, predictive analytics, and data science to justify at least this much nomenclature.  In this article I will try to describe (but not fully defend) my opinion.<span id="more-2386"></span>
<p/>
<p>My breakdown of the different information sciences is given below (I try to treat each with the respect it deserves, so I am certain to offend all).  For this article I am most interested the fields that lean towards modeling, so I will tend to move on from topics that are not centered on this topic.</p>
<h2>The nature of statistics</h2>
<p>Statistics is the original computing with data.  It is the field that deals with data with the most portability (it isn&#8217;t dependent on one type of physical model) and rigor.  Statistics can be a pessimal field: statisticians are the masters of anticipating what can go wrong with experiments and what fallacies can be drawn from naive uses of data.  Statistics has enough techniques to solve just about any problem, but it also has an inherent conservatism to it.</p>
<p>I often say the best source of good statistical work is bad experiments.  If all experiments were well conducted, we wouldn&#8217;t need a lot of statistics.  However, we live in the real world; most experiments have significant shortcomings and statistics is incredibly valuable.</p>
<p>Another aspect of statistics is it is the only field that really emphasizes the risks of small data.  There are many other potential data problems statistics describes well (like <a href="http://en.wikipedia.org/wiki/Simpson's_paradox">Simpson&#8217;s paradox</a>), but statistics is fairly unique in the information sciences in emphasizing the risks of trying to reason from small datasets.  This is actually very important: datasets that are expensive to produce (such as drug trials) are necessarily small.  </p>
<p>It is only recently that minimally curated big datasets became perceived as being inherently valuable (the earlier attitude being closer to <a target="_blank" href="http://en.wikipedia.org/wiki/Garbage_in,_garbage_out">GIGO</a>).  And in some cases big data is promoted as valuable only because it is the cheapest to produce.  Often a big dataset (such as logs of all clicks seen on a search engine) is useful largely because they are a good proxy for a smaller dataset that is too expensive to actually produce (such as interviewing a good cross section of search engine users as to their actual intent).</p>
<p>If your business is directly producing truly valuable data (not just producing useful proxy data) you likely have small data issues.  If you have any hint of a small data issue, you want to consult with a good statistician.</p>
<h2>The nature of machine learning</h2>
<p>In some sense machine learning rushes where statisticians fear to tread.  Machine learning does have some concept of small data issues (such as knowing about over-fitting), but it is an essentially optimistic field.  </p>
<p>The goal of machine learning is to create a predictive model that is indistinguishable from a correct model.  This is an operational attitude that tends to offend statisticians who want a model that not only appears to be accurate but is in fact correct (i.e. also has some explanatory value).</p>
<p>My opinion is the best machine learning work is an attempt to re-phrase prediction as an optimization problem (see for example:  Bennett, K. P., &#038; Parrado-Hernandez, E. (2006). The Interplay of Optimization and Machine Learning Research. Journal of Machine Learning Research, 7, 1265–1281).  Good machine learning papers use good optimization techniques and bad machine learning papers (most of them in fact) use bad out of date ad-hoc optimization techniques.</p>
<h2>The nature of data mining</h2>
<p>Data mining is a term that was quite hyped and now somewhat derided.  One of the reasons more people use the term &#8220;data science&#8221; nowadays is they are loath to say &#8220;data mining&#8221; (though in my opinion the two activities have different goals).  </p>
<p>The goal of data mining is to find relations in data, not to necessarily make predictions or come up with explanations.   Data mining is often what I call &#8220;an x&#8217;s only enterprise&#8221; (meaning you have many driver  or &#8220;independent&#8221; variables but no pre-ordained outcome or &#8220;dependent&#8221; variables) and some of the typical goals are clustering, outlier detection and characterization.</p>
<p>There is a sense that when it was called exploratory statistics it was considered boring, but when it was called data mining it was considered sexy.  Actual exploratory statistics (as defined by Tukey) is exciting and always an important &#8220;get your hands into the data&#8221; step of any predictive analytics project.</p>
<h2>The nature of informatics</h2>
<p>Informatics and in particular bioinformatics are very hot terms.  A lot of good data scientists (a term I will explain later) come from the bioinformatics field.</p>
<p>Once we separate out the portions of bioinformatics that are in fact statistics and the ones that are in fact biology we are left with data infrastructure and matching algorithms.  We have the creation and management of data stores, data bases and design of efficient matching and query algorithms.  This isn&#8217;t meant to be a left handed compliment: algorithms are a first love of mine and some of the matching algorithms bioinformaticians uses (like <a target="_blank" href="http://en.wikipedia.org/wiki/Ukkonen%27s_algorithm">online suffix trees</a>) are quite brilliant.</p>
<h2>The nature of big data</h2>
<p>Big data is a white-hot topic.  The thing to remember is: it is just the infrastructure (MapReduce, Hadoop, noSQL and so on).  It is the platform you perform modeling (or usually just report generation) on top of.</p>
<h2>The nature of predictive analytics</h2>
<p>The Wikipedia defines <a target="_blank" href="http://en.wikipedia.org/wiki/Predictive_analytics">Predictive analytics</a> as the &#8220;&#8230; variety of techniques from statistics, modeling, machine learning, and data mining that analyze current and historical facts to make predictions about future, or otherwise unknown, events.&#8221;  It is  a set of goals and techniques emphasizing making models.  It is very close to what is also meant by data science.</p>
<p>I don&#8217;t tend to use the term predictive analytics because I come from a probability, simulation, algorithms and machine learning background and not from an analytics background.  To my ear analytics is more associated with visualization, reporting and summarization than with modeling.  I also try to use the term modeling over prediction (when I remember) as prediction often in non-technical English implies something like forecasting into the future (which is but one modeling task).</p>
<h2>The nature of data science</h2>
<p>The Wikipedia defines <a target="_blank" href="http://en.wikipedia.org/wiki/Data_science">data science</a> as a field that &#8220;incorporates varying elements and builds on techniques and theories from many fields, including math, statistics, data engineering, pattern recognition and learning, advanced computing, visualization, uncertainty modeling, data warehousing, and high performance computing with the goal of extracting meaning from data and creating data products.&#8221;</p>
<p>Data science is a term I use to represent the ownership and management of the entire modeling process: discovering the true business need, collecting data, managing data, building models and deploying models into production.  </p>
<h2>Conclusion</h2>
<p>Machine learning and statistics may be the stars, but data science the whole show.</p>
<div class='yarpp-related-rss'>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2010/10/a-personal-perspective-on-machine-learning/' rel='bookmark' title='A Personal Perspective on Machine Learning'>A Personal Perspective on Machine Learning</a></li>
<li><a href='http://www.win-vector.com/blog/2012/04/setting-expectations-in-data-science-projects/' rel='bookmark' title='Setting expectations in data science projects'>Setting expectations in data science projects</a></li>
<li><a href='http://www.win-vector.com/blog/2012/05/the-differing-perspectives-of-statistics-and-machine-learning/' rel='bookmark' title='The differing perspectives of statistics and machine learning'>The differing perspectives of statistics and machine learning</a></li>
</ol></p>
<img src='http://yarpp.org/pixels/b8781af2b90c83bd11d1e98c04b31afb'/>
</div>
]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2013/04/data-science-machine-learning-and-statistics-what-is-in-a-name/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Checking claims in published statistics papers</title>
		<link>http://www.win-vector.com/blog/2013/04/checking-claims-in-published-statistics-papers/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=checking-claims-in-published-statistics-papers</link>
		<comments>http://www.win-vector.com/blog/2013/04/checking-claims-in-published-statistics-papers/#comments</comments>
		<pubDate>Mon, 08 Apr 2013 21:30:38 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Programming]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[linear regression]]></category>
		<category><![CDATA[synthetic dataset]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=2385</guid>
		<description><![CDATA[When finishing Worry about correctness and repeatability, not p-values I got to thinking a bit more about what can you actually check when reading a paper, especially when you don&#8217;t have access to the raw data. Some of the fellow scientists I admire most have a knack for back of the envelope calculations and dimensional [...]<div class='yarpp-related-rss'>

Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2013/04/worry-about-correctness-and-repeatability-not-p-values/' rel='bookmark' title='Worry about correctness and repeatability, not p-values'>Worry about correctness and repeatability, not p-values</a></li>
<li><a href='http://www.win-vector.com/blog/2009/12/statistics-to-english-translation-part-2b-calculating-significance/' rel='bookmark' title='Statistics to English Translation, Part 2b: Calculating Significance'>Statistics to English Translation, Part 2b: Calculating Significance</a></li>
<li><a href='http://www.win-vector.com/blog/2009/12/statistics-to-english-translation-part-2a-significant-doesnt-always-mean-important/' rel='bookmark' title='Statistics to English Translation, Part 2a: ’Significant’ Doesn’t Always Mean ’Important’'>Statistics to English Translation, Part 2a: ’Significant’ Doesn’t Always Mean ’Important’</a></li>
</ol>
<img src='http://yarpp.org/pixels/b8781af2b90c83bd11d1e98c04b31afb'/>
</div>
]]></description>
				<content:encoded><![CDATA[<p>When finishing <a target="_blank" href="http://www.win-vector.com/blog/2013/04/worry-about-correctness-and-repeatability-not-p-values/">Worry about correctness and repeatability, not p-values</a> I got to thinking a bit more about what can you actually check when reading a paper, especially when you don&#8217;t have access to the raw data.  Some of the fellow scientists I admire most have a knack for back of the envelope calculations and dimensional analysis style calculations.  They could always read a few facts off a presentation that the presenter may not have meant to share.  There is a <a target="_blank" href="http://www.win-vector.com/blog/2009/05/the-joy-of-calculation/">joy in calculation</a> and figuring, so I decided it would be a fun challenge to see if you could check any of the claims of &#8220;Association between muscular strength and mortality in men: prospective cohort study,&#8221; Ruiz et. al. <a target="_blank" href="http://www.bmj.com/content/337/bmj.a439">BMJ 2008;337:a439</a> from just the summary tables supplied in the paper itself.<span id="more-2385"></span>
<p/>
<p>The main summary you can extract from the paper is the distribution of the categorical variables.  By combining the numbers from table 1 and table 2 you can compile a small overall table like the following:</p>
<p><img style="display:block; margin-left:auto; margin-right:auto;" src="http://www.win-vector.com/blog/wp-content/uploads/2013/04/Summaries.png" alt="Summaries" title="Summaries.png" border="0" width="572" height="187" /></p>
<p>For each of the ten conditions listed as row headers we can find in the paper: the total number of deceased subjects who had the condition, the total number of subjects who had the condition, and how subjects with the condition are distributed amount the muscular strength groups.  This is in fact enough information to recover the complete linear regression design matrix for mortality models involving the three muscular strength groups and at most one of the remaining seven categorical variables.  From this you can fit coefficients, find p-values, perform an ANOVA and so on.</p>
<p>We wanted to see if we could do a bit more.  Could we build a synthetic dataset that obeyed all of the roll-ups (or margins) as shown in the above table?  If we could build a complete synthetic data set that claimed to have 8762 individuals in it and matched all of the known summaries then we could try to reproduce some of the paper results, without having to monkey with the details of fitting (we could use the standard fitters already found in <a target="_blank" href="http://cran.us.r-project.org">R</a>).</p>
<p>To do this we decided to think about how many possible types of individual could be distinguished in this kind of study.  Since we have the three exclusive (and complete) muscular groups, seven more binary conditions and a single binary outcome there are exactly 3*(2^7)*2 = 768 possible types of individual.   Our idea was to each of these 768 possible individual signatures assign a weight representing how many individuals with the given signature are in our synthetic dataset.  These weights would completely determine our data.  All we would have to do is ensure:</p>
<ol>
<li>The weights are all non-negative.</li>
<li>The weights sum up to 8762.</li>
<li>The 768 possible signatures when summed up proportional to the weights match all of the summaries in our table.</li>
</ol>
<p>These are just linear inequalities (on the weights) subject to non-negativity constraints.  A linear program, solvable by a <a target="_blank" href="http://www.win-vector.com/blog/2012/11/yet-another-java-linear-programming-library/">linear programming package</a>.  Now the weights are not uniquely determined (because all of the pairwise correlations and higher-order correlations between the non-muscular factors are not known).  But we can try various directions to look to get different synthetic data sets.</p>
<p>Below is a typical ideal sample.  Notice it only used 34 of the individual types (out of 768 possible) and fills out to 8762 individuals by repetitions proportional to the &#8220;wt&#8221; column.  Notice also some weights are fractional, so if we want an un-weighted data set with 8762 rows we are going to have to round the weights to integers (which will perturb all of the counts slightly).</p>
<p><img style="display:block; margin-left:auto; margin-right:auto;" src="http://www.win-vector.com/blog/wp-content/uploads/2013/04/sample.png" alt="Sample" title="sample.png" border="0" width="600" height="546" /></p>
<p>The above synthetic data set found the same association between low muscle strength and death as the original study.  But we in fact generated 10 of these synthetic samples (by randomly favoring different signatures).  A few of these synthetic data sets did not reproduce the results.  For example the synthetic data set we are call group-2 shows the claimed relation when the muscle strength levels are the only variables, but loses the relation when the other conditions are added to the model.</p>
<pre>
&gt; print(summary(lm(deceased~0
   +MuscularStrength.lower
   +MuscularStrength.middle
   +MuscularStrength.upper,data=dg)))

Call:
lm(formula = deceased ~ 0 + MuscularStrength.lower + MuscularStrength.middle + 
    MuscularStrength.upper, data = dg)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.07329 -0.07329 -0.04995 -0.04896  0.95104 

Coefficients:
                        Estimate Std. Error t value Pr(&gt;|t|)    
MuscularStrength.lower  0.073288   0.004300   17.04   &lt;2e-16 ***
MuscularStrength.middle 0.048956   0.004299   11.39   &lt;2e-16 ***
MuscularStrength.upper  0.049949   0.004298   11.62   &lt;2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2324 on 8761 degrees of freedom
Multiple R-squared:  0.0596,	Adjusted R-squared:  0.05927 
F-statistic: 185.1 on 3 and 8761 DF,  p-value: &lt; 2.2e-16
</pre>
<p>That is, for this synthetic data set the conclusions of the paper do not hold (see below).  </p>
<pre>
&gt; print(summary(lm(deceased~0
   +MuscularStrength.lower
   +MuscularStrength.middle
   +MuscularStrength.upper
   +sedentary
   +current.smoker
   +five.drinks.weekly
   +diabetes.millitus
   + hypertension
   +hyercholesterolaemia 
   +family.cardiovascular,data=dg)))

Call:
lm(formula = deceased ~ 0 + MuscularStrength.lower + MuscularStrength.middle + 
    MuscularStrength.upper + sedentary + current.smoker + five.drinks.weekly + 
    diabetes.millitus + hypertension + hyercholesterolaemia + 
    family.cardiovascular, data = dg)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.22271 -0.06749 -0.03845 -0.02006  0.98879 

Coefficients:
                         Estimate Std. Error t value Pr(&gt;|t|)    
MuscularStrength.lower   0.012554   0.006748   1.860  0.06287 .  
MuscularStrength.middle -0.005834   0.006327  -0.922  0.35650    
MuscularStrength.upper  -0.004142   0.006069  -0.682  0.49499    
sedentary               -0.029898   0.009790  -3.054  0.00226 ** 
current.smoker           0.002652   0.009193   0.289  0.77296    
five.drinks.weekly       0.025898   0.005587   4.636 3.61e-06 ***
diabetes.millitus        0.131324   0.016718   7.855 4.46e-15 ***
hypertension             0.068976   0.006191  11.141  &lt; 2e-16 ***
hyercholesterolaemia     0.042745   0.006680   6.399 1.64e-10 ***
family.cardiovascular    0.069631   0.008404   8.285  &lt; 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2289 on 8754 degrees of freedom
Multiple R-squared:  0.08821,	Adjusted R-squared:  0.08717 
F-statistic: 84.69 on 10 and 8754 DF,  p-value: &lt; 2.2e-16
</pre>
<p>Now of course the paper is claiming the conclusion holds on a single real dataset, not that it would hold on all datasets with the same summary statistics.  But it does mean: the tables given in the paper are not specific enough to entail the claimed result.  To confirm the paper&#8217;s result you would need more detailed access to their data.  In the future refereeing will improve to the point where you can&#8217;t expect to publish claims without releasing the confirming data and procedures (ideas like iPython workbooks and so on).  But, unfortunately, that isn&#8217;t the standard of today (so we really shouldn&#8217;t criticize).</p>
<p>We did all of this in fun.  We think the muscle strength paper is good paper.  But we also want to be ready and able to kick the tires on papers.  Along those lines we are sharing code that reads a summary table (formatted like our first table) and produces synthetic datasets: <a href="https://github.com/WinVector/ExperimentInspector">ExperimentInspector</a>.  All code and data used to write this article is shared there.</p>
<p>If you have a serious need for a more production hardened tool of this nature, please get in touch.   We would love to help with things like clinical oversight, experiment planning, regulatory compliance and fraud detection.</p>
<div class='yarpp-related-rss'>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2013/04/worry-about-correctness-and-repeatability-not-p-values/' rel='bookmark' title='Worry about correctness and repeatability, not p-values'>Worry about correctness and repeatability, not p-values</a></li>
<li><a href='http://www.win-vector.com/blog/2009/12/statistics-to-english-translation-part-2b-calculating-significance/' rel='bookmark' title='Statistics to English Translation, Part 2b: Calculating Significance'>Statistics to English Translation, Part 2b: Calculating Significance</a></li>
<li><a href='http://www.win-vector.com/blog/2009/12/statistics-to-english-translation-part-2a-significant-doesnt-always-mean-important/' rel='bookmark' title='Statistics to English Translation, Part 2a: ’Significant’ Doesn’t Always Mean ’Important’'>Statistics to English Translation, Part 2a: ’Significant’ Doesn’t Always Mean ’Important’</a></li>
</ol></p>
<img src='http://yarpp.org/pixels/b8781af2b90c83bd11d1e98c04b31afb'/>
</div>
]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2013/04/checking-claims-in-published-statistics-papers/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Spring 2013 Win Vector LLC marketing drive</title>
		<link>http://www.win-vector.com/blog/2013/04/spring-2013-win-vector-llc-marketing-drive/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=spring-2013-win-vector-llc-marketing-drive</link>
		<comments>http://www.win-vector.com/blog/2013/04/spring-2013-win-vector-llc-marketing-drive/#comments</comments>
		<pubDate>Sat, 06 Apr 2013 19:28:59 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Administrativia]]></category>
		<category><![CDATA[marketing drive]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=2380</guid>
		<description><![CDATA[Dear readers, I am asking for your help promoting Win Vector LLC and the Win Vector LLC blog ( http://www.win-vector.com/blog/ ). We here at Win Vector LLC try hard to provide quality content and always benefit from more contacts and readers. If you have any possible leads or can make any introductions to companies that [...]<div class='yarpp-related-rss'>

Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2011/10/kernel-methods-and-support-vector-machines-de-mystified/' rel='bookmark' title='Kernel Methods and Support Vector Machines de-Mystified'>Kernel Methods and Support Vector Machines de-Mystified</a></li>
<li><a href='http://www.win-vector.com/blog/2012/09/on-being-a-data-scientist/' rel='bookmark' title='On Being a Data Scientist'>On Being a Data Scientist</a></li>
<li><a href='http://www.win-vector.com/blog/2009/06/public-service-article-jstor-and-other-useful-research-archives/' rel='bookmark' title='Public Service Article: JSTOR and other Useful Research Archives'>Public Service Article: JSTOR and other Useful Research Archives</a></li>
</ol>
<img src='http://yarpp.org/pixels/b8781af2b90c83bd11d1e98c04b31afb'/>
</div>
]]></description>
				<content:encoded><![CDATA[<p>Dear readers,</p>
<p>I am asking for your help promoting Win Vector LLC and the Win Vector LLC blog ( <a target="_blank" href="http://www.win-vector.com/blog/">http://www.win-vector.com/blog/</a> ).  We here at Win Vector LLC try hard to provide quality content and always benefit from more contacts and readers.</p>
<p>If you have any possible leads or can make any introductions to companies that may want some data science consulting I would love to hear from you (email: <a href="mailto:contact@win-vector.com">contact@win-vector.com</a> ).</p>
<p>Also, please subscribe to our data science blog (RSS: <a href="http://www.win-vector.com/blog/feed/">http://www.win-vector.com/blog/feed/</a>) and new Twitter account ( <a target="_blank" href="http://twitter.com/WinVectorLLC/">http://twitter.com/WinVectorLLC/</a> ).  Better yet please share our blog and Twitter account with anybody you think would be interested (and please ask them to do the same).</p>
<p>Thank you!</p>
<div class='yarpp-related-rss'>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2011/10/kernel-methods-and-support-vector-machines-de-mystified/' rel='bookmark' title='Kernel Methods and Support Vector Machines de-Mystified'>Kernel Methods and Support Vector Machines de-Mystified</a></li>
<li><a href='http://www.win-vector.com/blog/2012/09/on-being-a-data-scientist/' rel='bookmark' title='On Being a Data Scientist'>On Being a Data Scientist</a></li>
<li><a href='http://www.win-vector.com/blog/2009/06/public-service-article-jstor-and-other-useful-research-archives/' rel='bookmark' title='Public Service Article: JSTOR and other Useful Research Archives'>Public Service Article: JSTOR and other Useful Research Archives</a></li>
</ol></p>
<img src='http://yarpp.org/pixels/b8781af2b90c83bd11d1e98c04b31afb'/>
</div>
]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2013/04/spring-2013-win-vector-llc-marketing-drive/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Worry about correctness and repeatability, not p-values</title>
		<link>http://www.win-vector.com/blog/2013/04/worry-about-correctness-and-repeatability-not-p-values/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=worry-about-correctness-and-repeatability-not-p-values</link>
		<comments>http://www.win-vector.com/blog/2013/04/worry-about-correctness-and-repeatability-not-p-values/#comments</comments>
		<pubDate>Fri, 05 Apr 2013 23:58:56 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[data science]]></category>
		<category><![CDATA[Expository Writing]]></category>
		<category><![CDATA[Opinion]]></category>
		<category><![CDATA[Rants]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Statistics To English Translation]]></category>
		<category><![CDATA[Tutorials]]></category>
		<category><![CDATA[p-values]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[significance]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=2379</guid>
		<description><![CDATA[In data science work you often run into cryptic sentences like the following: Age adjusted death rates per 10,000 person years across incremental thirds of muscular strength were 38.9, 25.9, and 26.6 for all causes; 12.1, 7.6, and 6.6 for cardiovascular disease; and 6.1, 4.9, and 4.2 for cancer (all P &#60; 0.01 for linear [...]<div class='yarpp-related-rss'>

Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2012/10/level-fit-summaries-can-be-tricky-in-r/' rel='bookmark' title='Level fit summaries can be tricky in R'>Level fit summaries can be tricky in R</a></li>
<li><a href='http://www.win-vector.com/blog/2012/12/how-to-test-xcom-dice-rolls-for-fairness/' rel='bookmark' title='How to test XCOM &#8220;dice rolls&#8221; for fairness'>How to test XCOM &#8220;dice rolls&#8221; for fairness</a></li>
<li><a href='http://www.win-vector.com/blog/2013/03/a-bit-more-on-sample-size/' rel='bookmark' title='A bit more on sample size'>A bit more on sample size</a></li>
</ol>
<img src='http://yarpp.org/pixels/b8781af2b90c83bd11d1e98c04b31afb'/>
</div>
]]></description>
				<content:encoded><![CDATA[<p>In data science work you often run into cryptic sentences like the following:</p>
<blockquote><p>
Age adjusted death rates per 10,000 person years across incremental thirds of muscular strength were 38.9, 25.9, and 26.6 for all causes; 12.1, 7.6, and 6.6 for cardiovascular disease; and 6.1, 4.9, and 4.2 for cancer (all P &lt; 0.01 for linear trend).</p>
<p>(From &#8220;Association between muscular strength and mortality in men: prospective cohort study,&#8221; Ruiz et. al. <a target="_blank" href="http://www.bmj.com/content/337/bmj.a439">BMJ 2008;337:a439</a>.)
</p></blockquote>
<p>The accepted procedure is to recognize &#8220;p&#8221; or &#8220;p-value&#8221; as shorthand for &#8220;<a target="_blank" href="http://www.win-vector.com/blog/2009/12/statistics-to-english-translation-part-2a-significant-doesnt-always-mean-important/">significance</a>,&#8221; keep your mouth shut and hope the paper explains what is actually claimed somewhere later on.  We know the writer is claiming significance, but despite the technical terminology they have not actually said which test they actually ran (lm(), glm(), contingency table, normal test, t-test, f-test, g-test, chi-sq, permutation test, exact test and so on).  I am going to go out on a limb here and say these type of sentences are gibberish and nobody actually understands them.  From experience we know generally what to expect, but it isn&#8217;t until we read further we can precisely pin down what is actually being claimed.  This isn&#8217;t the authors&#8217; fault, they are likely good scientists, good statisticians, and good writers; but this incantation is required by publishing tradition and reviewers.</p>
<p>We argue you should worry about the correctness of your results (how likely a bad result could look like yours, the subject of frequentist significance) and repeatability (how much variance is in your estimation procedure, as measured by procedures like the bootstrap).  p-values and significance are important in how they help structure the above questions.</p>
<p>The legitimate purpose of technical jargon is to make conversations quicker and more precise.  However, saying &#8220;p&#8221; is not much shorter than saying &#8220;significance&#8221; and there are many different procedures that return p-values (so saying &#8220;p&#8221; does not limit you down to exactly one procedure like a good acronym might).  At best the savings in time would be from having to spend 10 minutes thinking which interpretation of significance is most approbate to the actual problem at hand versus needing a mere 30 seconds to read about the &#8220;p.&#8221;  However, if you don&#8217;t have 10 minutes to consider if the entire result a paper is likely an observation artifact due to chance or noise (the subject of significance) then you really don&#8217;t care much about the paper.</p>
<p>In our opinion &#8220;p-values&#8221; have degenerated from a useful jargon into a secretive argot.  We are going to discuss thinking about significance as &#8220;worrying about correctness&#8221; (a fundamental concern) instead of as a cut and dried statistical procedure you should automate out of view (uncritically copying reported p&#8217;s from fitters).  Yes &#8220;p&#8221;s are significances, but there is no reason to not just say what sort of error you are claiming is unlikely.<span id="more-2379"></span>
<p/>
<p>We started with bad writing on significance, let&#8217;s share an example of good writing:</p>
<blockquote><p>
Suppose that in an agricultural experiment four different chemical treatments of soil produced mean wheat yields of 28, 22, 18, and 24 bushels per acre, respectively.  Is there a significant difference in these means, or is the observed spread due simply to chance?</p>
<p>(From Schaum&#8217;s Outlines, &#8220;Statistics&#8221; Fourth Edition, Murray R. Spiegel and Larry J. Stephens, McGraw-Hill, 2008.)
</p></blockquote>
<p>From the above paragraph you have some idea of what is going on and why you should care.  Imagine you were asked to choose one of these soil treatments for your farm.  You would want the one that actually is the best, not one that managed to fool you once.  You care about actual correctness.  The mathematician Gian-Carlo Rota called out an earlier version of this text as being the one that finally explained to him what the purpose of analysis of variance was.  Things like this are why this book has been in continuous print since 1961.  </p>
<p>From this same book:</p>
<blockquote><p>
When the first edition was introduced in 1961, the p-value was not as widely used as it is today, because it is often difficult to determine without the aid of computer software.  Today p-values are routinely provided by statistical software packages since the computer software computation of p-values is often a trivial matter.
</p></blockquote>
<p>People knew how to do statistics properly before 1961, so they probably had interesting methods that work around the need for explicit p-values.  In this article we will demonstrate R code to take all of the small steps to organize our data and produce summaries demonstrating the nature, correctness and repeatability of experimental results.  There are things you should look for (small counts, large error bars and overlapping distributions) that give diagnostic clues long before you make it to the p-values.</p>
<p>There is nothing intrinsically wrong with p-values, but I hold that the slavish copying of them from computer results into reports has distracted us away from thinking about important issues of correctness, repeatability and significance.</p>
<p>What is significance?  Significance is usually an estimate of the probability of some event you don&#8217;t want to happen looking like a favorable event you saw in your experimental data.  It is a frequentist idea and you introduce a straw-man explanation of the data (that you hope to falsify or knock down) called the &#8220;null hypothesis.&#8221;  For the soil treatment example it could be the probability that what we are identifying as the best soil treatment is actually an inferior soil treatment that &#8220;got lucky&#8221; during the measurements.  We wish to show that the odds of this kind of error are low, and such low odds of error are called &#8220;high significance.&#8221;  High significance does not guarantee you are not making a mistake (for one your modeling assumptions could be wrong).  However, low significance is usually very bad.  It says even assuming everything is the way you hoped, there is still a significant chance you are wrong.  That should matter to you.</p>
<p>Even if you think significance doesn&#8217;t matter to you, it will matter to your clients, managers and peers.  If you promote work that you has low significance (or worse yet, you haven&#8217;t checked some form of significance) you are promoting work that not only may fail, you are promoting work that may have an obvious large chance of failure.  That can go over poorly in a project post-mortem.  You should always work with the feeling that someday &#8220;the truth will out.&#8221;  Not only will more data be collected in the future, but it will be obvious if you had enough evidence to justify the decisions you made earlier in a project.  For example in Bob Shaw&#8217;s short story &#8220;Burden of Proof&#8221; a crime is committed in front of a device that will play the crime back years in the future.  In the story there is no way to get the playback sooner, but knowing that someday the truth will out intensifies the detective dilemma: they can&#8217;t just make a convincing case they must make the right case.  In the real world you can&#8217;t always be right, so you should measure and share your level of uncertainty.  This why you calculate significance and why you want to effectively communicate what significance means to your possibly non-technical partners.</p>
<p>Let&#8217;s work through the significance claim from the paper relating muscular strength to mortality rates that we started with.  The claim of the paper is that in a study of 10,000 people over an average follow-up period of 19 years that a statistically significant predictor of mortality rate was weakness in certain muscular strength tests.  The paper goes on to claim that this relation is significant even when accounting for age and disease conditions.  We can check what the paper claims on the relation between strength and mortality (as we can see the raw numbers in their reported tables), but we can&#8217;t check if the effect remains when controlling for other conditions as we don&#8217;t have enough of their data to reproduce that second analysis.   So let&#8217;s end this article on a concrete note by exploring the relation between strength and mortality using the statistical package <a target="_blank" href="http://cran.us.r-project.org">R</a>.</p>
<p>From the paper&#8217;s tables 1 and 2 we can find the number of people in each of the muscular strength groups (called &#8220;lower&#8221;, &#8220;middle&#8221;, and &#8220;upper) and the number of deaths in each of these groups.  The raw numbers are (typed by hand into R):</p>
<pre>
> # from tables 1 and 2 of http://www.bmj.com/content/337/bmj.a439
> d = data.frame(MuscularStrength=c('lower','middle','upper'),
      count=c(2920,2919,2923),deaths=c(214,143,146))
> # make upper strength the reference level
> d$MuscularStrength = relevel(as.factor(d$MuscularStrength),
    ref='upper')
> print(d)
  MuscularStrength count deaths
1            lower  2920    214
2           middle  2919    143
3            upper  2923    146
</pre>
<p>The obvious thing to look at is the death rates:</p>
<pre>
> # quickly look at rates and typical deviations (sqrt of variance)
> #  http://en.wikipedia.org/wiki/Binomial_distribution
> d$MortalityRate = d$deaths/d$count
> d$MortalityRateDevEst = sqrt(d$MortalityRate*(1-d$MortalityRate)/d$count)
> d$MortalityRateDevBound = sqrt(0.25/d$count)
> print(d)
  MuscularStrength count deaths MortalityRate MortalityRateDevEst MortalityRateDevBound
1            lower  2920    214    0.07328767         0.004822769           0.009252915
2           middle  2919    143    0.04898938         0.003995090           0.009254500
3            upper  2923    146    0.04994868         0.004029222           0.009248166
</pre>
<p>It looks like the lower strength group has a mortality rate of about 7% over the 19 year interval which is above the 5% rate we see for the other two groups.  This result is likely significant because we see each of these estimates has a standard error of around 0.5%, much lower than the 2% difference between groups we are seeing.  In relative terms it looks like you can cut your mortality rate by 40% by not being in the lower muscle performance group.  Notice also that these rates are radically different from the 3.89%, 2.59% and 2.66% reported in the summary.  This is because we are using different units (our case deaths per study individual and theirs deaths per year) and the actual study is breaking deaths up into different causes.</p>
<p>From a business perspective at this point, using the rule of thumb that at least 3 standard errors is significant, we are done.  The table above is exactly the right thing to show the client.  They can see the important things: the size of the study, the general mortality rates, the size of the effect, and the likely error in measurement.  The likely errors in measurement we have added to the table are the likely errors we would see in re-running a study such as this one.  That is: it describes the distribution of results a second researcher trying to reproduce our result might see even when our estimates were exactly right.  If there estimated deviations are large then we know even if we were right our work is unlikely to be reproduced with sample sizes similar to what we used (which  is bad).  If the estimated deviations are small then <em>assuming we are right</em> others should be able to reproduce our work (which is good). </p>
<p>We can produce a graphical version of this table as follows.  </p>
<pre>
# plot likely observed distributions of deaths, assuming 
# rates are exactly as measured
> library('ggplot2') # plotting
> plotFrame = c()
> deaths=0:sum(d$deaths)
> for(i in 1:(dim(d)[[1]])) {
  row = as.list(d[i,])
  plotData = data.frame(deaths=deaths,
     DeathRate=deaths/row$count,
     MuscularStrength=row$MuscularStrength,
     density=dbinom(deaths,size=row$count,prob=row$deaths/row$count))
  plotFrame = rbind(plotFrame,plotData)
}
> ggplot() + 
   geom_line(data=plotFrame,
      aes(x=DeathRate,y=density,color=MuscularStrength,
      linetype=MuscularStrength)) +
   geom_vline(data=d,aes(xintercept=deaths/count,color=MuscularStrength,
      linetype=MuscularStrength))
</pre>
<p>This produces the following plot which shows, assuming we have measure the mortality rates for the three groups correctly, how  follow-up studies (using the same data-set size we used) would likely look.  The important thing to notice is how little the 3 distribution groups overlap with each other (and how little of each of them crosses the center line of the others).  </p>
<p><img style="display:block; margin-left:auto; margin-right:auto;" src="http://www.win-vector.com/blog/wp-content/uploads/2013/04/DeathRate1.png" alt="DeathRate" title="DeathRate.png" border="0" width="600" height="480" /></p>
<p>At this point we have addressed the variance of our estimation procedure (an issue the bootstrap method also works on) which speaks to the repeatability of our work.  We have not yet touched on the correctness (such as measuring the variance of a null hypothesis would help with) or specific-ness of our result (such as forming Bayesian estimates of the posterior distributions of the three mortality rates would help with).</p>
<p>While we haven&#8217;t quite gotten to the traditional frequentist significance interpretation yet, we are very close.  In the frequentist notion of statistics instead of assuming our measurements are correct (which is a dangerous habit to get into) we pick something we don&#8217;t want to happen and try to show our data is very unlikely under such an assumption.  For example we could assume that lower physical strength group has the same mortality rate (around 5%) as the other groups (which is bad as implies our 7% measurement is then wrong).  The frequentist significance then calculates how often a 5% mortality rate group would return a sample with a 7% mortality rate.  This fact is already represented on our graph has how much of the middle and lower distributions cross the lower-groups center line (in this case almost none of the density does this).  So we do in fact have strong evidence of both a reproducible and significant result already in our graph and table.</p>
<p>In this case it is okay to leave significance un-calculated and informal.  The important follow up questions are ones of practicality and causation: can we change people&#8217;s strength group and would changing their strength group change their mortality rate?  These are the important questions, but they would have to be answered by new experiments as they can&#8217;t be addressed with just the data at hand.   And that is why I don&#8217;t suggest spending too long on significance with this data, observational error (the topic of significance) in this particular case it is only one worry among many.</p>
<p>Of course not all studies work out this well.  Many experiments generate results that are near the boundary of significance and non-significance.   So we very much need to know how to precisely estimate significance.  The easiest way to do this is to apply the appropriate model (in this case logistic regression) and copy the significances from the model&#8217;s supplied summary report.  Before we can fit a model we must re-shape our data into an appropriate format.   We could use the melt and cast operators from <a href="http://had.co.nz">Hadley Wickham</a>&#8216;s <a target="_blank" href="http://cran.r-project.org/web/packages/reshape2/index.html">reshape2 package</a> (as illustrated in <a target="_blank" href="http://www.win-vector.com/blog/2011/07/your-data-is-never-the-right-shape/">Your Data is Never the Right Shape</a>), but we will instead use the join operator in his <a target="_blank" href="http://cran.r-project.org/web/packages/plyr/index.html">plyr package</a>.  We feel in this case the notion of joining more accurately expresses the fluid transformations you need to be able to perform on data.  The commands to create the new data format are as follows:</p>
<pre>
# convert data into a longer format and get at same facts as in a model
> library('plyr')    # joining data
> outcomes = data.frame(outcome=c('survived','died'))
> outcomes$dummy = 'a'
> d$dummy='a'
> joined = join(d,outcomes,by=c('dummy'),type='full',match='all')
> joined$count = ifelse(joined$outcome=='survived',
                      joined$count-joined$deaths,
                      joined$deaths)
> data = subset(joined,select=c(MuscularStrength,count,outcome))
> print(data)
  MuscularStrength count  outcome
1            lower  2706 survived
2            lower   214     died
3           middle  2776 survived
4           middle   143     died
5            upper  2777 survived
6            upper   146     died
</pre>
<p>And now that we have prepared, the step of interest is essentially a one-liner:</p>
<pre>
> model = glm(outcome=='died'~MuscularStrength,
   weights=data$count,family=binomial(link='logit'),data=data)
> summary(model)

Call:
glm(formula = outcome == "died" ~ MuscularStrength, family = binomial(link = "logit"), 
    data = data, weights = data$count)

Deviance Residuals: 
     1       2       3       4       5       6  
-20.30   33.44  -16.70   29.37  -16.87   29.58  

Coefficients:
                       Estimate Std. Error z value Pr(>|z|)    
(Intercept)            -2.94552    0.08491 -34.691  < 2e-16 ***
MuscularStrengthlower   0.40827    0.11069   3.688 0.000226 ***
MuscularStrengthmiddle -0.02040    0.12068  -0.169 0.865746    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 3851.3  on 5  degrees of freedom
Residual deviance: 3831.6  on 3  degrees of freedom
AIC: 3837.6

Number of Fisher Scoring iterations: 6
</pre>
<p>What we are looking for is the value in the column named "Pr(>|z|)" for the row MuscularStrengthlower.  This is the so-called p-value (notice not even the fitting software is so rude as to use just "p") and in this case it is 0.000226 which says there is about a one in four thousand chance of seeing a relation this strong between muscular strength and mortality if there were in fact no correlation (the so-called null hypothesis).  The result is in fact  statistically significant.</p>
<p>As a word of warning look at the deviance and null deviance reported in this model (3831.6 and 3851.3) respectively.  Treating deviance as an analogy for variance we see our model explains only 0.5% of the variation outcomes (who dies and who lives).  This is why to not rely too much on global variance style measures when evaluating models: they behave too much like <a target="_blank" href="http://www.win-vector.com/blog/2009/11/i-dont-think-that-means-what-you-think-it-means-statistics-to-english-translation-part-1-accuracy-measures/">accuracy</a> and miss a lot of what is going on.   Notice instead this model identifies a group of around 1/3rd of the subjects that if muscle weakness was in fact causing mortality (it might not be) perhaps exercises could be used to reduce the mortality rate of this group by 40%.  The hope is this would be an opportunity to cut down the mortality rate of the overall population by as much as 13%, which would be huge.  The original authors know this, this is why they made it the title of their study.  And this is the kind of thing you miss if you just look at modeling summaries instead of getting your hands in the data.</p>
<p>There are a few additional points to share here. In principal a statistician would see the estimates and standard errors as mere details of parameterization to be integrated out along the path to computing significance.  To a business  the actual values of the estimates, relative rates in different groups and sizes of expected errors are all of vital interest.  Significance is a important check if these values are right.  But we also want the actual values available for discussion and possible use.  We feel working more with the data in small fluid steps is of more benefit to the data scientist and client that submitting data hopefully (in the right format) to a monolithic tool that quickly returns a single answer without exposing the intermediate tables, graphs and calculations.  You lose a lot of opportunities to notice an anomaly when you don't look at the intermediate results.</p>
<p>In conclusion: you don't directly care about p-values; you care about correctness, repeatability and significance.</p>
<div class='yarpp-related-rss'>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2012/10/level-fit-summaries-can-be-tricky-in-r/' rel='bookmark' title='Level fit summaries can be tricky in R'>Level fit summaries can be tricky in R</a></li>
<li><a href='http://www.win-vector.com/blog/2012/12/how-to-test-xcom-dice-rolls-for-fairness/' rel='bookmark' title='How to test XCOM &#8220;dice rolls&#8221; for fairness'>How to test XCOM &#8220;dice rolls&#8221; for fairness</a></li>
<li><a href='http://www.win-vector.com/blog/2013/03/a-bit-more-on-sample-size/' rel='bookmark' title='A bit more on sample size'>A bit more on sample size</a></li>
</ol></p>
<img src='http://yarpp.org/pixels/b8781af2b90c83bd11d1e98c04b31afb'/>
</div>
]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2013/04/worry-about-correctness-and-repeatability-not-p-values/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>Win Vector LLC now tweets</title>
		<link>http://www.win-vector.com/blog/2013/04/win-vector-llc-now-tweets/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=win-vector-llc-now-tweets</link>
		<comments>http://www.win-vector.com/blog/2013/04/win-vector-llc-now-tweets/#comments</comments>
		<pubDate>Wed, 03 Apr 2013 17:00:28 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Administrativia]]></category>
		<category><![CDATA[Twitter]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=2376</guid>
		<description><![CDATA[Win-Vector LLC now tweets as WinVectorLLC. We will announce news and articles with appropriate hashtags. Please follow us!<div class='yarpp-related-rss'>

Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2011/10/kernel-methods-and-support-vector-machines-de-mystified/' rel='bookmark' title='Kernel Methods and Support Vector Machines de-Mystified'>Kernel Methods and Support Vector Machines de-Mystified</a></li>
</ol>
<img src='http://yarpp.org/pixels/b8781af2b90c83bd11d1e98c04b31afb'/>
</div>
]]></description>
				<content:encoded><![CDATA[<p>Win-Vector LLC now tweets as <a target="_blank" href="https://twitter.com/WinVectorLLC">WinVectorLLC</a>.  We will announce news and articles with appropriate hashtags.  Please follow us!</p>
<div class='yarpp-related-rss'>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2011/10/kernel-methods-and-support-vector-machines-de-mystified/' rel='bookmark' title='Kernel Methods and Support Vector Machines de-Mystified'>Kernel Methods and Support Vector Machines de-Mystified</a></li>
</ol></p>
<img src='http://yarpp.org/pixels/b8781af2b90c83bd11d1e98c04b31afb'/>
</div>
]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2013/04/win-vector-llc-now-tweets/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

<!-- Dynamic Page Served (once) in 8.014 seconds -->
<!-- Cached page served by WP-Cache -->
