<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Win-Vector Blog &#187; R</title>
	<atom:link href="http://www.win-vector.com/blog/tag/r/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.win-vector.com/blog</link>
	<description>The Applied Theorist&#039;s Point of View</description>
	<lastBuildDate>Sat, 04 Feb 2012 17:42:12 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>My Favorite Graphs</title>
		<link>http://www.win-vector.com/blog/2011/12/my-favorite-graphs/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=my-favorite-graphs</link>
		<comments>http://www.win-vector.com/blog/2011/12/my-favorite-graphs/#comments</comments>
		<pubDate>Tue, 06 Dec 2011 00:59:19 +0000</pubDate>
		<dc:creator>Nina Zumel</dc:creator>
				<category><![CDATA[Applications]]></category>
		<category><![CDATA[Opinion]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Tutorials]]></category>
		<category><![CDATA[boxplots]]></category>
		<category><![CDATA[ggplot]]></category>
		<category><![CDATA[ggplot2]]></category>
		<category><![CDATA[graphical perception]]></category>
		<category><![CDATA[linear regression]]></category>
		<category><![CDATA[Logistic Regression]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[statistical graphs]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1886</guid>
		<description><![CDATA[The important criterion for a graph is not simply how fast we can see a result; rather it is whether through the use of the graph we can see something that would have been harder to see otherwise or that could not have been seen at all. &#8211; William Cleveland, The Elements of Graphing Data, [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2011/02/the-cranky-guide-to-trying-r-packages/' rel='bookmark' title='The cranky guide to trying R packages'>The cranky guide to trying R packages</a></li>
<li><a href='http://www.win-vector.com/blog/2009/08/good-graphs-graphical-perception-and-data-visualization/' rel='bookmark' title='Good Graphs: Graphical Perception and Data Visualization'>Good Graphs: Graphical Perception and Data Visualization</a></li>
<li><a href='http://www.win-vector.com/blog/2010/11/learn-a-powerful-machine-learning-tool-logistic-regression-and-beyond/' rel='bookmark' title='Learn Logistic Regression (and beyond)'>Learn Logistic Regression (and beyond)</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<blockquote><p>The important criterion for a graph is not simply how fast we can see a result; rather it is whether through the use of the graph we can see something that would have been harder to see otherwise or that could not have been seen at all.</p></blockquote>
<p>&#8211; William Cleveland, <em>The Elements of Graphing Data</em>, Chapter 2</p>
<p>In this article, I will discuss some graphs that I find extremely useful in my day-to-day work as a data scientist. While all of them are helpful (to me) for statistical visualization during the analysis process, not all of them will necessarily be useful for presentation of final results, especially to non-technical audiences.</p>
<p>I tend to follow Cleveland&#8217;s philosophy, quoted above; these graphs show me &#8212; and hopefully you &#8212; aspects of data and models that I might not otherwise see. Some of them, however, are non-standard, and tend to require explanation. My purpose here is to share with our readers some ideas for graphical analysis that are either useful to you directly, or will give you some ideas of your own.</p>
<p><span id="more-1886"></span>
<p>The graphs are all produced in <code>R</code>, using the <code>ggplot2</code> package. While <code>ggplot2</code> has a fairly high learning curve, it is the most flexible of the <code>R</code> graphing packages that I have encountered, and I&#8217;ve been able to quickly create rich graphics more easily than I would be able to with the <code>R</code> base graphics, or with other graphics packages.</p>
<p>Let&#8217;s start with some exploratory analysis. We will use the <code>AdultUCI</code> dataset that is included in the <code>arules</code> package.</p>
<pre><code>
library(arules)
data("AdultUCI")
dframe = AdultUCI[, c("education", "hours-per-week")]
colnames(dframe) = c("education", "hours_per_week")
         # get rid of the annoying minus signs in the column names
</code></pre>
<p>We want to compare the distribution of work-week length to education, using a box-and-whisker plot that is overlaid on a jittered scatterplot of the data.</p>
<pre><code>
library(ggplot2)
ggplot(dframe, aes(x=education, y=hours_per_week)) +
          geom_point(colour="lightblue", alpha=0.1, position="jitter") +
          geom_boxplot(outlier.size=0, alpha=0.2) + coord_flip()
</code></pre>
<p>The <code>outlier.size=0</code> argument to <code>geom_boxplot</code> turns off the outlier plotting, and <code>coord_flip</code> switches the coordinate axes (because there are a lot of education levels).</p>
<p>The resulting graph:</p>
<p><img style="display:block; margin-left:auto; margin-right:auto;" src="http://www.win-vector.com/blog/wp-content/uploads/2011/12/Rplot.png" alt="Rplot" border="0"/></p>
<p>Recall that the box of a box-and-whisker plot covers the central 50% of the data distribution; the line in the center marks the median. In this case, the work-week length concentrates so strongly at 40 hours (except for PhDs and those with professional degrees; they are doomed to work longer hours, typically) that most of the boxes appear one-sided; it&#8217;s easier to see what is happening with both the scatterplot and box-and-whisker superimposed, than it might be with the box-and-whisker alone. We can also see the relative concentration of the subjects along each educational level.</p>
<p>I&#8217;ve found that this superimposed graph is fairly easy to explain in a presentation (easier than a plain box-and-whisker, actually). The primary disadvantage that the scatterplot can get illegible for high volume datasets (this set has about 49 thousand rows). In this case, we have to return to the box-and-whisker plot alone.
</p>
<p>Beyond exploratory analysis, we also want plots to evaluate the models that we fit. Win-Vector&#8217;s bread-and-butter recently has been logistic regression, so we will start with some visualizations for evaluating binary logistic regression models. We&#8217;ll use the heart disease dataset that Hastie, et.al, used in the <em>Elements of Statistical Learning</em>.</p>
<pre><code>
path = "http://www-stat.stanford.edu/~tibs/ElemStatLearn/datasets/SAheart.data"
saheart = read.table(path, sep=",",head=T,row.names=1)
fmla = "chd ~ sbp + tobacco + ldl + adiposity + famhist + typea + obesity"
model = glm(fmla, data=saheart, family=binomial(link="logit"),
             na.action=na.exclude)
</code></pre>
<p>We will make a data frame of <em>chd</em> (the true response, coronary heart disease), and the score from the model.</p>
<pre><code>
dframe = data.frame(chd=as.factor(saheart$chd),
                    prediction=predict(model, type="response"))
</code></pre>
<p>The standard diagnostic plot for logistic models is the ROC curve, which is fine, but personally, I don&#8217;t get a visceral feel for the model from looking at the ROC. Also, if you are interested in setting a score threshold on the model for classification purposes, the ROC adds an additional level of indirection, since it essentially integrates the score away. I used to plot the distribution of score (prediction) versus true response, like so:</p>
<pre><code>
ggplot(dframe, aes(x=prediction, colour=chd)) + geom_density()
</code></pre>
<p><img style="display:block; margin-left:auto; margin-right:auto;" src="http://www.win-vector.com/blog/wp-content/uploads/2011/12/Rplot01.png" alt="Rplot01" border="0"/></p>
<p>This visualization tells me whether or not the model scores actually separate the response &#8212; in this case, the model identifies negative cases (no coronary heart disease) better than positive cases. The graph is hard to explain to a non-technical audience, and it has the disadvantage that both distributions are separately normalized to have unit area, so you get no sense of the relative proportion of positive and negative cases (in this case, about 35% of the population have coronary heart disease). </p>
<p>Here&#8217;s an alternate graph:</p>
<pre><code>
ggplot(dframe, aes(x=prediction, fill=chd)) +
               geom_histogram(position="identity", binwidth=0.05, alpha=0.5)
</code></pre>
<p><img style="display:block; margin-left:auto; margin-right:auto;" src="http://www.win-vector.com/blog/wp-content/uploads/2011/12/Rplot02.png" alt="Rplot02" border="0" /></p>
<p>This is two semi-transparent histograms; the blue histogram for <code>chd=1</code> is &#8220;in front&#8221; of the the red histogram. Because they are histograms, rather than density plots, we can more clearly see the relative distribution of positive to negative cases, and we have a better sense of how well (or not) the model separates the positive cases from the negative ones. Clearly, for most score thresholds, the model will have a fairly high false positive rate. I use this visualization all the time, but it is also fairly hard to explain, the transparency in particular.</p>
<p>We can also use our friend the box-and-whisker scatterplot.</p>
<pre><code>
ggplot(dframe, aes(x=chd, y=prediction)) +
               geom_point(position="jitter", alpha=0.2) +
               geom_boxplot(outlier.size=0, alpha=0.5)

</code></pre>
<p><img style="display:block; margin-left:auto; margin-right:auto;" src="http://www.win-vector.com/blog/wp-content/uploads/2011/12/Rplot03.png" alt="Rplot03" border="0" /></p>
<p>The median score for the coronary heart disease cases is pulled away from the median score of the healthy subjects, but the central 50% of the two distributions still overlap. </p>
<p>Finally, let&#8217;s look at visualizations for linear regression. We&#8217;ll use the <a href="http://www-stat.stanford.edu/~tibs/ElemStatLearn/datasets/prostate.data">prostate cancer data</a> from <em>Elements of Statistical Learning</em>.</p>
<pre><code>
fmla = "lpsa ~ lcavol + lweight + age + lbph + svi + lcp + gleason + pgg45"
model = lm(fmla, data=prostate.data)
</code></pre>
<p>We can just <code>plot(model)</code> for some diagnostic graphs:</p>
<pre><code>
par(mfrow = c(2, 2), oma = c(0, 0, 2, 0))
plot(model)
</pre>
<p></code><br />
<img style="display:block; margin-left:auto; margin-right:auto;" src="http://www.win-vector.com/blog/wp-content/uploads/2011/12/Rplot04.png" alt="Rplot04" border="0" /></p>
<p>These diagnostics are useful to determine whether or not a linear model is suitable, and to identify outliers; but again, I personally don't get a visceral feel for the model. I prefer to directly plot prediction against true response:</p>
<pre><code>
dframe = data.frame(lpsa=prostate.data$lpsa, prediction=predict(model))

title = sprintf("Prostate Cancer model\n R-squared = %1.3f",
                summary(model)$r.squared)
ggplot(dframe, aes(x=lpsa, y=prediction)) +
               geom_point(alpha=0.2) +
               geom_line(aes(y=lpsa), colour="blue") +
               opts(title=title)
</pre>
<p></code><br />
<img style="display:block; margin-left:auto; margin-right:auto;" src="http://www.win-vector.com/blog/wp-content/uploads/2011/12/Rplot05.png" alt="Rplot05" border="0" /></p>
<p>This graph gives you the same information as the Residuals vs. Fitted plot, and the Q-Q plot -- in particular, whether there is systematic over- or under-prediction in specific ranges of the data. It will expose outliers, and it is intuitive to explain when presenting your results. Furthermore, it can be used to evaluate other models that predict a continuous response, such as regression trees or polynomial fits. </p>
<p>Which graphs do you find especially useful for your day-to-day work?</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2011/02/the-cranky-guide-to-trying-r-packages/' rel='bookmark' title='The cranky guide to trying R packages'>The cranky guide to trying R packages</a></li>
<li><a href='http://www.win-vector.com/blog/2009/08/good-graphs-graphical-perception-and-data-visualization/' rel='bookmark' title='Good Graphs: Graphical Perception and Data Visualization'>Good Graphs: Graphical Perception and Data Visualization</a></li>
<li><a href='http://www.win-vector.com/blog/2010/11/learn-a-powerful-machine-learning-tool-logistic-regression-and-beyond/' rel='bookmark' title='Learn Logistic Regression (and beyond)'>Learn Logistic Regression (and beyond)</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2011/12/my-favorite-graphs/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Win-Vector starts submitting content to r-bloggers.com</title>
		<link>http://www.win-vector.com/blog/2011/08/win-vector-starts-submitting-content-to-r-bloggers-com/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=win-vector-starts-submitting-content-to-r-bloggers-com</link>
		<comments>http://www.win-vector.com/blog/2011/08/win-vector-starts-submitting-content-to-r-bloggers-com/#comments</comments>
		<pubDate>Mon, 08 Aug 2011 15:13:00 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Administrativia]]></category>
		<category><![CDATA[R]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1713</guid>
		<description><![CDATA[We have been consistently impressed by and enjoyed the wealth of R wisdom available on the R-bloggers aggregation site. Therefore Win-Vector LLC is granting the right to reformat and redistribute (with attribution and link) our blog&#8216;s R content in the R-bloggers site and feeds. We hope to see our R content shared through this network. [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2011/10/kernel-methods-and-support-vector-machines-de-mystified/' rel='bookmark' title='Kernel Methods and Support Vector Machines de-Mystified'>Kernel Methods and Support Vector Machines de-Mystified</a></li>
<li><a href='http://www.win-vector.com/blog/2009/07/should-your-mom-use-google-search/' rel='bookmark' title='Should your mom use Google search?'>Should your mom use Google search?</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>We have been consistently impressed by and enjoyed the wealth of <a target="_blank" href="http://cran.r-project.org/">R</a> wisdom available on the <a target="_blank" href="http://www.r-bloggers.com/">R-bloggers</a> aggregation site.</p>
<p>Therefore Win-Vector LLC is granting the right to reformat and redistribute (with attribution and link)  our <a target="_blank" href="http://www.win-vector.com/blog/">blog</a>&#8216;s <a target="_blank" href="http://www.win-vector.com/blog/tag/r/">R content</a> in the  <a target="_blank" href="http://www.r-bloggers.com/">R-bloggers</a> site and feeds.</p>
<p>We hope to see our R content shared through this network.</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2011/10/kernel-methods-and-support-vector-machines-de-mystified/' rel='bookmark' title='Kernel Methods and Support Vector Machines de-Mystified'>Kernel Methods and Support Vector Machines de-Mystified</a></li>
<li><a href='http://www.win-vector.com/blog/2009/07/should-your-mom-use-google-search/' rel='bookmark' title='Should your mom use Google search?'>Should your mom use Google search?</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2011/08/win-vector-starts-submitting-content-to-r-bloggers-com/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Programmers Should Know R</title>
		<link>http://www.win-vector.com/blog/2011/08/programmers-should-know-r/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=programmers-should-know-r</link>
		<comments>http://www.win-vector.com/blog/2011/08/programmers-should-know-r/#comments</comments>
		<pubDate>Sat, 06 Aug 2011 15:29:22 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Tutorials]]></category>
		<category><![CDATA[Analytics]]></category>
		<category><![CDATA[diagnosis]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[R]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1711</guid>
		<description><![CDATA[Programmers should definitely know how to use R. I don&#8217;t mean they should switch from their current language to R, but they should think of R as a handy tool during development.Again and again I find myself working with Java code like the following. public class SomeBigProject1 { public static double logStirlingApproximation(final int n) { [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/' rel='bookmark' title='Automatic Differentiation with Scala'>Automatic Differentiation with Scala</a></li>
<li><a href='http://www.win-vector.com/blog/2009/11/r-examine-objects-tutorial/' rel='bookmark' title='R examine objects tutorial'>R examine objects tutorial</a></li>
<li><a href='http://www.win-vector.com/blog/2009/09/survive-r/' rel='bookmark' title='Survive R'>Survive R</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>Programmers should definitely know how to use <a target="_blan" href="http://cran.r-project.org/">R</a>.  I don&#8217;t mean they should switch from their current language to R, but they should think of R as a handy tool during development.<span id="more-1711"></span>Again and again I find myself working with Java code like the following.</p>
<style type="text/css">
td.linenos { background-color: #f0f0f0; padding-right: 10px; }
span.lineno { background-color: #f0f0f0; padding: 0 5px 0 5px; }
pre { line-height: 125%; }
body .hll { background-color: #ffffcc }
body  { background: #f8f8f8; }
body .c { color: #408080; font-style: italic } /* Comment */
body .err { border: 1px solid #FF0000 } /* Error */
body .k { color: #008000; font-weight: bold } /* Keyword */
body .o { color: #666666 } /* Operator */
body .cm { color: #408080; font-style: italic } /* Comment.Multiline */
body .cp { color: #BC7A00 } /* Comment.Preproc */
body .c1 { color: #408080; font-style: italic } /* Comment.Single */
body .cs { color: #408080; font-style: italic } /* Comment.Special */
body .gd { color: #A00000 } /* Generic.Deleted */
body .ge { font-style: italic } /* Generic.Emph */
body .gr { color: #FF0000 } /* Generic.Error */
body .gh { color: #000080; font-weight: bold } /* Generic.Heading */
body .gi { color: #00A000 } /* Generic.Inserted */
body .go { color: #808080 } /* Generic.Output */
body .gp { color: #000080; font-weight: bold } /* Generic.Prompt */
body .gs { font-weight: bold } /* Generic.Strong */
body .gu { color: #800080; font-weight: bold } /* Generic.Subheading */
body .gt { color: #0040D0 } /* Generic.Traceback */
body .kc { color: #008000; font-weight: bold } /* Keyword.Constant */
body .kd { color: #008000; font-weight: bold } /* Keyword.Declaration */
body .kn { color: #008000; font-weight: bold } /* Keyword.Namespace */
body .kp { color: #008000 } /* Keyword.Pseudo */
body .kr { color: #008000; font-weight: bold } /* Keyword.Reserved */
body .kt { color: #B00040 } /* Keyword.Type */
body .m { color: #666666 } /* Literal.Number */
body .s { color: #BA2121 } /* Literal.String */
body .na { color: #7D9029 } /* Name.Attribute */
body .nb { color: #008000 } /* Name.Builtin */
body .nc { color: #0000FF; font-weight: bold } /* Name.Class */
body .no { color: #880000 } /* Name.Constant */
body .nd { color: #AA22FF } /* Name.Decorator */
body .ni { color: #999999; font-weight: bold } /* Name.Entity */
body .ne { color: #D2413A; font-weight: bold } /* Name.Exception */
body .nf { color: #0000FF } /* Name.Function */
body .nl { color: #A0A000 } /* Name.Label */
body .nn { color: #0000FF; font-weight: bold } /* Name.Namespace */
body .nt { color: #008000; font-weight: bold } /* Name.Tag */
body .nv { color: #19177C } /* Name.Variable */
body .ow { color: #AA22FF; font-weight: bold } /* Operator.Word */
body .w { color: #bbbbbb } /* Text.Whitespace */
body .mf { color: #666666 } /* Literal.Number.Float */
body .mh { color: #666666 } /* Literal.Number.Hex */
body .mi { color: #666666 } /* Literal.Number.Integer */
body .mo { color: #666666 } /* Literal.Number.Oct */
body .sb { color: #BA2121 } /* Literal.String.Backtick */
body .sc { color: #BA2121 } /* Literal.String.Char */
body .sd { color: #BA2121; font-style: italic } /* Literal.String.Doc */
body .s2 { color: #BA2121 } /* Literal.String.Double */
body .se { color: #BB6622; font-weight: bold } /* Literal.String.Escape */
body .sh { color: #BA2121 } /* Literal.String.Heredoc */
body .si { color: #BB6688; font-weight: bold } /* Literal.String.Interpol */
body .sx { color: #008000 } /* Literal.String.Other */
body .sr { color: #BB6688 } /* Literal.String.Regex */
body .s1 { color: #BA2121 } /* Literal.String.Single */
body .ss { color: #19177C } /* Literal.String.Symbol */
body .bp { color: #008000 } /* Name.Builtin.Pseudo */
body .vc { color: #19177C } /* Name.Variable.Class */
body .vg { color: #19177C } /* Name.Variable.Global */
body .vi { color: #19177C } /* Name.Variable.Instance */
body .il { color: #666666 } /* Literal.Number.Integer.Long */
</style>
<div class="highlight">
<pre><span class="kd">public</span> <span class="kd">class</span> <span class="nc">SomeBigProject1</span> <span class="o">{</span>
	<span class="kd">public</span> <span class="kd">static</span> <span class="kt">double</span> <span class="nf">logStirlingApproximation</span><span class="o">(</span><span class="kd">final</span> <span class="kt">int</span> <span class="n">n</span><span class="o">)</span> <span class="o">{</span>
		<span class="k">return</span> <span class="n">n</span><span class="o">*(</span><span class="n">Math</span><span class="o">.</span><span class="na">log</span><span class="o">(</span><span class="n">n</span><span class="o">)-</span><span class="mi">1</span><span class="o">)</span> <span class="o">+</span> <span class="mf">0.5</span><span class="o">*</span><span class="n">Math</span><span class="o">.</span><span class="na">log</span><span class="o">(</span><span class="mi">2</span><span class="o">*</span><span class="n">Math</span><span class="o">.</span><span class="na">PI</span><span class="o">*</span><span class="n">n</span><span class="o">);</span>
	<span class="o">}</span>

	<span class="kd">public</span> <span class="kd">static</span> <span class="kt">double</span> <span class="nf">logFactorial</span><span class="o">(</span><span class="kd">final</span> <span class="kt">int</span> <span class="n">n</span><span class="o">)</span> <span class="o">{</span>
		<span class="kt">double</span> <span class="n">r</span> <span class="o">=</span> <span class="mf">0.0</span><span class="o">;</span>
		<span class="k">for</span><span class="o">(</span><span class="kt">int</span> <span class="n">i</span><span class="o">=</span><span class="n">n</span><span class="o">;</span><span class="n">i</span><span class="o">&gt;</span><span class="mi">1</span><span class="o">;--</span><span class="n">i</span><span class="o">)</span> <span class="o">{</span>
			<span class="n">r</span> <span class="o">+=</span> <span class="n">Math</span><span class="o">.</span><span class="na">log</span><span class="o">(</span><span class="n">i</span><span class="o">);</span>
		<span class="o">}</span>
		<span class="k">return</span> <span class="n">r</span><span class="o">;</span>
	<span class="o">}</span>

	<span class="kd">public</span> <span class="kd">static</span> <span class="kt">void</span> <span class="nf">main</span><span class="o">(</span><span class="kd">final</span> <span class="n">String</span><span class="o">[]</span> <span class="n">args</span><span class="o">)</span> <span class="o">{</span>
		<span class="kt">int</span> <span class="n">nbad</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span>
		<span class="k">for</span><span class="o">(</span><span class="kt">int</span> <span class="n">n</span><span class="o">=</span><span class="mi">1000</span><span class="o">;</span><span class="n">n</span><span class="o">&lt;</span><span class="mi">10000</span><span class="o">;++</span><span class="n">n</span><span class="o">)</span> <span class="o">{</span>
			<span class="k">if</span><span class="o">(</span><span class="n">Math</span><span class="o">.</span><span class="na">abs</span><span class="o">(</span><span class="n">logFactorial</span><span class="o">(</span><span class="n">n</span><span class="o">)-</span><span class="n">logStirlingApproximation</span><span class="o">(</span><span class="n">n</span><span class="o">))&gt;=</span><span class="mf">1.0</span><span class="n">e</span><span class="o">-</span><span class="mi">5</span><span class="o">)</span> <span class="o">{</span>
				<span class="o">++</span><span class="n">nbad</span><span class="o">;</span>
			<span class="o">}</span>
		<span class="o">}</span>
		<span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="s">&quot;nbad: &quot;</span> <span class="o">+</span> <span class="n">nbad</span><span class="o">);</span>
	<span class="o">}</span>
<span class="o">}</span>
</pre>
</div>
<p>Imagine that this is some humongous project to use <a target="_blank" href="http://en.wikipedia.org/wiki/Stirling's_approximation">Stirling&#8217;s Approximation</a> as a replacement for factorial.  All the code up until main is great.  But the unfortunate developer has hard-coded an acceptance test into <code>main()</code>.  If they run their big project all they get out is:</p>
<pre>
nbad: 7334
</pre>
<p>The developer needs to re-code and re-build to diagnose the failure, tweak their acceptance criteria or add more measurements.</p>
<p>I strongly recommend a different work pattern.  Instead of bringing criteria into the code, bring the data out:</p>
<div class="highlight">
<pre><span class="kd">public</span> <span class="kd">class</span> <span class="nc">SomeBigProject2</span> <span class="o">{</span>
	<span class="kd">public</span> <span class="kd">static</span> <span class="kt">void</span> <span class="nf">main</span><span class="o">(</span><span class="kd">final</span> <span class="n">String</span><span class="o">[]</span> <span class="n">args</span><span class="o">)</span> <span class="o">{</span>
		<span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="s">&quot;n&quot;</span>
				<span class="o">+</span> <span class="s">&quot;\t&quot;</span> <span class="o">+</span> <span class="s">&quot;logFactorial&quot;</span>
				<span class="o">+</span> <span class="s">&quot;\t&quot;</span> <span class="o">+</span> <span class="s">&quot;logStirlingApproximation&quot;</span><span class="o">);</span>
		<span class="k">for</span><span class="o">(</span><span class="kt">int</span> <span class="n">n</span><span class="o">=</span><span class="mi">1000</span><span class="o">;</span><span class="n">n</span><span class="o">&lt;</span><span class="mi">10000</span><span class="o">;++</span><span class="n">n</span><span class="o">)</span> <span class="o">{</span>
			<span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="n">String</span><span class="o">.</span><span class="na">valueOf</span><span class="o">(</span><span class="n">n</span><span class="o">)</span>
					<span class="o">+</span> <span class="s">&quot;\t&quot;</span> <span class="o">+</span> <span class="n">SomeBigProject1</span><span class="o">.</span><span class="na">logFactorial</span><span class="o">(</span><span class="n">n</span><span class="o">)</span>
					<span class="o">+</span> <span class="s">&quot;\t&quot;</span> <span class="o">+</span> <span class="n">SomeBigProject1</span><span class="o">.</span><span class="na">logStirlingApproximation</span><span class="o">(</span><span class="n">n</span><span class="o">));</span>
		<span class="o">}</span>
	<span class="o">}</span>
<span class="o">}</span>
</pre>
</div>
<p>Capture this output in a file named &#8220;data.tsv&#8221; and both Microsoft Excel and R can open it.  Naturally I prefer to use R (so that is what I will demonstrate).  To read the results into R you start up an R and type in a command like the following:</p>
<pre>
 &gt; d &lt;- read.table('data.tsv',
        header=T,sep='\t',quote='',as.is=T,
        stringsAsFactors=F,comment.char='',allowEscapes=F)
</pre>
<p>Most of the arguments controlling the style of file R is to expected (what the field separator is, weather to expect escapes and quotes and so on).  The settings I suggest here are the &#8220;ultra hardened&#8221; settings.  If you make sure none of your fields have a tab or line-break in them when you print then it is guaranteed R can read the data (not matter what whacky symbols are in it).  On the java side that usually means making sure any varying text fields are run through <code>.replaceAll("\\s+"," ")</code> &#8220;just in case.&#8221; At this point you can already look at your data with the <code>summary()</code> command:</p>
<pre>
 &gt; summary(d)
</pre>
<pre>
       n         logFactorial   logStirlingApproximation
 Min.   :1000   Min.   : 5912   Min.   : 5912
 1st Qu.:3250   1st Qu.:23034   1st Qu.:23034
 Median :5500   Median :41870   Median :41870
 Mean   :5500   Mean   :42536   Mean   :42536
 3rd Qu.:7749   3rd Qu.:61653   3rd Qu.:61653
 Max.   :9999   Max.   :82100   Max.   :82100
</pre>
<p>This immediately hints that you should have been thinking in terms of relative error instead of absolute error (since insisting on high absolute accuracy on large results does not always make sense).</p>
<p>You also have access to standard statistical measures of agreement like correlation: </p>
<pre>
 &gt; with(d,cor(logFactorial,logStirlingApproximation))
</pre>
<pre>
result: 1
</pre>
<p>You can see where your failures were:</p>
<pre>
 &gt; library(ggplot2)
 &gt; d$bad &lt;- with(d,abs(logFactorial-logStirlingApproximation)&gt;=1.0e-5)
 &gt; ggplot(d) + geom_point(aes(x=n,y=bad))
</pre>
<p>Yields the graph:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/07/bad.png" alt="bad.png" border="0" width="525" height="525" /><br />
</center></p>
<p>You can see all your failures are in the initial interval.  You can then drill in:</p>
<pre>
 &gt; ggplot(d) + geom_point(aes(x=n,y=logFactorial-logStirlingApproximation))
                + scale_y_log10()
</pre>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/07/diff.png" alt="diff.png" border="0" width="525" height="525" /><br />
</center></p>
<p>And here we see some things (that are in general true for Stirling&#8217;s approximation):</p>
<ol>
<li>It is very accurate.</li>
<li>It is always an under estimate.</li>
<li>It gets better as n gets larger.</li>
</ol>
<p>Essentially by poking around with graphs in R you can figure out the nature of your errors (telling you what to fix) and generate findings that tell you how to fix your criteria (perhaps your code is working- but your test wasn&#8217;t sensible).  The &#8220;dump everything and then use R&#8221; technique is also particularly good for generating reports on code timings using either <code>geom_histogram</code> or <code>geom_density</code>. </p>
<p>For example, if we had data with a field <code>runTimeMS</code> then it is a simple one-liner to get plot like the following:</p>
<pre>
 &gt; ggplot(t) + geom_density(aes(x=runTimeMS))
</pre>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/07/timing.png" alt="timing.png" border="0" width="525" height="525" /><br />
</center></p>
<p>From this graph we can immediately see:</p>
<ol>
<li>Most of our run-times are very fast.</li>
<li>We have a heavy right-tail (evidence of &#8220;contagion&#8221; or one slow-down causing others, like CPU or IO contention).</li>
<li>Data is truncated at 100MS (could be something &#8220;censoring&#8221; the measurement, an exception being thrown or an abort).</li>
<li>There is a spike at 30MS (something is true and slow for some subset of the data that isn&#8217;t present in the majority).</li>
</ol>
<p>This is a lot more that would be seen in a mean-only or mean and standard deviation summary.  We may even being seeings signs of two different bugs (the truncation and the spike).</p>
<p>In all cases the key is to dump a lot of data in machine readable form and then come back to to analyze.  This is far more flexible than hoping to code in the right summaries and then further hoping the summaries don&#8217;t miss something important (or that you at least get a chance to notice if they do miss something).  Being able to do exploratory statistics on dumps from your code (both results and timing) gives you incredible measurement, tuning and debugging powers.   The scriptability of R means any later analysis is as easy as cut and paste.</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/' rel='bookmark' title='Automatic Differentiation with Scala'>Automatic Differentiation with Scala</a></li>
<li><a href='http://www.win-vector.com/blog/2009/11/r-examine-objects-tutorial/' rel='bookmark' title='R examine objects tutorial'>R examine objects tutorial</a></li>
<li><a href='http://www.win-vector.com/blog/2009/09/survive-r/' rel='bookmark' title='Survive R'>Survive R</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2011/08/programmers-should-know-r/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Your Data is Never the Right Shape</title>
		<link>http://www.win-vector.com/blog/2011/07/your-data-is-never-the-right-shape/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=your-data-is-never-the-right-shape</link>
		<comments>http://www.win-vector.com/blog/2011/07/your-data-is-never-the-right-shape/#comments</comments>
		<pubDate>Sun, 31 Jul 2011 20:27:00 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Tutorials]]></category>
		<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[plyr]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[reshape]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1687</guid>
		<description><![CDATA[One of the recurring frustrations in data analytics is that your data is never in the right shape. Worst case: you are not aware of this and every step you attempt is more expensive, less reliable and less informative than you would want. Best case: you notice this and have the tools to reshape your [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/' rel='bookmark' title='A Demonstration of Data Mining'>A Demonstration of Data Mining</a></li>
<li><a href='http://www.win-vector.com/blog/2009/08/good-graphs-graphical-perception-and-data-visualization/' rel='bookmark' title='Good Graphs: Graphical Perception and Data Visualization'>Good Graphs: Graphical Perception and Data Visualization</a></li>
<li><a href='http://www.win-vector.com/blog/2009/04/the-data-enrichment-method/' rel='bookmark' title='The Data Enrichment Method'>The Data Enrichment Method</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>One of the recurring frustrations in data analytics is that your data is never in the right shape.  Worst case: you are not aware of this and every step you attempt is more expensive, less reliable and less informative than you would want.  Best case: you notice this and have the tools to reshape your data.  </p>
<p>There is no final &#8220;right shape.&#8221;  In fact even your data is never right. You will always be called to re-do your analysis (new variables, new data, corrections) so you should always understand you are on your &#8220;penultimate analysis&#8221; (always one more to come).  This is why we insist on using general methods and scripted techniques, as these methods are much much easier to reliably reapply on new data than GUI/WYSWYG techniques.</p>
<p>In this article we will work a small example and call out some <a target="_blank" href="http://cran.r-project.org/">R</a> tools that make reshaping your data much easier.  The idea is to think in terms of &#8220;relational algebra&#8221; (like SQL) and transform your data towards your tools (and not to attempt to adapt your tools towards the data in an ad-hoc manner).<span id="more-1687"></span>Take a simple example where you are designing a new score called &#8220;<code>score2</code>&#8221; to predict or track an already known value called &#8220;<code>score1</code>.&#8221;  The typical situation is <code>score1</code> is a future outcome (such as the number of dollars profit on a transaction) and <code>score2</code> is a prediction (such as the estimated profit before the transaction is attempted).  Training data is usually assembled by performing a large number of transactions, recording what was known before the transaction and then aligning or joining this data with measured results when they become available.  For this example we are not interested in the inputs driving the model (a rare situation, but we are trying to make our example as simple as possible) but only examining the quality of <code>score2</code> (which is defined as how well it tracks <code>score1</code>).</p>
<p>All of this example will be in R, but the principles are chosen apply more generally.  First let us enter some example data:</p>
<p><code><br />
<br/> &gt; d &lt;- data.frame(id=c(1,2,3,1,2,3),score1=c(17,5,6,10,13,7),score2=c(13,10,5,13,10,5))<br />
<br/> &gt; d<br />
</code></p>
<p>This gives us our example data.  Each row is numbered (1 through 6) has an <code>id</code> and both our scores:</p>
<pre>
  id score1 score2
1  1     17     13
2  2      5     10
3  3      6      5
4  1     10     13
5  2     13     10
6  3      7      5
</pre>
<p>We said our only task was to characterize how well <code>score2</code> works at predicting <code>score1</code> (or how good a substitute <code>score2</code> is for <code>score1</code>).  We could compute correlation, RMS error, info-gain or some such.  But instead lets look at this graphically.  We will prepare a graph showing how well <code>score1</code> is represented by <code>score2</code>.  For this we choose to place <code>score1</code> on the y-axis (as it is the outcome) and <code>score2</code> on the x-axis (as it is the driver).</p>
<p><code><br />
<br/> &gt; library(ggplot2)<br />
<br/> &gt; ggplot(d) + geom_point(aes(x=score2,y=score1))<br />
</code></p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/07/plot1.png" alt="plot1.png" border="0" width="525" height="525" /></p>
<p>Figure 1: <code>score1</code> as a function of  <code>score2</code>.</p>
<p></center></p>
<p>This does not look good.  We would liked to have seen all of the dots falling on the line &#8220;y=x.&#8221;  This plot shows <code>score2</code> is not predicting <code>score1</code> very well.  Part of this is that we missed an important feature of the data (and because we missed it the feature becomes a problem): the <code>id</code>s repeat.  First we re-order by <code>id</code> to make this more obvious.</p>
<p><code><br />
<br/> &gt; dsort &lt;- d[order(d$id),]<br />
<br/> &gt; dsort<br />
</code></p>
<pre>
  id score1 score2
1  1     17     13
4  1     10     13
2  2      5     10
5  2     13     10
3  3      6      5
6  3      7      5
</pre>
<p>This is a very common situation.  The original score is not completely a function of the known inputs.  We are using &#8220;<code>id</code>&#8221; to abstract represent all of the inputs, two rows in our example have the same <code>id</code> if and only if all known inputs are exactly the same.  The repeating <code>id</code>s are the same experiment run at different times (a good idea) and the variation in <code>score1</code> could be the effect of an un-modeled input that changed value or something simple like a &#8220;noise term&#8221; (a random un-modeled effect).   Notice that <code>score2</code> is behaving as a function of <code>id</code>- all rows with the same <code>id</code> have the same value for <code>score2</code>.  If <code>score2</code> is a model then it has to be a function of the inputs (or more precisely if it is not a function of the inputs you have done something wrong).  So any variation of <code>score1</code> between rows with identical <code>id</code> is &#8220;unexplainable variation&#8221; (unexplainable from the point of view of currently tracked inputs).  You should know about, characterize and report this variation (why it is good to have some repeated experiments).  But this variation is not the model&#8217;s fault, if we want to know how good a job we did constructing the model (which we now see can be a slightly different question than how well the model works at prediction) we need to see how much of the explainable variation the model accounts for.</p>
<p>If we assume (as is traditional) the unexplained variation is from a &#8220;unbiased noise source&#8221; then we can lessen the impact of the noise source by replacing <code>score1</code> with a value averaged over rows with the same <code>id</code>.  This assumption is traditional because an unbiased noise source is present in many problems and assuming anything more requires more research into the problem domain.   You would eventually fold such research into your model- so your goal is always have all effects or biases in your model and hope what is left over is unbiased.  This is usually not strictly true, but not accounting for the unexplained variation at all is in many cases even worse than modeling the unexplained variation as being bias-free.</p>
<p>And now we find our data is the &#8220;wrong shape.&#8221;  To replace <code>score1</code> with the appropriate averages we need to do some significant data manipulation.  We need to group sets of rows and add new columns. We could do this imperatively (write some loops and design some variables to track and manipulate state) or declaratively (find a path of operations from what you have to what you need through R&#8217;s data manipulation algebra).  Even though the declarative method is more work the first time (you could often write the code in less time than it takes to read the manuals) it is the right way to go (as it is more versatile and powerful in the end).</p>
<p>Luckily we don&#8217;t have to use raw R.  There are a number of remarkable packages (all by <a target="_blank" href="http://had.co.nz/">Hadley Wickham</a> who is also the author of the <a target="_blank" href="http://had.co.nz/ggplot2/">ggplot2</a> package we use to prepare our figures) that really improve R&#8217;s ability to coherently manage data.  The easiest (on us) way do fix up our data is to make the computer work hard and use the powerful melt/cast technique.  These functions are found in the libraries <a target="_blank" href="http://www.jstatsoft.org/v21/i12/paper">reshape</a> and <a target="_blank" href="http://www.jstatsoft.org/v40/i01/paper">plyr</a> (which were automatically loaded with we loaded ggplot2 library).</p>
<p>melt is a bit abstract.  What it does convert your data into a &#8220;narrow&#8221; format where rows are split into many rows each carrying just one result column of the original row.  For example we can melt our data by <code>id</code> as follows:</p>
<p><code><br />
<br/> &gt; dmelt &lt;- melt(d,id.vars=c('id'))<br />
<br/> &gt; dmelt<br />
</code></p>
<p>Which yields the following:</p>
<pre>
   id variable value
1   1   score1    17
2   1   score1    10
3   2   score1     5
4   2   score1    13
5   3   score1     6
6   3   score1     7
7   1   score2    13
8   1   score2    13
9   2   score2    10
10  2   score2    10
11  3   score2     5
12  3   score2     5
</pre>
<p>Each of the two facts (<code>score1</code>, <code>score2</code>) from our original row is split into its own row.  The <code>id</code> column plus the new variable column are now considered to be keys.  This format is not used directly but used because it is easy to express important data transformations in terms of it.  For instance we wanted our table to have duplicate rows collected and <code>score1</code> replaced by its average (to attempt to remove the unexplainable variation).  That is now easy:</p>
<p><code><br />
<br/> &gt; dmean &lt;- cast(dmelt,fun.aggregate=mean)<br />
<br/> &gt; dmean<br />
</code></p>
<pre>
  id score1 score2
1  1   13.5     13
2  2    9.0     10
3  3    6.5      5
</pre>
<p>We used <code>cast()</code> in its default mode, where it assumes all columns not equal to &#8220;value&#8221; are the keyset.  It then collects all rows with identical keying and combines them back into wide rows using mean or average as the function to deal with duplicates.  Notice <code>score1</code> is now the desired average, and <code>score2</code> is as before (as it was a function of the keys or inputs, so it is not affected by averaging).  With this new smaller data set we can re-try our original graph:</p>
<p><code><br />
<br/> &gt; ggplot(dmean) + geom_point(aes(x=score2,y=score1))<br />
</code></p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/07/plot2.png" alt="plot2.png" border="0" width="525" height="525" /></p>
<p>Figure 2: <code>mean(score1)</code> as a function of  <code>score2</code>.</p>
<p></center></p>
<p>This doesn&#8217;t look so bad.  A lot of the error or variation in the first plot was unexplainable variation.  <code>score2</code> isn&#8217;t bad given its inputs.  If you wanted to do better than <code>score2</code> you would be advised to find more modeling inputs (versus trying more exotic modeling techniques).</p>
<p>Of course a client or user is not interested if <code>score2</code> is &#8220;best possible.&#8221;  They want to know if it is any good.  To do this we should show them (either by graph or by quantitative summary statistics like we mentioned earlier) at least 3 things:</p>
<ol>
<li>How well the model predicts overall (the very first graph we presented).</li>
<li>How much of the explainable variation the model predicts (the second graph).</li>
<li>The nature of the unexplained variation (which we will explore next).</li>
</ol>
<p>We said earlier we are hoping the unexplained variation is noise (or if it is not noise it would be nice if it is a clue to new important modeling features).  So the unexplained variation must not go unexamined.  We will finish by showing how to characterize the unexplained variation.  As before will will just make a graph, but the data preparation steps would be exactly the same if we were using a quantitive summary (like correlation, or any other).  And, of course, our data is still not the right shape for this step.  Luckily there is another tool ready to fix this: <code>join()</code>.</p>
<p><code><br />
<br/> &gt; djoin &lt;- join(dsort,dsort,'id')<br />
<br/> &gt; fixnames &lt;- function(cn) {<br />
     n &lt;- length(cn);<br />
     for(i in 2:((n+1)/2)) { cn[i] &lt;- paste('a',cn[i],sep='') };<br />
     for(i in ((n+3)/2):n) { cn[i] &lt;- paste('b',cn[i],sep='') };<br />
     cn<br />
  }<br />
<br/> &gt; colnames(djoin) &lt;- fixnames(colnames(djoin))<br />
<br/> &gt; djoin<br />
</code></p>
<p>which produces:</p>
<pre>
   id ascore1 ascore2 bscore1 bscore2
1   1      17      13      17      13
2   1      17      13      10      13
3   1      10      13      17      13
4   1      10      13      10      13
5   2       5      10       5      10
6   2       5      10      13      10
7   2      13      10       5      10
8   2      13      10      13      10
9   3       6       5       6       5
10  3       6       5       7       5
11  3       7       5       6       5
12  3       7       5       7       5
</pre>
<p>All of the work was done by the single line &#8220;<code>djoin &lt;- join(dsort,dsort,'id')</code>&#8221; the rest was just fixing the column names (as self-join is not the central use case of join).  What we have now is a table that is exactly right for studying unexplained variation.  For each <code>id</code> we have each row with the same <code>id</code> matched.  This blows every <code>id</code> from having 2 rows in <code>dsort</code> to 4 rows in <code>djoin</code>.  Notices this gives us every pair of <code>score1</code> values seen for the same <code>id</code> (which will let us examine unexplained variation) and <code>score2</code> is still constant over all rows with the same <code>id</code> (as it has always been throughout our analysis).  With this table we can now plot how <code>score1</code> varies for rows with the same <code>id</code>:</p>
<p><code><br />
<br/> &gt; ggplot(djoin) + geom_point(aes(x=ascore1,y=bscore1))<br />
</code></p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/07/unex.png" alt="unex.png" border="0" width="525" height="525" /></p>
<p>Figure 3: <code>score1</code> as a function of  <code>score1</code>.</p>
<p></center></p>
<p>And we can see, as we expected, the unexplained variation in <code>score1</code> is about as large as the mismatch between <code>score1</code> and <code>score2</code> in our original plot.  The important thing is this is all about <code>score1</code> (<code>score2</code> is now literally out of the picture).  The analyst&#8217;s job would now be to try and tie bits of the unexplained variation to new inputs (that can be folded into a new <code>score2</code>) and/or characterize the noise term (so the customer knows how close they should expect repeated experiments to be).</p>
<p>What we are trying to encourage with the use of &#8220;big hammer tools&#8221; is an ability and willingness to look at and transform your data in meaningful steps.  It often seems easier and more efficient to build one more piece of data tubing, but a lot of data tubes become an unmanageable collection of spaghetti code.  The analyst should, in some sense, always be looking at data and not looking at coding details.  For these sort of analyses we encourage analysts to think in terms of &#8220;data shape&#8221; and transforms.  This discipline leaves more of the analysts energy and attention to think productively about the data and actual problem domain.</p>
<hr />
Note:</p>
<p>For the third plot showing the variation of <code>score1</code> across different rows (but same <code>id</code>s) it may be appropriate to use a slightly more complicated <code>join()</code> procedure than we showed.  The join shown produced rows of artificial agreement where both values of <code>score1</code> came from the same row (thus had no chance of being different, so in some sense deserve no credit).  This is also the only way any non-duplicated evaluations could make it to the plot.  To eliminate these uninteresting agreements from the plot do the following:</p>
<p><code><br />
<br/> &gt; d$rowNumber &lt;- 1:(dim(d)[1])<br />
<br/> &gt; djoin &lt;- join(d,d,'id')<br />
<br/> &gt; colnames(djoin) &lt;- fixnames(colnames(djoin))<br />
<br/> &gt; djoin &lt;- djoin[djoin$arowNumber!=djoin$browNumber,]<br />
<br/> &gt; djoin<br />
</code></p>
<p>This gives us a table that shows only values of <code>score1</code> from different rows:</p>
<pre>
   id ascore1 ascore2 arowNumber bscore1 bscore2 browNumber
2   1      17      13          1      10      13          4
4   2       5      10          2      13      10          5
6   3       6       5          3       7       5          6
7   1      10      13          4      17      13          1
9   2      13      10          5       5      10          2
11  3       7       5          6       6       5          3
</pre>
<p>And only plots points on the diagonal if &#8220;you have really earned them&#8221;:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/07/fig4.png" alt="fig4.png" border="0" width="525" height="525" /><br />
</center></p>
<p>So while the direct <code>join()</code> may not be the immediate perfect answer it is still a good intermediate to form as what you want is only simple data transformation away from it.</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/' rel='bookmark' title='A Demonstration of Data Mining'>A Demonstration of Data Mining</a></li>
<li><a href='http://www.win-vector.com/blog/2009/08/good-graphs-graphical-perception-and-data-visualization/' rel='bookmark' title='Good Graphs: Graphical Perception and Data Visualization'>Good Graphs: Graphical Perception and Data Visualization</a></li>
<li><a href='http://www.win-vector.com/blog/2009/04/the-data-enrichment-method/' rel='bookmark' title='The Data Enrichment Method'>The Data Enrichment Method</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2011/07/your-data-is-never-the-right-shape/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>The cranky guide to trying R packages</title>
		<link>http://www.win-vector.com/blog/2011/02/the-cranky-guide-to-trying-r-packages/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=the-cranky-guide-to-trying-r-packages</link>
		<comments>http://www.win-vector.com/blog/2011/02/the-cranky-guide-to-trying-r-packages/#comments</comments>
		<pubDate>Sun, 13 Feb 2011 16:45:00 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Opinion]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Tutorials]]></category>
		<category><![CDATA[Cranky Guide]]></category>
		<category><![CDATA[GAM]]></category>
		<category><![CDATA[general additive models]]></category>
		<category><![CDATA[ggplot2]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[Regression]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1642</guid>
		<description><![CDATA[This is a tutorial on how to try out a new package in R. The summary is: expect errors, search out errors and don&#8217;t start with the built in examples or real data. Suppose you want to try out a novel statistical technique? A good fraction of the time R is your best bet for [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2011/12/my-favorite-graphs/' rel='bookmark' title='My Favorite Graphs'>My Favorite Graphs</a></li>
<li><a href='http://www.win-vector.com/blog/2009/11/r-examine-objects-tutorial/' rel='bookmark' title='R examine objects tutorial'>R examine objects tutorial</a></li>
<li><a href='http://www.win-vector.com/blog/2011/07/your-data-is-never-the-right-shape/' rel='bookmark' title='Your Data is Never the Right Shape'>Your Data is Never the Right Shape</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>This is a tutorial on how to try out a new package in <a target="_blank" href="http://www.r-project.org/">R</a>.  The summary is: expect errors, search out errors and don&#8217;t start with the built in examples or real data.</p>
<p>Suppose you want to try out a novel statistical technique?  A good fraction of the time R is your best bet for a first trial.  Take as an example general additive models (&#8220;Generalized Additive Models,&#8221; Trevor J Hastie, Robert Tibshirani, Statistical Science (1986) vol. 1 (3) pp. 297-318); R has a package named &#8220;gam&#8221; written by Trevor Hastie himself.  But, like most R packages, trying the package from the supplied documentation brings in unfamiliar data and concerns.  It is best to start small and quickly test if the package itself is suitable to your needs.  We give a quick outline of how to learn such a package and quickly find out if the package is for you.</p>
<p><span id="more-1642"></span><br />
To start, install and activate the package in R:</p>
<pre>
install.packages('gam')
library(gam)
help(gam)
</pre>
<p>From the help we see gam fits in much the same way lm() and glm() do.  So we need some data to try it out.  I suggest not using the package example data or a real problem- use deliberately trivial data so you are diagnosing the new package (not diagnosing something else).  I like to create a quick data frame as follows:</p>
<pre>
d &lt;- data.frame(
   x1=rnorm(100),
   x2=sample(100),
   x3=0*(1:100),
   x4=sample(c('a','b','c'),size=100,replace=T),
   x5=as.factor(sample(c('d','e','f'),size=100,replace=T)),
   x6=sample(c(F,T),size=100,replace=T),
   x7=NA + 1:100,
 stringsAsFactors=F)
</pre>
<p>Right now d is a data frame with a lot of variable types we are likely to see in practice (we have left out ordered factors):</p>
<ul>
<li>Nicely distributed continuous values (x1)</li>
<li>Integer values (x2)</li>
<li>Stuck or constant varible (x3)</li>
<li>String values (x4)</li>
<li>Factor values (x5)</li>
<li>Logical values (x6)</li>
<li>A heap of missing values (x7)</li>
</ul>
<p>We now augment our data frame with one more input (a duplicated variable) and an output (to be predicted):</p>
<pre>
d$x8 = d$x1
d$y = rnorm(100)
  + with(d,20*exp(x1) + x2
  + 7*as.integer(as.factor(x4))
  + 9*as.integer(x5) + 10*as.integer(x6))
</pre>
<p>This is our simple test data set.  Our standard method of fitting a linear model for y  (before trying the gam package) is either to (falsely) assume x1&#8242;s contribution is linear or to (amazingly) guess the exact transformation required is exp(x1).</p>
<p>In the first case the model looks like this (using the ggplot2 package):</p>
<p>(As a side note- statisticians usually cluck if you ask for a graph showing &#8220;truth&#8221; as a function of &#8220;prediction&#8221; (or y ~ f(x)).  They say you should be looking at residuals (or (y-f(x) ~ f(x)) instead (and usually only supply functions that produce residual plots).  But in their own publications, such as the paper we started with, when they want to make a point: they actually plot truth as a function of prediction.)</p>
<pre>
m1 &lt;- lm(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x8, data=d)
ggplot(d,aes(predict(m1),y)) + geom_point(shape=1)
   + geom_abline(slope=1) + opts(aspect.ratio=1)
</pre>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/02/m1.png" alt="m1.png" border="0" width="525" height="525" /><br />
</center></p>
<p>And in the second case the model looks like this:</p>
<pre>
m2 &lt;- lm(y ~ exp(x1) + x2 + x3 + x4 + x5 + x6 + x8, data=d)
ggplot(d,aes(predict(m2),y)) + geom_point(shape=1)
   + geom_abline(slope=1) + opts(aspect.ratio=1)
</pre>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/02/m2.png" alt="m2.png" border="0" width="525" height="525" /><br />
</center></p>
<p>(In both cases we had to leave out x7, which was all NAs.) Notice how knowing the functional form of x1 moves us from a good fit to an extraordinary fit.  We would like to automatically learn the form of x1&#8242;s contribution from data and get this better fit automatically.  This is in fact the point of the gam package.  To perform the gam fit we add the<br />
smoothing symbol s() to each variable we want to try and learn the possibly non-linear shape of contribution of.</p>
<pre>
mG &lt;- gam(y ~ s(x1) + s(x2) + x3 + x4 + x5 + x6 + s(x8), data=d)
ggplot(d,aes(predict(mG),y)) + geom_point(shape=1)
   + geom_abline(slope=1) + opts(aspect.ratio=1)
</pre>
<p>We could not add the s() symbol to any of the un-ordered factors, strings, logicals, the NAs or constant columns.  Except for that the gam package is performing as robustly as the built in lm() package and produces a fit essentially as good as knowing the shape of x1&#8242;s contribution ahead of time:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/02/mG.png" alt="mG.png" border="0" width="525" height="525" /><br />
</center></p>
<p>That is the good part.  But it just wouldn&#8217;t be an R-package without some bad parts.  </p>
<p>The first problem is: at first glance the gam package plot() appears as if it is broken (or at least not a compatible extension of plot.lm or plot.glm, classes even though gam claims to extend both of those classes).  Traditionally when you call plot() on a model it steps you through a bunch of arcane graphs that statisticians swear are more important than examining the fit directly.  But plot(mG) seems to step through all of a different family of graphs without waiting for user input, and we are left only with the graph that presumably shows the shape adjustment of variable x8.  Only if you anticipate plot() is different for gam than for other models (or dump the function code for plot.gam) do you learn you need to add the argument ask=T.   plot(mG,ask=T)  enters an interactive mode where you can see the inferred shape of each variable.  That is, there is an argument that defaults to a value you don&#8217;t want (or as we say in industry: &#8220;you forgot to set the don&#8217;t lose flag.&#8221;).  gam is actually a very high quality package, but these sort of &#8220;poison defaults&#8221; are something you have to be in the habit of looking out for in R.</p>
<p>The second problem is intrinsic to the method: we are not guaranteed that s(x1) or s(x8) either look much like exp(x1).  It is only guaranteed that some linear of them does (as that was how they were used in the model).  We can get direct access to the learned reshapings by calling predict() to ask for term contributions and see how the model is linear in the transformed coordinates (essentially with all coefficients 1).</p>
<pre>
pG = predict(mG,type='terms')
summary(lm(d$y ~ pG))

Call:
lm(formula = d$y ~ pG)
Residuals:
    Min      1Q  Median      3Q     Max
-3.5690 -0.7121 -0.0288  0.6924  5.3883
Coefficients: (1 not defined because of singularities)
             Estimate Std. Error t value Pr(&gt;|t|)
(Intercept) 1.125e+02  1.363e-01  824.92   &lt;2e-16 ***
pGs(x1)     9.994e-01  6.486e-03  154.09   &lt;2e-16 ***
pGs(x2)     1.002e+00  4.905e-03  204.19   &lt;2e-16 ***
pGx3               NA         NA      NA       NA
pGx4        9.973e-01  2.548e-02   39.14   &lt;2e-16 ***
pGx5        1.004e+00  1.926e-02   52.10   &lt;2e-16 ***
pGx6        9.967e-01  2.796e-02   35.64   &lt;2e-16 ***
pGs(x8)     1.057e+00  2.520e-02   41.96   &lt;2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
</pre>
<p>We can examine the transforms (like s(x8) = f(x8)) by plotting:</p>
<pre>
ggplot(d) + geom_point(aes(x8,pG[,'s(x8)']))
</pre>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/02/s8.png" alt="s8.png" border="0" width="525" height="525" /><br />
</center></p>
<p>Even though s(x8) has a funny shape it plus s(x1) are an excellent approximation of exp(x1) (with commensurate magnitudes):</p>
<pre>
&gt; lm &lt;- lm(exp(d$x1)~pG[,'s(x1)']+pG[,'s(x8)'])
&gt; summary(lm)

Call:
lm(formula = exp(d$x1) ~ pG[, "s(x1)"] + pG[, "s(x8)"])
Residuals:
      Min        1Q    Median        3Q       Max
-0.118086 -0.014125  0.002899  0.023389  0.217152
Coefficients:
               Estimate Std. Error t value Pr(&gt;|t|)
(Intercept)   1.2536112  0.0046410  270.12   &lt;2e-16 ***
pG[, "s(x1)"] 0.0500079  0.0002126  235.25   &lt;2e-16 ***
pG[, "s(x8)"] 0.0534411  0.0008349   64.01   &lt;2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.04641 on 97 degrees of freedom
Multiple R-squared: 0.9987,	Adjusted R-squared: 0.9986
F-statistic: 3.588e+04 on 2 and 97 DF,  p-value: &lt; 2.2e-16
</pre>
<p>The functions do have different variances (s(x1) is doing most of the work):</p>
<pre>
&gt; var(pG[,'s(x1)'])
[1] 514.8578
&gt; var(pG[,'s(x8)'])
[1] 33.37209
</pre>
<p>Yet the coefficients of mG seem to be  gibberish (notice the 0 on s(x8)):</p>
<pre>
&gt; mG$coefficients
(Intercept)       s(x1)       s(x2)          x3         x4b         x4c         x5e
  45.580747   22.749160    1.007585    0.000000    7.048177   13.346365    8.740460
        x5f      x6TRUE       s(x8)
   18.003981   10.102217    0.000000
</pre>
<p>So by poking around we have learned <i>not</i> to look at this slot of the returned model (and it is much cheaper to learn this through this cranky poking around on a trivial example than to learn it while trying to analyze real data or blundering through R&#8217;s overly operational documentation).</p>
<p>The third (and last) problem is one of attitude (and one of the barriers to learning statistics).   There is not a lot of support for exporting the derived gam smoothers (the transforms on the input variables) from R.  The original paper suggests that you should think of the non-parametric smoothers as graphs and user linear interpolation between your data points.  You can do this by calling &#8220;predict(mG,type=&#8217;terms&#8217;)&#8221; as we did above.  Or you can try to switch to parametric splines and then run into the same problem that the splines package is not really export friendly.  Or you can ask around.  The R community is generally quite bright and friendly- but every once in a while you get a whiff of statistics territorialism (or perhaps a defensiveness, where if you are correct but not fully general you fear you will look shallow).  Sensible requirements, like wanting to export usable model parameters to another system, are considered naive.  A favorite example of mine: in this <a target="_blank" href="http://r.789695.n4.nabble.com/which-coefficients-for-a-gam-mgcv-model-equation-td1578925.html">help thread</a> somebody who is asking how to configure and export explicit spline transforms to meet an external requirement to get their paper published is advised: &#8220;I think that the referee is being unreasonable here. There are many perfectly respectable ways of estimating GAMs for which no explicit expression for the estimated smooth terms is available (See Hastie and Tibshirani&#8217;s GAM book).&#8221;  And then offered additional references to help educate the referee (instead of recognizing that an explicit sharable solution in hand could, in some cases, be more useful than a maximally general solution that can&#8217;t be communicated succinctly).  It is like lecturing a drowning man on how important water is to fish.</p>
<p>In summary- never first try a new R package on real data.  R packages are often realization of very deep concepts from the literature that bring in their own terminology, trade-offs and attitudes.  You need time to absorb these things in isolation.  Expect and forgive non-essential errors (many important and valuable packages have them).  Approach new packages with a cranky inquisitiveness about the package, otherwise you may actually fall into a non-productive state of frustration.</p>
<p>Also, gam methods are amazing.</p>
<p>(note: we also recommend trying the mgcv package for gam modeling.  It represents a different set of tradeoffs.)</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2011/12/my-favorite-graphs/' rel='bookmark' title='My Favorite Graphs'>My Favorite Graphs</a></li>
<li><a href='http://www.win-vector.com/blog/2009/11/r-examine-objects-tutorial/' rel='bookmark' title='R examine objects tutorial'>R examine objects tutorial</a></li>
<li><a href='http://www.win-vector.com/blog/2011/07/your-data-is-never-the-right-shape/' rel='bookmark' title='Your Data is Never the Right Shape'>Your Data is Never the Right Shape</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2011/02/the-cranky-guide-to-trying-r-packages/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Learn Logistic Regression (and beyond)</title>
		<link>http://www.win-vector.com/blog/2010/11/learn-a-powerful-machine-learning-tool-logistic-regression-and-beyond/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=learn-a-powerful-machine-learning-tool-logistic-regression-and-beyond</link>
		<comments>http://www.win-vector.com/blog/2010/11/learn-a-powerful-machine-learning-tool-logistic-regression-and-beyond/#comments</comments>
		<pubDate>Tue, 23 Nov 2010 06:18:33 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Applications]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Tutorials]]></category>
		<category><![CDATA[Learn a Powerful Machine Learning Tool]]></category>
		<category><![CDATA[Logistic Regression]]></category>
		<category><![CDATA[Max-Ent]]></category>
		<category><![CDATA[Maximum Entropy]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[Regularization]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1578</guid>
		<description><![CDATA[One of the current best tools in the machine learning toolbox is the 1930s statistical technique called logistic regression. We explain how to add professional quality logistic regression to your analytic repertoire and describe a bit beyond that.A statistical analyst working on data tends to deliberately start simple move cautiously to more complicated methods. When [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2011/09/the-equivalence-of-logistic-regression-and-maximum-entropy-models/' rel='bookmark' title='The equivalence of logistic regression and maximum entropy models'>The equivalence of logistic regression and maximum entropy models</a></li>
<li><a href='http://www.win-vector.com/blog/2011/09/the-simpler-derivation-of-logistic-regression/' rel='bookmark' title='The Simpler Derivation of Logistic Regression'>The Simpler Derivation of Logistic Regression</a></li>
<li><a href='http://www.win-vector.com/blog/2010/12/large-data-logistic-regression-with-example-hadoop-code/' rel='bookmark' title='Large Data Logistic Regression (with example Hadoop code)'>Large Data Logistic Regression (with example Hadoop code)</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>One of the current best tools in the machine learning toolbox is the 1930s statistical technique called logistic regression.  We explain how to add professional quality logistic regression to your analytic repertoire and describe a bit beyond that.<span id="more-1578"></span>A statistical analyst working on data tends to deliberately start simple move cautiously to more complicated methods.  When first encountering data it is best to think in terms of visualization and exploratory data analysis (in the sense of Tukey).  But in the end the analyst is expected to deliver actionable decision procedures- not just pretty charts.  To get actionable advice the analyst will move up to more complicated tools like pivot tables and Naive Bayes.  Once the problem requires control of interactions and dependencies the analyst must move away from these tools and towards the more complicated statistical tools like standard regression, decision trees and logistic regression.  Beyond that we have machine learning tools such as: kernel methods, boosting, bagging, decision trees and support vector machines.  Which tool is best depends a lot on the situation- and the prepared analyst can quickly try many techniques.  Logistic regression is often a winning method and we will use this article to discuss logistic regression a bit deeper.  By the end of this writeup you should be able to use standard tools to perform a logistic regression and know some of the limitations you will want to work beyond.</p>
<p>Logistic regression was invented in the late 1930s by statisticians Ronald Fisher and Frank Yates.  The definitive resource on this (and other generalized linear models) is Alan Agresti &#8220;Categorical Data Analysis&#8221; 2002, New York, Wiley-Interscience.   Logistic regression is a &#8220;categorical&#8221; tool in that it is used to predict categories (fraud/not-fraud, good/bad &#8230;) instead of numeric scores (like standard regression).  For example: consider the &#8220;car&#8221; data set from the <a target="_blank" href="http://archive.ics.uci.edu/ml/">UCI machine learning database</a> ( <a  target="_blank" href="http://archive.ics.uci.edu/ml/machine-learning-databases/car/">data</a>, <a target="_blank" href="http://archive.ics.uci.edu/ml/machine-learning-databases/car/car.names">description</a> ).  The data is taken from a consumer review of cars.  Each car is summarized by 6 attributes ( price, maintenance costs, doors, storage size, seating and safety ); there is also a conclusion column that contains the final overall recommendation (unacceptable, acceptable,  good, very good).  The machine learning problem is to infer the reviewer&#8217;s relative importance or weight of each feature.  This could be used to influence a cost constrained design of a future car.  This dataset was originally used to demonstrate hierarchical and decision tree based expert systems.  But logistic regression can quickly derive interesting results.</p>
<p>Let us perform a complete analysis together (at least in our imaginations if not with our actual computers).  First download and install the excellent free analysts workbench called <a target="_blan" href="http://www.R-project.org/">&#8220;R&#8221;</a>.  This software package is an implementation of John Chambers&#8217; S language (a statistical language designed to allow for self-service statistics to relieve some of Chambers&#8217; consulting responsibilities) and a near relative of the SPlus system.  This system is powerful and has a number of very good references (our current favorite being <a href="http://oreilly.com/catalog/9780596801717">Joseph Adler &#8220;R in a nutshell&#8221; 2009, O&#8217;Reilly Media</a>).  Using R  we can build an interesting model in 2 lines of code.</p>
<p>After installing R start the program and copy the following line into the R command shell.</p>
<pre>
CarData <- read.table(url('http://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data'),
    sep=',',col.names=c('buying','maintenance','doors','persons','lug_boot','safety','rating'))
</pre>
<p>This has downloaded the car data directly from the UCI database and added a header line so we can refer to variables by name.  To see roll-ups or important summaries of our data we could just type in "summary(CarData)."  But we will move on with the promised modeling.  Now type in the following line:</p>
<pre>
logisticModel <- glm(rating!='unacc' ~ buying + maintenance + doors + persons + lug_boot +safety,
    family=binomial(link = "logit"),data=CarData)
</pre>
<p>We now have a complete logistic model.  To examine this model we type "summary(logisticModel)".  And see the following (rather intimidating) summary:</p>
<pre>
Coefficients:
                  Estimate Std. Error z value Pr(>|z|)
(Intercept)       -28.4255  1257.5255  -0.023    0.982
buyinglow           5.0481     0.5670   8.904  < 2e-16 ***
buyingmed           3.9218     0.4842   8.100 5.49e-16 ***
buyingvhigh        -2.0662     0.3747  -5.515 3.49e-08 ***
maintenancelow      3.4064     0.4692   7.261 3.86e-13 ***
maintenancemed      3.4064     0.4692   7.261 3.86e-13 ***
maintenancevhigh   -2.8254     0.4145  -6.816 9.36e-12 ***
doors3              1.8556     0.4042   4.591 4.41e-06 ***
doors4              2.4816     0.4278   5.800 6.62e-09 ***
doors5more          2.4816     0.4278   5.800 6.62e-09 ***
persons4           29.9652  1257.5256   0.024    0.981
personsmore        29.5843  1257.5255   0.024    0.981
lug_bootmed        -1.5172     0.3758  -4.037 5.40e-05 ***
lug_bootsmall      -4.4476     0.4750  -9.363  < 2e-16 ***
safetylow         -30.5045  1300.3428  -0.023    0.981
safetymed          -3.0044     0.3577  -8.400  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
</pre>
<p>We also saw the cryptic warning message "glm.fit: fitted probabilities numerically 0 or 1 occurred" which we will discuss later.  Returning to our result we see a left column that is formed by concatenating variable names and variable values (values are called "levels" when they are strings).  For example the label "buyinglow" is a combination of "buying" and "low" meaning a low purchase price.  The next column (and the last one we will dig into) is the score associated with this combination of variable and level.  The interpretation is that a care that has "buyinglow" is given a 5.0481 point score bonus.  Whereas a car with "safetylow" is given a -30.5045 scoring penalty.  In fact the complete prediction procedure for a new car is to look the levels specified for all 6 variables and add up the correct scores (plus the "(Intercept)" score of  -28.4255 which is used as an initial score).  Any value not found is assumed to be zero.  This summed-up score is called the "link" and is essentially the model prediction.  Positive link values are associated with acceptable cars and negative link values are associated with unacceptable cars.  For example the first car in our data set is:</p>
<pre>
  buying maintenance doors persons lug_boot safety rating
   vhigh       vhigh     2       2    small    low  unacc
</pre>
<p>According to the columns: we see above our scoring procedure assigns a very bad score of -68 to this car- correctly predicting the "unacc" rating.  We can examine the error rates of our model with the single line:</p>
<pre>
table(CarData$rating,predict(logisticModel,type='response')>=0.5)
</pre>
<p>While yields the result:</p>
<pre>
        FALSE TRUE
  acc      32  352
  good      0   69
  unacc  1166   44
  vgood     0   65
</pre>
<p>This diagram is called a "contingency table" and is a very powerful summary.  The rows are labeled with the ratings assigned at training time (unacceptable, acceptable, good and very good).  The columns FALSE and TRUE  denote the model predicted the car was unacceptable or at least acceptable.  From the row "unacc" we see that 1166 of the 1166+44 unacceptable cars were correctly predicted FALSE (or not at least acceptable).  Also notice the only face negatives are the 32 FALSEs in the "acc" row- none of the good or very good cars were predicted to be unacceptable.  We can also look at this graphically:</p>
<pre>
library(lattice)
densityplot(predict(logisticModel,type='link'),groups=CarData$rating!='unacc',auto.key=T)
</pre>
<p>Which yields the following graph:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/11/density1.png" alt="density1.png" border="0" width="525" height="525" /><br />
</center></p>
<p>This is an area density chart.  Each car that was defined as being unacceptable adds a single blue circle to the bottom of the chart.  Each car that was defined as being acceptable adds a single magenta circle to the bottom of the chart.  The left-right position of each circle is the link score the model assigned to the circle.  There are so many circles that they start to overlap into solid smudges.  To help with this charting software adds the density curves above the circles.  Density curves are a lot like histograms- the height of the curve indicates what fraction of the circles of the same color are under that region of the curve.  So we can see most of the blue circles are in 3 clusters centered at -55, -30 and -5 while the magenta circles are largely clustered around +5.  From a chart like this you can see that a decision procedure of saying a link score above zero is good and below zero is bad would be pretty accurate (most of the blue is to the left and most of the magenta is to the right).  In fact this rule would be over 95% accurate (though accuracy is not a be-all measure, see: <a target="_none" href="http://www.win-vector.com/blog/2009/11/i-dont-think-that-means-what-you-think-it-means-statistics-to-english-translation-part-1-accuracy-measures/">Accuracy Measures</a>).</p>
<p>So far so good.  We have built a model that accurately predicts conclusions based on inputs (so it could be used to generate new ratings) and furthermore our model has visible "Estimate" coefficients that report the relative strength of each feature (which we could use prescriptively in valuing trade-offs in designing a new car).  Except we need to go back to the warning message: "glm.fit: fitted probabilities numerically 0 or 1 occurred."  For a logistic regression the only way the model fitter can encounter "probabilities numerically 0 or 1" is if the link score was out of a reasonable range of zero (say +-20).  Remember we saw link scores as low as -68.  With a link score of -68 the probability of the car in question being acceptable is around 2*10^-16.  This from a training set that only included 1728 items (so really can not be expected to see events much rarer than one in a thousand. We are deliberately confusing two different types of probabilities here- but it is a good way to think). </p>
<p>What is the cause of these extreme link scores?  Extreme coefficient estimates.  The +29.96 preference for "persons4" (cars that seat 4) is a huge vote that swamps out effects from purchase price and maintenance cost.  The model has over fit and made some dangerous extreme assumptions.  What causes this are variable and level combinations that have no falsification in the data set.  For example: suppose only one car had the variable level "person4" and that car happened to be acceptable.  The logistic modeling package could always raise the link score of that single car by putting a bit more weight on the "person4" estimate.  Since this variable level shows up only in positive  examples (and in this case only one example) there is absolutely no penalty for increasing the coefficient.  Logistic regression models are built by an optimizer.  And when an optimizer finds a situation with no penalty- it abuses the situation to no end.  This is what the warning was reporting.  All link numbers map to probabilities between zero and one; only ridiculously large link values map to probabilities near one (and only ridiculously small values map to probabilities near zero).  The optimizer was caught trying to make some of the coefficients "run away" or "run to infinity."  There are some additional diagnostic signs (such as the large coefficients, large standard errors and low significant levels), but there is no advice offered by the system in how to deal with this.  The standard methods are to suppress problem variables and levels (or suppress data with the problem variables and levels present) from the model.  But this is inefficient in that the only way we have of preventing a variable from appearing to be too useful is not to use it.  These are exactly the variables we do not want to eliminate from the model, but they are unsafe to keep in the model (their presence can cause unreliable predictions on new data not seen during training).</p>
<p>What can we do to fix this?  We need to ensure that running a coefficient to infinity is not without cost.  One way to achieve this would be something like Laplace smoothing where we enter two made-up data items: one that has every level set on and is acceptable and one that has every level set on and is unacceptable.  Unfortunately there is no easy way to do this from the user-layer in R.  For example each datum can only have one value set for each categorical variable- so we can't define a datum that has all features on.  Another way to fix this would be to directly penalize large coefficients (like Tychonoff regularization in linear algebra).  Explicit regularization is a good idea and very much in the current style.  Again, unfortunately, the R user layer does not expose a regularization control to the user.  But one of the advantages of logistic regression is that it is relatively easy to implement (harder than Naive Bayes or standard regression in that it needs and optimizer, but easier than SVM in that the optimizer is fairly trivial).</p>
<p>The logistic regression optimizer works to find a model of probabilities p() that maximizes the sum:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/11/cost1.png" alt="cost1.png" border="0" width="644" height="56" /><br />
</center></p>
<p>Or in english: assign large probabilities to examples known to be positive and small probabilities to examples known to be negative.  Now the logistic model assigns probabilities using a function of the form:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/11/p.png" alt="p.png" border="0" width="244" height="83" /><br />
</center></p>
<p>The beta is the model parameters and the x is the data associated with a given example.  The dot product of beta and x is the link score we saw earlier.  The rest of the function is called the sigmoid (also used by neural nets) and its inverse is called the "logit" which is where logistic regression gets its name.  Now this particular function (and others have the so-called "canonical link") has the property that the gradient (the vector of derivative directions indicating the direction of most rapid score improvement) is particularly beautiful.  The gradient vector is:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/11/grad1.png" alt="grad1.png" border="0" width="410" height="56" /><br />
</center></p>
<p>This quantity is a vector because it is a weighted sum over the data_i (which are all vectors of feature values and value/level indicators).  As we expect the gradient to be zero at an optimal point we now have a set of equations we expect to be simultaneously satisfied at the optimal model parameters.  In fact these equations are enough to essentially determine the model- find parameter values that satisfy all of this vector equation and you have found the model (this is usually done by the Newton–Raphson method or by Fisher Scoring).  As an aside this is also the set of equations that the maximum entropy method must specify; which is why for probability problems maximum entropy and logistic regression models are essentially identical.  </p>
<p>If we directly add a penalty for large coefficients to our original scoring function as below:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/11/cost2.png" alt="cost2.png" border="0" width="731" height="56" /><br />
</center></p>
<p>Beta is (again) our model weights (laid out in the same order as the per-datum variables) and epsilon (>=0) is the new user control for regularization.  A small positive epsilon will cause regularization without great damage to model performance (for our example we used epsilon = 0.1). Now our optimal gradient equations (or set of conditions we must meet) become:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/11/grad2b.png" alt="grad2b.png" border="0" width="485" height="56" /><br />
</center></p>
<p>Or- instead of the model having to reproduce all know feature summaries (if a feature is on in 30% of the positive training data then it must be on for 30% of the modeled positive probabilities) we now have a slop term of 2 epsilon beta.  To the extent a coefficient is large its matching summary is allowed slack (making it harder for the summary to drive the coefficient to infinity).  This system of equations is as easy to solve as the original system (a slightly different update is used in the Newton-Raphson method) and we get a solution as below:</p>
<pre>
variable	kind		level	value
		Intercept		-2.071592578277024
buying		Categorical	high	-1.8456895489650629
buying		Categorical	low	2.024816396087424
buying		Categorical	med	1.257553912038549
buying		Categorical	vhigh	-3.508273337437926
doors		Categorical	2	-1.8414721595612646
doors		Categorical	3	-0.391932359146582
doors		Categorical	4	0.08090597021546056
doors		Categorical	5more	0.08090597021546052
lug_boot	Categorical	big	0.8368739895290729
lug_boot	Categorical	med	-0.2858686820997001
lug_boot	Categorical	small	-2.62259788570632
maintenance	Categorical	high	-1.274387434600252
maintenance	Categorical	low	1.3677275941857292
maintenance	Categorical	med	1.3677275941857292
maintenance	Categorical	vhigh	-3.5326603320481342
persons		Categorical	2	-8.360699258777752
persons		Categorical	4	3.2937880724973065
persons		Categorical	more	2.9953186080035197
safety		Categorical	high	4.160122946359636
safety		Categorical	low	-8.028169713081502
safety		Categorical	med	1.7964541884449603
</pre>
<p>This model has essentially identical performance and much smaller coefficients.  From a performance point of view this is essentially the same model.  What has changed is the model no longer is able to pick up a small bonus by "piling on" a coefficient.  For example moving a link value to infinity moves the probabilities from 0.999 to 1.0 which in turn moves the data penalty (assuming the data point is positive) from log(0.999) to log(1.0) or from -0.001 to 0.0.  The approximate 1/1000 score improvement is offset by a penalty proportional to the size of the coefficient- making the useless "adding of nines" no longer worth it.  Or, as has become famous with large margin classifiers, it is important to improve what the model does on probability estimates near 0.5 not estimates already near 0 or 1.</p>
<p> As expected: the coefficients are significantly different than the standard logistic regression.  For example the original model has 3 variables with extreme levels (the intercept, number of passengers and safety) while the new model sees only extreme values (but much smaller) for number of persons and safety (which are likely correlated).  Also consider the difference between the buying levels low and very high in the original model (5 - -2 = 7) and in the new model (2 - -3.5 = 5.5) differ by 1.5 or around 3 of the reported standard deviations (indicating the significance summaries are not enough to certify the location of model parameters).  It is not just that all of the coefficients have shifted, many of the differences are smaller (and others are not- changing the emphasis of the model).  We don not want to overstate differences- we are not so much looking for something better than standard logistic regression as adding an automatic safety that saves us both the effort and loss of predictive power found in fixing models by suppressing unusually distributed (but useful) variables and levels.</p>
<p>An analyst is well served to have logistic regression (and the density plots plus contingency table summaries) as ready tools.  These methods will take you quite far.  And if you start hitting the limits of these tools you can, as we do, bring in custom tools that allow for explicit regularization yielding effective and reliable results.</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2011/09/the-equivalence-of-logistic-regression-and-maximum-entropy-models/' rel='bookmark' title='The equivalence of logistic regression and maximum entropy models'>The equivalence of logistic regression and maximum entropy models</a></li>
<li><a href='http://www.win-vector.com/blog/2011/09/the-simpler-derivation-of-logistic-regression/' rel='bookmark' title='The Simpler Derivation of Logistic Regression'>The Simpler Derivation of Logistic Regression</a></li>
<li><a href='http://www.win-vector.com/blog/2010/12/large-data-logistic-regression-with-example-hadoop-code/' rel='bookmark' title='Large Data Logistic Regression (with example Hadoop code)'>Large Data Logistic Regression (with example Hadoop code)</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2010/11/learn-a-powerful-machine-learning-tool-logistic-regression-and-beyond/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Must Have Software</title>
		<link>http://www.win-vector.com/blog/2010/05/must-have-software/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=must-have-software</link>
		<comments>http://www.win-vector.com/blog/2010/05/must-have-software/#comments</comments>
		<pubDate>Fri, 28 May 2010 17:26:07 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Computers]]></category>
		<category><![CDATA[Opinion]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Tutorials]]></category>
		<category><![CDATA[Excel]]></category>
		<category><![CDATA[git]]></category>
		<category><![CDATA[GnuPG]]></category>
		<category><![CDATA[Keynote]]></category>
		<category><![CDATA[Latex]]></category>
		<category><![CDATA[Must Have Software]]></category>
		<category><![CDATA[Papers]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[Software]]></category>
		<category><![CDATA[Tools]]></category>
		<category><![CDATA[TrueCrypt]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1461</guid>
		<description><![CDATA[Having worked with Unix (BSD, HPUX, IRIX, Linux and OSX), Windows (NT4, 2000, XP, Vista and 7) for quite a while I have seen a lot of different software tools. I would like to quickly exhibit my &#8220;must have&#8221; list. These are the packages that I find to be the single &#8220;must have offerings&#8221; in [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2009/01/exciting-technique-1-the-r-language/' rel='bookmark' title='Exciting Technique #1: The &#8220;R&#8221; language.'>Exciting Technique #1: The &#8220;R&#8221; language.</a></li>
<li><a href='http://www.win-vector.com/blog/2011/10/the-mythical-man-month-is-still-a-good-read/' rel='bookmark' title='&#8220;The Mythical Man Month&#8221; is still a good read'>&#8220;The Mythical Man Month&#8221; is still a good read</a></li>
<li><a href='http://www.win-vector.com/blog/2009/07/microsoft-store-again/' rel='bookmark' title='Microsoft Store Again'>Microsoft Store Again</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>Having worked with Unix (BSD, HPUX, IRIX, Linux and OSX), Windows (NT4, 2000, XP, Vista and 7) for quite a while I have seen a lot of different software tools.  I would like to quickly exhibit my &#8220;must have&#8221; list.  These are the packages that I find to be the single &#8220;must have offerings&#8221; in a number of categories.  I have avoided some categories (such as editors, email programs, programing language, IDEs, photo editors, backup solutions, databases, database tools and web tools) where I have no feeling of having seen a single absolute best offering.</p>
<p>The spirit of the list is to pick items such that: if you disagree with an item in this list then either you are wrong or you know something I would really like to hear about.</p>
<p><span id="more-1461"></span></p>
<dl>
<dt><strong>Encryption, disk images: <a href="http://www.truecrypt.org/" target="ext">TrueCrypt</a> (open source: Linux, Windows, OSX)</strong></dt>
<dd>TrueCrypt can create portable encrypted virtual disks (files that can be mounted as a disk on any operating system).</dd>
<dd></dd>
<dt><strong>Encryption, files: <a href="http://www.gnupg.org/" target="ext">GnuPG</a> (open source: Linux, Windows, OSX)</strong></dt>
<dd>GnuPG is the tool to use to encrypt files for email.</dd>
<dd></dd>
<dt><strong>Presentation: <a href="http://www.apple.com/iwork/keynote/" target="ext">Apple Keynote</a> (commercial: OSX)</strong></dt>
<dd>Keynote is not quite as friendly as Microsoft PowerPoint, but it quickly produces beautiful presentations.</dd>
<dt><strong>Reference Library: <a href="http://mekentosj.com/papers/" target="ext">Papers</a> (commercial: OSX)</strong></dt>
<dd>&#8220;iTunes for PDF.&#8221;  Manage thousands of PDFs and references, annotate with meta-data, place papers into multiple project folders.  An interesting runner-up is <a href="http://bibdesk.sourceforge.net/" target="ext">BibDesk</a> (open source: OSX).</dd>
<dt><strong>Spreadsheet: <a href="http://office.microsoft.com/en-gb/excel/default.aspx" target="ext">Microsoft Excel</a> (commercial: Windows, OSX)</strong></dt>
<dd>Open Office and Google Docs are getting better every day, but neither come close to Microsoft Excel in functionality and versatility of user interface.  If you are on a platform that supports Excel, working regularly with spreadsheets and using something other than Excel: it really means that you do not value your time.</dd>
<dt><strong>Statistics Software: <a href="http://www.r-project.org/" target="ext">R</a> (open source: Linux, Windows, OSX)</strong></dt>
<dd>R is rapidly becoming the platform of choice for statisticians and is (with the addition of lattice and ggplot2) the best way to produce graphs.  R has fairly nasty programming language, but has so many statistical operations available that it can not be avoided.</dd>
<dt><strong>Technical Documentation: <a href="http://www.tug.org/" target="ext">LaTeX</a> (open source: Linux, Windows, OSX)</strong></dt>
<dd>It may seem antiquated but TeX/LaTex is still far more powerful than the &#8220;WSYWYG&#8221; pretenders.  The separation of presentation from specification, automatic management of references, table of contents and being able<br />
to include PDFs from external files (which get refreshed when you re-build the document) are all lifesavers.</dd>
<dt><strong>Version Control: <a href="http://git-scm.com/" target="ext">git</a> (open source: Linux, Windows, OSX)</strong></dt>
<dd>Just about the only version control system that: doesn&#8217;t damage the data you are trying to manage by adding dot-files into all of the directories, can routinely handle large files and can work productively without a network connection.  <a href="http://www.perforce.com/" target="ext">Perforce</a> is powerful central server commercial option (with the ability to have central policies, control and review).
</dd>
</dl>
<p></p>
<p>I look forward to learning which of my choices are considered poor and what your must-haves are.</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2009/01/exciting-technique-1-the-r-language/' rel='bookmark' title='Exciting Technique #1: The &#8220;R&#8221; language.'>Exciting Technique #1: The &#8220;R&#8221; language.</a></li>
<li><a href='http://www.win-vector.com/blog/2011/10/the-mythical-man-month-is-still-a-good-read/' rel='bookmark' title='&#8220;The Mythical Man Month&#8221; is still a good read'>&#8220;The Mythical Man Month&#8221; is still a good read</a></li>
<li><a href='http://www.win-vector.com/blog/2009/07/microsoft-store-again/' rel='bookmark' title='Microsoft Store Again'>Microsoft Store Again</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2010/05/must-have-software/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>R annoyances</title>
		<link>http://www.win-vector.com/blog/2010/03/r-annoyances/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=r-annoyances</link>
		<comments>http://www.win-vector.com/blog/2010/03/r-annoyances/#comments</comments>
		<pubDate>Sat, 20 Mar 2010 18:49:42 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Coding]]></category>
		<category><![CDATA[Rants]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Principle of Least Astonishment]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[R is not your friend]]></category>
		<category><![CDATA[R programming annoyances]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1407</guid>
		<description><![CDATA[Readers returning to our blog will know that Win-Vector LLC is fairly &#8220;pro-R.&#8221; You can take that to mean &#8220;in favor or R&#8221; or &#8220;professionally using R&#8221; (both statements are true). Some days we really don&#8217;t feel that way. Consider the following snippet of R code where we create a list with a single element [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2009/11/r-examine-objects-tutorial/' rel='bookmark' title='R examine objects tutorial'>R examine objects tutorial</a></li>
<li><a href='http://www.win-vector.com/blog/2009/12/cru-graph-yet-again-with-r/' rel='bookmark' title='CRU graph yet again (with R)'>CRU graph yet again (with R)</a></li>
<li><a href='http://www.win-vector.com/blog/2009/09/survive-r/' rel='bookmark' title='Survive R'>Survive R</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>Readers returning to our blog will know that Win-Vector LLC is fairly &#8220;pro-<a href="http://www.win-vector.com/blog/tag/r/" target="x">R</a>.&#8221;  You can take that to mean &#8220;in favor or R&#8221; or &#8220;professionally using R&#8221; (both statements are true).  Some days we really don&#8217;t feel that way.  <span id="more-1407"></span><br />
Consider the following snippet of R code where we create a list with a single element named &#8220;x&#8221; that refers to a numeric vector.  We start with a demonstration of the hard-coded method of pulling the x-value back out using the &#8220;$&#8221; operator.</p>
<pre>
&gt; l &lt;- list(x=c(1,2,3))
&gt; l$x
[1] 1 2 3
</pre>
<p>But suppose we wanted to automate this; that is pass in the name of the value we want in a variable.  We are after all using a computer, so automating a step seems like a reasonable desire.  R supplies a notation for this using the &#8220;[]&#8221; operator.  But something slightly different comes out under the &#8220;[]&#8221; operator than under the &#8220;$&#8221; operator:</p>
<pre>
&gt; varName <- 'x'
&gt; l[varName]
$x
[1] 1 2 3
</pre>
<p>Notice that the printed outputs are slightly different (one echoes "$x" and one does not).  Let's use the "class()" method to see what is actually being returned in each case.</p>
<pre>
&gt; class(l$x)
[1] "numeric"
&gt; class(l['x'])
[1] "list"
</pre>
<p>Completely different return types are returned (in one case a numeric vector in the other a general list, not interchangeable types). </p>
<p>At this point you may think it is time to turn in our "pro" label and call ourselves "newb" (Internet slang for "newbie" or "idiot").  But let's slow down for a bit.   When two views of the same situation disagree (such as the difference in opinion between the authors of R and myself whether the "[]" and "$" operators should return the same type) you at most know that at least one of those views is wrong.  You don't really know if one view is right or even if one view is right which one it is.  I can, however, bring in some additional argument to try and show the design of R is in fact wrong.  The additional argument is <a href="http://en.wikipedia.org/wiki/Principle_of_least_astonishment" target="o">"The Principle of Least Astonishment."</a>  This principle roughly says that it is a mistake to introduce unnecessary differences in outcomes (which to the unprepared user are unpleasant surprises).  There may be some deep (yet obscure) reasons the two operators prefer to return different results.  But the fact you would have to find a way to document and explain these differences really should make one think that this situation is really a mis-design and the "explanation" is really an attempt at a work around.  Or to put it more rudely: there may be an explanation, but there is no excuse.</p>
<p>For another example consider creating a 3 by 3 matrix:</p>
<pre>
&gt; m &lt;- matrix(c(1,2,3,1,1,1,0,0,1),nrow=3,ncol=3)
&gt; m
     [,1] [,2] [,3]
[1,]    1    1    0
[2,]    2    1    0
[3,]    3    1    1
</pre>
<p>Now select the last two rows of the matrix.</p>
<pre>
&gt; m[c(FALSE,TRUE,TRUE),]
     [,1] [,2] [,3]
[1,]    2    1    0
[2,]    3    1    1
&gt;
</pre>
<p>Now (for the punchline) try to select just the middle row of the matrix.<br />
 </p>
<pre>
&gt; m[c(FALSE,TRUE,FALSE),]
[1] 2 1 0
</pre>
<p>Notice that once again (and without warning) the result is subtly different.  I admit that it seems paranoid to worry about such small differences- but when you are debugging a system that should work these are exactly the killing mistakes you are looking for.  In this case the problem is pretty bad.  See what happens if you tried to ask for the dimension of each of these differing returns:</p>
<pre>
&gt; dim(m[c(FALSE,TRUE,TRUE),])
[1] 2 3
&gt; dim(m[c(FALSE,TRUE,FALSE),])
NULL
</pre>
<p>The first case works fine (reports 2 rows and 3 columns).  The second case returns "NULL" (instead of 1 row and 3 columns).   In R NULL is sometimes used as an error-value (instead of throwing an exception) and this value will poison any further conditions or calculations it is involved in.  The main way to deal with the arbitrary introduction of such NULLs is the incredibly tedious uncertain defensive coding practices that we argue against in <a href="http://www.win-vector.com/blog/2010/02/postels-law-not-sure-who-to-be-angry-with/">Postel’s Law: Not Sure Who To Be Angry With</a>.  Such code weakens both programs and programmers.</p>
<p>But what is going on in this example?  Once again we use the "class()" method to inspect the subtly different results.</p>
<pre>
&gt; class(m[c(FALSE,TRUE,TRUE),])
[1] "matrix"
&gt; class(m[c(FALSE,TRUE,FALSE),])
[1] "numeric"
</pre>
<p>The result is disappointing.  For a two-row select R returns a matrix (what we would expect).  For a single-row select R does us the "favor" of converting the result into a vector.  This is a disaster.  A single row matrix is similar to a vector, but even R itself does not support the same set of operations and outcomes on vectors as it does on matrices (for example the failure of the "dim()" method).  It is not safe to further calculate with these results (without by-hand converting the result back to a single row matrix which R can in fact represent).  In my case this created crashing bugs deep in a long running analysis (and was hard to diagnose as the bug was in an "innocent operation" not in a "risky calculation").</p>
<p>All of this has to violate John Chambers' "Prime Directive" for data: "an obligation on all creators of software to program in such a way that the computations can be understood and trusted."  Chambers' opinion being relevant as he is the author of the S language (of which R is an open source re-implementation).  We continue to recommend R, but we also recommend being exceptionally careful when using it (which unfortunately adds time to projects).</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2009/11/r-examine-objects-tutorial/' rel='bookmark' title='R examine objects tutorial'>R examine objects tutorial</a></li>
<li><a href='http://www.win-vector.com/blog/2009/12/cru-graph-yet-again-with-r/' rel='bookmark' title='CRU graph yet again (with R)'>CRU graph yet again (with R)</a></li>
<li><a href='http://www.win-vector.com/blog/2009/09/survive-r/' rel='bookmark' title='Survive R'>Survive R</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2010/03/r-annoyances/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
		</item>
		<item>
		<title>CRU graph yet again (with R)</title>
		<link>http://www.win-vector.com/blog/2009/12/cru-graph-yet-again-with-r/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=cru-graph-yet-again-with-r</link>
		<comments>http://www.win-vector.com/blog/2009/12/cru-graph-yet-again-with-r/#comments</comments>
		<pubDate>Sun, 13 Dec 2009 19:25:00 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Rants]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Climate]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[Regression]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1195</guid>
		<description><![CDATA[IowaHawk has a excellent article attempting to reproduce the infamous CRU climate graph using OpenOffice: Fables of the Reconstruction. We thought we would show how to produced similarly bad results using R. If the re-constructed technique is close to what was originally done then so many bad moves were taken that you can&#8217;t learn much [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2010/03/r-annoyances/' rel='bookmark' title='R annoyances'>R annoyances</a></li>
<li><a href='http://www.win-vector.com/blog/2011/02/the-cranky-guide-to-trying-r-packages/' rel='bookmark' title='The cranky guide to trying R packages'>The cranky guide to trying R packages</a></li>
<li><a href='http://www.win-vector.com/blog/2009/09/survive-r/' rel='bookmark' title='Survive R'>Survive R</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>IowaHawk has a excellent article attempting to reproduce the infamous CRU climate graph using OpenOffice: <a href="http://iowahawk.typepad.com/iowahawk/2009/12/fables-of-the-reconstruction.html">Fables of the Reconstruction</a>.   We thought we would show how to produced similarly bad results using R.<br />
<span id="more-1195"></span></p>
<p>If the re-constructed technique is close to what was originally done then so many bad moves were taken that you can&#8217;t learn much of anything from the original &#8220;result.&#8221;   This points out some of the pratfalls of not performing hold-out tests, not examining the modeling diagnostics and not remembering that linear regression models fail to low-variance models (i.e. when they fail they do a good job predicting the mean and vastly under-estimate variance).</p>
<p>Our article not an article on global warming, but an article on analysis technique.  Human driven global warming is either happening or not happening independent of any bad analysis.  Finding the physical truth is a bigger harder job than eliminating some bad reports (the opposite of a bad report is not necessarily the truth). Bad analyses can have many different sources (mistakes, trying to jump ahead of your colleagues on something you believe is true, trying to fake something you believe is false or be figments of overly harsh critics) and we have not heard enough to make any accusations.</p>
<p>First: load the data (I re-formatted it at bit so <a href="http://cran.r-project.org/">R</a> can read it:<a href="http://www.win-vector.com/blog/wp-content/uploads/2009/12/jonesmannrogfig2c.txt"> jonesmannrogfig2c.txt</a>,  <a href="http://www.win-vector.com/blog/wp-content/uploads/2009/12/data1400.dat_.txt">data1400.dat_.txt</a>   ) , perform the principle components reduction and fit a first<br />
model.</p>
<pre>
&gt; library(lattice)
&gt; d1400 &lt;- read.table('data1400.dat.txt',sep='\t',header=FALSE)
&gt; d1400r &lt;- as.matrix(d1400[,2:23])
&gt; pcomp &lt;- prcomp(na.omit(d1400r))
&gt; plot(pcomp)
&gt; vars &lt;- data.frame(cbind(Year=d1400[,1],d1400r %*% pcomp$rotation),row.names=d1400[,1])
&gt; jones &lt;- read.table('jonesmannrogfig2c.txt',sep='\t',header=TRUE)
&gt; datUnion &lt;- merge(vars,jones,all=TRUE)
&gt; datUnion$avgTemp &lt;- with(datUnion,(NH+CET+Central.Europe+Fennoscandia)/4.0)
&gt; model &lt;- lm(avgTemp ~ PC1 + PC2 + PC3 + PC4 + PC5 ,dat=datUnion)
&gt; summary(model)

Call:
lm(formula = avgTemp ~ PC1 + PC2 + PC3 + PC4 + PC5, data = datUnion)

Residuals:
       Min         1Q     Median         3Q        Max
-0.8811679 -0.2658117  0.0008174  0.2933058  1.0450044 

Coefficients:
              Estimate Std. Error t value Pr(&gt;|t|)
(Intercept)  0.0065252  0.5750696   0.011   0.9910
PC1         -0.0001683  0.0003912  -0.430   0.6679
PC2         -0.0003678  0.0010114  -0.364   0.7168
PC3          0.0003177  0.0014821   0.214   0.8307
PC4          0.0044084  0.0019351   2.278   0.0246 *
PC5          0.0188520  0.0205137   0.919   0.3601
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.4505 on 113 degrees of freedom
  (484 observations deleted due to missingness)
Multiple R-squared: 0.05223,	Adjusted R-squared: 0.01029
F-statistic: 1.245 on 5 and 113 DF,  p-value: 0.2927
</pre>
<p>We used only 5 principle components as modeling variables, because as is typical of principle component analysis- beyond the first few components the components become vanishingly small and unsuitable to use in modeling (see graph pcomp below).</p>
<p><img src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/pcomp.png" width="400"></p>
<p>However, this gave a model with far smaller R-squared than people are reporting, so lets add in a lot of components like everybody else does (bad!).</p>
<pre>
&gt; model &lt;- lm(avgTemp ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8 + PC9 +PC10 +PC11 +PC12 + PC13 ,dat=datUnion)
&gt; summary(model)

Call:
lm(formula = avgTemp ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 +
    PC8 + PC9 + PC10 + PC11 + PC12 + PC13, data = datUnion)

Residuals:
     Min       1Q   Median       3Q      Max
-0.87249 -0.25951  0.03996  0.25055  0.99039 

Coefficients:
              Estimate Std. Error t value Pr(&gt;|t|)
(Intercept)  7.431e-01  1.424e+00   0.522   0.6028
PC1         -1.796e-04  3.665e-04  -0.490   0.6253
PC2         -4.179e-04  9.759e-04  -0.428   0.6694
PC3          3.306e-05  1.430e-03   0.023   0.9816
PC4          3.416e-03  1.803e-03   1.894   0.0609 .
PC5          4.032e-02  1.978e-02   2.039   0.0440 *
PC6         -3.260e-03  2.660e-02  -0.123   0.9027
PC7         -7.134e-02  3.620e-02  -1.971   0.0514 .
PC8         -1.339e-01  7.895e-02  -1.696   0.0928 .
PC9          7.577e-02  5.734e-02   1.321   0.1892
PC10         2.700e-01  5.878e-02   4.594 1.22e-05 ***
PC11         8.562e-02  6.741e-02   1.270   0.2068
PC12        -8.057e-02  1.053e-01  -0.765   0.4461
PC13        -4.099e-02  1.064e-01  -0.385   0.7008
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.4141 on 105 degrees of freedom
  (484 observations deleted due to missingness)
Multiple R-squared: 0.2558,	Adjusted R-squared: 0.1637
F-statistic: 2.777 on 13 and 105 DF,  p-value: 0.001961
</pre>
<p>This is a degenerate model that essentially didn&#8217;t fit (thought the significance on PC10 component fools the fitter, but PC10 can&#8217;t be usable- it is essentially noise).  Graphically we can see the fit is not very useful (despite having  a little bit of R-squared) by looking at the graph of the fit plotted in the region of fitting.  Notice how the fit variance is much smaller than the true data variance even in the region of training data, this is typical of bad regression fits.</p>
<pre>
&gt; dRange &lt;- datUnion[datUnion$Year&gt;=1856 &#038; datUnion$Year&lt;=1980,]
&gt; xyplot(avgTemp + prediction ~Year,dat=dRange,type='l',auto.key=TRUE)
</pre>
<p><img src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/pFitRegion.png" width="400"></p>
<p>Now the statement they wanted to make is that the present looks nothing like the past.  The past is only available through the fit model so what you would hope is that the model looks like the present and then the model itself separates the past and present.  Instead as you see in the graphs above and below this fails two ways: the model looks nothing like the present and the model&#8217;s past looks a lot like the model&#8217;s present.</p>
<pre>
&gt; datUnion$prediction &lt;- predict(model,newdata=datUnion)
&gt; xyplot(avgTemp + prediction ~Year,dat=datUnion,type=c('p','smooth'),auto.key=TRUE)
</pre>
<p><img src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/pBoth.png" width="400"></p>
<p>What we could do to falsely drive the conclusion (which itself may or may not be true, it just is not supported by this technique, model or data) is create the infamous graph where we switch from modeled data in the past to actual data in the present and then act surprised that the two did not line up (which they did at no step during the fitting).  I don&#8217;t have the heart to unify the colors or remove the legend, but here is the graph below:</p>
<pre>
&gt; datUnion$dinked &lt;- datUnion$prediction
&gt; datUnion$dinked[!is.na(datUnion$avg)] &lt;- NA
&gt; xyplot(avgTemp + dinked ~Year,dat=datUnion,type=c('p','smooth'),auto.key=TRUE)
</pre>
<p><img src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/pJacked.png" width="400"></p>
<p>The reason the blue points look different than the others is they came from the average temperature data instead of the model (where everything else came from).  Switching the series is essentially assuming the conclusion that recent past looks very different than the far past.</p>
<p>Essentially this methodology was so poor it could not have illustrated or contradicted recent global warming.  There are plenty of warning signs that the model fitting are problematic and the conclusion illustrated in the last graph can not actually be proved or disproved from this data (the proxy variables are too weak to be useful, that is not to say there are not other better proxy variables or modeling techniques).  The problems of the presentation are, of course, not essential problems in detecting global warming (which likely is occurring and likely will be a drain on future quality of life) but problems found in a single bad analysis.</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2010/03/r-annoyances/' rel='bookmark' title='R annoyances'>R annoyances</a></li>
<li><a href='http://www.win-vector.com/blog/2011/02/the-cranky-guide-to-trying-r-packages/' rel='bookmark' title='The cranky guide to trying R packages'>The cranky guide to trying R packages</a></li>
<li><a href='http://www.win-vector.com/blog/2009/09/survive-r/' rel='bookmark' title='Survive R'>Survive R</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2009/12/cru-graph-yet-again-with-r/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>R examine objects tutorial</title>
		<link>http://www.win-vector.com/blog/2009/11/r-examine-objects-tutorial/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=r-examine-objects-tutorial</link>
		<comments>http://www.win-vector.com/blog/2009/11/r-examine-objects-tutorial/#comments</comments>
		<pubDate>Sat, 21 Nov 2009 15:39:21 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Coding]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Tutorials]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[Tutorial]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1134</guid>
		<description><![CDATA[This article is quick concrete example of how to use the techniques from Survive R to lower the steepness of The R Project for Statistical Computing&#8216;s learning curve (so an apology to all readers who are not interested in R). What follows is for people who already use R and want to achieve more control [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2011/02/the-cranky-guide-to-trying-r-packages/' rel='bookmark' title='The cranky guide to trying R packages'>The cranky guide to trying R packages</a></li>
<li><a href='http://www.win-vector.com/blog/2010/11/learn-a-powerful-machine-learning-tool-logistic-regression-and-beyond/' rel='bookmark' title='Learn Logistic Regression (and beyond)'>Learn Logistic Regression (and beyond)</a></li>
<li><a href='http://www.win-vector.com/blog/2011/12/my-favorite-graphs/' rel='bookmark' title='My Favorite Graphs'>My Favorite Graphs</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>This article is quick concrete example of how to use the techniques from  <a href="http://www.win-vector.com/blog/2009/09/survive-r/">Survive R</a> to lower the steepness of <a href="http://www.r-project.org/">The R Project for Statistical Computing</a>&#8216;s learning curve (so an apology to all readers who are not interested in R).  What follows is for people who already use R and want to achieve more control of the software.<span id="more-1134"></span><br />
I am a fan of the <a href="http://www.r-project.org/">R</a>.  The R software does a number of incredible things and is the result of a number of good design choices.  However, you can&#8217;t fully benefit from R if you are not already familiar the internal workings of R.  You can quickly become familiar with the internal workings of R if you learn how to inspect the objects of R (as an addition to using the built in help system).  Here I give a concrete example of how to use the R system itself to find answers, with or without the help system.  R documentation has the difficult dual responsibility of attempting to explain both how to use the R software and explain the nature of the underlying statistics; so the documentation is not always the quickest thing to browse.</p>
<p>First let&#8217;s give R the commands to build a fake data set that has a variable y that turns out to be 3 times x (another variable) plus some noise:</p>
<pre>
&gt; n &lt;- 100
&gt; x &lt;- rnorm(n)
&gt; y &lt;- 3*x + 0.2*rnorm(n)
&gt; d &lt;- data.frame(x,y)
</pre>
<p>This data set (by design) has a nearly a linear relation between x and y.  We can plot<br />
the data as follows:</p>
<pre>
&gt; library(ggplot2)
&gt; ggplot(data=d) + geom_point(aes(x=x,y=y))
</pre>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/dat1.png" alt="dat1.png" border="0" width="500" height="500" /><br />
</center></p>
<p>With data like this the most obvious statistical analysis is a <a href="http://en.wikipedia.org/wiki/Linear_regression">linear regression</a>.  R can very quickly perform the linear regression and report the results.</p>
<pre>
&gt; model &lt;- lm(y~x,data=d)
&gt; summary(model)

Call:
lm(formula = y ~ x, data = d)

Residuals:
     Min       1Q   Median       3Q      Max
-0.41071 -0.12762 -0.00651  0.10240  0.62772 

Coefficients:
            Estimate Std. Error t value Pr(&gt;|t|)
(Intercept) -0.02609    0.02102  -1.241    0.217
x            2.99150    0.02202 135.858   &lt;2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.2102 on 98 degrees of freedom
Multiple R-squared: 0.9947,	Adjusted R-squared: 0.9947
F-statistic: 1.846e+04 on 1 and 98 DF,  p-value: &lt; 2.2e-16
</pre>
<p>We can read the report and see that the estimated fit formula is: y =  2.99150*x &#8211; 0.02609 (which is very close to the true formula y = 3*x) .  At this point the analysis is done (if the goal of the analysis is to just print the results).  However, if we want to use the results in a calculation we need to get at the numbers shown in above printout.  This printout contains a lot of information (such as the estimate fit coefficients, the standard errors, the t-values and the significances) that a statistician would want to see and want to use in further calculations.  But it is unclear how to get at these numbers.  For example: how do you get the &#8220;standard errors&#8221; (the numbers in the &#8220;Std. Error&#8221; column) from the returned model?  Are we forced to cut and paste them from the printed report?   What can you do?</p>
<p>The documentation nearly tells us what we need to know.  <tt>help(lm)</tt> yields:</p>
<blockquote><p><tt><br />
The functions summary and anova are used to obtain and print a summary and analysis of variance table of the results. The generic accessor functions coefficients, effects, fitted.values and residuals extract various useful features of the value returned by lm.<br />
</tt></p></blockquote>
<p>To a newer R user this may not be clear (as there are technical issues from both R and statistics quickly being run through).   However, the experienced R user would immediately recognize from this help that what is returned form <tt>summary(model)</tt> is an object (not just a blob of text) and that looking at the class of the returned object (which turns out to be summary.lm) might tell them what they would need to know.</p>
<p>Typing:</p>
<pre>
&gt;class(summary(model))
[1] "summary.lm"
&gt; help(summary.lm)
</pre>
<p>Yields:</p>
<blockquote><p><tt><br />
coefficients: a p x 4 matrix with columns for the estimated coefficient, its standard error, t-statistic and corresponding (two-sided) p-value. Aliased coefficients are omitted.<br />
</tt></p></blockquote>
<p>But if you are not very familiar with R you might miss that the summary function returns a useful object (instead of blob of text).  Also you might only know to look at <tt>help(summary)</tt> which  does not describe the location of the desired standard errors (but does have a reference to summary.lm, so if you are patient you might find it).  We describe how to find the information you need by using R&#8217;s object inspection facilities.  This is a &#8220;doing it the hard way&#8221; technique for when you do not understand the help system or you are using a package with less complete help documentation.</p>
<p>First  (using the techniques described in the slides:  <a href="http://www.win-vector.com/blog/2009/09/survive-r/">Survive R</a>) examine the model to see if the standard errors are there:</p>
<pre>
&gt; names(model)
 [1] "coefficients"  "residuals"     "effects"       "rank"          "fitted.values" "assign"
 [7] "qr"            "df.residual"   "xlevels"       "call"          "terms"         "model"    

&gt; model$coefficients
(Intercept)           x
-0.02609243  2.99150259
</pre>
<p>We found the coefficients, but did not find the standard errors.  Now we know the standard errors are reported by <tt>summary(model)</tt>, so they must be somewhere.  Instead of performing a wild goose chase to find the standard errors let&#8217;s instead trace how the summary method works to find where it gets them.  If we type print(summary) we don&#8217;t get any really useful information.  This is because summary is a generic method and we need to know what type-qualified name the summary of a linear model is called.</p>
<pre>
&gt; class(model)
[1] "lm"
</pre>
<p>So we see our model is of type lm so the <tt>summary(model)</tt> call would use a summary method called summary.lm (which as we saw is also the returned class of the <tt>summary(model)</tt> object).  As we mentioned the solution is in <tt>help(summary.lm)</tt>, but if the solution had not been there we could still make progress:  we could dump the source of the summary.lm method:</p>
<pre>
&gt; print(summary.lm)
function (object, correlation = FALSE, symbolic.cor = FALSE,
    ...)
{
    ....
    class(ans) &lt;- "summary.lm"
    ans
}
</pre>
<p>We actually deleted the bulk of the print(summary.lm) result because the important thing to notice is that the method is huge and that it returns an object instead of a blob of text.  The fact that the method summary.lm was huge means that it is likely calculating the things it reports (confirming that the standard errors are not part of the model object).  The fact that an object is returned means that what we are looking for may sitting somewhere in the summary waiting for us.  To find what we are looking for we convert the summary into a list (using the unclass() method) and look for something with the name or value we are looking for:</p>
<pre>
&gt; unclass(summary(model))
$call
lm(formula = y ~ x, data = d)
...
$coefficients
               Estimate Std. Error    t value      Pr(&gt;|t|)
(Intercept) -0.02609243 0.02102062  -1.241278  2.174662e-01
x            2.99150259 0.02201930 135.858209 2.095643e-113
...
</pre>
<p>And we have found it.  The named slot <tt>summary(model)</tt>$coefficients is in fact a table that has what we are looking for in the second column.  We can create a new list that will let us look up the standard errors by name (for the variable x and for the intercept):</p>
<pre>
&gt; stdErrors &lt;- as.list(summary(model)$coefficients[,2])
</pre>
<p>Now that we have the stdErrors in list form we can look up the numbers we wanted by name.</p>
<pre>
&gt; stdErrors['x']
$x
[1] 0.0220193

&gt; stdErrors['(Intercept)']
$`(Intercept)`
[1] 0.02102062
</pre>
<p>And we finally have the standard errors.  But why did we want the standard errors?  In this case I wanted the standard errors so I could plot the fit model and show the uncertainty of the model.  As, is often the case, R already has a function that does all of this.  Also (as is often the case) the R function that does this asks the right statistical question (instead of the obvious question) and can draw error bars that display the uncertainty of future predictions.  The uncertainty in future prediction is in fact different than the uncertainty of the estimate (what was most obvious to calculate from the standard errors) and (after some reflection) is what I really wanted.  Having these sort of distinctions already thought out is why we are using a statistics package like R instead of just coding everything up.  These calculations are all trivial to implement- but remembering to perform the calculations that answer the right statistical questions can be difficult.  The built in R solution of plotting the the fit model (black line) and the region of expected prediction uncertainty (blue lines) is as follows:</p>
<pre>
&gt; pred &lt;- predict.lm(model,interval='prediction')
&gt; dfit &lt;- data.frame(x,y,fit=pred[,1],lwr=pred[,2],upr=pred[,3])
&gt; ggplot(data=dfit) + geom_point(aes(x=x,y=y)) +
    geom_line(aes(x,fit)) +
    geom_line(aes(x=x,y=lwr),color='blue') + geom_line(aes(x=x,y=upr),color='blue')
</pre>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/fit1.png" alt="fit1.png" border="0" width="500" height="500" /><br />
</center></p>
<p>And we are done.</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2011/02/the-cranky-guide-to-trying-r-packages/' rel='bookmark' title='The cranky guide to trying R packages'>The cranky guide to trying R packages</a></li>
<li><a href='http://www.win-vector.com/blog/2010/11/learn-a-powerful-machine-learning-tool-logistic-regression-and-beyond/' rel='bookmark' title='Learn Logistic Regression (and beyond)'>Learn Logistic Regression (and beyond)</a></li>
<li><a href='http://www.win-vector.com/blog/2011/12/my-favorite-graphs/' rel='bookmark' title='My Favorite Graphs'>My Favorite Graphs</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2009/11/r-examine-objects-tutorial/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
	</channel>
</rss>

