<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Win-Vector Blog &#187; R</title>
	<atom:link href="http://www.win-vector.com/blog/tag/r/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.win-vector.com/blog</link>
	<description>The Applied Theorist&#039;s Point of View</description>
	<lastBuildDate>Thu, 29 Jul 2010 17:09:49 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
		<item>
		<title>Must Have Software</title>
		<link>http://www.win-vector.com/blog/2010/05/must-have-software/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=must-have-software</link>
		<comments>http://www.win-vector.com/blog/2010/05/must-have-software/#comments</comments>
		<pubDate>Fri, 28 May 2010 17:26:07 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Computers]]></category>
		<category><![CDATA[Opinion]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Tutorials]]></category>
		<category><![CDATA[Excel]]></category>
		<category><![CDATA[git]]></category>
		<category><![CDATA[GnuPG]]></category>
		<category><![CDATA[Keynote]]></category>
		<category><![CDATA[Latex]]></category>
		<category><![CDATA[Must Have Software]]></category>
		<category><![CDATA[Papers]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[Software]]></category>
		<category><![CDATA[Tools]]></category>
		<category><![CDATA[TrueCrypt]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1461</guid>
		<description><![CDATA[Having worked with Unix (BSD, HPUX, IRIX, Linux and OSX), Windows (NT4, 2000, XP, Vista and 7) for quite a while I have seen a lot of different software tools. I would like to quickly exhibit my &#8220;must have&#8221; list. These are the packages that I find to be the single &#8220;must have offerings&#8221; in [...]


Related posts:<ol><li><a href='http://www.win-vector.com/blog/2009/07/microsoft-store-again/' rel='bookmark' title='Permanent Link: Microsoft Store Again'>Microsoft Store Again</a></li>
<li><a href='http://www.win-vector.com/blog/2009/01/exciting-technique-1-the-r-language/' rel='bookmark' title='Permanent Link: Exciting Technique #1: The &#8220;R&#8221; language.'>Exciting Technique #1: The &#8220;R&#8221; language.</a></li>
<li><a href='http://www.win-vector.com/blog/2009/06/public-service-article-jstor-and-other-useful-research-archives/' rel='bookmark' title='Permanent Link: Public Service Article: JSTOR and other Useful Research Archives'>Public Service Article: JSTOR and other Useful Research Archives</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>Having worked with Unix (BSD, HPUX, IRIX, Linux and OSX), Windows (NT4, 2000, XP, Vista and 7) for quite a while I have seen a lot of different software tools.  I would like to quickly exhibit my &#8220;must have&#8221; list.  These are the packages that I find to be the single &#8220;must have offerings&#8221; in a number of categories.  I have avoided some categories (such as editors, email programs, programing language, IDEs, photo editors, backup solutions, databases, database tools and web tools) where I have no feeling of having seen a single absolute best offering.</p>
<p>The spirit of the list is to pick items such that: if you disagree with an item in this list then either you are wrong or you know something I would really like to hear about.</p>
<p><span id="more-1461"></span></p>
<dl>
<dt><strong>Encryption, disk images: <a href="http://www.truecrypt.org/" target="ext">TrueCrypt</a> (open source: Linux, Windows, OSX)</strong></dt>
<dd>TrueCrypt can create portable encrypted virtual disks (files that can be mounted as a disk on any operating system).</dd>
<dd></dd>
<dt><strong>Encryption, files: <a href="http://www.gnupg.org/" target="ext">GnuPG</a> (open source: Linux, Windows, OSX)</strong></dt>
<dd>GnuPG is the tool to use to encrypt files for email.</dd>
<dd></dd>
<dt><strong>Presentation: <a href="http://www.apple.com/iwork/keynote/" target="ext">Apple Keynote</a> (commercial: OSX)</strong></dt>
<dd>Keynote is not quite as friendly as Microsoft PowerPoint, but it quickly produces beautiful presentations.</dd>
<dt><strong>Reference Library: <a href="http://mekentosj.com/papers/" target="ext">Papers</a> (commercial: OSX)</strong></dt>
<dd>&#8220;iTunes for PDF.&#8221;  Manage thousands of PDFs and references, annotate with meta-data, place papers into multiple project folders.  An interesting runner-up is <a href="http://bibdesk.sourceforge.net/" target="ext">BibDesk</a> (open source: OSX).</dd>
<dt><strong>Spreadsheet: <a href="http://office.microsoft.com/en-gb/excel/default.aspx" target="ext">Microsoft Excel</a> (commercial: Windows, OSX)</strong></dt>
<dd>Open Office and Google Docs are getting better every day, but neither come close to Microsoft Excel in functionality and versatility of user interface.  If you are on a platform that supports Excel, working regularly with spreadsheets and using something other than Excel: it really means that you do not value your time.</dd>
<dt><strong>Statistics Software: <a href="http://www.r-project.org/" target="ext">R</a> (open source: Linux, Windows, OSX)</strong></dt>
<dd>R is rapidly becoming the platform of choice for statisticians and is (with the addition of lattice and ggplot2) the best way to produce graphs.  R has fairly nasty programming language, but has so many statistical operations available that it can not be avoided.</dd>
<dt><strong>Technical Documentation: <a href="http://www.tug.org/" target="ext">LaTeX</a> (open source: Linux, Windows, OSX)</strong></dt>
<dd>It may seem antiquated but TeX/LaTex is still far more powerful than the &#8220;WSYWYG&#8221; pretenders.  The separation of presentation from specification, automatic management of references, table of contents and being able<br />
to include PDFs from external files (which get refreshed when you re-build the document) are all lifesavers.</dd>
<dt><strong>Version Control: <a href="http://git-scm.com/" target="ext">git</a> (open source: Linux, Windows, OSX)</strong></dt>
<dd>Just about the only version control system that: doesn&#8217;t damage the data you are trying to manage by adding dot-files into all of the directories, can routinely handle large files and can work productively without a network connection.  <a href="http://www.perforce.com/" target="ext">Perforce</a> is powerful central server commercial option (with the ability to have central policies, control and review).
</dd>
</dl>
<p></p>
<p>I look forward to learning which of my choices are considered poor and what your must-haves are.</p>


<p>Related posts:<ol><li><a href='http://www.win-vector.com/blog/2009/07/microsoft-store-again/' rel='bookmark' title='Permanent Link: Microsoft Store Again'>Microsoft Store Again</a></li>
<li><a href='http://www.win-vector.com/blog/2009/01/exciting-technique-1-the-r-language/' rel='bookmark' title='Permanent Link: Exciting Technique #1: The &#8220;R&#8221; language.'>Exciting Technique #1: The &#8220;R&#8221; language.</a></li>
<li><a href='http://www.win-vector.com/blog/2009/06/public-service-article-jstor-and-other-useful-research-archives/' rel='bookmark' title='Permanent Link: Public Service Article: JSTOR and other Useful Research Archives'>Public Service Article: JSTOR and other Useful Research Archives</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2010/05/must-have-software/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>R annoyances</title>
		<link>http://www.win-vector.com/blog/2010/03/r-annoyances/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=r-annoyances</link>
		<comments>http://www.win-vector.com/blog/2010/03/r-annoyances/#comments</comments>
		<pubDate>Sat, 20 Mar 2010 18:49:42 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Coding]]></category>
		<category><![CDATA[Rants]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Principle of Least Astonishment]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[R is not your friend]]></category>
		<category><![CDATA[R programming annoyances]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1407</guid>
		<description><![CDATA[Readers returning to our blog will know that Win-Vector LLC is fairly &#8220;pro-R.&#8221; You can take that to mean &#8220;in favor or R&#8221; or &#8220;professionally using R&#8221; (both statements are true). Some days we really don&#8217;t feel that way. Consider the following snippet of R code where we create a list with a single element [...]


Related posts:<ol><li><a href='http://www.win-vector.com/blog/2009/11/r-examine-objects-tutorial/' rel='bookmark' title='Permanent Link: R examine objects tutorial'>R examine objects tutorial</a></li>
<li><a href='http://www.win-vector.com/blog/2010/01/relative-returns-a-banker-versus-trader-paradox/' rel='bookmark' title='Permanent Link: Relative returns: a banker versus trader paradox'>Relative returns: a banker versus trader paradox</a></li>
<li><a href='http://www.win-vector.com/blog/2008/04/sorting-in-anger/' rel='bookmark' title='Permanent Link: Sorting Used in Anger'>Sorting Used in Anger</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>Readers returning to our blog will know that Win-Vector LLC is fairly &#8220;pro-<a href="http://www.win-vector.com/blog/tag/r/" target="x">R</a>.&#8221;  You can take that to mean &#8220;in favor or R&#8221; or &#8220;professionally using R&#8221; (both statements are true).  Some days we really don&#8217;t feel that way.  <span id="more-1407"></span><br />
Consider the following snippet of R code where we create a list with a single element named &#8220;x&#8221; that refers to a numeric vector.  We start with a demonstration of the hard-coded method of pulling the x-value back out using the &#8220;$&#8221; operator.</p>
<pre>
&gt; l &lt;- list(x=c(1,2,3))
&gt; l$x
[1] 1 2 3
</pre>
<p>But suppose we wanted to automate this; that is pass in the name of the value we want in a variable.  We are after all using a computer, so automating a step seems like a reasonable desire.  R supplies a notation for this using the &#8220;[]&#8221; operator.  But something slightly different comes out under the &#8220;[]&#8221; operator than under the &#8220;$&#8221; operator:</p>
<pre>
&gt; varName <- 'x'
&gt; l[varName]
$x
[1] 1 2 3
</pre>
<p>Notice that the printed outputs are slightly different (one echoes "$x" and one does not).  Let's use the "class()" method to see what is actually being returned in each case.</p>
<pre>
&gt; class(l$x)
[1] "numeric"
&gt; class(l['x'])
[1] "list"
</pre>
<p>Completely different return types are returned (in one case a numeric vector in the other a general list, not interchangeable types). </p>
<p>At this point you may think it is time to turn in our "pro" label and call ourselves "newb" (Internet slang for "newbie" or "idiot").  But let's slow down for a bit.   When two views of the same situation disagree (such as the difference in opinion between the authors of R and myself whether the "[]" and "$" operators should return the same type) you at most know that at least one of those views is wrong.  You don't really know if one view is right or even if one view is right which one it is.  I can, however, bring in some additional argument to try and show the design of R is in fact wrong.  The additional argument is <a href="http://en.wikipedia.org/wiki/Principle_of_least_astonishment" target="o">"The Principle of Least Astonishment."</a>  This principle roughly says that it is a mistake to introduce unnecessary differences in outcomes (which to the unprepared user are unpleasant surprises).  There may be some deep (yet obscure) reasons the two operators prefer to return different results.  But the fact you would have to find a way to document and explain these differences really should make one think that this situation is really a mis-design and the "explanation" is really an attempt at a work around.  Or to put it more rudely: there may be an explanation, but there is no excuse.</p>
<p>For another example consider creating a 3 by 3 matrix:</p>
<pre>
&gt; m &lt;- matrix(c(1,2,3,1,1,1,0,0,1),nrow=3,ncol=3)
&gt; m
     [,1] [,2] [,3]
[1,]    1    1    0
[2,]    2    1    0
[3,]    3    1    1
</pre>
<p>Now select the last two rows of the matrix.</p>
<pre>
&gt; m[c(FALSE,TRUE,TRUE),]
     [,1] [,2] [,3]
[1,]    2    1    0
[2,]    3    1    1
&gt;
</pre>
<p>Now (for the punchline) try to select just the middle row of the matrix.<br />
 </p>
<pre>
&gt; m[c(FALSE,TRUE,FALSE),]
[1] 2 1 0
</pre>
<p>Notice that once again (and without warning) the result is subtly different.  I admit that it seems paranoid to worry about such small differences- but when you are debugging a system that should work these are exactly the killing mistakes you are looking for.  In this case the problem is pretty bad.  See what happens if you tried to ask for the dimension of each of these differing returns:</p>
<pre>
&gt; dim(m[c(FALSE,TRUE,TRUE),])
[1] 2 3
&gt; dim(m[c(FALSE,TRUE,FALSE),])
NULL
</pre>
<p>The first case works fine (reports 2 rows and 3 columns).  The second case returns "NULL" (instead of 1 row and 3 columns).   In R NULL is sometimes used as an error-value (instead of throwing an exception) and this value will poison any further conditions or calculations it is involved in.  The main way to deal with the arbitrary introduction of such NULLs is the incredibly tedious uncertain defensive coding practices that we argue against in <a href="http://www.win-vector.com/blog/2010/02/postels-law-not-sure-who-to-be-angry-with/">Postel’s Law: Not Sure Who To Be Angry With</a>.  Such code weakens both programs and programmers.</p>
<p>But what is going on in this example?  Once again we use the "class()" method to inspect the subtly different results.</p>
<pre>
&gt; class(m[c(FALSE,TRUE,TRUE),])
[1] "matrix"
&gt; class(m[c(FALSE,TRUE,FALSE),])
[1] "numeric"
</pre>
<p>The result is disappointing.  For a two-row select R returns a matrix (what we would expect).  For a single-row select R does us the "favor" of converting the result into a vector.  This is a disaster.  A single row matrix is similar to a vector, but even R itself does not support the same set of operations and outcomes on vectors as it does on matrices (for example the failure of the "dim()" method).  It is not safe to further calculate with these results (without by-hand converting the result back to a single row matrix which R can in fact represent).  In my case this created crashing bugs deep in a long running analysis (and was hard to diagnose as the bug was in an "innocent operation" not in a "risky calculation").</p>
<p>All of this has to violate John Chambers' "Prime Directive" for data: "an obligation on all creators of software to program in such a way that the computations can be understood and trusted."  Chambers' opinion being relevant as he is the author of the S language (of which R is an open source re-implementation).  We continue to recommend R, but we also recommend being exceptionally careful when using it (which unfortunately adds time to projects).</p>


<p>Related posts:<ol><li><a href='http://www.win-vector.com/blog/2009/11/r-examine-objects-tutorial/' rel='bookmark' title='Permanent Link: R examine objects tutorial'>R examine objects tutorial</a></li>
<li><a href='http://www.win-vector.com/blog/2010/01/relative-returns-a-banker-versus-trader-paradox/' rel='bookmark' title='Permanent Link: Relative returns: a banker versus trader paradox'>Relative returns: a banker versus trader paradox</a></li>
<li><a href='http://www.win-vector.com/blog/2008/04/sorting-in-anger/' rel='bookmark' title='Permanent Link: Sorting Used in Anger'>Sorting Used in Anger</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2010/03/r-annoyances/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
		</item>
		<item>
		<title>CRU graph yet again (with R)</title>
		<link>http://www.win-vector.com/blog/2009/12/cru-graph-yet-again-with-r/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=cru-graph-yet-again-with-r</link>
		<comments>http://www.win-vector.com/blog/2009/12/cru-graph-yet-again-with-r/#comments</comments>
		<pubDate>Sun, 13 Dec 2009 19:25:00 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Rants]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Climate]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[Regression]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1195</guid>
		<description><![CDATA[IowaHawk has a excellent article attempting to reproduce the infamous CRU climate graph using OpenOffice: Fables of the Reconstruction. We thought we would show how to produced similarly bad results using R. If the re-constructed technique is close to what was originally done then so many bad moves were taken that you can&#8217;t learn much [...]


Related posts:<ol><li><a href='http://www.win-vector.com/blog/2009/11/r-examine-objects-tutorial/' rel='bookmark' title='Permanent Link: R examine objects tutorial'>R examine objects tutorial</a></li>
<li><a href='http://www.win-vector.com/blog/2009/09/survive-r/' rel='bookmark' title='Permanent Link: Survive R'>Survive R</a></li>
<li><a href='http://www.win-vector.com/blog/2010/03/r-annoyances/' rel='bookmark' title='Permanent Link: R annoyances'>R annoyances</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>IowaHawk has a excellent article attempting to reproduce the infamous CRU climate graph using OpenOffice: <a href="http://iowahawk.typepad.com/iowahawk/2009/12/fables-of-the-reconstruction.html">Fables of the Reconstruction</a>.   We thought we would show how to produced similarly bad results using R.<br />
<span id="more-1195"></span></p>
<p>If the re-constructed technique is close to what was originally done then so many bad moves were taken that you can&#8217;t learn much of anything from the original &#8220;result.&#8221;   This points out some of the pratfalls of not performing hold-out tests, not examining the modeling diagnostics and not remembering that linear regression models fail to low-variance models (i.e. when they fail they do a good job predicting the mean and vastly under-estimate variance).</p>
<p>Our article not an article on global warming, but an article on analysis technique.  Human driven global warming is either happening or not happening independent of any bad analysis.  Finding the physical truth is a bigger harder job than eliminating some bad reports (the opposite of a bad report is not necessarily the truth). Bad analyses can have many different sources (mistakes, trying to jump ahead of your colleagues on something you believe is true, trying to fake something you believe is false or be figments of overly harsh critics) and we have not heard enough to make any accusations.</p>
<p>First: load the data (I re-formatted it at bit so <a href="http://cran.r-project.org/">R</a> can read it:<a href="http://www.win-vector.com/blog/wp-content/uploads/2009/12/jonesmannrogfig2c.txt"> jonesmannrogfig2c.txt</a>,  <a href="http://www.win-vector.com/blog/wp-content/uploads/2009/12/data1400.dat_.txt">data1400.dat_.txt</a>   ) , perform the principle components reduction and fit a first<br />
model.</p>
<pre>
&gt; library(lattice)
&gt; d1400 &lt;- read.table('data1400.dat.txt',sep='\t',header=FALSE)
&gt; d1400r &lt;- as.matrix(d1400[,2:23])
&gt; pcomp &lt;- prcomp(na.omit(d1400r))
&gt; plot(pcomp)
&gt; vars &lt;- data.frame(cbind(Year=d1400[,1],d1400r %*% pcomp$rotation),row.names=d1400[,1])
&gt; jones &lt;- read.table('jonesmannrogfig2c.txt',sep='\t',header=TRUE)
&gt; datUnion &lt;- merge(vars,jones,all=TRUE)
&gt; datUnion$avgTemp &lt;- with(datUnion,(NH+CET+Central.Europe+Fennoscandia)/4.0)
&gt; model &lt;- lm(avgTemp ~ PC1 + PC2 + PC3 + PC4 + PC5 ,dat=datUnion)
&gt; summary(model)

Call:
lm(formula = avgTemp ~ PC1 + PC2 + PC3 + PC4 + PC5, data = datUnion)

Residuals:
       Min         1Q     Median         3Q        Max
-0.8811679 -0.2658117  0.0008174  0.2933058  1.0450044 

Coefficients:
              Estimate Std. Error t value Pr(&gt;|t|)
(Intercept)  0.0065252  0.5750696   0.011   0.9910
PC1         -0.0001683  0.0003912  -0.430   0.6679
PC2         -0.0003678  0.0010114  -0.364   0.7168
PC3          0.0003177  0.0014821   0.214   0.8307
PC4          0.0044084  0.0019351   2.278   0.0246 *
PC5          0.0188520  0.0205137   0.919   0.3601
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.4505 on 113 degrees of freedom
  (484 observations deleted due to missingness)
Multiple R-squared: 0.05223,	Adjusted R-squared: 0.01029
F-statistic: 1.245 on 5 and 113 DF,  p-value: 0.2927
</pre>
<p>We used only 5 principle components as modeling variables, because as is typical of principle component analysis- beyond the first few components the components become vanishingly small and unsuitable to use in modeling (see graph pcomp below).</p>
<p><img src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/pcomp.png" width="400"></p>
<p>However, this gave a model with far smaller R-squared than people are reporting, so lets add in a lot of components like everybody else does (bad!).</p>
<pre>
&gt; model &lt;- lm(avgTemp ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8 + PC9 +PC10 +PC11 +PC12 + PC13 ,dat=datUnion)
&gt; summary(model)

Call:
lm(formula = avgTemp ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 +
    PC8 + PC9 + PC10 + PC11 + PC12 + PC13, data = datUnion)

Residuals:
     Min       1Q   Median       3Q      Max
-0.87249 -0.25951  0.03996  0.25055  0.99039 

Coefficients:
              Estimate Std. Error t value Pr(&gt;|t|)
(Intercept)  7.431e-01  1.424e+00   0.522   0.6028
PC1         -1.796e-04  3.665e-04  -0.490   0.6253
PC2         -4.179e-04  9.759e-04  -0.428   0.6694
PC3          3.306e-05  1.430e-03   0.023   0.9816
PC4          3.416e-03  1.803e-03   1.894   0.0609 .
PC5          4.032e-02  1.978e-02   2.039   0.0440 *
PC6         -3.260e-03  2.660e-02  -0.123   0.9027
PC7         -7.134e-02  3.620e-02  -1.971   0.0514 .
PC8         -1.339e-01  7.895e-02  -1.696   0.0928 .
PC9          7.577e-02  5.734e-02   1.321   0.1892
PC10         2.700e-01  5.878e-02   4.594 1.22e-05 ***
PC11         8.562e-02  6.741e-02   1.270   0.2068
PC12        -8.057e-02  1.053e-01  -0.765   0.4461
PC13        -4.099e-02  1.064e-01  -0.385   0.7008
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.4141 on 105 degrees of freedom
  (484 observations deleted due to missingness)
Multiple R-squared: 0.2558,	Adjusted R-squared: 0.1637
F-statistic: 2.777 on 13 and 105 DF,  p-value: 0.001961
</pre>
<p>This is a degenerate model that essentially didn&#8217;t fit (thought the significance on PC10 component fools the fitter, but PC10 can&#8217;t be usable- it is essentially noise).  Graphically we can see the fit is not very useful (despite having  a little bit of R-squared) by looking at the graph of the fit plotted in the region of fitting.  Notice how the fit variance is much smaller than the true data variance even in the region of training data, this is typical of bad regression fits.</p>
<pre>
&gt; dRange &lt;- datUnion[datUnion$Year&gt;=1856 &#038; datUnion$Year&lt;=1980,]
&gt; xyplot(avgTemp + prediction ~Year,dat=dRange,type='l',auto.key=TRUE)
</pre>
<p><img src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/pFitRegion.png" width="400"></p>
<p>Now the statement they wanted to make is that the present looks nothing like the past.  The past is only available through the fit model so what you would hope is that the model looks like the present and then the model itself separates the past and present.  Instead as you see in the graphs above and below this fails two ways: the model looks nothing like the present and the model&#8217;s past looks a lot like the model&#8217;s present.</p>
<pre>
&gt; datUnion$prediction &lt;- predict(model,newdata=datUnion)
&gt; xyplot(avgTemp + prediction ~Year,dat=datUnion,type=c('p','smooth'),auto.key=TRUE)
</pre>
<p><img src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/pBoth.png" width="400"></p>
<p>What we could do to falsely drive the conclusion (which itself may or may not be true, it just is not supported by this technique, model or data) is create the infamous graph where we switch from modeled data in the past to actual data in the present and then act surprised that the two did not line up (which they did at no step during the fitting).  I don&#8217;t have the heart to unify the colors or remove the legend, but here is the graph below:</p>
<pre>
&gt; datUnion$dinked &lt;- datUnion$prediction
&gt; datUnion$dinked[!is.na(datUnion$avg)] &lt;- NA
&gt; xyplot(avgTemp + dinked ~Year,dat=datUnion,type=c('p','smooth'),auto.key=TRUE)
</pre>
<p><img src="http://www.win-vector.com/blog/wp-content/uploads/2009/12/pJacked.png" width="400"></p>
<p>The reason the blue points look different than the others is they came from the average temperature data instead of the model (where everything else came from).  Switching the series is essentially assuming the conclusion that recent past looks very different than the far past.</p>
<p>Essentially this methodology was so poor it could not have illustrated or contradicted recent global warming.  There are plenty of warning signs that the model fitting are problematic and the conclusion illustrated in the last graph can not actually be proved or disproved from this data (the proxy variables are too weak to be useful, that is not to say there are not other better proxy variables or modeling techniques).  The problems of the presentation are, of course, not essential problems in detecting global warming (which likely is occurring and likely will be a drain on future quality of life) but problems found in a single bad analysis.</p>


<p>Related posts:<ol><li><a href='http://www.win-vector.com/blog/2009/11/r-examine-objects-tutorial/' rel='bookmark' title='Permanent Link: R examine objects tutorial'>R examine objects tutorial</a></li>
<li><a href='http://www.win-vector.com/blog/2009/09/survive-r/' rel='bookmark' title='Permanent Link: Survive R'>Survive R</a></li>
<li><a href='http://www.win-vector.com/blog/2010/03/r-annoyances/' rel='bookmark' title='Permanent Link: R annoyances'>R annoyances</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2009/12/cru-graph-yet-again-with-r/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>R examine objects tutorial</title>
		<link>http://www.win-vector.com/blog/2009/11/r-examine-objects-tutorial/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=r-examine-objects-tutorial</link>
		<comments>http://www.win-vector.com/blog/2009/11/r-examine-objects-tutorial/#comments</comments>
		<pubDate>Sat, 21 Nov 2009 15:39:21 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Coding]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Tutorials]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[Tutorial]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1134</guid>
		<description><![CDATA[This article is quick concrete example of how to use the techniques from Survive R to lower the steepness of The R Project for Statistical Computing&#8216;s learning curve (so an apology to all readers who are not interested in R). What follows is for people who already use R and want to achieve more control [...]


Related posts:<ol><li><a href='http://www.win-vector.com/blog/2009/09/survive-r/' rel='bookmark' title='Permanent Link: Survive R'>Survive R</a></li>
<li><a href='http://www.win-vector.com/blog/2009/12/cru-graph-yet-again-with-r/' rel='bookmark' title='Permanent Link: CRU graph yet again (with R)'>CRU graph yet again (with R)</a></li>
<li><a href='http://www.win-vector.com/blog/2010/03/r-annoyances/' rel='bookmark' title='Permanent Link: R annoyances'>R annoyances</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>This article is quick concrete example of how to use the techniques from  <a href="http://www.win-vector.com/blog/2009/09/survive-r/">Survive R</a> to lower the steepness of <a href="http://www.r-project.org/">The R Project for Statistical Computing</a>&#8216;s learning curve (so an apology to all readers who are not interested in R).  What follows is for people who already use R and want to achieve more control of the software.<span id="more-1134"></span><br />
I am a fan of the <a href="http://www.r-project.org/">R</a>.  The R software does a number of incredible things and is the result of a number of good design choices.  However, you can&#8217;t fully benefit from R if you are not already familiar the internal workings of R.  You can quickly become familiar with the internal workings of R if you learn how to inspect the objects of R (as an addition to using the built in help system).  Here I give a concrete example of how to use the R system itself to find answers, with or without the help system.  R documentation has the difficult dual responsibility of attempting to explain both how to use the R software and explain the nature of the underlying statistics; so the documentation is not always the quickest thing to browse.</p>
<p>First let&#8217;s give R the commands to build a fake data set that has a variable y that turns out to be 3 times x (another variable) plus some noise:</p>
<pre>
&gt; n &lt;- 100
&gt; x &lt;- rnorm(n)
&gt; y &lt;- 3*x + 0.2*rnorm(n)
&gt; d &lt;- data.frame(x,y)
</pre>
<p>This data set (by design) has a nearly a linear relation between x and y.  We can plot<br />
the data as follows:</p>
<pre>
&gt; library(ggplot2)
&gt; ggplot(data=d) + geom_point(aes(x=x,y=y))
</pre>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/dat1.png" alt="dat1.png" border="0" width="500" height="500" /><br />
</center></p>
<p>With data like this the most obvious statistical analysis is a <a href="http://en.wikipedia.org/wiki/Linear_regression">linear regression</a>.  R can very quickly perform the linear regression and report the results.</p>
<pre>
&gt; model &lt;- lm(y~x,data=d)
&gt; summary(model)

Call:
lm(formula = y ~ x, data = d)

Residuals:
     Min       1Q   Median       3Q      Max
-0.41071 -0.12762 -0.00651  0.10240  0.62772 

Coefficients:
            Estimate Std. Error t value Pr(&gt;|t|)
(Intercept) -0.02609    0.02102  -1.241    0.217
x            2.99150    0.02202 135.858   &lt;2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.2102 on 98 degrees of freedom
Multiple R-squared: 0.9947,	Adjusted R-squared: 0.9947
F-statistic: 1.846e+04 on 1 and 98 DF,  p-value: &lt; 2.2e-16
</pre>
<p>We can read the report and see that the estimated fit formula is: y =  2.99150*x &#8211; 0.02609 (which is very close to the true formula y = 3*x) .  At this point the analysis is done (if the goal of the analysis is to just print the results).  However, if we want to use the results in a calculation we need to get at the numbers shown in above printout.  This printout contains a lot of information (such as the estimate fit coefficients, the standard errors, the t-values and the significances) that a statistician would want to see and want to use in further calculations.  But it is unclear how to get at these numbers.  For example: how do you get the &#8220;standard errors&#8221; (the numbers in the &#8220;Std. Error&#8221; column) from the returned model?  Are we forced to cut and paste them from the printed report?   What can you do?</p>
<p>The documentation nearly tells us what we need to know.  <tt>help(lm)</tt> yields:</p>
<blockquote><p><tt><br />
The functions summary and anova are used to obtain and print a summary and analysis of variance table of the results. The generic accessor functions coefficients, effects, fitted.values and residuals extract various useful features of the value returned by lm.<br />
</tt></p></blockquote>
<p>To a newer R user this may not be clear (as there are technical issues from both R and statistics quickly being run through).   However, the experienced R user would immediately recognize from this help that what is returned form <tt>summary(model)</tt> is an object (not just a blob of text) and that looking at the class of the returned object (which turns out to be summary.lm) might tell them what they would need to know.</p>
<p>Typing:</p>
<pre>
&gt;class(summary(model))
[1] "summary.lm"
&gt; help(summary.lm)
</pre>
<p>Yields:</p>
<blockquote><p><tt><br />
coefficients: a p x 4 matrix with columns for the estimated coefficient, its standard error, t-statistic and corresponding (two-sided) p-value. Aliased coefficients are omitted.<br />
</tt></p></blockquote>
<p>But if you are not very familiar with R you might miss that the summary function returns a useful object (instead of blob of text).  Also you might only know to look at <tt>help(summary)</tt> which  does not describe the location of the desired standard errors (but does have a reference to summary.lm, so if you are patient you might find it).  We describe how to find the information you need by using R&#8217;s object inspection facilities.  This is a &#8220;doing it the hard way&#8221; technique for when you do not understand the help system or you are using a package with less complete help documentation.</p>
<p>First  (using the techniques described in the slides:  <a href="http://www.win-vector.com/blog/2009/09/survive-r/">Survive R</a>) examine the model to see if the standard errors are there:</p>
<pre>
&gt; names(model)
 [1] "coefficients"  "residuals"     "effects"       "rank"          "fitted.values" "assign"
 [7] "qr"            "df.residual"   "xlevels"       "call"          "terms"         "model"    

&gt; model$coefficients
(Intercept)           x
-0.02609243  2.99150259
</pre>
<p>We found the coefficients, but did not find the standard errors.  Now we know the standard errors are reported by <tt>summary(model)</tt>, so they must be somewhere.  Instead of performing a wild goose chase to find the standard errors let&#8217;s instead trace how the summary method works to find where it gets them.  If we type print(summary) we don&#8217;t get any really useful information.  This is because summary is a generic method and we need to know what type-qualified name the summary of a linear model is called.</p>
<pre>
&gt; class(model)
[1] "lm"
</pre>
<p>So we see our model is of type lm so the <tt>summary(model)</tt> call would use a summary method called summary.lm (which as we saw is also the returned class of the <tt>summary(model)</tt> object).  As we mentioned the solution is in <tt>help(summary.lm)</tt>, but if the solution had not been there we could still make progress:  we could dump the source of the summary.lm method:</p>
<pre>
&gt; print(summary.lm)
function (object, correlation = FALSE, symbolic.cor = FALSE,
    ...)
{
    ....
    class(ans) &lt;- "summary.lm"
    ans
}
</pre>
<p>We actually deleted the bulk of the print(summary.lm) result because the important thing to notice is that the method is huge and that it returns an object instead of a blob of text.  The fact that the method summary.lm was huge means that it is likely calculating the things it reports (confirming that the standard errors are not part of the model object).  The fact that an object is returned means that what we are looking for may sitting somewhere in the summary waiting for us.  To find what we are looking for we convert the summary into a list (using the unclass() method) and look for something with the name or value we are looking for:</p>
<pre>
&gt; unclass(summary(model))
$call
lm(formula = y ~ x, data = d)
...
$coefficients
               Estimate Std. Error    t value      Pr(&gt;|t|)
(Intercept) -0.02609243 0.02102062  -1.241278  2.174662e-01
x            2.99150259 0.02201930 135.858209 2.095643e-113
...
</pre>
<p>And we have found it.  The named slot <tt>summary(model)</tt>$coefficients is in fact a table that has what we are looking for in the second column.  We can create a new list that will let us look up the standard errors by name (for the variable x and for the intercept):</p>
<pre>
&gt; stdErrors &lt;- as.list(summary(model)$coefficients[,2])
</pre>
<p>Now that we have the stdErrors in list form we can look up the numbers we wanted by name.</p>
<pre>
&gt; stdErrors['x']
$x
[1] 0.0220193

&gt; stdErrors['(Intercept)']
$`(Intercept)`
[1] 0.02102062
</pre>
<p>And we finally have the standard errors.  But why did we want the standard errors?  In this case I wanted the standard errors so I could plot the fit model and show the uncertainty of the model.  As, is often the case, R already has a function that does all of this.  Also (as is often the case) the R function that does this asks the right statistical question (instead of the obvious question) and can draw error bars that display the uncertainty of future predictions.  The uncertainty in future prediction is in fact different than the uncertainty of the estimate (what was most obvious to calculate from the standard errors) and (after some reflection) is what I really wanted.  Having these sort of distinctions already thought out is why we are using a statistics package like R instead of just coding everything up.  These calculations are all trivial to implement- but remembering to perform the calculations that answer the right statistical questions can be difficult.  The built in R solution of plotting the the fit model (black line) and the region of expected prediction uncertainty (blue lines) is as follows:</p>
<pre>
&gt; pred &lt;- predict.lm(model,interval='prediction')
&gt; dfit &lt;- data.frame(x,y,fit=pred[,1],lwr=pred[,2],upr=pred[,3])
&gt; ggplot(data=dfit) + geom_point(aes(x=x,y=y)) +
    geom_line(aes(x,fit)) +
    geom_line(aes(x=x,y=lwr),color='blue') + geom_line(aes(x=x,y=upr),color='blue')
</pre>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/11/fit1.png" alt="fit1.png" border="0" width="500" height="500" /><br />
</center></p>
<p>And we are done.</p>


<p>Related posts:<ol><li><a href='http://www.win-vector.com/blog/2009/09/survive-r/' rel='bookmark' title='Permanent Link: Survive R'>Survive R</a></li>
<li><a href='http://www.win-vector.com/blog/2009/12/cru-graph-yet-again-with-r/' rel='bookmark' title='Permanent Link: CRU graph yet again (with R)'>CRU graph yet again (with R)</a></li>
<li><a href='http://www.win-vector.com/blog/2010/03/r-annoyances/' rel='bookmark' title='Permanent Link: R annoyances'>R annoyances</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2009/11/r-examine-objects-tutorial/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Survive R</title>
		<link>http://www.win-vector.com/blog/2009/09/survive-r/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=survive-r</link>
		<comments>http://www.win-vector.com/blog/2009/09/survive-r/#comments</comments>
		<pubDate>Tue, 29 Sep 2009 06:11:02 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Statistics]]></category>
		<category><![CDATA[R]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=862</guid>
		<description><![CDATA[New PDF slides version (presented at the Bay Area R Users Meetup October 13, 2009). We at Win-Vector LLC appear to like R a bit more than some of our, perhaps wiser, colleagues ( see: Choose your weapon: Matlab, R or something else? and R and data ). While we do like R (see: Exciting [...]


Related posts:<ol><li><a href='http://www.win-vector.com/blog/2009/11/r-examine-objects-tutorial/' rel='bookmark' title='Permanent Link: R examine objects tutorial'>R examine objects tutorial</a></li>
<li><a href='http://www.win-vector.com/blog/2009/01/exciting-technique-1-the-r-language/' rel='bookmark' title='Permanent Link: Exciting Technique #1: The &#8220;R&#8221; language.'>Exciting Technique #1: The &#8220;R&#8221; language.</a></li>
<li><a href='http://www.win-vector.com/blog/2010/03/r-annoyances/' rel='bookmark' title='Permanent Link: R annoyances'>R annoyances</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>New <a href="http://www.win-vector.com/dfiles/SurviveR.pdf">PDF slides version</a> (presented at the <a href="http://www.meetup.com/R-Users/calendar/11202051/">Bay Area R Users Meetup October 13, 2009</a>).</p>
<p>We at Win-Vector LLC appear to like <a href="http://www.r-project.org/">R</a> a bit more than some of our, perhaps wiser, colleagues ( see: <a href="http://scottlocklin.wordpress.com/2009/05/08/choose-your-weapon-matlab-r-or-something-else/">Choose your weapon: Matlab, R or something else?</a> and <a href="http://erehweb.wordpress.com/2009/05/26/r-and-data/">R and data</a> ).  While we do like R (see: <a href="http://www.win-vector.com/blog/2009/01/exciting-technique-1-the-r-language/">Exciting Technique #1: The “R” language</a> ) we also understand the need to defend oneself against the abuse regularly dished out by R.  Here we will quickly share a few fighting techniques.<br />
<span id="more-862"></span></p>
<p>If you are not already using R the following will not mean much.  If you are using R this may scratch a few itches.</p>
<ul>
<li>
First: Write down everything- keep notes in a separate file.  </p>
<p />
When you do figure out how to do something in R it will be concise, powerful and completely un-mnemonic and impossible to find again through the help system.
</li>
<li>
Second: Find some way to search for R answers.</p>
<p />
<a href="http://stackoverflow.com/questions/102056/how-to-search-for-r-materials">http://stackoverflow.com/questions/102056/how-to-search-for-r-materials</a>
</li>
<li>
Third: Learn unclass().</p>
<p />
<code></p>
<pre>
# Here is an example of fitting a linear model (from the help(glm) documentation)
## Dobson (1990) Page 93: Randomized Controlled Trial :
&gt; counts &lt;- c(18,17,15,20,10,20,25,13,12)
&gt; outcome &lt;- gl(3,1,9)
&gt; treatment &lt;- gl(3,3)
&gt; glm.D93 &lt;- glm(counts ~ outcome + treatment, family=poisson())
</code>
</pre>
<p>Want to get the model coefficients and don't feel like suffering through the documentation/help system?  You can't inspect the glm.D93 object because it has overridden the print() and summary() methods to hide details (in particular you can't find the member data).  No problem, type this:</p>
<p></code><code></p>
<pre>
&gt; model &lt;- unclass(glm.D93)
</pre>
</pre>
<p></code></p>
<p>The model is now a harmless list without a bunch of pesky methods hiding the information.
</li>
<li>
Fourth:  learn how to list class and methods.</p>
<p />
Often one of methods(), showMethods() or getS3Method() can show you what methods are on a class or object.  Be prepared to try them all as they apply in different contexts.</p>
<p><code></p>
<pre>
# lets make a tricky function
&gt; fe &lt;- function(x) UseMethod("fe")
&gt; fe.formula &lt;- function(x) { print('formula')}
&gt; fe.numeric &lt;- function(x) { print('numeric')}
</pre>
</pre>
<p></code></p>
<p>How will anyone figure out what we have done?  </p>
<p><code></p>
<pre>
&gt; class(fe)
[1] "function"

&gt; methods(fe)
# [1] fe.formula fe.numeric

&gt; getS3method('fe','numeric')
# fe.numeric &lt;- function(x) { print('numeric')}
</pre>
</pre>
<p></code>
</li>
<li>
Fifth: Learn to stomp out attributes.</p>
<p />
Ever have this crud follow you around?</p>
<p><code></p>
<pre>
&gt; m &lt;- summary(c(1,2))[4]
&gt; m
Mean
 1.5
</pre>
<p></code></p>
<p>Ah that&#8217;s cute: a little &#8220;Mean&#8221; tag is following the data around.  But what if we try to use this value:</p>
<p><code></p>
<pre>
&gt; m*m
Mean
2.25
</pre>
<p></code></p>
<p>Okay, now the &#8220;Mean&#8221; tag has outstayed its welcome.  The fix:</p>
<p><code></p>
<pre>
&gt; attributes(m) &lt;- c()
&gt; m*m
[1] 2.25
</pre>
<p></code></p>
<p>MUCH better.
</li>
<li>
Sixth: Swallow your pride.</p>
<p />
My example: does R have map structures?  I have no idea and I am too ashamed to ask.  However I know I can fake it with environments (which may be &#8220;the R way to do this&#8221; or may be &#8220;a horrible abuse of the language&#8221;- I have no idea which).</p>
<p><code></p>
<pre>
&gt; map &lt;- new.env(hash=TRUE)
&gt; assign('dog',7,map)
&gt; ls(map)
[1] "dog"
&gt; get('dog',envir=map)
[1] 7
</pre>
<p></code></p>
<p>That (nearly) gives you maps with string keys.  For maps with numeric keys we can fake something else up with findInterval().  For maps from generic comparable objects keys- I have no idea how you would trick R into helping.  This is one reason we like to separate out all data-preparation into a pre-processing step implemented in Java or SQL.</p>
<p>Note important correction from Eward Ratzer: use &#8220;map &lt;- new.env(hash=TRUE,parent=emptyenv()), see comments.
</li>
<li>
Seventh:  Find and rely on &#8220;the one-liners.&#8221;</p>
<p />
Reading in an entire comma separated file in a single line ( read.table() ), re-aggregating data ( table() or doBy&#8217;s summaryBy() command ) or building an empirical density ( ecdf() ) in a single line of code is an experience not to be missed.
</li>
</ul>
<p>The overall all point is that while R has some (unnecessarily) sharp edges and pain-points it is a powerful tool worth using.  I would much rather struggle through a minor R-language issue when trying to prepare my data than to do without the many special functions, distributions, fitters and plotters built into the R system.</p>


<p>Related posts:<ol><li><a href='http://www.win-vector.com/blog/2009/11/r-examine-objects-tutorial/' rel='bookmark' title='Permanent Link: R examine objects tutorial'>R examine objects tutorial</a></li>
<li><a href='http://www.win-vector.com/blog/2009/01/exciting-technique-1-the-r-language/' rel='bookmark' title='Permanent Link: Exciting Technique #1: The &#8220;R&#8221; language.'>Exciting Technique #1: The &#8220;R&#8221; language.</a></li>
<li><a href='http://www.win-vector.com/blog/2010/03/r-annoyances/' rel='bookmark' title='Permanent Link: R annoyances'>R annoyances</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2009/09/survive-r/feed/</wfw:commentRss>
		<slash:comments>22</slash:comments>
		</item>
		<item>
		<title>Good Graphs: Graphical Perception and Data Visualization</title>
		<link>http://www.win-vector.com/blog/2009/08/good-graphs-graphical-perception-and-data-visualization/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=good-graphs-graphical-perception-and-data-visualization</link>
		<comments>http://www.win-vector.com/blog/2009/08/good-graphs-graphical-perception-and-data-visualization/#comments</comments>
		<pubDate>Fri, 28 Aug 2009 15:40:41 +0000</pubDate>
		<dc:creator>Nina Zumel</dc:creator>
				<category><![CDATA[Exciting Techniques]]></category>
		<category><![CDATA[Expository Writing]]></category>
		<category><![CDATA[Mathematics]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Cleveland]]></category>
		<category><![CDATA[data exploration]]></category>
		<category><![CDATA[graphical perception]]></category>
		<category><![CDATA[Lattice]]></category>
		<category><![CDATA[Mathematical Bedside Reading]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[visualization]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=296</guid>
		<description><![CDATA[What makes a good graph? When faced with a slew of numeric data, graphical visualization can be a more efficient way of getting a feel for the data than going through the rows of a spreadsheet. But do we know if we are getting an accurate or useful picture? How do we pick an effective [...]


Related posts:<ol><li><a href='http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/' rel='bookmark' title='Permanent Link: A Demonstration of Data Mining'>A Demonstration of Data Mining</a></li>
<li><a href='http://www.win-vector.com/blog/2010/02/living-in-a-lognormal-world/' rel='bookmark' title='Permanent Link: Living in A Lognormal World'>Living in A Lognormal World</a></li>
<li><a href='http://www.win-vector.com/blog/2009/04/the-data-enrichment-method/' rel='bookmark' title='Permanent Link: The Data Enrichment Method'>The Data Enrichment Method</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>What makes a good graph? When faced with a slew of numeric data, graphical visualization can be a more efficient way of getting a feel for the data than going through the rows of a spreadsheet. But do we know if we are getting an accurate or useful picture? How do we pick an effective visualization that neither obscures important details, or drowns us in confusing clutter? In 1968, William Cleveland published a text called <a href="http://www.stat.purdue.edu/~wsc/elements.html"><em>The Elements of Graphing Data,</em></a> inspired by Strunk and White&#8217;s classic writing handbook <a href="http://www.amazon.com/Elements-Style-50th-Anniversary/dp/0205632645"><em>The Elements of Style</em></a> . <em>The Elements of Graphing Data</em> puts forward Cleveland&#8217;s philosophy about how to produce good, clear graphs — not only for presenting one&#8217;s experimental results to peers, but also for the purposes of data analysis and exploration. Cleveland&#8217;s approach is based on a theory of graphical perception: how well the human perceptual system accomplishes certain tasks involved in reading a graph. For a given data analysis task, the goal is to align the information being presented with the perceptual tasks the viewer accomplishes the best. <span id="more-296"></span></p>
<blockquote><p>When a graph is made, quantitative and categorical information is encoded by a display method. Then the information is visually decoded. This visual perception is a vital link. No matter how clever the choice of the information, and no matter how technologically impressive the encoding, a visualization fails if the decoding fails. Some display methods lead to efficient, accurate decoding, and others lead to inefficient, inaccurate decoding. It is only through scientific study of visual perception that informed judgments can be made about display methods. The display methods of <em>Elements</em> rest on a foundation of scientific enquiry.</p></blockquote>
<p>— from the preface of <em>The Elements of Graphing Data</em></p>
<p>A revised edition of <em>The Elements of Graphing Data</em> was published in 1994, along with a companion volume, <a href="http://www.stat.purdue.edu/~wsc/visualizing.html"><em>Visualizing Data,</em></a> which is oriented towards the implementation and technical details of different graphing techniques. I highly recommend <em>The Elements of Graphing Data</em> as a guidebook for creating graphs, as well as for its excellent survey of several useful techniques. Cleveland, along with other colleagues at Bell Labs, developed the <a href="http://stat.bell-labs.com/project/trellis/s.html">Trellis display system,</a> a framework for the visualization of multivariable databases, using the ideas developed in his texts. Trellis, in turn, influenced Deepayan Sarkar&#8217;s Lattice graphics system for R. Lattice implements many of Cleveland&#8217;s ideas, and I also recommend Sarkar&#8217;s <a href="http://lmdvr.r-forge.r-project.org/figures/figures.html">Lattice manual</a> if you do data visualization in R.</p>
<p>It&#8217;s important to note here that Cleveland writes for researchers and decision-makers who use graphs to analyze data, or to convey scientific results to colleagues in an (ideally) objective manner. This distinguishes him from Darrell Huff, whose 1954 <a href="http://www.amazon.com/How-Lie-Statistics-Darrell-Huff/dp/0393310728"><em>How to Lie with Statistics</em></a> considered the use of graphs (and statistics in general) as rhetorical devices for convincing others of one&#8217;s point of view. Hence, some of Cleveland&#8217;s recommendations and guidelines actually contradict Huff&#8217;s. <a id="refHuff" href="#Huff"><sup>1</sup></a></p>
<p>Edward Tufte also explored the idea that the choice of graphical display should be influenced by the viewer&#8217;s cognitive processes, in his 1990 book <a href="http://www.edwardtufte.com/tufte/books_ei"><em>Envisioning Information</em></a>. Tufte tends to be more broadly concerned with the gestalt of a graph, beyond its use as an analysis tool; he is also more concerned than Cleveland is with aesthetic considerations.</p>
<p>Cleveland&#8217;s philosophy might be summarized as: <em>minimize the mental gymnastics that the viewer must go through to understand the graph</em>. This leads to some obvious advice: avoid clutter and occlusion, make graphing symbols or color-coding unambiguous, use scale-lines on all four sides of the graph, and so on. It also leads to advice that perhaps should be as obvious, but isn&#8217;t: <em>make the aspect of the data that you want to analyze as clear as possible</em>. But what does this mean in practice?</p>
<p><strong>Make important differences large enough to perceive</strong></p>
<p>Weber&#8217;s Law is a well known observation from the psychophysics literature, which states that the &#8220;just noticeable&#8221; change in a stimulus is a constant ratio of the original stimulus. Put another way, people are only capable of detecting a change in a stimulus that is greater than a certain percentage <em>k</em> of the original stimulus. Here, &#8220;stimulus&#8221; can refer to any perceivable physical quantity: weight, intensity, length, orientation. The percentage <em>k</em> will vary with stimulus, and with observer.</p>
<table border="0" align="center">
<tbody>
<tr>
<td>
<div style="text-align:center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/weberslaw.jpg" border="0" alt="weberslaw.jpg" width="488" height="233" /></div>
</td>
</tr>
</tbody>
<caption>Figure 1: From Cleveland, <em>The Elements of Graphing Data</em></caption>
<tbody>
<tr>
<td></td>
</tr>
</tbody>
</table>
<p>Figure 1 shows the application of Weber&#8217;s law to lengths. The bars A and B are of different lengths, but the difference is such a small fraction of the &#8220;base&#8221; length (say, A&#8217;s length, to be specific) that is difficult to tell whether or not they are different, or which is longer. On the right, the bars have been embedded in frames of identical length, and now it is easy to see that B is longer. Why? Because the difference in lengths of the <em>white</em> intervals is a much larger percentage of the white &#8220;base&#8221; length (say the white A interval). It is easy to see that the white B interval is shorter than the white A interval, and therefore, the black B interval is longer than the black A interval.</p>
<p>The moral is that you always want the viewer to be estimating changes or differences with respect to a short base length. You can do this with reference grids, as demonstrated below.</p>
<table border="0" align="center">
<caption>From Cleveland, <em>The Elements of Graphing Data</em></caption>
<tbody>
<tr>
<td><!-- original 319 by 601 --><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/noreferencegrids.jpg" border="0" alt="noreferencegrids.jpg" width="200" height="400" align="left" /></td>
<td><!-- original 319 by 601 --><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/referencegrids1.jpg" border="0" alt="referencegrids.jpg" width="200" height="400" align="right" /></td>
</tr>
<tr>
<td align="center">Figure 2</td>
<td align="center">Figure 3</td>
</tr>
</tbody>
</table>
<p>Figure 2 shows eight curves. Which one dips to the lowest minimum? Are the high curves approaching the same value, and which one is rising the fastest? Are the low curves dipping to the same minimum? Are they going to the same steady state? Figure 3 shows the same curves, graphed with identical reference grids. The grids shorten the base lengths that are being compared, and it is now much easier to compare highs, lows, and steady state behavior.</p>
<p>But wouldn&#8217;t it be better to compare the graphs by superposing them? For two or three curves, perhaps. But in this case, eight curves can clutter the graph, and use up the symbol or color space, making it difficult to distinguish the different datasets &#8212; increasing the mental gymnastics.</p>
<p>Reference grids are useful even for a single curve, especially one with slowly varying segments, such as these graphs have. The reference grid makes it easier to answer questions like: does the process return to the initial state, or to a different steady state? Has the process reached steady state, or is it still growing?</p>
<p><strong>Make important shape changes large enough to perceive: Banking to 45 degrees.</strong></p>
<p>The aspect ratio of a graph is important when trying to understand shape. Rate of change information is encoded in the slope of the curve, which the viewer estimates by changes in the orientation of the local tangents at each point of the graph. Weber&#8217;s Law tells us that very small changes in this orientation will be difficult to detect. For a given (physical) curve, the local orientation changes will be dependent on the aspect ratio of its graphical presentation, as shown (to an exaggerated degree) in Figure 4. Here, the same curve (two line segments) is plotted at three different aspect ratios, one that centers the graph at 45 degrees, one that forces the curve to be nearly vertical, and another that forces it to be nearly horizontal. In the last two cases, the change in orientation of the two line segments is so small as to be nearly undetectable.</p>
<table border="0" align="center">
<caption>Figure 4: From Cleveland</caption>
<tbody>
<tr>
<td><!-- original 670 by 630 --></p>
<div style="text-align:center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/angles.jpg" border="0" alt="angles.jpg" width="446" height="420" align="left" /></div>
</td>
</tr>
</tbody>
</table>
<p>For two line segments with positive, unequal slopes, a simple geometric argument shows that their absolute difference in orientation is maximized by the aspect ratio that sets their average orientation to 45 degrees (the first graph in Figure 4). Empirical studies by Cleveland and others have indeed verified that a viewer&#8217;s ability to judge the relative slopes of line segments on a graph is maximized when the absolute values of the orientations of the segments are centered on 45 degrees.</p>
<p>This result leads to a technique called <em>Banking to 45</em>, whereby the aspect ratio of the graph is chosen so that the average slope of the entire graph is 45 degrees. The details are discussed in Cleveland, and many of the plots in R&#8217;s Lattice package also have an option to bank the graph to 45 degrees.</p>
<p>This deliberate exaggeration of slope is something that Darrell Huff deplores. In <em>How to Lie with Statistics</em>, Huff refers to these graphs as &#8220;gee-whiz&#8221; graphs — and in the context of his discussion of statistics as rhetoric, they are:</p>
<table border="0" align="center">
<caption>Figure 5: From Huff, <em>How to Lie With Statistics</em></caption>
<tbody>
<tr>
<td><!-- original 461 by 351 --></p>
<div style="text-align:center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/geewhiz.jpg" border="0" alt="geewhiz.jpg" width="461" height="351" /></div>
</td>
</tr>
</tbody>
</table>
<p>To insist that a graph should always include a zero line and that units be in proportion may be good advice from a rhetorical perspective; but it is poor advice if the purpose of the graph is data analysis. As Figure 6 below demonstrates, we can lose resolution if we always insist on including the zero. Does the trend line in the left graph increase linearly, superlinearly, or sublinearly? The convexity of the curve is more apparent when it is banked to 45, as on the right. Assuming that the scientist reads the axis and is cognizant of the actual magnitude changes involved, the graph on the right conveys more information.</p>
<table border="0" align="center">
<caption>Figure 6: From Cleveland</caption>
<tbody>
<tr>
<td><img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/bank451.jpg" border="0" alt="bank45.jpg" width="500"  /></td>
</tr>
</tbody>
</table>
<p><strong>Make sure all the data is equally well resolved.</strong></p>
<p>It is quite common for positive data —  word frequencies, populations, price distributions, just to name a few examples — to be skewed: most of the data is bunched towards low values, the rest of it is spread out on a very long tail. This long tail squashes the majority of the data into a tiny interval of a very narrow dynamic range, as in Figure 7, making it difficult to evaluate the data.</p>
<table border="0" align="center">
<tbody>
<tr>
<td>
<table border="0">
<tbody>
<tr>
<td><!-- original size: 990 by 860 --><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/skewed1.gif" border="0" alt="skewed.gif" width="250" /></td>
</tr>
</tbody>
<caption>Figure 7: Long-tailed distribution of purchase sizes</caption>
</table>
</td>
<td>
<table border="0">
<tbody>
<tr>
<td><!-- original = 499 by 675 --><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/logskewed1.gif" border="0" alt="logskewed.gif" width="250" /></td>
</tr>
</tbody>
<caption>Figure 8: Distribution of log(purchase size)</caption>
</table>
</td>
</tr>
</tbody>
</table>
<p>Imagine that Figure 7 represents the distribution of average purchase size across an online merchant&#8217;s customers: average purchase size is plotted on the x-axis, and the y-axis represents the fraction of the total customer population whose average purchase size is a given value (the area under the graph integrates to one). According to this graph, most customers make fairly small purchases on average, but there is a long tail of big spenders trailing out into the range of several thousand dollars. Obviously, one would like a little more resolution on the big spike of customers near zero. One could simply &#8220;zoom in&#8221; on this range, by chopping off some long chunk of the tail, but you may potentially lose sight of some global patterns in the data by doing so.</p>
<p>Graphing the distribution of log(purchase size) enables you to increase the resolution near zero, while preserving the global view. Figure 8 shows the distribution of log(purchase size), revealing two spending populations: a population of high spenders who tend to make purchases in the $3000 range (in log space), and another population whose purchases are centered (in log space) around $60. The existence of these two distinct populations is not apparent in the original graph.</p>
<p>Notice that Figure 8 has two x-axis scales: the top axis is marked in log units, while the bottom axis is marked in absolute dollars, spaced on a log scale. This accords with the principle of minimizing mental gymnastics, since the viewer of the graph will typically be concerned about prices in dollars, not log dollars. In fact, it would have been better yet to have plotted the distribution of log<sub>2</sub> or log<sub>10</sub> of the data; the former would allow us to see at a glance the doubling of price ranges, the latter to see price changes in factors of ten.</p>
<table border="0" align="center">
<caption>Figure 9: The 14 most abundant elements in meteorites. From Cleveland</caption>
<tbody>
<tr>
<td><!-- original = 543 by 522 --><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/metals.jpg" border="0" alt="metals.jpg" width="250" /></td>
<td><!-- original = 550 by 600 --><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/logmetals.jpg" border="0" alt="logmetals.jpg" width="250" /></td>
</tr>
</tbody>
</table>
<p>Figure 9 shows another example: the fourteen most abundant elements in meteorites, specifically the average percent of each of the elements. If we graph the percentages directly, as on the left, we cannot easily distinguish the differences in the elements from aluminum on down. Graphing log<sub>2</sub> of the percentages, as on the right, improves the resolution. Again, we have two x-axes on the graph of the log data.</p>
<p><strong>If you want to analyze the difference between two processes, then graph the difference, not the processes (or graph both).</strong></p>
<p>Suppose that we are comparing the two processes f1 and f2 that are shown in Figure 10. As x increases, the two processes appear to be approaching each other  — that is, the difference between the two seems to be decreasing. In reality, the difference between the two is constant: f2 = f1+1.</p>
<table border="0" align="center">
<tbody>
<tr>
<td>
<table border="0">
<tbody>
<tr>
<td><!-- original size: 990 by 860 --><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/difference1.gif" border="0" alt="difference.gif" width="250" /></td>
</tr>
</tbody>
<caption>Figure 10: The illusion of convergence</caption>
</table>
</td>
<td>
<table border="0">
<tbody>
<tr>
<td><!-- original = 499 by 675 --><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/imports.jpg" border="0" alt="imports.jpg" width="250" /></td>
</tr>
</tbody>
<caption>Figure 11: British Imports and Exports. From Cleveland</caption>
</table>
</td>
</tr>
</tbody>
</table>
<p>It turns out that people are good at perceiving the perpendicular difference between two curves, but not the differences in height, which is what we are actually interested in here. When we try to infer the differences from the process graph, we may not only miss key information, we may actually draw incorrect conclusions.</p>
<p>A less toy example is given in Figure 11. Here the imports to and exports from England are graphed over the first 80 years of the 18th century. In the difference graph on the bottom, we can see a local peak in (imports-exports) just after 1760; this is not obvious from simply comparing the two processes (top graph).</p>
<p><strong>If you are interested in rate of change, then graph rate of change.</strong></p>
<p>In Figure 12, we see the population figures for a given community from 1990 to 2009. Obviously, the population is steadily increasing, but how quickly? Is the rate of population growth increasing over time, or is it decreasing? If we are interested in these questions, then simply graphing the population over time is not enough. We need to look at the rate of change directly.</p>
<table border="0" align="center">
<tbody>
<tr>
<td>
<table border="0">
<caption>Figure 12</caption>
<tbody>
<tr>
<td><!-- original 998 by 860 --><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/rateofchange1.gif" border="0" alt="rateofchange.gif" width="250" /></td>
</tr>
</tbody>
</table>
</td>
<td>
<table border="0">
<caption>Figure 13</caption>
<tbody>
<tr>
<td><!-- original 720 by 720 --><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/lograteofchange2.gif" border="0" alt="lograteofchange.gif" width="250" /></td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
<p>The classic way to do this is by graphing the logarithm of the data. In Figure 13, we have graphed log<sub>2</sub> of the population over time, with the log scale printed on the right hand y-axis, and the actual population numbers printed at a log scale on the left hand axis. Now we can see that the population increased at a constant rate from 1990 to 2000, quadrupling approximately every four years, and then slowed down (to a lower constant rate) after 2000.</p>
<p><strong>Graphs as a research tool</strong></p>
<p>Throughout this discussion, we have considered graphs as a tool for data exploration and initial understanding. It is an iterative process &#8212; as questions arise, the data will be reprocessed and re-plotted to highlight the new issues to be examined. A good research graph must display this information directly, with a minimum of mental gymnastics, but &#8212; as with any research tool &#8212; there can be a learning curve. For example, densityplots (such as those shown in Figures 7 and 8) are in my opinion more useful than histograms for understanding how numerical data is distributed &#8212; and I am constantly surprised at the amount of explanation that they require when I show them to people who are unfamiliar with them. A number of very useful graphs that are discussed in Cleveland&#8217;s texts meet with the same reaction from people who encounter that style of graph for the first time. This is a disadvantage, relative to using a more fashionable graph, when attempting to communicate results. But the insight into the data that these graphs provide often make it worth spending the time to educate clients or peers on how to read the graph.</p>
<p>Even so, a good graph still may not be a quick read. As Cleveland writes:</p>
<blockquote><p>While there is a place for rapidly-understood graphs, it is too limiting to make speed a requirement in science and technology, where the use of graphs ranges from detailed in-depth data analysis to quick presentation.<br />
&#8230;</p>
<p>The important criterion for a graph is not simply how fast we can see a result; rather it is whether through the use of the graph we can see something that would have been harder to see otherwise or that could not have been seen at all.</p></blockquote>
<p>- <em>The Elements of Graphing Data</em>, Chapter 2</p>
<hr /><a id="Huff" href="#refHuff">[Back]</a><sup>1</sup><em>How to Lie with Statistics</em> is an entertaining (if a little dated) discussion of how to read statistical and quantitative claims critically, and is definitely worth a read.</p>


<p>Related posts:<ol><li><a href='http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/' rel='bookmark' title='Permanent Link: A Demonstration of Data Mining'>A Demonstration of Data Mining</a></li>
<li><a href='http://www.win-vector.com/blog/2010/02/living-in-a-lognormal-world/' rel='bookmark' title='Permanent Link: Living in A Lognormal World'>Living in A Lognormal World</a></li>
<li><a href='http://www.win-vector.com/blog/2009/04/the-data-enrichment-method/' rel='bookmark' title='Permanent Link: The Data Enrichment Method'>The Data Enrichment Method</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2009/08/good-graphs-graphical-perception-and-data-visualization/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Exciting Technique #1: The &#8220;R&#8221; language.</title>
		<link>http://www.win-vector.com/blog/2009/01/exciting-technique-1-the-r-language/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=exciting-technique-1-the-r-language</link>
		<comments>http://www.win-vector.com/blog/2009/01/exciting-technique-1-the-r-language/#comments</comments>
		<pubDate>Thu, 22 Jan 2009 19:59:01 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Exciting Techniques]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[R]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=26</guid>
		<description><![CDATA[Our first &#8220;exciting technique&#8221; article is about a statistical language called &#8220;R.&#8221; R is a language for statistical analysis available from http://cran.r-project.org/ . The things you can immediately do with it are incredible. You can import a spreadsheet and immediately spot relationships, trend and anomalies. R gives you instant access to top notch visualization methods [...]


Related posts:<ol><li><a href='http://www.win-vector.com/blog/2010/03/r-annoyances/' rel='bookmark' title='Permanent Link: R annoyances'>R annoyances</a></li>
<li><a href='http://www.win-vector.com/blog/2009/11/r-examine-objects-tutorial/' rel='bookmark' title='Permanent Link: R examine objects tutorial'>R examine objects tutorial</a></li>
<li><a href='http://www.win-vector.com/blog/2009/09/survive-r/' rel='bookmark' title='Permanent Link: Survive R'>Survive R</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>Our first &#8220;exciting technique&#8221; article is about a statistical language called &#8220;R.&#8221;</p>
<p>R is a language for statistical analysis available from <a href="http://cran.r-project.org/">http://cran.r-project.org/</a> .  The things you can immediately do with it are incredible.  You can import a spreadsheet and immediately spot relationships, trend and anomalies.  R gives you instant access to top notch visualization methods and sophisticated statistical methods.</p>
<p><span id="more-26"></span></p>
<p>R is so hot (a strange thing to say about a statistics package) that it was the subject of a recent New York Times article: <a href="http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html">http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html</a> .  If you read between the lines some of the interviewees come off as being slightly threatened by R (there is a slight hint of &#8220;R is very good for others&#8221;).  In fact R is simply very good.  A good statistician with R can do things that a great statistician without R can not.  Like all tools R is dangerous, ask for the wrong analysis and you well draw wrong and misleading conclusions.  Ask for the right analysis and R will correctly perform it while tracking critical implementation details that would take you hundreds of hours to master on you own.</p>
<p>Want to produce graphs using the theories of perception and analysis of W. S. Cleveland?  Simple- use Deepayan Sarkar&#8217;s &#8220;Lattice&#8221; model, which even has a wonderful book.</p>
<p>Want to find subtle relationships in your data using logistic regression (one of the more complicated cousins of linear regression)?  That is built into the base R system.</p>
<p>Need to re-run all of your analyses because the data has changed?  R is script based and stores your command history.  A single paste can re-run a 20 step analysis and re-build a 10 slide presentation.</p>
<p>Impressed by a particular type of analysis? Take, for example, Roger Koenker&#8217;s &#8220;Quantile Regression&#8221; (which is a brilliant idea backed by a masterpiece of a book).  Guess what, the original author has supplied a free R-module that implements the ideas.</p>
<p>Want to give a client working software?  Easy, R is open source and comes with very good automated installers for OSX, Linux and Windows.</p>
<p>Want to train somebody to use R?  Easy, R has an extensive library of excellent books and there is even an exciting set of books with a series title &#8220;Use R!&#8221;</p>
<p>Want to learn the internals of R from John M. Chambers (one of the inventors of the &#8220;S&#8221; language that R is an implementation of)?  You are in luck the latest book by Chambers is &#8220;Software for Data Analysis, Programming with R.&#8221;  R is so popular that it has managed to pull one of the creators of S language and the proprietary S+ implementation into its world.</p>
<p>It is almost getting to the point where you need to justify not using R.</p>


<p>Related posts:<ol><li><a href='http://www.win-vector.com/blog/2010/03/r-annoyances/' rel='bookmark' title='Permanent Link: R annoyances'>R annoyances</a></li>
<li><a href='http://www.win-vector.com/blog/2009/11/r-examine-objects-tutorial/' rel='bookmark' title='Permanent Link: R examine objects tutorial'>R examine objects tutorial</a></li>
<li><a href='http://www.win-vector.com/blog/2009/09/survive-r/' rel='bookmark' title='Permanent Link: Survive R'>Survive R</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2009/01/exciting-technique-1-the-r-language/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>
