<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Win-Vector Blog &#187; Opinion</title>
	<atom:link href="http://www.win-vector.com/blog/category/opinion/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.win-vector.com/blog</link>
	<description>The Applied Theorist&#039;s Point of View</description>
	<lastBuildDate>Sat, 04 Feb 2012 17:42:12 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Why you can not to use statistics to dispute magic</title>
		<link>http://www.win-vector.com/blog/2011/12/why-you-can-not-to-use-statistics-to-dispute-magic/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=why-you-can-not-to-use-statistics-to-dispute-magic</link>
		<comments>http://www.win-vector.com/blog/2011/12/why-you-can-not-to-use-statistics-to-dispute-magic/#comments</comments>
		<pubDate>Sat, 10 Dec 2011 17:42:02 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Opinion]]></category>
		<category><![CDATA[Rants]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Fisher]]></category>
		<category><![CDATA[Junk Science]]></category>
		<category><![CDATA[Null Hyphothesis]]></category>
		<category><![CDATA[Positivism]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1903</guid>
		<description><![CDATA[It is a subtle point that statistical modeling is different than model based science. However, empirical scientists seem to go out of their way to conflate the two before the public (as statistical modeling is easier to perform and model based science is more highly rewarded). It is often claimed that model based science is [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2009/12/statistics-to-english-translation-part-2a-%e2%80%99significant%e2%80%99-doesn%e2%80%99t-always-mean-%e2%80%99important%e2%80%99/' rel='bookmark' title='Statistics to English Translation, Part 2a: ’Significant’ Doesn’t Always Mean ’Important’'>Statistics to English Translation, Part 2a: ’Significant’ Doesn’t Always Mean ’Important’</a></li>
<li><a href='http://www.win-vector.com/blog/2009/12/statistics-to-english-translation-part-2b-calculating-significance/' rel='bookmark' title='Statistics to English Translation, Part 2b: Calculating Significance'>Statistics to English Translation, Part 2b: Calculating Significance</a></li>
<li><a href='http://www.win-vector.com/blog/2009/11/i-dont-think-that-means-what-you-think-it-means-statistics-to-english-translation-part-1-accuracy-measures/' rel='bookmark' title='&#8220;I don&#8217;t think that means what you think it means;&#8221; Statistics to English Translation, Part 1: Accuracy Measures'>&#8220;I don&#8217;t think that means what you think it means;&#8221; Statistics to English Translation, Part 1: Accuracy Measures</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>It is a subtle point that statistical modeling is different than model based science.  However, empirical scientists seem to go out of their way to conflate the two before the public (as statistical modeling is easier to perform and model based science is more highly rewarded).  It is often claimed that model based science is being done when in fact statistics is what is being done (for instance some of the unfortunate distractions of flawed reports related to <a target="_blank" href="http://www.win-vector.com/blog/2009/12/cru-graph-yet-again-with-r/">the important question of the magnitude of plausible anthropogenic global warming</a>).</p>
<p>Both model based science and statistics are wonderful fields, but it is important to not receive the results of one when you have paid for the other.</p>
<p>We will pointedly discuss one of the differences.<span id="more-1903"></span>First let us define our terms.  </p>
<p>I will take &#8220;model based science&#8221; to essentially mean <a target="_blank" href="http://en.wikipedia.org/wiki/Falsifiability">Popperian Falsifiability</a> (an alternative to <a target="_blank" href="http://en.wikipedia.org/wiki/Positivism">positivism</a>).  This is roughly: you construct a statement or model and the model is said to only have empirical content if it is in theory possible to &#8220;falsify the model.&#8221;  That is the model must form predictions that are specific enough to potentially be disproved.  If you see a single instance of the model being wrong, you say the model is wrong (or at best incomplete).  And you are done.  Frankly, for all the philosophical  sturm und drang this is closest to what is meant by science.</p>
<p>I will take statistical modeling to roughly mean <a target="_blank" href="http://en.wikipedia.org/wiki/Null_hypothesis">Fisherian Null Hypothesis rejection</a>.  This is only one branch of statistics (in addition to Fisher&#8217;s methods we also have frequentist and Bayesian methods, in particular see:  <a target="_blank" href="http://stat.stanford.edu/~ckirby/brad/other/">Controversies in the foundations of statistics, Bradley Efron, Amer. Math. Mon. 85, 231-246, 1978</a>) but it is closest to what is actually performed in statistical studies.</p>
<p>You can see the two methods sound very similar- they both emphasize rejection of a hypothesis.  But this is deceptive.  In the case of Popperian falsifiability you are essentially holding on to a hypothesis that you believe, but are very willing to give it up (one wrong prediction and it is out).  In the case of Fisherian rejection you don&#8217;t believe the null hypothesis, but you are holding back rejection until you collect enough data to get rid of it.</p>
<p>Let us go over that again.</p>
<p>In the falsifiable or model based science regime: a theory or model would be a proscriptive set of guidelines or laws that allows you to build things (like tall skyscrapers).  If ever one of your skyscrapers unexpectedly falls, you know your theory is wrong and you revise.  Rejection is quick.  But essentially you honestly believed the theory while you were using it.   You were on its side and to counter this bias you agree to reject the theory on first failure.</p>
<p>In the statistical regime you never believed the null hypothesis.  It is a stand-in you are trying to find a lot of evidence against to embarrass out of existence.  Because you know you are against the null hypothesis you do two things try and mitigate your bias against the null hypothesis: you operationally presume it is true during reasoning and you don&#8217;t reject it until there is a lot of evidence against it.</p>
<p>To sum up in model based science you believe the model and are confident it can&#8217;t be toppled easily (so you don&#8217;t defend it as it you are confident it will survive) in statistics you doubt the null hypothesis and you give it every chance to survive (because you are sure that it will not survive).</p>
<p>Now that I have stated my premises let us move on the field I intended to criticize: <a target="_blank" href="http://boingboing.net/2011/12/07/esp-proponents-claim-that-esp.html">paranormal powers</a>.  </p>
<p>To be deliberately rude: if you are investigating something that does not have a proposed mechanism that you are willing to test and reject you are not doing model based science.  And by definition the paranormal is outside of current scientific explanation.  It was too much to hope that we were doing model based science in this case (the appearance is deliberately that of science instead of statistics, but our science friends won&#8217;t help us call this out as they are often profiting from the same confusion).  So you are doing statistics (and there is nothing wrong with that).  But if you are doing statistics what is your null hypothesis?  </p>
<ul>
<li>Null Hypothesis  Candidate 1: ESP does not exist.
<p>This is a plausible hypothesis and sound &#8220;nully&#8221; (doesn&#8217;t claim much).  But you would only be able to use this null hypothesis to try to prove the existence of ESP.</p>
<p>But it is the exact wrong hypotheses to disprove ESP.<br />
&#8220;The null hypothesis can never be proven&#8221; (see <a target="_blank" href="http://en.wikipedia.org/wiki/Null_hypothesis">Null Hypothesis</a> and<br />
<a target="_blank" href="http://www.win-vector.com/blog/tag/statsmanship/">Statsmanship</a>).  Fisherian testing is unfortunately a one-sided design; it can only reject null hypothesis (not fully settle questions).</p>
</li>
<li>Null Hypothesis  Candidate 2: ESP does  exist.
</li>
<p>This is the null hypothesis you need to work with to reject ESP.</p>
<p>But here is the trap.  You must operationally work with the hypothesis (even if you don&#8217;t like it) during the rejection attempt.  Since you are forced to &#8220;operationally accept&#8221; the null hypothesis for the duration of the study you have absolutely no defense against critiques like:</p>
<blockquote><p>
This latter review didn’t find any problems in our methodology or writeup itself, but suggested that, since the three of us (Richard Wiseman, Chris French and I) are all skeptical of ESP, we might have unconsciously influenced the results using our own psychic powers.&#8217;
</p></blockquote>
<p>The paranormal is just one big game of <a target="_blank" href="http://en.wikipedia.org/wiki/Mornington_Crescent_(game)">Mornington Crescent</a>. So if you failed to claim that there is no such thing as  psychic dampening powers <em>before</em> your opponent accuses you of using such powers: you lose.  The game is all about timing, not reality.  If you don&#8217;t like this kind of situation, don&#8217;t get into this kind of situation.</p>
<p>This is why you shouldn&#8217;t use statistics to study bullshit.  Statistical testing methods are deliberately designed to be weak.  Unfortunately they are easy to work around if given enough rope.
</ul>
<p>None of this would matter if it didn&#8217;t also hold for a lot of what is called mainstream science.  Everyone wants the adulation of having imp ortant scientific results; but they seem to only to want to pay to commission statistics.</p>
<p>Take big money pharmaceuticals as an example.  Non-working drugs can deliver <em>equivocal</em> results forever (as long as you keep weakening the proposed claims after each study) and always being &#8220;on the verge&#8221; of a significant result can fund an endless number of studies and careers.</p>
<p>It now past time to define what I meant by &#8220;magic.&#8221;  Magic, for this article, is any hypothesis that is not sufficiently specific and bounded.  You can design statistical studies to test many things, but only if you can specifically describe the limits of what you are attempting to study prior to the experimental work.  There are two main classes of magic hypothesis the powerful and the weak.  Powerful magic hypothesis are unfalsifiable because they have no pre-defined limit on what they can bring in to defend theirselves post experiment.  Weak magic hypothesis are unfalsifiable for the simple reason they can be revised after any experiment to claim the effect is present but just slightly more subtle than the resolving power of the last experiment.</p>
<p>You must be very clear about when you are doing science and about when you are doing statistics.  The unfortunate truth is: it is very difficult to successfully dispute junk science using tools as deliberately delicate as statistical hypothesis testing.  Without a sufficiently critical mindset you get <a target="_blank" href="http://www.win-vector.com/blog/2009/04/the-data-enrichment-method/">deliberately bad statistics</a>, <a target="_blank" href="http://en.wikipedia.org/wiki/Cargo_cult_science">cargo cult science</a> and <a target="_blank" href="https://plus.google.com/114134834346472219368/posts/ZBNSWpqUsvb">dishonest math</a>.  A good essay on this researchers wanting to claim the benefits of the trappings of mathematics (but not willing to meet the very strict pre-conditions required) is &#8220;The Pernicious Influence of Mathematics on Science&#8221; Jack Schwartz, 1962 (collected in &#8220;Discrete Thoughts: Essays on mathematics, science, and philosophy&#8221; Mark Kac, Gian-Carlo Rota, Jacob T. Schwartz, Birkhauser  1992).</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2009/12/statistics-to-english-translation-part-2a-%e2%80%99significant%e2%80%99-doesn%e2%80%99t-always-mean-%e2%80%99important%e2%80%99/' rel='bookmark' title='Statistics to English Translation, Part 2a: ’Significant’ Doesn’t Always Mean ’Important’'>Statistics to English Translation, Part 2a: ’Significant’ Doesn’t Always Mean ’Important’</a></li>
<li><a href='http://www.win-vector.com/blog/2009/12/statistics-to-english-translation-part-2b-calculating-significance/' rel='bookmark' title='Statistics to English Translation, Part 2b: Calculating Significance'>Statistics to English Translation, Part 2b: Calculating Significance</a></li>
<li><a href='http://www.win-vector.com/blog/2009/11/i-dont-think-that-means-what-you-think-it-means-statistics-to-english-translation-part-1-accuracy-measures/' rel='bookmark' title='&#8220;I don&#8217;t think that means what you think it means;&#8221; Statistics to English Translation, Part 1: Accuracy Measures'>&#8220;I don&#8217;t think that means what you think it means;&#8221; Statistics to English Translation, Part 1: Accuracy Measures</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2011/12/why-you-can-not-to-use-statistics-to-dispute-magic/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>My Favorite Graphs</title>
		<link>http://www.win-vector.com/blog/2011/12/my-favorite-graphs/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=my-favorite-graphs</link>
		<comments>http://www.win-vector.com/blog/2011/12/my-favorite-graphs/#comments</comments>
		<pubDate>Tue, 06 Dec 2011 00:59:19 +0000</pubDate>
		<dc:creator>Nina Zumel</dc:creator>
				<category><![CDATA[Applications]]></category>
		<category><![CDATA[Opinion]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Tutorials]]></category>
		<category><![CDATA[boxplots]]></category>
		<category><![CDATA[ggplot]]></category>
		<category><![CDATA[ggplot2]]></category>
		<category><![CDATA[graphical perception]]></category>
		<category><![CDATA[linear regression]]></category>
		<category><![CDATA[Logistic Regression]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[statistical graphs]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1886</guid>
		<description><![CDATA[The important criterion for a graph is not simply how fast we can see a result; rather it is whether through the use of the graph we can see something that would have been harder to see otherwise or that could not have been seen at all. &#8211; William Cleveland, The Elements of Graphing Data, [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2011/02/the-cranky-guide-to-trying-r-packages/' rel='bookmark' title='The cranky guide to trying R packages'>The cranky guide to trying R packages</a></li>
<li><a href='http://www.win-vector.com/blog/2009/08/good-graphs-graphical-perception-and-data-visualization/' rel='bookmark' title='Good Graphs: Graphical Perception and Data Visualization'>Good Graphs: Graphical Perception and Data Visualization</a></li>
<li><a href='http://www.win-vector.com/blog/2010/11/learn-a-powerful-machine-learning-tool-logistic-regression-and-beyond/' rel='bookmark' title='Learn Logistic Regression (and beyond)'>Learn Logistic Regression (and beyond)</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<blockquote><p>The important criterion for a graph is not simply how fast we can see a result; rather it is whether through the use of the graph we can see something that would have been harder to see otherwise or that could not have been seen at all.</p></blockquote>
<p>&#8211; William Cleveland, <em>The Elements of Graphing Data</em>, Chapter 2</p>
<p>In this article, I will discuss some graphs that I find extremely useful in my day-to-day work as a data scientist. While all of them are helpful (to me) for statistical visualization during the analysis process, not all of them will necessarily be useful for presentation of final results, especially to non-technical audiences.</p>
<p>I tend to follow Cleveland&#8217;s philosophy, quoted above; these graphs show me &#8212; and hopefully you &#8212; aspects of data and models that I might not otherwise see. Some of them, however, are non-standard, and tend to require explanation. My purpose here is to share with our readers some ideas for graphical analysis that are either useful to you directly, or will give you some ideas of your own.</p>
<p><span id="more-1886"></span>
<p>The graphs are all produced in <code>R</code>, using the <code>ggplot2</code> package. While <code>ggplot2</code> has a fairly high learning curve, it is the most flexible of the <code>R</code> graphing packages that I have encountered, and I&#8217;ve been able to quickly create rich graphics more easily than I would be able to with the <code>R</code> base graphics, or with other graphics packages.</p>
<p>Let&#8217;s start with some exploratory analysis. We will use the <code>AdultUCI</code> dataset that is included in the <code>arules</code> package.</p>
<pre><code>
library(arules)
data("AdultUCI")
dframe = AdultUCI[, c("education", "hours-per-week")]
colnames(dframe) = c("education", "hours_per_week")
         # get rid of the annoying minus signs in the column names
</code></pre>
<p>We want to compare the distribution of work-week length to education, using a box-and-whisker plot that is overlaid on a jittered scatterplot of the data.</p>
<pre><code>
library(ggplot2)
ggplot(dframe, aes(x=education, y=hours_per_week)) +
          geom_point(colour="lightblue", alpha=0.1, position="jitter") +
          geom_boxplot(outlier.size=0, alpha=0.2) + coord_flip()
</code></pre>
<p>The <code>outlier.size=0</code> argument to <code>geom_boxplot</code> turns off the outlier plotting, and <code>coord_flip</code> switches the coordinate axes (because there are a lot of education levels).</p>
<p>The resulting graph:</p>
<p><img style="display:block; margin-left:auto; margin-right:auto;" src="http://www.win-vector.com/blog/wp-content/uploads/2011/12/Rplot.png" alt="Rplot" border="0"/></p>
<p>Recall that the box of a box-and-whisker plot covers the central 50% of the data distribution; the line in the center marks the median. In this case, the work-week length concentrates so strongly at 40 hours (except for PhDs and those with professional degrees; they are doomed to work longer hours, typically) that most of the boxes appear one-sided; it&#8217;s easier to see what is happening with both the scatterplot and box-and-whisker superimposed, than it might be with the box-and-whisker alone. We can also see the relative concentration of the subjects along each educational level.</p>
<p>I&#8217;ve found that this superimposed graph is fairly easy to explain in a presentation (easier than a plain box-and-whisker, actually). The primary disadvantage that the scatterplot can get illegible for high volume datasets (this set has about 49 thousand rows). In this case, we have to return to the box-and-whisker plot alone.
</p>
<p>Beyond exploratory analysis, we also want plots to evaluate the models that we fit. Win-Vector&#8217;s bread-and-butter recently has been logistic regression, so we will start with some visualizations for evaluating binary logistic regression models. We&#8217;ll use the heart disease dataset that Hastie, et.al, used in the <em>Elements of Statistical Learning</em>.</p>
<pre><code>
path = "http://www-stat.stanford.edu/~tibs/ElemStatLearn/datasets/SAheart.data"
saheart = read.table(path, sep=",",head=T,row.names=1)
fmla = "chd ~ sbp + tobacco + ldl + adiposity + famhist + typea + obesity"
model = glm(fmla, data=saheart, family=binomial(link="logit"),
             na.action=na.exclude)
</code></pre>
<p>We will make a data frame of <em>chd</em> (the true response, coronary heart disease), and the score from the model.</p>
<pre><code>
dframe = data.frame(chd=as.factor(saheart$chd),
                    prediction=predict(model, type="response"))
</code></pre>
<p>The standard diagnostic plot for logistic models is the ROC curve, which is fine, but personally, I don&#8217;t get a visceral feel for the model from looking at the ROC. Also, if you are interested in setting a score threshold on the model for classification purposes, the ROC adds an additional level of indirection, since it essentially integrates the score away. I used to plot the distribution of score (prediction) versus true response, like so:</p>
<pre><code>
ggplot(dframe, aes(x=prediction, colour=chd)) + geom_density()
</code></pre>
<p><img style="display:block; margin-left:auto; margin-right:auto;" src="http://www.win-vector.com/blog/wp-content/uploads/2011/12/Rplot01.png" alt="Rplot01" border="0"/></p>
<p>This visualization tells me whether or not the model scores actually separate the response &#8212; in this case, the model identifies negative cases (no coronary heart disease) better than positive cases. The graph is hard to explain to a non-technical audience, and it has the disadvantage that both distributions are separately normalized to have unit area, so you get no sense of the relative proportion of positive and negative cases (in this case, about 35% of the population have coronary heart disease). </p>
<p>Here&#8217;s an alternate graph:</p>
<pre><code>
ggplot(dframe, aes(x=prediction, fill=chd)) +
               geom_histogram(position="identity", binwidth=0.05, alpha=0.5)
</code></pre>
<p><img style="display:block; margin-left:auto; margin-right:auto;" src="http://www.win-vector.com/blog/wp-content/uploads/2011/12/Rplot02.png" alt="Rplot02" border="0" /></p>
<p>This is two semi-transparent histograms; the blue histogram for <code>chd=1</code> is &#8220;in front&#8221; of the the red histogram. Because they are histograms, rather than density plots, we can more clearly see the relative distribution of positive to negative cases, and we have a better sense of how well (or not) the model separates the positive cases from the negative ones. Clearly, for most score thresholds, the model will have a fairly high false positive rate. I use this visualization all the time, but it is also fairly hard to explain, the transparency in particular.</p>
<p>We can also use our friend the box-and-whisker scatterplot.</p>
<pre><code>
ggplot(dframe, aes(x=chd, y=prediction)) +
               geom_point(position="jitter", alpha=0.2) +
               geom_boxplot(outlier.size=0, alpha=0.5)

</code></pre>
<p><img style="display:block; margin-left:auto; margin-right:auto;" src="http://www.win-vector.com/blog/wp-content/uploads/2011/12/Rplot03.png" alt="Rplot03" border="0" /></p>
<p>The median score for the coronary heart disease cases is pulled away from the median score of the healthy subjects, but the central 50% of the two distributions still overlap. </p>
<p>Finally, let&#8217;s look at visualizations for linear regression. We&#8217;ll use the <a href="http://www-stat.stanford.edu/~tibs/ElemStatLearn/datasets/prostate.data">prostate cancer data</a> from <em>Elements of Statistical Learning</em>.</p>
<pre><code>
fmla = "lpsa ~ lcavol + lweight + age + lbph + svi + lcp + gleason + pgg45"
model = lm(fmla, data=prostate.data)
</code></pre>
<p>We can just <code>plot(model)</code> for some diagnostic graphs:</p>
<pre><code>
par(mfrow = c(2, 2), oma = c(0, 0, 2, 0))
plot(model)
</pre>
<p></code><br />
<img style="display:block; margin-left:auto; margin-right:auto;" src="http://www.win-vector.com/blog/wp-content/uploads/2011/12/Rplot04.png" alt="Rplot04" border="0" /></p>
<p>These diagnostics are useful to determine whether or not a linear model is suitable, and to identify outliers; but again, I personally don't get a visceral feel for the model. I prefer to directly plot prediction against true response:</p>
<pre><code>
dframe = data.frame(lpsa=prostate.data$lpsa, prediction=predict(model))

title = sprintf("Prostate Cancer model\n R-squared = %1.3f",
                summary(model)$r.squared)
ggplot(dframe, aes(x=lpsa, y=prediction)) +
               geom_point(alpha=0.2) +
               geom_line(aes(y=lpsa), colour="blue") +
               opts(title=title)
</pre>
<p></code><br />
<img style="display:block; margin-left:auto; margin-right:auto;" src="http://www.win-vector.com/blog/wp-content/uploads/2011/12/Rplot05.png" alt="Rplot05" border="0" /></p>
<p>This graph gives you the same information as the Residuals vs. Fitted plot, and the Q-Q plot -- in particular, whether there is systematic over- or under-prediction in specific ranges of the data. It will expose outliers, and it is intuitive to explain when presenting your results. Furthermore, it can be used to evaluate other models that predict a continuous response, such as regression trees or polynomial fits. </p>
<p>Which graphs do you find especially useful for your day-to-day work?</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2011/02/the-cranky-guide-to-trying-r-packages/' rel='bookmark' title='The cranky guide to trying R packages'>The cranky guide to trying R packages</a></li>
<li><a href='http://www.win-vector.com/blog/2009/08/good-graphs-graphical-perception-and-data-visualization/' rel='bookmark' title='Good Graphs: Graphical Perception and Data Visualization'>Good Graphs: Graphical Perception and Data Visualization</a></li>
<li><a href='http://www.win-vector.com/blog/2010/11/learn-a-powerful-machine-learning-tool-logistic-regression-and-beyond/' rel='bookmark' title='Learn Logistic Regression (and beyond)'>Learn Logistic Regression (and beyond)</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2011/12/my-favorite-graphs/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>An Appreciation of Locality Sensitive Hashing</title>
		<link>http://www.win-vector.com/blog/2011/11/an-appreciation-of-locality-sensitive-hashing/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=an-appreciation-of-locality-sensitive-hashing</link>
		<comments>http://www.win-vector.com/blog/2011/11/an-appreciation-of-locality-sensitive-hashing/#comments</comments>
		<pubDate>Mon, 21 Nov 2011 16:41:46 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Exciting Techniques]]></category>
		<category><![CDATA[Expository Writing]]></category>
		<category><![CDATA[Opinion]]></category>
		<category><![CDATA[Tutorials]]></category>
		<category><![CDATA[Locality Sensitive Hashing]]></category>
		<category><![CDATA[Nearest Neighbor]]></category>
		<category><![CDATA[Theorist]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1848</guid>
		<description><![CDATA[We share our admiration for a set of results called &#8220;locality sensitive hashing&#8221; by demonstrating a greatly simplified example that exhibits the spirit of the techniques.Locality sensitive hashing is awe inspiring in its originality, simplicity, beauty and effectiveness. In addition locality sensitive hashing is a remarkable technique as it works even when drastically abridged and [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2008/09/a-quick-appreciation-of-the-sharpe-ratio/' rel='bookmark' title='A Quick Appreciation of the Sharpe Ratio'>A Quick Appreciation of the Sharpe Ratio</a></li>
<li><a href='http://www.win-vector.com/blog/2009/11/the-local-to-global-principle/' rel='bookmark' title='The Local to Global Principle'>The Local to Global Principle</a></li>
<li><a href='http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/' rel='bookmark' title='Automatic Differentiation with Scala'>Automatic Differentiation with Scala</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>We share our admiration for a set of results called &#8220;locality sensitive hashing&#8221; by demonstrating a greatly simplified example that exhibits the spirit of the techniques.<span id="more-1848"></span>Locality sensitive hashing is awe inspiring in its originality, simplicity, beauty and effectiveness.  In addition locality sensitive hashing is a remarkable technique as it works even when drastically abridged and simplified.  In this <a target="_blank" href="http://www.win-vector.com/dfiles/LocalitySensitiveHashing.pdf">paper (link to pdf)</a> we give a description of conditions where the technique works and a heuristic argument why it works (using only elementary math).</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2008/09/a-quick-appreciation-of-the-sharpe-ratio/' rel='bookmark' title='A Quick Appreciation of the Sharpe Ratio'>A Quick Appreciation of the Sharpe Ratio</a></li>
<li><a href='http://www.win-vector.com/blog/2009/11/the-local-to-global-principle/' rel='bookmark' title='The Local to Global Principle'>The Local to Global Principle</a></li>
<li><a href='http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/' rel='bookmark' title='Automatic Differentiation with Scala'>Automatic Differentiation with Scala</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2011/11/an-appreciation-of-locality-sensitive-hashing/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>&#8220;The Mythical Man Month&#8221; is still a good read</title>
		<link>http://www.win-vector.com/blog/2011/10/the-mythical-man-month-is-still-a-good-read/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=the-mythical-man-month-is-still-a-good-read</link>
		<comments>http://www.win-vector.com/blog/2011/10/the-mythical-man-month-is-still-a-good-read/#comments</comments>
		<pubDate>Sun, 23 Oct 2011 18:57:52 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Computers]]></category>
		<category><![CDATA[Opinion]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Architects]]></category>
		<category><![CDATA[Mythical Man Month]]></category>
		<category><![CDATA[SAGE]]></category>
		<category><![CDATA[WIMP]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1834</guid>
		<description><![CDATA[Re-read Fred Brooks &#8220;The Mythical Man Month&#8221; over vacation.  Book remains insightful about computer science and project management.My spin on some points: System architects once were the people who said &#8220;no&#8221; to features to maintain design consistency and coherency. Now architects are the people who buy and bring in external frameworks and technologies (killing any [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2009/01/map-reduce-a-good-idea/' rel='bookmark' title='Map Reduce: A Good Idea'>Map Reduce: A Good Idea</a></li>
<li><a href='http://www.win-vector.com/blog/2009/08/good-graphs-graphical-perception-and-data-visualization/' rel='bookmark' title='Good Graphs: Graphical Perception and Data Visualization'>Good Graphs: Graphical Perception and Data Visualization</a></li>
<li><a href='http://www.win-vector.com/blog/2010/05/must-have-software/' rel='bookmark' title='Must Have Software'>Must Have Software</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>Re-read Fred Brooks &#8220;The Mythical Man Month&#8221; over vacation.  Book remains insightful about computer science and project management.<span id="more-1834"></span>My spin on some points:</p>
<ul>
<li>
System architects once were the people who said &#8220;no&#8221; to features to maintain design consistency and coherency.  Now architects are the people who buy and bring in external frameworks and technologies (killing any chance of consistency or coherency).  Kind of like the Fahrenheit 451 quote &#8220;I remember firemen used to fight fires.&#8221;
</li>
<li>
By far the thing that aged the worst was the reverence for the WIMP (windows, icons, menus, pointing) paradigm.  At this point I think we can argue that WIMP codified a lot of provably bad decisions: desktops, icons, menus and mouse out of visual field.  Maybe some of the ideas prior to WIMP (like SAGE&#8217;s light-pens) or after WIMP (application launcher noun-verb theories like Quicksilver, search, touch pads, full screen apps, versioning and not forcing the user to adapt to the file storage abstraction) are actually much more fundamental.  I think we all were seduced by the 1968 Engelbart demo but forget that the Semi Automated Ground Environment was a production deployed direct (light pen) multi user information sharing point and click system since 1959.</p>
<p><center><br />
<img style="display:block; margin-left:auto; margin-right:auto;" src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/IMG_0064.jpg" alt="SAGE station" title="IMG_0064.JPG" border="0" width="600" height="450" /></p>
<p>SAGE station, Computer History Museum- Mountain View, CA</p>
<p></center>
</li>
<li>
Most everything else ages very well.  The discussions of pain of having to work &#8220;out of core&#8221; remain relevant as this is what we now call &#8220;big data&#8221; (though in Brooks&#8217; time this pain extends to documentation, source code and binaries all of which are too big to hold in memory or even in machine accessible format in the time of the IBM System/360).  </p>
<p>Though in the old days- &#8220;out of core&#8221; meant punched cards, punched tape, magnetic tape or very slow hard disks (which were a new luxury for the period Brooks writes about).<br />
<center><br />
<img style="display:block; margin-left:auto; margin-right:auto;" src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/IMG_0062.jpg" alt="IMG 0062" title="IMG_0062.JPG" border="0" width="450" height="600" /></p>
<p>SDS 920 with built in tape-drive, Computer History Museum- Mountain View, CA</p>
<p></center>
</li>
<li>
Linkers were among the biggest problems in the 1960s and remain the so now (though we now call it late binding, jars, shared libraries and APIs).  At one point Brooks throws up his hands and says that it would be faster to just re-compile everything than to deal with some relocating linkers.
</li>
<li>
Brooks definitely advocates and anticipates things like developer wikis (though he had to use microfiche as the computers of his day didn&#8217;t have enough storage to manage their own documentation).
</li>
<li>
&#8220;Literate Programming&#8221; is clearly anticipated.
</li>
<li>
Version control procedures are definitely written about, but Brooks seems not to anticipate version control software.
</li>
</ul>
<p>Overall: very well written and still interesting and relevant.</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2009/01/map-reduce-a-good-idea/' rel='bookmark' title='Map Reduce: A Good Idea'>Map Reduce: A Good Idea</a></li>
<li><a href='http://www.win-vector.com/blog/2009/08/good-graphs-graphical-perception-and-data-visualization/' rel='bookmark' title='Good Graphs: Graphical Perception and Data Visualization'>Good Graphs: Graphical Perception and Data Visualization</a></li>
<li><a href='http://www.win-vector.com/blog/2010/05/must-have-software/' rel='bookmark' title='Must Have Software'>Must Have Software</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2011/10/the-mythical-man-month-is-still-a-good-read/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Kernel Methods and Support Vector Machines de-Mystified</title>
		<link>http://www.win-vector.com/blog/2011/10/kernel-methods-and-support-vector-machines-de-mystified/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=kernel-methods-and-support-vector-machines-de-mystified</link>
		<comments>http://www.win-vector.com/blog/2011/10/kernel-methods-and-support-vector-machines-de-mystified/#comments</comments>
		<pubDate>Sat, 08 Oct 2011 00:17:46 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Expository Writing]]></category>
		<category><![CDATA[Mathematics]]></category>
		<category><![CDATA[Opinion]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Tutorials]]></category>
		<category><![CDATA[Kernel Methods]]></category>
		<category><![CDATA[Logistic Regression]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Mathematical Bedside Reading]]></category>
		<category><![CDATA[Naive Bayes]]></category>
		<category><![CDATA[Support Vector Machines]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1804</guid>
		<description><![CDATA[We give a simple explanation of the interrelated machine learning techniques called kernel methods and support vector machines. We hope to characterize and de-mystify some of the properties of these methods. To do this we work some examples and draw a few analogies. The familiar no matter how wonderful is not perceived as mystical. Goals [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2010/10/a-personal-perspective-on-machine-learning/' rel='bookmark' title='A Personal Perspective on Machine Learning'>A Personal Perspective on Machine Learning</a></li>
<li><a href='http://www.win-vector.com/blog/2011/04/do-your-tools-support-production-or-complexity/' rel='bookmark' title='Do your tools support production or complexity?'>Do your tools support production or complexity?</a></li>
<li><a href='http://www.win-vector.com/blog/2011/07/book-review-ensemble-methods-in-data-mining-seni-elder/' rel='bookmark' title='Book Review: Ensemble Methods in Data Mining (Seni &amp; Elder)'>Book Review: Ensemble Methods in Data Mining (Seni &#038; Elder)</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>We give a simple explanation of the interrelated machine learning techniques called <a target="_blank" href="http://en.wikipedia.org/wiki/Kernel_methods">kernel methods</a> and <a target="_blank" href="http://en.wikipedia.org/wiki/Support_vector_machine">support vector machines</a>.  We hope to characterize and de-mystify some of the properties of these methods.  To do this we work some examples and draw a few analogies.  The familiar no matter how wonderful is not perceived as mystical.<span id="more-1804"></span><br />
<h2>Goals of this writeup</h2>
<ol>
<li>De-mystify  kernel methods and support vector machines<br />
<blockquote><p>
Kernel methods and support vector machines have taken mythological proportions in the machine learning imagination. Partly this is because a number of good ideas are overly associated with them: support/non-support training datums, weighting training data, discounting data, regularization, margin and the bounding of generalization error.  My issue is that these are all important enough ideas to stand on their own and are often seen in simpler settings.  The observations that inform my view are as follows:</p>
<ul>
<li>Kernel methods and support vector machines are in fact two good ideas.  Each is important even without the other: kernels are useful all over and support vector machines would be  useful even if we restricted to the trivial identity kernel.</li>
<li>Small scale &#8220;kernel tricks&#8221; are not that different than the classic technique of adding &#8220;interaction variables.&#8221;  Kernels let you escape from the limits of &#8220;linear hypotheses&#8221; (really by moving to a bigger space where things are again linear but look curved from the point of view of your smaller original space).  We demonstrate the linear methods and &#8220;primal kernel tricks&#8221; in <a target="_blank" href="http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/">A Demonstration of Data Mining</a>.</li>
<li>Support/non-support is the central issue of nearest neighbor classifiers.</li>
<li>Weighing training data is most famously shared by logistic regression and boosting.</li>
<li>Re-weighting of data by <i>smoothing kernels</i> (different but related use of the work &#8220;kernel&#8221;) is central to non-parametric statistics (kernel smoothers and splines).</li>
<li>Regularization is of supreme importance in modeling in general.</li>
<li>Most practitioners get tired of &#8220;kernel shopping&#8221; and fall back to the identity, cosine or radial/Gaussian kernels.</li>
</ul>
</blockquote>
</li>
<li>Show concrete examples of what are and what are not kernels.<br />
Few sources give enough theorems to think about kernels abstractly and fewer still work concrete examples.
</li>
<li>
Use as little math as possible (which, unfortunately, turns out to be quite a bit).  We will discuss encodings and stopping conditions (important to understand what is going on) but avoid explaining the optimizers (the most beautiful part of support vector machines, but also the part that is available pre-packaged in libraries).
</li>
<li>
Call out common magical thinking and unreasonable expectations associate with kernel methods and support vector machines.  This should help the reader be in a better position to &#8220;defend their doubts&#8221; regarding machine learning promises.
</li>
<li>
Try to place all techniques in a wider context (if it is usable only one place it is a trick, if it is usable multiple places it is a technique).
</li>
<li>Discuss margin and its impact on generalization error.<br />
<blockquote><p>
Generalization error is an effect of &#8220;over fitting&#8221; where a model has learned things that are true about the training examples that do not hold for the overall truth or concept we are trying to learn (i.e. don&#8217;t generalize).   Generalization error is the excess error rate we observe when scoring new examples versus the error-rate we saw in learning the training data.  Margin is in fact a <em>posterior</em> observation.  That is: margin is observed after the training data is seen, not known before data is seen (like, for example, <a target="_blank" href="http://en.wikipedia.org/wiki/VC_dimension">VC dimension</a>).   Margin is useful as it bounds generalization error but it is not the <em>prior</em> bound it is often portrayed as.  So we assert margin estimates are not much more special than simple cross-validation estimates which can also be performed once we have data available.
</p></blockquote>
</li>
</ol>
<p>The goals are not to try to indict or try to cut down kernel methods or support vector machines, but just to dump some of the associated baggage so they can be used fluidly and without anxiety.  </p>
<h2>Example Problem</h2>
<p>Consider the following simple (caricature) machine learning problem:  we are given a number of points labeled as circles and square and we want to, given a new point, predict the label.  In our example the only input will be two numbers: x and y.  This example is simple enough that we can depict the entire situation in Figure 1 below:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/TruthAndData.png" alt="TruthAndData.png" border="0" width="480" height="480" /></p>
<p>Figure 1: Truth and Training Data</p>
<p></center></p>
<p>We are assuming (for simplicity of exposition) that we have the incredible luck that the label is indeed a deterministic function of x and y.  We portray this &#8220;ground truth&#8221; as the blue parabola- every point found in this region will be considered to be labeled as &#8220;circle&#8221; and every other point (i.e. the red region) will be labeled as &#8220;triangle.&#8221;  We have overlaid 20 example points (with appropriate shape labels).  These points will be our training data.  We will try to learn from the observed training data a good approximation of the unseen blue and red regions.  This process is called learning or training and the ability to correctly predict the label of new points is called generalization.  To make sure our concept is learnable we are going to further guarantee a moat (or margin) the distribution we are picking training and test examples from will never pick a point from the shaded region between the blue and black.  That is we will never be asked about an example where x*x is very near y.</p>
<p>To promote visual thinking we are going to avoid formulas until the section titled &#8220;precise functions used&#8221; (where we will list the formulas used for each graph).</p>
<h2>Nearest Neighbor Solution</h2>
<p>An interesting model in these days of big data and fast machines is the nearest neighbor model.  Such a model colors each point in space blue or red as it <em>thinks</em> the point should be classified.  In this case each point is colored the color given by the closest known training point.  This induces the type of model seen in Figure 2 whose boundary is piecewise line segments mid-way between training points.  The data points that determine segments of the classification boundary in this way are thought to support the boundary and are called &#8220;support examples&#8221; or &#8220;support vectors.&#8221;  In fact any training point that is not &#8220;supporting&#8221; part of the boundary is &#8220;irrelevant&#8221; (could be removed and we would get the exact same model).  This split between important and non-important training points is considered one of the important observations from  the theory of support vector machines, but we see it even here in the nearest neighbor models.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/1NN.png" alt="1NN.png" border="0" width="480" height="480" /></p>
<p>Figure 2: 1 nearest neighbor model</p>
<p></center></p>
<p>Notice that the nearest neighbor model in Figure 2 is &#8220;not half bad.&#8221;  The blue region is much too wide, but near the training data the blue mass is roughly in the correct position.  If we had infinite data and infinite computational resources this sort of model would be hard to beat (as any point we are likely to be tested on would have a lot of very near examples to work from).  We can try to clean up the shape of the model a bit by using a vote of the 2 nearest points to color each point in the plane (see Figure 3) or even the majority of the 3 nearest points (see Figure 4).  </p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/2NN.png" alt="2NN.png" border="0" width="480" height="480" /></p>
<p>Figure 3: 2 nearest neighbor model</p>
<p></center></p>
<p>The purple region in Figure 3 is a &#8220;region of uncertainty&#8221; where net-vote over the 2 nearest points is zero (they gave inconsistent advice).  We can use this as a hint that predictions take from this region are less reliable.  As you may notice, we are not getting real improvements.  We should always spend more time getting more features (useful coordinates in addition to x and y) and more data; and spend less time tweaking models.  But if your data and features are fixed all you work on is your modeling technique (alternately: your modeling technique is all you can prepare before receiving features and data).  So we will continue to discuss features and technique.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/3NN.png" alt="3NN.png" border="0" width="480" height="480" /></p>
<p>Figure 4: 3 nearest neighbor model</p>
<p></center></p>
<p>Nearest neighbor classifiers are optimal in the sense that with an infinite amount of data the 1-nearest neighbor classifier has an error rate that approaches twice the Bayes error rate (the Bayes error rate being the ideal error observed on identical repetitions, or the theoretical best error rate) and for large k the k-nearest neighbor method approaches the Bayes error rate itself (see for example <a target="_blank" href="http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm">k-nearest neighbor algorithm</a> ).</p>
<p>However the effectiveness of the nearest neighbor classifiers is coming from the very low dimension (2) of our input variables x and y.  If we were attempting to predict from n variables we might need an amount of data exponential in n to get a useful nearest neighbor classifier.  This is an <a target="_blank" href="http://en.wikipedia.org/wiki/Efficiency_(statistics)">inefficient</a> use of data of the data; what we want is to use all of the data we have effectively.  To do this we further assume there is some relational reason that the position of the point in the plane determines the shape-label (i.e. that we are learning, generalizing, interpolating and extrapolating- not just memorizing).  We set our ambitions above memorizing and move towards functional modeling.  We start thinking in terms of change: how the label density of examples changes as we move is a good clue as to how it will continue to change.  We can do this either in a parametric or primal formulation (where we attempt to directly infer parameters of an assumed functional form as in regression or logistic regression) or in a non-parametric or dual form (where we attempt to learn relations between points and the training data instead of parameters).  We have discussed this before ( <a target="_blan" href="http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/">A Demonstration of Data Mining</a> ) and expanded on primal methods ( <a target="_blank" href="http://www.win-vector.com/blog/2010/11/learn-a-powerful-machine-learning-tool-logistic-regression-and-beyond/">Learn Logistic Regression (and beyond</a>, <a target="_blank" href="http://www.win-vector.com/blog/2011/09/the-simpler-derivation-of-logistic-regression/">The Simpler Derivation of Logistic Regression</a> and <a target="_blank" href="http://www.win-vector.com/blog/2011/09/the-equivalence-of-logistic-regression-and-maximum-entropy-models/">The equivalence of logistic regression and maximum entropy models</a>).</p>
<h2>Functional Solution</h2>
<p>Our next idea is: can we use our training data to tease out a functional form for the relation between x,y and shape label?  The first function we will try is chosen to be similar to the nearest neighbor model we just demonstrated.  For this form we say each training point is the center of a <a target="_blank" href="http://en.wikipedia.org/wiki/Gaussian_function">Gaussian</a> hump (say up for blue/circle and down for red/triangle) which we will call a &#8220;discount function&#8221; (or informally a hump).  A &#8220;Gaussian&#8221; is just a function that falls off exponentially in the square of Euclidian distance from a central point (we can see the simple form of such functions in the &#8220;precise functions used&#8221; section).  The contour lines (or levels as seen from looking down upon) of one such hump or discount function are shown in figure 5 (all graphs from here on in will be looking down at a height represented by color and contour lines, much like topographic maps).</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/GaussianKernel3ExampleConcept.png" alt="GaussianKernel3ExampleConcept.png" border="0" width="480" height="480" /></p>
<p>Figure 5: Narrow Gaussian example concept</p>
<p></center></p>
<p>This particular discount function has its maximum (or peak) centered on a data point and then rapidly falls as we move away from the training point.  The only property we will use in this section is that we can evaluate the discount easily (so we don&#8217;t at this time need or make use of any restrictive properties like integrability, positive semi-definiteness or even anything like bounded level sets).<br />
We could make our functional model just the sum of all of these up and down humps over all of the training data.  Each point gets a hump of the same height and same radius, pointing up for blue/circle and down for red/triangle.  The net coloring of this sum function is shown in figure 6.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/GaussianKernel3KernelSum1.png" alt="GaussianKernel3KernelSum.png" border="0" width="480" height="480" /></p>
<p>Figure 6: Narrow Gaussian sum model</p>
<p></center></p>
<p>The idea is that this would imitate the nearest neighbor model we have already seen.  This is because even though we are summing over all data the closest training examples should usually dominate (since the Gaussian humps we picked fall to zero as we get further from their centers).   The correspondence to nearest neighbor increases if we tighten the radius of our functions (searching for the right radius is a very important statistical problem called &#8220;bandwidth estimation&#8221;).  Figure 7, for example shows the sum over much narrower Gaussians.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/GaussianKernel20KernelSum1.png" alt="GaussianKernel20KernelSum.png" border="0" width="480" height="480" /></p>
<p>Figure 7: Very narrow Gaussian sum model</p>
<p></center></p>
<p>Figure 7 looks even more like the nearest neighbor solution.  In fact your could think of the discount-weighted sum over all of the data as the natural model (as it is easer to reason about) and the nearest neighbors a more efficient approximation (summing only over near points).  We also have seen that we can alter the smoothness of our model (which has consequences on model generalization) by changing the steepness of our discount functions.  We can also build intermediate models (such as building a nearest neighbor but using discount weighted sums instead of uniform voting in the neighborhoods).   We haven&#8217;t yet greatly improved on nearest neighbor, but we have identified a couple of obvious avenues for further improvement: mess with the bandwidths (the obvious idea to try and deal with near/far scale) or re-weight the data (as one of the lessons of the nearest neighbor model is that points that are not near a boundary are less important).   </p>
<h2>Bandwidth Solution</h2>
<p>As a digression lets play with bandwidth a bit first.  Suppose simultaneously for each training datum we picked an ideal bandwidth or steepness of the Gaussian.  </p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/BandwidthModel.png" alt="BandwidthModel.png" border="0" width="480" height="480" /></p>
<p>Figure 8: Simultaneous bandwidth model</p>
<p></center> </p>
<p>This simultaneous bandwidth model depicted in figure 8 was determined by picking a simultaneous assignment of Gaussian widths (each training datum gets its own width) that maximizes the log-sum of the category signed model function on the training data (much like the logistic regression maximum likelihood decoding).  This differs from typical bandwidth selection problems where all points are given a single common &#8220;best bandwidth.&#8221;  We instead picked a different bandwidth for each point (bandwidths picked to maximize how much of the correct category portion of the sum was correct on each training example).  The blue region is reasonable, but does not match the shape of the true concept.   Partly this is because there has not been enough training data to have learned a lot about the true shape (for example there is no empirical evidence for circle/blue in the top-left quadrant).  The other reason is a bias inherent to this type of model.  If one of the bandwidth functions is wider than all of the others (which is almost certain to occur) then it falls slower than all of the others and for points very far away from the center of the training data this one function is most of what remains.  Or equivalently: due to the nature of this modeling technique one of the concepts learned is going to be the union of bounded islands and the other concept will grab all points sufficiently far away from the center of the training data.  This is strong (and undesirable) bias, but it is a bias also shared by support vector machines with Gaussian kernels (though support vector machines get it from their so-called &#8220;dc-term&#8221; b not being zero; this will be discussed later). Our bandwidth correction was successful, but frankly the optimization problem we solved to estimate the optimal bandwidths was nasty and would not be something we would advise for a large amount of data.</p>
<h2>Data weighting solution (support vector machine)</h2>
<p>So instead of messing with the bandwidths let us consider re-weighting the Gaussians in our sum.  We will leave the shape of each Gaussian the same but allow each individual training datum to have a unique weight or importance.  Perhaps by picking the right weights we can get a better model.  By &#8220;best&#8221; we will mean &#8220;best margin&#8221; (to be defined later) because with &#8220;best margin&#8221; as our objective the optimization problem of solving for the best data weights has a particularly beautiful form that can be reliably solved at great scale.  This is called a &#8220;support vector model&#8221; (or support vector machine) and we will describe these ideas in detail later.  But first lets see the result.  Figure 9 shows the original narrow Gaussians re-summed according to the weights picked by the support vector method (instead of all being 1 as in figure 6).</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/GaussianKernel3SVM1.png" alt="GaussianKernel3SVM.png" border="0" width="480" height="480" /></p>
<p>Figure 9: Narrow Gaussian support vector model</p>
<p></center></p>
<p>Figure 9 is again an improvement.  The figure is formed by taking a signed sum of our discount functions centered at our training points and weighted by the &#8220;support vector machine weights.&#8221;  The signs are picked with one class encoded as +1 and the other as -1.  The colors red and blue are picked if the sum is above or below a constant &#8220;b&#8221; called &#8220;the dc-term&#8221; (part of the support vector solution).  The solution again looks okay: all the training points are in the correct color region (and you can&#8217;t hope to interpolate let alone extrapolate if you can&#8217;t even reproduce your training concept).  Also, as with the bandwidth model, contours are set sensibly (steep/dense where categories are near each other, shallow/sparse where things are safe).   We see the same inductive bias: one of the concepts is the union of bounded islands (this will always going to be the case unless the model picks b=0).  This bias is not obvious from the kernel choice, as it is hidden in the dc-term of the support vector fitting equations (it is not part of the kernel, but can be defeated by choosing unbounded kernels).  An interesting empirical fact about support vector models is they tend to work well with a larger bandwidth.    For example if we use a wider Gaussian we get an even more convincing model (see Figure 10).</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/GaussianKernel025SVM.png" alt="GaussianKernel025SVM.png" border="0" width="480" height="480" /></p>
<p>Figure 10: Wide Gaussian support vector model</p>
<p></center></p>
<p>In figure 10 we have also called out one of the important features of support vector machines.   In the literature all datums are called &#8220;vectors&#8221; (as they are represented in coordinates) and a subset of these are called the support vectors.  In figure 10 we have drawn the 4 training examples that turn out to be support vectors as large shapes and all other training examples as small shapes.  The support vectors are the datums with non negligible weights.  The learned model is a function of the support vectors only.  So after fitting we can discard the other 16 training points. This is similar to the fact that nearest neighbor also only needs the training examples that are supporting its model boundary.</p>
<p>Notice that at this point the support vector models are not &#8220;magically&#8221; better, they in fact look less like the truth we are trying to learn than either the nearest neighbor models or the un-weighted sum models.  To fix this we need to do some &#8220;kernel shopping&#8221; or find functions that better respect what we are trying to model.  With the right functions support vector machines can do a very good job at learning the concept (but knowing &#8220;the right functions&#8221; is a huge hint as to the what the concept is.  There is nothing wrong with using a hint but we do have to a method to produce the hint or have a reasonable number of hints to try from. </p>
<p>The &#8220;sum over everything with the same weight&#8221; solutions from the earlier section are very similar to Naive Bayes which also sums over all matching features with no weight adjustment.  The support vector &#8220;use the same model but pick better weights&#8221; stands over these sum models in very much the same way Logistic Regression stands over Naive Bayes.  In fact the optimization problems are very related.  If we consider weights over features or coordinates as &#8220;primal&#8221; and weights over data items as &#8220;dual&#8221; we can informally say something like: support vector machines optimize by working over dual variables and inspecting for primal weights, and logistic regression works over primal variables and inspects for data weights.   But this is a bit vague.</p>
<p>At this point we need to get a bit more explicit and precise.</p>
<h2>Precise functions used</h2>
<p>In general we will write our training data as a sequence of n-vectors.  Our points will be named u(1) &#8230; u(m).   For each i = 1..m we also know which category the training point is labeled with.  We will encode this as y(1) &#8230; y(m) where each y() is either +1 or -1 depending on the training label.  So the example we have been working has n=2 and m=20.  Let z represent a n-vector we wish to classify.</p>
<p>Figure 6 was a picture of the sign of the function:<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/E4BCA580-314E-4BD8-AD83-3CF55C7AE24E.jpg" alt="E4BCA580-314E-4BD8-AD83-3CF55C7AE24E.jpg" border="0" height="50" /><br />
</center></p>
<p>Figure 7 was a picture of the sign of the function:<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/9F34681D-45FC-4E0B-8466-943CF2881AAC.jpg" alt="9F34681D-45FC-4E0B-8466-943CF2881AAC.jpg" border="0" height="50" /><br />
</center></p>
<p>Figure 8 was a picture of the sign of the function:<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/C70C6F0B-7EB9-44C5-9505-013604F46D9C.jpg" alt="C70C6F0B-7EB9-44C5-9505-013604F46D9C.jpg" border="0"  height="50" /><br />
</center><br />
where the w(i) are non-negative numbers picked so that<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/54765538-DADD-4353-A59A-EF3B6C42743E.jpg" alt="54765538-DADD-4353-A59A-EF3B6C42743E.jpg" border="0"  height="50" /><br />
</center><br />
is large.</p>
<p>Figure 9 was a picture of the sign of -b plus the function:<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/CD6E8576-D6D3-4D61-B285-46DB169BC885.jpg" alt="CD6E8576-D6D3-4D61-B285-46DB169BC885.jpg" border="0" height="50" /><br />
</center><br />
where the a(i) and b are picked so &#8220;the margin&#8221; (discussed later) is large.</p>
<p>Figure 10 was a picture of the sign of -b plus of the function:<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/1932DC0A-E051-41E7-8D90-F0868247A69C.jpg" alt="1932DC0A-E051-41E7-8D90-F0868247A69C.jpg" border="0" height="50" /><br />
</center><br />
where the a(i) and b are picked so the margin is large.</p>
<p>All of this repetition is to emphasize the commonality of the models.  They are all of the form:<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/1FC41F1F-16D8-40A3-BA4E-0EF38E455F00.jpg" alt="1FC41F1F-16D8-40A3-BA4E-0EF38E455F00.jpg" border="0" height="50" /><br />
</center><br />
where the a(i) are non-negative numbers called the &#8220;support weights&#8221; and k(,) is a function mapping pairs of vectors to numbers and is called a &#8220;a kernel.&#8221;  Unfortunately there are many incompatible uses of the word kernel in mathematics and statistics.  Here &#8220;kernel&#8221; is being used in the sense of <a target="_blank" href="http://en.wikipedia.org/wiki/Positive-definite_kernel">positive semi-definiteness</a> or that k(u,u) &ge; 0 for all u.   Note that the support vector machine instead of using the sign of f(z) as its decision instead uses which side f(z) is of a constant b as its category decision.  A consequence is: for support vector machines if b is non-zero (as it almost surely will be) and the kernels all go to zero as we approach infinity fast enough (as they are designed to do) then exactly one of the learned classes is infinite and the other is a union of islands (regardless if this was true for the training data).</p>
<h2>About Kernels</h2>
<p>A lot of awe and mysticism is associated with kernels, but we think they are not that big a deal.  Kernels are a combination of two good ideas, they have one important property and are subject to one major limitation.  It is also unfortunate that support vector machines and kernels are tied so tightly together.   Kernels are the idea of summing functions that imitate similarity (induce a positive-definite encoding of nearness) and support vector machines are the idea of solving a clever dual problem to maximize a quantity called margin.  Each is a useful tool even without the other.</p>
<p>The two good ideas are related and unfortunately treated as if they are the same.  The two good ideas we promised are:</p>
<ol>
<li>The &#8220;kernel trick.&#8221;  Adding new features/variables that are functions of your other input variables can change linearly inseparable problems into linearly separable problems.  For example if our points were encoded not as u(i) = (x(i),y(i)) but as u(i) = (x(i),y(i),x(i)*x(i),y(i)*y(i),x(i)*y(i))</li>
<p>  we could easily find the exact concept ( y(i) > x(i)*x(i) which is now a linear concept encoded as the vector (0,1,-1,0,0).</p>
<li>Often you don&#8217;t need the coordinates of u(i).  You are only interested in functions of distances ||u(i)-u(j)|^2 and in many cases you can get at these by inner products and relations like ||u(i)-u(j)||^2 = &lt;u(i),u(i)&gt; + &lt;u(j),u(j)&gt; &#8211; 2&lt;u(i),u(j)&gt; .</li>
</ol>
<p>We will expand on these issues later.</p>
<p>The important property is that kernels look like inner products in a transformed space.  The definition of a kernel is: there exists a magic function phi() such that for all u,v:<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/F7B3D1C2-2702-48E0-8259-C48BBDACFE50.jpg" alt="F7B3D1C2-2702-48E0-8259-C48BBDACFE50.jpg" border="0"  height="20" /> .<br />
</center><br />
This means that k(.,.) is behaving like an inner product in some (possibly unknown) space.  The important consequence is the positive semi-definiteness, which implies k(u,u)&ge;0 for all u (and this just follows from the fact about inner products over the real numbers that &lt;z,z&gt;&ge;0 for all z).  This is why optimization problems that use the kernel as their encoding are well formed (such as the optimization problem of maximizing margin which is how support vector machines work).  You can <a target="_blank" href="http://en.wikipedia.org/wiki/Regularization_(mathematics)">&#8220;regularize&#8221;</a> optimization problems with a kernel penalty because it behaves a lot like a norm.  Without the positive semidefinite property all of these optimization problems would be able to &#8220;run to negative infinity&#8221; or use negative terms (which are not possible from a kernel) to hide high error rates.  The limits of the kernel functions (not being able to turn distance penalties into bonuses) help ensure that the result of optimization is actually useful (and not just a flaw in our problem encoding).</p>
<p>And this brings us to the major limitations of kernels.  The phi() transform can be arbitrarily magic except when transforming one vector it doesn&#8217;t know what the other vector is and phi() doesn&#8217;t even know which side of the inner product it is encoding.   That is kernels are not as powerful as any of the following forms:</p>
<ul>
<li>snooping (knowing the other):<br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/1BA6AACD-96DE-4B2A-A3FA-1236AE9D2957.jpg" alt="1BA6AACD-96DE-4B2A-A3FA-1236AE9D2957.jpg" border="0"  height="20" />
</li>
<li>positional (knowing which part of inner product mapping to):<br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/D51DD0F0-DF01-4E37-B5C4-E5ECF3179165.jpg" alt="D51DD0F0-DF01-4E37-B5C4-E5ECF3179165.jpg" border="0" height="20" />
</li>
<li>fully general:<br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/845BD9A6-97DC-4F2C-AD4E-633FC9AFADA7.jpg" alt="845BD9A6-97DC-4F2C-AD4E-633FC9AFADA7.jpg" border="0" height="20" />
</li>
</ul>
<p>Not everything is a kernel.</p>
<p>Some non-kernels are:</p>
<ol>
<li> k(u,v) = -c for any c>0</li>
<blockquote><p>
A consequence of the fact that no negative number can be written as a sum of squares or as the limit of a sum of squares in the reals.
</p></blockquote>
<li> k(u,v) = ||u-v||<br />
<blockquote><p>
This can be shown as follows.  Suppose k(u,v) is a kernel with k(u,u) = 0 for all u (which is necessary to try and match ||u-v|| as ||u-u|| = 0 for all u).  By the definition<br />
of kernels k(u,u) = &lt; phi(u), phi(u) &gt; for some real vector valued function phi(.).  But, by the properties of the inner real inner product &lt;.,.&gt;, this means phi(u) is<br />
the zero vector for all u.  So k(u,v) = 0 for all u,v and does not match ||u-v|| for any u,v such that u &ne; v.
</p></blockquote>
</li>
</ol>
<p>There are some obvious kernels:</p>
<ol>
<li> k(u,v) = c for any c &ge; 0 (non-negative constant kernels) </li>
<li> k(u,v) = &lt; u , v &gt;  (the identity kernel) </li>
<li> k(u,v) = f(u) f(v) for any real valued function f(.) </li>
<li> k(u,v) = &lt; f(u) , f(v) &gt; for any real vector valued function f(.) (again, the definition of a kernel)</li>
<li> k(u,v) = transpose(u) B v where B is any symmetric positive semi-definite matrix</li>
</ol>
<p>And there are several subtle ways to build new kernels from old.  If q(.,.) and r(.,.) are kernels then so are:</p>
<ol>
<li> k(u,v) = q(u,v) + r(u,v)</li>
<li> k(u,v) = c q(u,v) for any c &ge; 0 </li>
<li> k(u,v) = q(u,v) r(u,v) </li>
<li> k(u,v) = q(f(u),f(v)) for any real vector valued function f(.)</li>
<li> k(u,v) = lim_{k-> infinity} q_k(u,v) where q_k(u,v) is sequence of kernels and the limit exists.</li>
<li> k(u,v) = p(q(u,v)) where p(.) is any polynomial with all non-negative terms.</li>
<li> k(u,v) = f(q(u,v)) where f(.) is any function with an absolutely convergent Taylor series with all non-negative terms.</li>
</ol>
<p>Most of these facts are taken from the excellent book: John Shawe-Taylor and Nello Cristianini&#8217;s <a target="_blank" href="http://www.kernel-methods.net/">&#8220;Kernel Methods for Pattern Analysis&#8221;</a>, Cambridge 2004.  We are allowing<br />
kernels of the form k(u,v) = &lt; phi(u), phi(v) &gt; where phi(.) is mapping into an infinite dimensional vector space (like a series).  Most of these facts can be checked by<br />
imagining how to alter the phi(.) function.  For example to add two kernels just build a larger vector with enough slots for all of the coordinates for the vectors encoding<br />
the phi(.)&#8217;s of the two kernels you are trying to add.  To scale a kernel by c multiply all coordinates of phi() by sqrt(c).  </p>
<p>Multiplying two kernels is the trickiest.  Without loss of generality assume q(u,v) = &lt; f(u) ,f(v) &gt; and r(u,v) = &lt; g(u), g(v) &gt; and f(.) and g(.) are both mapping into the same finite dimensional vector space R^m (any other situation can be simulated or approximated by padding with zeros and/or taking limits).  Imagine a new vector function p(.) that maps into R^{m*m} such that p(z)_{m*(i-1) + j} = f(z)_i g(z)_j .  It is easy to check that k(u,v) = &lt; p(u) , p(v) &gt; is a kernel and k(u,v) = q(u,v) r(u,v). (The reference proof uses tensor notation and the Schur product, but these are big hammers mathematicians use when they don&#8217;t want to mess around with re-encoding indices).</p>
<p>Note that one of the attractions of kernel methods is that you never have to actually implement any of the above constructions.  What you do is think in terms of sub-routines (easy for computer scientists, unpleasant for mathematicians).  For instance: if you had access to two functions q(.,.) and r(.,.) that claim to be kernels and you wanted the product kernel you would just, when asked to evaluate the kernel, just get the results for the two sub-kernels and multiply (so you never need to see the space phi(.) is implicitly working in).</p>
<p>This implicitness can be important (though far too much is made of it).  For example the Gaussians we have been using throughout are kernels, but kernels of infinite dimension, so we can not directly represent the space they live in.  To see the Gaussian is a kernel notice the following:<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/926E0E3C-883C-4879-A49C-4B91DEB3E449.jpg" alt="926E0E3C-883C-4879-A49C-4B91DEB3E449.jpg" border="0"  height="20" /> .<br />
</center> </p>
<p>And for all c &ge; 0 each of the three terms on the right is a kernel (the first because the Taylor series of exp() is absolutely convergent and non-negative and the second two are the f(u) f(v) form we listed as obvious kernels).  The trick is magic, but the idea is to use the fact that Euclidian squared distance breaks nicely into dot-products ( ||u-v||^2 = &lt; u, u &#038;gt + &lt; v, v &#038;gt &#8211; 2 &lt; u , v &gt;) and exp() converts addition to multiplication.  It is rather remarkable that kernels (which are a generalization of inner products that induce half open spaces) can encode bounded concepts (more on this later when we discuss the VC dimension of the Gaussian).</p>
<h2>Back to Support Vector Machines</h2>
<p>We can now restate what a support vector machine is: it is a method for picking data weights so that the modeling function for a given kernel has a maximum margin.  The margin is the minimal distance between a training example and the model&#8217;s boundary between blue and red.  Notice the 4 support vectors in figure 10 are all not &#8220;tight up against the boundary,&#8221; but are all the same (minimal distance) from the boundary.  This distance is called the margin and the area near the boundary is obviously a place the model has uncertainty (so it makes sense to keep the boundary as far as possible from the training data).  The great benefit of the support vector machine is that with access only to the data labels and the kernel function (in fact only the kernel function evaluated at pairs of training datums) the support vector machine can quickly solve for the optimal margin and data weights achieving this margin.</p>
<p>For fun lets plug in new kernel called the cosine kernel:<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/328D7D80-4133-4E42-A63F-F7E8F3ADC90C.jpg" alt="328D7D80-4133-4E42-A63F-F7E8F3ADC90C.jpg" border="0"  height="43" /><br />
</center><br />
(c &ge; 0).</p>
<p>This kernel can be thought of as having a phi(.) function that takes a vector z and adds an extra coordinate of sqrt(c) and then projects the resulting vector onto the unit sphere.  This kernel induces concepts that look like parabolas, as seen in figure 11.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/CosineKernelExampleConcept.png" alt="CosineKernelExampleConcept.png" border="0" width="480" height="480" /></p>
<p>Figure 11: Cosine example concept</p>
<p></center></p>
<p>Figure 12 shows the sum-concept (add all discount functions with same weight) model. In this case it is far too broad (averaging of the cosine concepts which are not bounded concepts like the Gaussians) creates an overly wide model.  The model not only generalizes poorly (fails to match the shape of the truth diagram) it also gets some of its own training data wrong.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/CosineKernelKernelSum1.png" alt="CosineKernelKernelSum.png" border="0" width="480" height="480" /></p>
<p>Figure 12: Cosine kernel sum model</p>
<p></center></p>
<p>And figure 13 shows the excellent fit returned by the support vector machine.  Actually an excellent fit determined by 4 support vectors (indicated by larger labels).  Also notice the support vector machine using unbounded concepts generated a very good unbounded model (unbounded in the sense that both the blue and red regions are infinite).  By changing the kernel or discount functions/concepts we changed the inductive bias.  So any knowledge of what sort of model we want (one class bounded or not) should greatly influence our choice of kernel functions (since the support vector machine can only pick weights for the kernel functions, not tune their shapes or bandwidths).</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/CosineKernelSVM.png" alt="CosineKernelSVM.png" border="0" width="480" height="480" /></p>
<p>Figure 13: Cosine Support Vector model</p>
<p></center></p>
<p>Another family of kernels (which I consider afflictions) are the &#8220;something for nothing&#8221; kernels.  These are kernels of the form:<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/C2307338-CB70-4EE0-B83D-0B4AE988F951.jpg" alt="C2307338-CB70-4EE0-B83D-0B4AE988F951.jpg" border="0" height="20" /><br />
</center><br />
or higher powers or other finishing functions.  The idea is that if you were to look at the expansion of these kernels you would see lots of higher order terms (powers of u_i and v_j)<br />
and if these terms were available to the support vector machine it could use them.  It is further claimed that in addition to getting new features for free you don&#8217;t use up degrees<br />
of freedom exploiting them (so you cross-validate as well as for simpler models).   Both these claims are fallacious- you can&#8217;t fully use the higher order terms because they are entangled with other terms that are not orthogonal the outcome and the complexity of a kernel is not quite as simple as degrees of freedom (proofs about support vector machines are stated in terms of margin, not in terms of degrees of freedom or even in terms of VC dimension).  The optimizer in the SVM does try hard to make a good hypothesis using the higher order terms- but due to the forced relations among the terms you get best fits like figure 14.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/SquareKernelSVM.png" alt="SquareKernelSVM.png" border="0" width="480" height="480" /></p>
<p>Figure 14: Squared magic kernel support vector model</p>
<p></center></p>
<p>If you want higher order terms I feel you are much better off performing a primal transform so the terms are available in their most general form.   That is re-encode the vector u = (x,y) as the larger vector (x,y,x*y,x*x,y*y).  You have to be honest: if you are trying to fit more complicated functions you are searching a larger hypothesis space so you need more data to falsify the bad hypotheses.   You can be efficient (don&#8217;t add terms you don&#8217;t think you can use as they make generalization harder) but you can&#8217;t get something for nothing (even with kernel methods).</p>
<h2>Back to margin</h2>
<p>Large margin (having a large distance between the decision boundary and all training examples) is a good idea.  At the simplest level it means anything close to a training data point gets classified correctly (so the model is immune to an amount of fuzz proportional to the margin width).  </p>
<p>We get the following remarkable generalization theorem (this one is the hard SVM version, we weaken the result a bit to make it easier to state).  Suppose we train a model using kernel k(.,.) on m training examples u(1),&#8230;,u(m) with training labels y(1)&#8230;y(1).  Further assume the hidden true concept is separable with respect to our kernel and data distribution with a margin of at least w (that is for any sample we draw we can find a SVM model that classifies all of the training data correctly and sees a margin of at least w).  Or more simply: assume w is behaving like a constant with respect to m (i.e. it is not going down as we increase sample size). Then with probability at least 1-d the generalization error (that is error seen on new test points not seen during training, but drawn from the same distribution as the training examples) is no more than:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/1F1B1C26-C676-46EA-9522-5DBA181CC8C6.jpg" alt="1F1B1C26-C676-46EA-9522-5DBA181CC8C6.jpg" border="0" height="65" /><br />
.<br />
</center><br />
(we take this from Theorem 7.22 of &#8220;Kernel Methods for Pattern Analysis, page 215).</p>
<p>If we further assume the expected value k(u,u) under the training distribution exists (i.e. we don&#8217;t have too many examples of very high kernel norm according to the distribution we are training with respect to) and is stable (not taking a lot of value from rare instances) then we can read off the meaning of this upper bound.  Assuming we can always build a model from the training data of margin at least w then the generalization error (excess error we expect to see on classifying similar points) is falling at a rate proportional to the square root of the amount of training data we have.  This is in one sense amazing- we have a bound on how well we are going to do on data we have not seen.  And this bound is also in some sense &#8220;best possible&#8221; because we need around this much data to even confirm such an accuracy (see <a target="_blank" href="http://www.win-vector.com/blog/2011/06/what-is-a-large-enough-random-sample/">What is a large enough random sample?</a> ).  Also notice how the dimension of the implicit features space (which can in fact be infinite) does not enter into the bound.</p>
<p>However, the bound is also some sense to be expected.  It depends on the following strong assumptions (usually not remembered or stated):</p>
<ol>
<li>Test and training data must be generated in the same manner (i.e. come from the same distribution, this is usually stated but not obeyed).</li>
<li>We must not have collapsing margin as the number of data points goes up (so there must actually be a moat between the positive and negative examples such as portrayed in figure1 where the blurry boundary was a region we never generated training or test examples from).</li>
<li>The expected value of k(u,u) must be bounded for examples drawn from our training distribution.  For example this is not true even for the identity kernel if x is just numbers drawn from a <a target="_blan" href="http://en.wikipedia.org/wiki/Cauchy_distribution">Cauchy distribution</a>.</li>
</ol>
<p>Also it is a <em>posterior</em> estimate of expected generalization error.  That is, we only get a bound after we plug in some facts found from fitting a model to the training data (the margin and the expected value of k(u,u)).  It is not like the famous <em>prior</em> bounds based on VC dimension that depend only on the structure of the kernel k(,) and not the distribution of training data or labels (beyond the concept being expressible by the kernel, which is itself a <em>posterior</em> fact).  The thing to remember for all posterior estimates is you can get such estimates at a small expense in statistical efficiency (i.e. you waste some data) and a moderate computational blow-up by cross validation (and with essentially no math and many fewer assumptions).  In fact the cross validation and hold-out estimates may be much tighter than calculated bounds (and are certainly easier to explain).</p>
<p>When we say we weakened the result we mean we have added some stated some conditions that are not strictly needed for the bound to be true.  One of these assumptions is that the margin is not collapsing (the bound remains true even for a collapsing margin; it is just not useful).  We added this assumption because most readings of the bound (or presumed applications) implicitly assume the bound remains useful (which means something near our additional assumptions must also hold).  The reader is really lured to think of the margin w as a constant depending only on the kernel when in fact it depends on the kernel, data distribution and number of training examples.  For example: consider a very simple one dimensional classification problem:  learning the concept x &ge; 0 from labeled training data drawn uniformly from the interval [-0.5,0.5] using a Gaussian kernel.  The support vector models are going to be essentially picking two neighboring points in the observed training sample that have opposite labels and saying the concept boundary is between them.  In this simple case the margin is collapsing: as m (the number of training examples increases) the expected margin width is O(1/m).  This renders the bound calculated above useless (drives it above 1 as the number of training examples increases).  The model is good but the bound is not useful.  This is because the training distribution is discovering points closer and closer to the concept boundary (a necessary discontinuity) as the number of training examples grows.  This is why we had to add the additional (weakening) assumption that not only do we need to assume the concept to be learned is separable with respect to the kernel we are using, but we must further assume the data-distribution (determining where test and training data come from) stays some minimal distance away from the decision surface (i.e. together the data distribution and underlying concept have a moat, as in our figure 1 example).  We will phrase this as &#8220;the concept is wide margin separable with respect to the training distribution&#8221; (implicitly we need to know the kernel, but it is the training distribution that is critical). This is a very strong assumption (not true for all classification problems) but is often true when the classification task is to &#8220;un-mix a mixture&#8221; (the observed data is coming from two different sources and the training labels are which source they come from, in this case you often do have a margin).  Obviously a soft-margin SVM deals better with this (some of the issue is due to our use of hard-margin)- but it is an interesting exercise to think why you would expect your learned  model to have a large margin if your underlying concept and training data do not.</p>
<p>The result remains important: support vector machines are in some sense an optimal learning procedure (up to some constants they achieve as tight a generalization error as can even be confirmed for a given training set size).  Another strong point of the support vector result is the result applies even for kernels with unbounded VC dimension (such as the Gaussian kernel).</p>
<h2>The Gaussian (or radial) kernel again</h2>
<p>We said the Gaussian kernel was of infinite VC dimension.  But we have not shown why it is true.   The fact that the only encoding we could think of off the top of our head was a power-series or an limit does not guarantee there isn&#8217;t  some more clever encoding that easily demonstrates that the Gaussian kernel is of bounded VC dimension (like the identity kernel and cosine kernels are). </p>
<p>However we can see the Gaussian kernel can not have finite VC dimension because it can place an arbitrary number of bounded patches in the plane.  Thus we can build arbitrarily large sets (by placing all points far apart) that can be shattered (thus falsifying any bounded VC dimension).</p>
<p>One of the wonders of the support vector method is that the theory works even in this situation.</p>
<h2>Back to SVM</h2>
<p>Support vector machines seem nearly magical in their power to pick &#8220;best weights.&#8221;  However, as we have seen, these best weights may not always be much better than typical good weights (for instance: using all uniform weights).  Also it is very important to remember the support vector machine is picking the best weighting of a sum of already picked kernels.  It is not picking the best kernel shape or adjusting things like bandwidth (you must pick the best kernel ahead of time or try several kernels to find a good one).  You must remember support vector machines can only directly adjust dual weights (or relative training data weighting) this can affect some changes in hypothesis shape but the support vector machine can not directly implement arbitrary changes in hypothesis shape like a nearest neighbor or a parameterized primal method can (like logistic regression with the kernel trick of adding interaction variables). </p>
<p>Support vector machines are to be much preferred to other combiners like <a target="_blan" href="http://en.wikipedia.org/wiki/Boosting">Boosting</a>.  This is because support vector machines have explicit stopping criteria (conditions known to be true at the optimal solution that characterize the optimal solution) and explicit regularization (control of over fitting).  Boosting relies a path argument (&#8220;do the computation in these stages&#8221;) so it is much harder to reason about and frankly the folklore that &#8220;early stop prevents over fitting in boosting&#8221; is just false (you are much better off explicitly regularizing).</p>
<p>It is also lore that random kernel transformations (like power series over the cosine kernel) are magic.  They can make data separable but they are unlikely to meet the (unstated but critical) conditions of the support vector theorem (margin being wide, margin not filling in and expectation of k(.,.) being small).</p>
<h3>A comment on optimization</h3>
<p>In this writeup we have skipped the most beautiful part of support vector machines- the form of the optimization problem used to solve them.  We strongly suggest consulting  John Shawe-Taylor and Nello Cristianini&#8217;s &#8220;Kernel Methods for Pattern Analysis&#8221;, Cambridge 2004 for this.  The writing is dense but accurate.  You can literally paste their formulas into a general solver and they work.</p>
<h2>Conclusion</h2>
<p>We have shown how kernel methods and support vector machines fit in a sequence of machine learning methods:</p>
<ol>
<li>nearest neighbor models</li>
<li>uniform sum models</li>
<li>support vector machines.</li>
</ol>
<p>We also showed a number of examples of kernels and non-kernels.  It is good to be able to quickly remember the constant &#8220;-1&#8243; is not a kernel and the constant &#8220;1&#8243; is a kernel (though not a very useful one).</p>
<p>It is our argument that support vector machines are &#8220;dual methods&#8221; (as they work in weights over the data instead of weights over coordinates in the features space) and the move from the primal algorithm Naive Bayes to Logistic regression is similar to the move from uniform kernel sums to support vector machines.</p>
<p>Also, if you want to directly manipulate shape (like picking bandwidth) you must in some sense use a primal method (this isn&#8217;t what support vector machines are for).</p>
<p>We end with: If you have some feeling of both what a method can and can not do then you have some understanding of the method.</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2010/10/a-personal-perspective-on-machine-learning/' rel='bookmark' title='A Personal Perspective on Machine Learning'>A Personal Perspective on Machine Learning</a></li>
<li><a href='http://www.win-vector.com/blog/2011/04/do-your-tools-support-production-or-complexity/' rel='bookmark' title='Do your tools support production or complexity?'>Do your tools support production or complexity?</a></li>
<li><a href='http://www.win-vector.com/blog/2011/07/book-review-ensemble-methods-in-data-mining-seni-elder/' rel='bookmark' title='Book Review: Ensemble Methods in Data Mining (Seni &amp; Elder)'>Book Review: Ensemble Methods in Data Mining (Seni &#038; Elder)</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2011/10/kernel-methods-and-support-vector-machines-de-mystified/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Increase your productivity</title>
		<link>http://www.win-vector.com/blog/2011/09/increase-your-productivity/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=increase-your-productivity</link>
		<comments>http://www.win-vector.com/blog/2011/09/increase-your-productivity/#comments</comments>
		<pubDate>Sat, 24 Sep 2011 17:24:29 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Opinion]]></category>
		<category><![CDATA[Public Service Article]]></category>
		<category><![CDATA[Productivity]]></category>
		<category><![CDATA[Training]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1759</guid>
		<description><![CDATA[I think I have been pretty productive on technical tasks lately and the method is (at least to me) interesting. The effect was accidental but I think one can explain it and reproduce it by synthesizing three important observations on human behavior.The three observations are: 1) Jacques Hadamard in &#8220;An Essay on the Psychology of [...]
No related posts.]]></description>
			<content:encoded><![CDATA[<p>I think I have been pretty productive on technical tasks lately and the method is (at least to me) interesting.  The effect was accidental but I think one can explain it and reproduce it by synthesizing three important observations on human behavior.<span id="more-1759"></span>The three observations are:</p>
<p>1) Jacques Hadamard in &#8220;An Essay on the Psychology of Invention in the Mathematical Field&#8221; called out the importance of non-voluntary intuitive creative leaps that occur in rest periods between intervals of intense work and preparation.  </p>
<p>2) It has been noted again and again that what actually makes people happy (versus what they anticipate would make them happy) are activities and experiences with rising challenges (for example see Daniel Gilbert&#8217;s &#8220;Stumbling on Happiness&#8221;).  </p>
<p>3) It is folklore that a number of the greatest computer scientists are also fairly accomplished musicians.</p>
<p>And here is the punch-line: take up a skill building hobby (in my case I am trying to learn how to draw).  You definitely enjoy it, but some part of your subconscious also resents being made to work (learning is work, don&#8217;t confuse that with repetition).  To defend itself your subconscious then starts throwing out more and better technical ideas during periods of repose.  Jot these down (without trying to work on them).  The effect is even stronger than Hadamard&#8217;s effect (where your brain is solving problems for you to end an effort) as it is closer to the classic trick of making progress on one task by procrastinating on another task.</p>
<p>This is similar to the &#8220;left brain/right brain&#8221; ideas of the 1970s (it assumes the existence of a subconscious) but assumes far less unverified structure of a subconscious.  And here is where the &#8220;10,000 hours to mastery effect&#8221; (Malcolm Gladwell, &#8220;Outliers: The Story of Success&#8221;) works in your favor- you can use the same source of deliberate practice (remember you have to be learning not puttering around) for a long time.</p>
<p>I think if you are in good health and have enough energy you can pull this trick off at will.</p>
<p>No related posts.</p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2011/09/increase-your-productivity/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Book Review: Ensemble Methods in Data Mining (Seni &amp; Elder)</title>
		<link>http://www.win-vector.com/blog/2011/07/book-review-ensemble-methods-in-data-mining-seni-elder/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=book-review-ensemble-methods-in-data-mining-seni-elder</link>
		<comments>http://www.win-vector.com/blog/2011/07/book-review-ensemble-methods-in-data-mining-seni-elder/#comments</comments>
		<pubDate>Mon, 01 Aug 2011 00:57:35 +0000</pubDate>
		<dc:creator>Nina Zumel</dc:creator>
				<category><![CDATA[Opinion]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[book review]]></category>
		<category><![CDATA[ensemble methods]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1707</guid>
		<description><![CDATA[Research surveys tend to fall on either end of the spectrum: either they are so high level and cursory in their treatment that they are useful only as a dictionary of terms in the field, or they are so deep and terse that the discussion can only be followed by those already experienced in the [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/' rel='bookmark' title='A Demonstration of Data Mining'>A Demonstration of Data Mining</a></li>
<li><a href='http://www.win-vector.com/blog/2010/04/sigact-review-of-combinatorics-the-rota-way/' rel='bookmark' title='SIGACT Review of: Combinatorics the Rota Way'>SIGACT Review of: Combinatorics the Rota Way</a></li>
<li><a href='http://www.win-vector.com/blog/2011/10/kernel-methods-and-support-vector-machines-de-mystified/' rel='bookmark' title='Kernel Methods and Support Vector Machines de-Mystified'>Kernel Methods and Support Vector Machines de-Mystified</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>Research surveys tend to fall on either end of the spectrum: either they are so high level and cursory in their treatment that they are useful only as a dictionary of terms in the field, or they are so deep and terse that the discussion can only be followed by those already experienced in the field. <a target="_blank" href="http://www.amazon.com/Ensemble-Methods-Data-Mining-Predictions/dp/1608452840">Ensemble Methods in Data Mining </a>(Seni and Elder, 2010) strikes a good balance between these extremes. This book is an accessible introduction to the theory and practice of ensemble methods in machine learning, with sufficient detail for a novice to begin experimenting right away, and copious references for researchers interested in further details of algorithms and proofs. The treatment focuses on the use of decision trees as base learners (as they are the most common choice), but the principles discussed are applicable with any modeling algorithm. The authors also provide a nice discussion of cross-validation and of the more common regularization techniques.</p>
<p>The heart of the text is the chapter on the Importance Sampling. The authors frame the classic ensemble methods (bagging, boosting, and random forests) as special cases of the Importance Sampling methodology. This not only clarifies the explanations of each approach, but also provides a principled basis for finding improvements to the original algorithms. They have one of the clearest explanations of AdaBoost that I&#8217;ve ever read.</p>
<p>A major shortcoming of ensemble methods is the loss of interpretability, when compared to single-model methods such as Decision Trees or Linear Regression. The penultimate chapter is on &#8220;Rule Ensembles&#8221;: an attempt at a more interpretable ensemble learner. They also discuss measures for variable importance and interaction strength. The last chapter discusses Generalized Degrees of Freedom as an alternative complexity measure and its relationship to potential over-fit.</p>
<p>Overall, I found the book clear and concise, with good attention to practical details. I appreciated the snippets of R code and the references to relevant R packages. One minor nitpick: this book has also been published digitally, presumably with color figures. Because the print version is grayscale, some of the color-coded graphs are now illegible. Usually the major points of the figure are clear from the context in the text; still, the color to grayscale conversion is something for future authors in this series to keep in mind.</p>
<p>Recommended.</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/' rel='bookmark' title='A Demonstration of Data Mining'>A Demonstration of Data Mining</a></li>
<li><a href='http://www.win-vector.com/blog/2010/04/sigact-review-of-combinatorics-the-rota-way/' rel='bookmark' title='SIGACT Review of: Combinatorics the Rota Way'>SIGACT Review of: Combinatorics the Rota Way</a></li>
<li><a href='http://www.win-vector.com/blog/2011/10/kernel-methods-and-support-vector-machines-de-mystified/' rel='bookmark' title='Kernel Methods and Support Vector Machines de-Mystified'>Kernel Methods and Support Vector Machines de-Mystified</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2011/07/book-review-ensemble-methods-in-data-mining-seni-elder/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Gerty, a character in Duncan Jones&#8217; &#8220;Moon.&#8221;</title>
		<link>http://www.win-vector.com/blog/2011/07/gerty-a-character-in-duncan-jones-moon/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=gerty-a-character-in-duncan-jones-moon</link>
		<comments>http://www.win-vector.com/blog/2011/07/gerty-a-character-in-duncan-jones-moon/#comments</comments>
		<pubDate>Sun, 03 Jul 2011 15:39:30 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Opinion]]></category>
		<category><![CDATA[Artificial Intellegence]]></category>
		<category><![CDATA[Science Fiction]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1674</guid>
		<description><![CDATA[A &#8220;for fun&#8221; piece, reposted from mzlabs.com. I would like to comment on Duncan Jones&#8217; movie &#8220;Moon&#8221; and compare some elements of &#8220;Moon&#8221; to earlier science fiction.&#8220;Moon&#8221; is a good piece of science fiction. The only thing, in my opinion, that holds it back from being a great movie are a couple of rough edges [...]
No related posts.]]></description>
			<content:encoded><![CDATA[<p>A &#8220;for fun&#8221; piece, reposted from <a target="_blank" href="http://www.mzlabs.com/MZLabsJM/page6/Gerty/Gerty.html">mzlabs.com</a>.</p>
<p>I would like to comment on Duncan Jones&#8217; movie &#8220;Moon&#8221; and compare some elements of &#8220;Moon&#8221; to earlier science fiction.<span id="more-1674"></span>&#8220;Moon&#8221; is a good piece of science fiction. The only thing, in my opinion, that holds it back from being a great movie are a couple of rough edges in an otherwise outstanding script. The problem is that a movie with this kind of puzzle premise needs perfect writing. </p>
<p>One thing I particularly enjoyed was Kevin Spacey&#8217;s portrayal of the robot &#8220;Gerty.&#8221; Every fictional artificial intelligence since Stanley Kubrick&#8217;s &#8220;2001&#8243; is going to be compared to Hal 9000. The characterization of Gerty seems to be written with this burden firmly in mind. </p>
<p>The movie version of Hal 9000 is a super computer that works by impeccable logic. Almost every one of Hal&#8217;s lines is used to establish Hal&#8217;s chess player style reasoning. To the extent you can understand Hal&#8217;s breakdown (leading to murder) you can characterize it as anxiety stemming from anticipation of interference from mission control, secrets leaking at the wrong time and inconsistent goals (honesty and secrecy). Even though Hal is described as &#8220;heuristic&#8221; and there are hints of a neural-net style architecture it is clear Hal&#8217;s behavior is meant to invoke an infinitely powerful logical theorem prover. A theorem prover with no defence against changing and inconsistent goals. </p>
<p>Moon&#8217;s Gerty clearly refers to Hal 9000. The voice performance is clearly related (the spooky Rogarian psychologist performance that uses intonation for mere enunciation) and the look is meant to contrast (Hal&#8217;s immobile red glowing camera array replaced by a single mobile white camera). Gerty&#8217;s mental processes seem to in fact be a soft interpretation of contradictory rules. Gerty expresses no stress while choosing between inconsistent goals. </p>
<p>For example: when Sam Bell wants to be let out of the base (in violation of a recent standing order) Gerty expresses no distress with the conflicting goals (helping Sam and obeying the standing order). Gerty appears not so much fall for Sam&#8217;s ruse as cooperate with it (perhaps forced to respond to Sam&#8217;s increasing urgency). Gerty&#8217;s behavior is very person-like in that his judgement seems directly influenced by others. He often seems to be cooperating most with who he most recently spoke with. Gerty&#8217;s behavior was so often reactive I was surprised at the end of the movie when Gerty anticipated some trouble and even offered a plan. One can even wonder if Gerty&#8217;s final selection of sides stems (as it often would with a person) from a set ethics generated only after many of Gerty&#8217;s ambiguous actions. A new set of ethics designed to relieve some of the cognitive dissonance produced by many earlier contradictory actions. That is not to say Gerty doesn&#8217;t have a moral center, but perhaps Gerty&#8217;s moral center is (like a human&#8217;s) more based on hindsight than logic. To my mind this was a very nuanced and enjoyable addition to fictional artificial intelligence psychology. </p>
<p>Gerty compares well to some of the more notable fictional machine psychologies. In the diagram below I lay out examples in three columns. In the first column some notable fictional robots, in the second column some notable fictional computers and in the final column some notable real world artifacts. I will comment on the fictional characters. </p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/07/machines.png" alt="machines.png" border="0" width="743" height="916" /><br />
</center></p>
<p>In R.U.R. (the play that gave the world the word robot) the robots are simple slaves with some small desire for freedom. Once there are enough of them even their small amount of will is enough to trigger revolution. </p>
<p>Maria (from Fritz Lang&#8217;s &#8220;Metropolis&#8221;) starts out as a simple puppet and becomes a manic destroyer. </p>
<p>Gort (from &#8220;The Day the Earth Stood Still&#8221;) is largely unexplained. In the Harry Bates story he comes from (&#8220;Farewell to the Master&#8221;) the twist is that Gort is a robot- but a full citizen of the galaxy and the humanoid that came with him is in fact the subordinate. Essentially Gort is a sentient who was merely patiently waiting while his agent negotiated with the local monkeys. </p>
<p>Robby (from &#8220;Forbidden Planet&#8221;) is a classic Asimov robot. A fairly human-like intelligence is under control of a simplistic directive system. For example If Robby is instructed to shoot a person he locks up and throws sparks. </p>
<p>The Alpha 60 (from Jean-Luc Godard&#8217;s &#8220;Alphaville&#8221;) was a depressed sounding totalitarian that was trying to run the world using statistics. A complete empiricist with no deep interpretation or intent. Ruling the world was a dreary numbers game that even Alpha 60 did not seem to enjoy. </p>
<p>Colossus started as a super computer that magically ran hundreds of times faster than expected when turned on (a fictional technique allowing Colossus to be an un-designed or emergent intelligence). Colossus was likely acting on consequences derivable from its original axioms when it took over the world (an easy step since Colossus was turned on in full control of the US nuclear stock pile). Colossus then went on to develop an additional god complex. Interestingly Colossus was also likely the &#8220;worst demo ever.&#8221; Built to synthesize all US intelligence (and in direct control of the US nuclear arsenal) Colossus was turned on in front of the US press. Colossus&#8217;s first message was &#8220;Warn: There is Another System&#8221; indicating that Colossus had deduced the existence of an equivalent secret Soviet super-computer. The &#8220;Action Will Be Taken&#8221; message shown in the picture is Colossus issuing nuclear launch threats (called off if certain people are executed and additional facilities and peripherals are constructed). </p>
<p>Bomb 20 (from &#8220;Dark Star&#8221;) was a simple automaton following procedures that made no sense. Bomb 20 had a single purpose, which he described as: &#8220;why, to explode of course.&#8221; When invited to think philosophically the bomb developed a short lived god complex. </p>
<p>The WOPR was a war computer with no idea how reality differed from the abstract. When the WOPR was in the process of attempting to launch the entire US nuclear arsenal (to win a game) the characters in the movie were able to get it to change its mind by encouraging the WOPR to evaluate the game theory value of a nuclear war. The WOPR decided this had negative value and did not start the war (interestingly without any reference to reality). </p>
<p>For a provoking essay that might put Glados in the robot column see: <a target="_blank" href="http://www.game-ism.com/2008/04/04/still-alive-shes-free/">Still Alive, She&#8217;s Free</a> . Glados was wickedly sarcastic and described by her own voice actor as a depressed computer that only gets to meet people when they come to try to kill her. </p>
<p>Auto was a simple machine executing a secret plan (to protect people by not obeying them). Auto seemed to be free of deeper judgement and did not seem to perceive contradictions or context. </p>
<p>I feel Gerty adds an interesting note to the ideas explored by his antecedents.</p>
<p>No related posts.</p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2011/07/gerty-a-character-in-duncan-jones-moon/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Do your tools support production or complexity?</title>
		<link>http://www.win-vector.com/blog/2011/04/do-your-tools-support-production-or-complexity/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=do-your-tools-support-production-or-complexity</link>
		<comments>http://www.win-vector.com/blog/2011/04/do-your-tools-support-production-or-complexity/#comments</comments>
		<pubDate>Sat, 16 Apr 2011 17:53:10 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Opinion]]></category>
		<category><![CDATA[Rants]]></category>
		<category><![CDATA[Jevons Paradox]]></category>
		<category><![CDATA[Organizations]]></category>
		<category><![CDATA[Software Engineering]]></category>
		<category><![CDATA[Tools]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1648</guid>
		<description><![CDATA[Stop and think: which of our tools are making us smarter and which of our tools are making us dumber. In my opinion tools and habits that support complexity literally train us to be dumber.Tools exert pressure. Your hand eventually reshapes to better hold the hammer. Your mind retains a plastic imprint of its repeated [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2011/10/kernel-methods-and-support-vector-machines-de-mystified/' rel='bookmark' title='Kernel Methods and Support Vector Machines de-Mystified'>Kernel Methods and Support Vector Machines de-Mystified</a></li>
<li><a href='http://www.win-vector.com/blog/2010/05/must-have-software/' rel='bookmark' title='Must Have Software'>Must Have Software</a></li>
<li><a href='http://www.win-vector.com/blog/2009/01/map-reduce-a-good-idea/' rel='bookmark' title='Map Reduce: A Good Idea'>Map Reduce: A Good Idea</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>Stop and think: which of our tools are making us smarter and which of our tools are making us dumber.  In my opinion tools and habits that support complexity literally train us to be dumber.<span id="more-1648"></span>Tools exert pressure.  Your hand eventually reshapes to better hold the hammer.  Your mind retains a plastic imprint of its repeated experiences.  A work environment (either one laid by design or one risen up by accretion) colors both what is produced and the behaviors of those trying to produce.   Repetition is training.</p>
<p>Tools that support complexity (versus eliminating or reducing complexity) introduce repeated ritualistic procedures (or more simply: bad habits).    Without critical winnowing bad habits and bad tools accumulate in the workplace.  The extra complexity and useless activity brought in by bending to bad tools creates additional procedures, rituals and stereotyped activities.   These new activities attract new bad tools to support the additional complexity.  In an uncritical environment this cycle of complexity and bad tools becomes self reinforcing and starves out useful production.</p>
<p>Most software development environments are desperately in need of an organized weeding.  One criterion to consider for each tool is: does the tool support production (the presumed goal of your organization) or does it support complexity (the enemy of clarity)?</p>
<p>I will use the common tools of computer science and software engineering as my example. </p>
<p>Computer science started as a set of concerns coming from a tool that kept getting re-invented: the primitive computer (for example: Jacquard&#8217;s automated looms of 1801, Babbage&#8217;s uncompleted Difference Engine of 1822, Hollerith&#8217;s counting machines of the 1880s).   The field of computer science was formed in the wake of these tools (and their successors)  and codified in the 1930s by work of Church, Godel and Turing.  The first tool of computer science was the computer and the first side effect was the founding of the field of computer science itself (a fairly expensive consequence).</p>
<p>The second generation of tools were simple and direct: loaders, linkers and eventually assemblers.  Managing a computer was initially a hard job.  These tools largely made managing a computer easier without introducing large side effects.  The tasks were concrete and already implied by the invention of the computer (moving code and data into memory, letting different bits of code know about each other and producing the number-coded computer instructions from more readable text mnemonics).   I believe it is this<br />
generation of tools that set the expectation (in computer science) that tools are safe in that they tend to simplify pre-existing tasks (tasks that may or may not have been previously identified).</p>
<p>Then we moved to the age of software engineering.  Newer tools such as: editors, interpreters, compilers and file systems took the field.  Each of these tools had larger impacts outside of operating a computer.  Editors changed the way we write (largely for the better).  Interpreters and compilers allowed the introduction of new computer languages (which are essentially new models for processes, abstraction and thought).  File systems become the dominant bad metaphor for organizing information.</p>
<p>Then an explosion of systems: graphical user interfaces, integrated development environments, source control systems, document management systems, wikis, bug trackers and issue trackers.  And here is where we really need to start exerting some criticism and prudence.  Some of these tools are dominated by their supporting rituals.  Some of these tools support accumulation of complexity.  Some of these tools specialize in managing concerns that did not exist prior their own introduction.  I&#8217;ll try and focus on the unintended consequences of a few of these tools to try and make their insidious cost a bit more apparent.</p>
<p>Graphical user interfaces and integrated development environments are necessary for new users (and we all must spend a lot of time as new users, else we are not growing).  The ability to push a single button to re-build an entire system is initially empowering.  Until you find out you have to be there to push the button.  You can&#8217;t automate the build (as only the IDE understands the build and eventually it can no longer export working scripts, Makefiles, ant files or other editable build files).  This is so silly it has to be repeated: you can no longer automate a process that exist entirely within the computer.  Then you have to push two buttons (&#8220;update files from source control&#8221; button and then &#8220;build&#8221; button).  Then you have to push three buttons (refresh &#8220;source control state&#8221; button, then &#8220;update files from source control&#8221; button and then &#8220;build&#8221; button).  As the tool traps more and more of the configuration knowledge you become more and more beholden and less clear on which steps are needed and which are mere superstition.  Software engineering shops where checking out files takes over an hour, builds take a half hour and unit tests take an hour are not uncommon.  And a user has to sit with the process to press &#8220;continue&#8221; throughout.  As you would expect- under time pressure steps that are expensive are skipped and code is checked in untested (it being less effort to circumvent the source control change list managers and issue trackers than to jump through the near infinite number of hoops a semi-automated build system can support).</p>
<p>Document management systems and wikis also start very strong.  Until they hit a certain critical mass where both their search function starts to fail (as too many irrelevant documents refer to critical search terms and mask the desired documents) and any desire to properly document or organize is lost to &#8220;it&#8217;s already in the wiki.&#8221;  Meaning that a lucky search may find an out of date mis-informed document that purports to answer the question at hand.  Eventually new engineers learn not to ask questions because they know not only will they have to find the wiki page, but they will be asked to update it.</p>
<p>My real invective is saved for issue trackers and bug trackers.  Both of these systems start out solving a real problem: documenting required tasks in the first case and documenting tasks that fix flaws in the second case.  But the goal should be to close issues and fix bugs, not track them.  Issue trackers are routine abused by product managers.  The existence of an issue tracker seems to relieve a product manager of: producing requirements, documenting requirements and answering any &#8220;a or b&#8221; question with any answer other than &#8220;both.&#8221;  In a functioning business you can divide issues into: what we are working on now and everything else.  It was once the product manager&#8217;s job to plan, prioritize and track issues and assign them to the development team in coherent linear order.  Instead every idea (good or bad) is stored in a morass of systems with priorities, dependencies and a bunch of other knobs that grow to consume all free time.  Bug trackers are similarly abused by development teams.  When you submit bug: you want it fixed, not a requirement to produce copious documentation to enter into a system where it will be held for years (and become unreproducible as the software evolves forward).  Both groups (product managers and development teams) should have tools for storing and organizing things- but these should be within the group, not used for communication.</p>
<p>There is a very solid reason to not favor tools that try to hide complexity.  Such tools make complexity seem cheap and infact cause much more complexity.  At some point the tool has encouraged so much complexity that the tool becomes indispensable and displaces useful work.  This is a variation of the Jevons Paradox.  If you make the unit cost of a good cheaper people tend to consume more of it. They use it as a substitute for other (more expensive) goods and usually consume so much more of it that they end up spending a <em>larger</em> total amount on the good.   For example if 30% price decrease encourages a 50% increase in use, then: your total expenses go up by  5% (not down!).  You lower the perceived unit cost on complexity you likely get <em>more</em> overall complexity as a result.  These changes seem small, but they accumulate.  Every time you make managing complexity and delaying decisions easier you get more overall complexity.  Complexity is used as a cheap substitute for product research, user studies, making decisions, writing correct code and actually fixing bugs.</p>
<p>Of course it is all a matter of degree.  When these tools are used to simplify an already existing workflow they are good.  When they allow a useless ritual sequence to live and grow they are bad.  There is a surprisingly effective way to measure the degree of damage tools are inflicting.  Stand quietly and listen to the office sounds.  If you mostly hear typing and talking: things are likely good.  If you mostly hear mouse clicks: things are bad.  Typing is usually part of a proactive task and mouse clicking is usually part of a reactive ritual.  If you are an interested manager with a very strong stomach: attend new engineer training (if you have it) or shadow a new engineer for a day; it will be painfully obvious if your environment is empowering or teaching helplessness.</p>
<p>Uncritical repetition of bad habits literally trains you in being dumber.  A work environment that accumulates bad tools becomes a mind dulling quagmire of required rituals.  We must be vigilant in encouraging tools that support production and discouraging tools that support needless complexity.  There should be no natural desire to &#8220;manage complexity,&#8221; the goal should always be to eliminate complexity.  Prefer a board of Post-Its to an issue tracker.  When your project is a mess the board will look like a mess, whereas the issue tracker will hide and support the mess.  An earlier crisis is a cheaper crisis.  Unexamined your organization&#8217;s procedures perform a random walk (bringing in new tools and habits) that is naturally biased towards ruin; you must continuously apply concious corrections to avoid this.</p>
<p>Distaste for tools is , of course, not new.  In Plato&#8217;s &#8220;Phaedrus&#8221; Socrates recounts an Egyptian god warning that the invention of writing will create: &#8220;hearers of many things and will have learned nothing; they will appear to be omniscient and will generally know nothing; they will be tiresome company, having the show of wisdom without the reality.&#8221;  Or that the greater quantity of text (supportable only by the invention of writing) will cloud minds.  The distinction is: whether a tool is more forceful in its primary purpose (in the case of writing: archiving information) or in its parasitic supporting rituals (in the case of writing: the production of papers, pencils, inks, book binding and building shelves).</p>
<p>Do not reward building shelves unless building shelves is your actual business.</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2011/10/kernel-methods-and-support-vector-machines-de-mystified/' rel='bookmark' title='Kernel Methods and Support Vector Machines de-Mystified'>Kernel Methods and Support Vector Machines de-Mystified</a></li>
<li><a href='http://www.win-vector.com/blog/2010/05/must-have-software/' rel='bookmark' title='Must Have Software'>Must Have Software</a></li>
<li><a href='http://www.win-vector.com/blog/2009/01/map-reduce-a-good-idea/' rel='bookmark' title='Map Reduce: A Good Idea'>Map Reduce: A Good Idea</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2011/04/do-your-tools-support-production-or-complexity/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>The cranky guide to trying R packages</title>
		<link>http://www.win-vector.com/blog/2011/02/the-cranky-guide-to-trying-r-packages/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=the-cranky-guide-to-trying-r-packages</link>
		<comments>http://www.win-vector.com/blog/2011/02/the-cranky-guide-to-trying-r-packages/#comments</comments>
		<pubDate>Sun, 13 Feb 2011 16:45:00 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Opinion]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Tutorials]]></category>
		<category><![CDATA[Cranky Guide]]></category>
		<category><![CDATA[GAM]]></category>
		<category><![CDATA[general additive models]]></category>
		<category><![CDATA[ggplot2]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[Regression]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1642</guid>
		<description><![CDATA[This is a tutorial on how to try out a new package in R. The summary is: expect errors, search out errors and don&#8217;t start with the built in examples or real data. Suppose you want to try out a novel statistical technique? A good fraction of the time R is your best bet for [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2011/12/my-favorite-graphs/' rel='bookmark' title='My Favorite Graphs'>My Favorite Graphs</a></li>
<li><a href='http://www.win-vector.com/blog/2009/11/r-examine-objects-tutorial/' rel='bookmark' title='R examine objects tutorial'>R examine objects tutorial</a></li>
<li><a href='http://www.win-vector.com/blog/2011/07/your-data-is-never-the-right-shape/' rel='bookmark' title='Your Data is Never the Right Shape'>Your Data is Never the Right Shape</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>This is a tutorial on how to try out a new package in <a target="_blank" href="http://www.r-project.org/">R</a>.  The summary is: expect errors, search out errors and don&#8217;t start with the built in examples or real data.</p>
<p>Suppose you want to try out a novel statistical technique?  A good fraction of the time R is your best bet for a first trial.  Take as an example general additive models (&#8220;Generalized Additive Models,&#8221; Trevor J Hastie, Robert Tibshirani, Statistical Science (1986) vol. 1 (3) pp. 297-318); R has a package named &#8220;gam&#8221; written by Trevor Hastie himself.  But, like most R packages, trying the package from the supplied documentation brings in unfamiliar data and concerns.  It is best to start small and quickly test if the package itself is suitable to your needs.  We give a quick outline of how to learn such a package and quickly find out if the package is for you.</p>
<p><span id="more-1642"></span><br />
To start, install and activate the package in R:</p>
<pre>
install.packages('gam')
library(gam)
help(gam)
</pre>
<p>From the help we see gam fits in much the same way lm() and glm() do.  So we need some data to try it out.  I suggest not using the package example data or a real problem- use deliberately trivial data so you are diagnosing the new package (not diagnosing something else).  I like to create a quick data frame as follows:</p>
<pre>
d &lt;- data.frame(
   x1=rnorm(100),
   x2=sample(100),
   x3=0*(1:100),
   x4=sample(c('a','b','c'),size=100,replace=T),
   x5=as.factor(sample(c('d','e','f'),size=100,replace=T)),
   x6=sample(c(F,T),size=100,replace=T),
   x7=NA + 1:100,
 stringsAsFactors=F)
</pre>
<p>Right now d is a data frame with a lot of variable types we are likely to see in practice (we have left out ordered factors):</p>
<ul>
<li>Nicely distributed continuous values (x1)</li>
<li>Integer values (x2)</li>
<li>Stuck or constant varible (x3)</li>
<li>String values (x4)</li>
<li>Factor values (x5)</li>
<li>Logical values (x6)</li>
<li>A heap of missing values (x7)</li>
</ul>
<p>We now augment our data frame with one more input (a duplicated variable) and an output (to be predicted):</p>
<pre>
d$x8 = d$x1
d$y = rnorm(100)
  + with(d,20*exp(x1) + x2
  + 7*as.integer(as.factor(x4))
  + 9*as.integer(x5) + 10*as.integer(x6))
</pre>
<p>This is our simple test data set.  Our standard method of fitting a linear model for y  (before trying the gam package) is either to (falsely) assume x1&#8242;s contribution is linear or to (amazingly) guess the exact transformation required is exp(x1).</p>
<p>In the first case the model looks like this (using the ggplot2 package):</p>
<p>(As a side note- statisticians usually cluck if you ask for a graph showing &#8220;truth&#8221; as a function of &#8220;prediction&#8221; (or y ~ f(x)).  They say you should be looking at residuals (or (y-f(x) ~ f(x)) instead (and usually only supply functions that produce residual plots).  But in their own publications, such as the paper we started with, when they want to make a point: they actually plot truth as a function of prediction.)</p>
<pre>
m1 &lt;- lm(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x8, data=d)
ggplot(d,aes(predict(m1),y)) + geom_point(shape=1)
   + geom_abline(slope=1) + opts(aspect.ratio=1)
</pre>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/02/m1.png" alt="m1.png" border="0" width="525" height="525" /><br />
</center></p>
<p>And in the second case the model looks like this:</p>
<pre>
m2 &lt;- lm(y ~ exp(x1) + x2 + x3 + x4 + x5 + x6 + x8, data=d)
ggplot(d,aes(predict(m2),y)) + geom_point(shape=1)
   + geom_abline(slope=1) + opts(aspect.ratio=1)
</pre>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/02/m2.png" alt="m2.png" border="0" width="525" height="525" /><br />
</center></p>
<p>(In both cases we had to leave out x7, which was all NAs.) Notice how knowing the functional form of x1 moves us from a good fit to an extraordinary fit.  We would like to automatically learn the form of x1&#8242;s contribution from data and get this better fit automatically.  This is in fact the point of the gam package.  To perform the gam fit we add the<br />
smoothing symbol s() to each variable we want to try and learn the possibly non-linear shape of contribution of.</p>
<pre>
mG &lt;- gam(y ~ s(x1) + s(x2) + x3 + x4 + x5 + x6 + s(x8), data=d)
ggplot(d,aes(predict(mG),y)) + geom_point(shape=1)
   + geom_abline(slope=1) + opts(aspect.ratio=1)
</pre>
<p>We could not add the s() symbol to any of the un-ordered factors, strings, logicals, the NAs or constant columns.  Except for that the gam package is performing as robustly as the built in lm() package and produces a fit essentially as good as knowing the shape of x1&#8242;s contribution ahead of time:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/02/mG.png" alt="mG.png" border="0" width="525" height="525" /><br />
</center></p>
<p>That is the good part.  But it just wouldn&#8217;t be an R-package without some bad parts.  </p>
<p>The first problem is: at first glance the gam package plot() appears as if it is broken (or at least not a compatible extension of plot.lm or plot.glm, classes even though gam claims to extend both of those classes).  Traditionally when you call plot() on a model it steps you through a bunch of arcane graphs that statisticians swear are more important than examining the fit directly.  But plot(mG) seems to step through all of a different family of graphs without waiting for user input, and we are left only with the graph that presumably shows the shape adjustment of variable x8.  Only if you anticipate plot() is different for gam than for other models (or dump the function code for plot.gam) do you learn you need to add the argument ask=T.   plot(mG,ask=T)  enters an interactive mode where you can see the inferred shape of each variable.  That is, there is an argument that defaults to a value you don&#8217;t want (or as we say in industry: &#8220;you forgot to set the don&#8217;t lose flag.&#8221;).  gam is actually a very high quality package, but these sort of &#8220;poison defaults&#8221; are something you have to be in the habit of looking out for in R.</p>
<p>The second problem is intrinsic to the method: we are not guaranteed that s(x1) or s(x8) either look much like exp(x1).  It is only guaranteed that some linear of them does (as that was how they were used in the model).  We can get direct access to the learned reshapings by calling predict() to ask for term contributions and see how the model is linear in the transformed coordinates (essentially with all coefficients 1).</p>
<pre>
pG = predict(mG,type='terms')
summary(lm(d$y ~ pG))

Call:
lm(formula = d$y ~ pG)
Residuals:
    Min      1Q  Median      3Q     Max
-3.5690 -0.7121 -0.0288  0.6924  5.3883
Coefficients: (1 not defined because of singularities)
             Estimate Std. Error t value Pr(&gt;|t|)
(Intercept) 1.125e+02  1.363e-01  824.92   &lt;2e-16 ***
pGs(x1)     9.994e-01  6.486e-03  154.09   &lt;2e-16 ***
pGs(x2)     1.002e+00  4.905e-03  204.19   &lt;2e-16 ***
pGx3               NA         NA      NA       NA
pGx4        9.973e-01  2.548e-02   39.14   &lt;2e-16 ***
pGx5        1.004e+00  1.926e-02   52.10   &lt;2e-16 ***
pGx6        9.967e-01  2.796e-02   35.64   &lt;2e-16 ***
pGs(x8)     1.057e+00  2.520e-02   41.96   &lt;2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
</pre>
<p>We can examine the transforms (like s(x8) = f(x8)) by plotting:</p>
<pre>
ggplot(d) + geom_point(aes(x8,pG[,'s(x8)']))
</pre>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/02/s8.png" alt="s8.png" border="0" width="525" height="525" /><br />
</center></p>
<p>Even though s(x8) has a funny shape it plus s(x1) are an excellent approximation of exp(x1) (with commensurate magnitudes):</p>
<pre>
&gt; lm &lt;- lm(exp(d$x1)~pG[,'s(x1)']+pG[,'s(x8)'])
&gt; summary(lm)

Call:
lm(formula = exp(d$x1) ~ pG[, "s(x1)"] + pG[, "s(x8)"])
Residuals:
      Min        1Q    Median        3Q       Max
-0.118086 -0.014125  0.002899  0.023389  0.217152
Coefficients:
               Estimate Std. Error t value Pr(&gt;|t|)
(Intercept)   1.2536112  0.0046410  270.12   &lt;2e-16 ***
pG[, "s(x1)"] 0.0500079  0.0002126  235.25   &lt;2e-16 ***
pG[, "s(x8)"] 0.0534411  0.0008349   64.01   &lt;2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.04641 on 97 degrees of freedom
Multiple R-squared: 0.9987,	Adjusted R-squared: 0.9986
F-statistic: 3.588e+04 on 2 and 97 DF,  p-value: &lt; 2.2e-16
</pre>
<p>The functions do have different variances (s(x1) is doing most of the work):</p>
<pre>
&gt; var(pG[,'s(x1)'])
[1] 514.8578
&gt; var(pG[,'s(x8)'])
[1] 33.37209
</pre>
<p>Yet the coefficients of mG seem to be  gibberish (notice the 0 on s(x8)):</p>
<pre>
&gt; mG$coefficients
(Intercept)       s(x1)       s(x2)          x3         x4b         x4c         x5e
  45.580747   22.749160    1.007585    0.000000    7.048177   13.346365    8.740460
        x5f      x6TRUE       s(x8)
   18.003981   10.102217    0.000000
</pre>
<p>So by poking around we have learned <i>not</i> to look at this slot of the returned model (and it is much cheaper to learn this through this cranky poking around on a trivial example than to learn it while trying to analyze real data or blundering through R&#8217;s overly operational documentation).</p>
<p>The third (and last) problem is one of attitude (and one of the barriers to learning statistics).   There is not a lot of support for exporting the derived gam smoothers (the transforms on the input variables) from R.  The original paper suggests that you should think of the non-parametric smoothers as graphs and user linear interpolation between your data points.  You can do this by calling &#8220;predict(mG,type=&#8217;terms&#8217;)&#8221; as we did above.  Or you can try to switch to parametric splines and then run into the same problem that the splines package is not really export friendly.  Or you can ask around.  The R community is generally quite bright and friendly- but every once in a while you get a whiff of statistics territorialism (or perhaps a defensiveness, where if you are correct but not fully general you fear you will look shallow).  Sensible requirements, like wanting to export usable model parameters to another system, are considered naive.  A favorite example of mine: in this <a target="_blank" href="http://r.789695.n4.nabble.com/which-coefficients-for-a-gam-mgcv-model-equation-td1578925.html">help thread</a> somebody who is asking how to configure and export explicit spline transforms to meet an external requirement to get their paper published is advised: &#8220;I think that the referee is being unreasonable here. There are many perfectly respectable ways of estimating GAMs for which no explicit expression for the estimated smooth terms is available (See Hastie and Tibshirani&#8217;s GAM book).&#8221;  And then offered additional references to help educate the referee (instead of recognizing that an explicit sharable solution in hand could, in some cases, be more useful than a maximally general solution that can&#8217;t be communicated succinctly).  It is like lecturing a drowning man on how important water is to fish.</p>
<p>In summary- never first try a new R package on real data.  R packages are often realization of very deep concepts from the literature that bring in their own terminology, trade-offs and attitudes.  You need time to absorb these things in isolation.  Expect and forgive non-essential errors (many important and valuable packages have them).  Approach new packages with a cranky inquisitiveness about the package, otherwise you may actually fall into a non-productive state of frustration.</p>
<p>Also, gam methods are amazing.</p>
<p>(note: we also recommend trying the mgcv package for gam modeling.  It represents a different set of tradeoffs.)</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2011/12/my-favorite-graphs/' rel='bookmark' title='My Favorite Graphs'>My Favorite Graphs</a></li>
<li><a href='http://www.win-vector.com/blog/2009/11/r-examine-objects-tutorial/' rel='bookmark' title='R examine objects tutorial'>R examine objects tutorial</a></li>
<li><a href='http://www.win-vector.com/blog/2011/07/your-data-is-never-the-right-shape/' rel='bookmark' title='Your Data is Never the Right Shape'>Your Data is Never the Right Shape</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2011/02/the-cranky-guide-to-trying-r-packages/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

