<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Win-Vector Blog &#187; John Mount</title>
	<atom:link href="http://www.win-vector.com/blog/author/john-mount/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.win-vector.com/blog</link>
	<description>The Applied Theorist&#039;s Point of View</description>
	<lastBuildDate>Sat, 04 Feb 2012 17:42:12 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Ergodic Theory for Interested Computer Scientists</title>
		<link>http://www.win-vector.com/blog/2012/02/ergodic-theory-for-interested-computer-scientists/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=ergodic-theory-for-interested-computer-scientists</link>
		<comments>http://www.win-vector.com/blog/2012/02/ergodic-theory-for-interested-computer-scientists/#comments</comments>
		<pubDate>Sat, 04 Feb 2012 17:42:07 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Expository Writing]]></category>
		<category><![CDATA[Mathematics]]></category>
		<category><![CDATA[Ergodic Theorem]]></category>
		<category><![CDATA[Gibbs Sampler]]></category>
		<category><![CDATA[Markov Chains]]></category>
		<category><![CDATA[Random Sampling]]></category>
		<category><![CDATA[Randomized Algorithms]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1933</guid>
		<description><![CDATA[We describe ergodic theory in modern notation accessible to interested computer scientists. The ergodic theorem (http://en.wikipedia.org/wiki/Ergodic theory (link)) is an important principle of recurrence and averaging in dynamical systems. However, there are some inconsistent uses of the term, much of the machinery is intended to work with deterministic dynamical systems (not probabilistic systems, as is [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2008/02/hello-world-an-instance-rhetoric-in-computer-science/' rel='bookmark' title='Hello World: An Instance Of Rhetoric in Computer Science'>Hello World: An Instance Of Rhetoric in Computer Science</a></li>
<li><a href='http://www.win-vector.com/blog/2012/01/six-fundamental-methods-to-generate-a-random-variable/' rel='bookmark' title='Six Fundamental Methods to Generate a Random Variable'>Six Fundamental Methods to Generate a Random Variable</a></li>
<li><a href='http://www.win-vector.com/blog/2010/08/what-did-theorists-do-before-the-age-of-big-data/' rel='bookmark' title='What Did Theorists Do Before The Age Of Big Data?'>What Did Theorists Do Before The Age Of Big Data?</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>We describe ergodic theory in modern notation accessible to interested computer scientists.</p>
<p>The ergodic theorem (http://en.wikipedia.org/wiki/Ergodic theory (link)) is an important principle of recurrence and averaging in dynamical systems. However, there are some inconsistent uses of the term, much of the machinery is intended to work with deterministic dynamical systems (not probabilistic systems, as is often implied) and often the conclusion of the theory is mis-described as its premises.</p>
<p>By “interested computer scientists” we mean people who know math and work with probabilistic systems1, but know not to accept mathematical definitions without some justification (actually a good attitude for mathematicians also).<span id="more-1933"></span>Please click through to read <a target="_blank" href="http://www.win-vector.com/dfiles/ErgodicTheory.pdf">Ergodic Theory for Interested Computer Scientists</a>.</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2008/02/hello-world-an-instance-rhetoric-in-computer-science/' rel='bookmark' title='Hello World: An Instance Of Rhetoric in Computer Science'>Hello World: An Instance Of Rhetoric in Computer Science</a></li>
<li><a href='http://www.win-vector.com/blog/2012/01/six-fundamental-methods-to-generate-a-random-variable/' rel='bookmark' title='Six Fundamental Methods to Generate a Random Variable'>Six Fundamental Methods to Generate a Random Variable</a></li>
<li><a href='http://www.win-vector.com/blog/2010/08/what-did-theorists-do-before-the-age-of-big-data/' rel='bookmark' title='What Did Theorists Do Before The Age Of Big Data?'>What Did Theorists Do Before The Age Of Big Data?</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2012/02/ergodic-theory-for-interested-computer-scientists/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Six Fundamental Methods to Generate a Random Variable</title>
		<link>http://www.win-vector.com/blog/2012/01/six-fundamental-methods-to-generate-a-random-variable/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=six-fundamental-methods-to-generate-a-random-variable</link>
		<comments>http://www.win-vector.com/blog/2012/01/six-fundamental-methods-to-generate-a-random-variable/#comments</comments>
		<pubDate>Fri, 20 Jan 2012 19:23:38 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Mathematics]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Tutorials]]></category>
		<category><![CDATA[Ergodic Theory]]></category>
		<category><![CDATA[Markov Chains]]></category>
		<category><![CDATA[Markov Monte Carlo]]></category>
		<category><![CDATA[Random Sampling]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1925</guid>
		<description><![CDATA[Introduction To implement many numeric simulations you need a sophisticated source of instances of random variables. The question is: how do you generate them? The literature is full of algorithms requiring random samples as inputs or drivers (conditional random fields, Bayesian network models, particle filters and so on). The literature is also full of competing [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2011/06/what-is-a-large-enough-random-sample/' rel='bookmark' title='What is a large enough random sample?'>What is a large enough random sample?</a></li>
<li><a href='http://www.win-vector.com/blog/2011/10/kernel-methods-and-support-vector-machines-de-mystified/' rel='bookmark' title='Kernel Methods and Support Vector Machines de-Mystified'>Kernel Methods and Support Vector Machines de-Mystified</a></li>
<li><a href='http://www.win-vector.com/blog/2011/07/book-review-ensemble-methods-in-data-mining-seni-elder/' rel='bookmark' title='Book Review: Ensemble Methods in Data Mining (Seni &amp; Elder)'>Book Review: Ensemble Methods in Data Mining (Seni &#038; Elder)</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<h2> Introduction</h2>
<p>To implement many numeric simulations you need a sophisticated source of instances of random variables.  The question is: how do you generate them?  </p>
<p>The literature is full of algorithms requiring random samples as inputs or drivers (<a target="_blank" href="http://en.wikipedia.org/wiki/Conditional_random_field">conditional random fields</a>, <a target="_blank" href="http://en.wikipedia.org/wiki/Bayesian_network">Bayesian network models</a>, <a target="_blank" href="http://en.wikipedia.org/wiki/Particle_filter">particle filters</a> and so on). The literature is also full of competing methods (<a target="_blank" href="http://en.wikipedia.org/wiki/Pseudorandom_number_generator">pseudorandom generators</a>, <a target="_blank" href="http://en.wikipedia.org/wiki/Entropy_(information_theory)">entropy sources</a>, <a target="_blank" href="http://en.wikipedia.org/wiki/Gibbs_sampling">Gibbs samplers</a>, <a target="blank" href="http://en.wikipedia.org/wiki/Metropolis–Hastings_algorithm">Metropolis–Hastings algorithm</a>, <a target="_blank" href="http://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo">Markov chain Monte Carlo methods</a>, <a target="_blank" href="http://en.wikipedia.org/wiki/Bootstrapping">bootstrap methods</a> and so on).  Our thesis is: this diversity is supported by only a few fundamental methods.  And you are much better off thinking in terms of a few deliberately simple composable mechanisms than you would be in relying on some hugely complicated black box &#8220;brand name&#8221; technique. </p>
<p>We will discuss the half dozen basic methods that all of these techniques are derived from.<span id="more-1925"></span>To our mind all of the famous random variate generation/sampling techniques are derived from combinations of the following six fundamental methods:</p>
<ol>
<li>Physical sources.</li>
<li>Empirical resampling.</li>
<li>Pseudo random generators.</li>
<li>Simulation/Game-play.</li>
<li>Rejection Sampling.</li>
<li>Transform methods.</li>
</ol>
<p>The technical fights (such as: &#8220;is Gibbs sampling superior to, or even distinguishable from, Markov chain Monte Carlo?&#8221;) are all in the details, history and citation conventions.   Each field and particular method accretes its own traditions.  We will quickly discuss the fundamental methods we listed.  As we will see: complexity goes up as we move through the list (so at some point things are no longer fundamental but instead derived, allowing us to end the list).</p>
<h2>The Methods</h2>
<h3>Physical sources</h3>
<p>This is the most basic way (though not as practical in the computer age) to generate random variables.  Observe the flip of a real coin, shuffle actual cards, mix numbered balls or count the number of ticks from an actual radioactive source.  In all of these the randomness comes from physical principles (such <a target="_blank" href="http://en.wikipedia.org/wiki/Chaos_theory">chaotic dynamics</a> for coin flips or <a target="_blank" href="http://en.wikipedia.org/wiki/Quantum_mechanics">quantum mechanics</a> for radioactive decay).</p>
<p>These sources are &#8220;outside of computer science&#8221; so we will say the least about them.</p>
<h3>Empirical resampling</h3>
<p>This is what used to be called &#8220;tables&#8221; (which were themselves often generated from physical processes).   The observation is: that sometimes<br />
to run a simulation you need access to instances of random variables that are distributed in a very precise way- but you don&#8217;t have a usable  description of the desired distribution.  You would think that in this case you could do nothing.  But the principle of empirical resampling is that you can approximately generate new samples by taking samples (with repetition or replacement) from an old sample.  This is the cornerstone of Bootstrap methods.</p>
<p>As an example:  suppose we were given the sample of numbers 5, 5, 10, 5, 5 which has mean equal to 6.  Further suppose we have no<br />
description of how these number were generated but we wanted to know if a mean of at least 8 is likely or unlikely for five more numbers drawn the same way.  We can approximate this by drawing many samples of size five from this original sample (allow the same number to be in our new<br />
 sample multiple times) and get the bootstrap estimate of the probability of seeing mean of at least 8 as having a probability around 0.6%.</p>
<p>This may seem trivial- but it is very important.</p>
<h3>Pseudo random generators</h3>
<p>In the computer age, to avoid need for external tables or expensive and slow peripherals we tend to use pseudo random generators.  That is the output of deterministic iterative procedures as equivalent to true random sources.  The science of pseudo randomness has evolved from cobbled together procedures passing ad-hoc tests (such as in Knuth Volume 2) to more formal pseudo randomness based on important properties (like provably being k-wise independent) or complexity (being computationally indistinguishable from a truly random on a time or space bounded machine).  Behind the canned routines of all of the basic &#8220;random generators&#8221; commonly available is a pseudo random source.  </p>
<p>Good references for the modern theory include: 	</p>
<ul>
<li>
&#8220;Pseudorandomness and Cryptographic Applications&#8221; Michael Luby 1996.
</li>
<li>
&#8220;Modern Cryptography, Probabilistic Proofs and Pseudorandomness&#8221; Oded Goldreich, 1999.
</li>
</ul>
<p>The most basic form of a sequential pseudo random generator is a sequence of states s(1), s(2), s(3) &#8230; . Where s(i+1) = g(s(i)) where g() is our deterministic function that maps state to state.  The observed random variables are then h(s(i)) where h() is some deterministic function maps state to observables.  For example for the <a target="_blank" href="http://en.wikipedia.org/wiki/Linear_congruential_generator">linear congruential generator</a>  found in glibc we have g(x) = (1103515245*x + 12345) modulo 2^32 and h(x) = x modulo 2^30 (x an integer from 0 to 2^32 &#8211; 1).  An example application: this generator when divided by (2^30 &#8211; 1) might return numbers passably uniformly distributed in the interval [0,1].  Two such variates might be uses as a uniform sample from the unit square.</p>
<p>That a simple iterated deterministic system (like the modulo arithmetic or even a physical system like coin flipping) would even superficially appear random (let alone be safe to use as pseudo random source) turns out to be the main consequence of <a target="_blank" href="http://en.wikipedia.org/wiki/Ergodic_theory">Ergodic theory</a> (which we will touch on in a later article).  The point is: it should not be obvious (without bringing in some more theory) why you should trust pseudo-random sources.</p>
<h3>Simulation/Game-play</h3>
<p>Another fundamental method is direct simulation or game play.  If we wanted a random variable that was 1 with probability equal to the odds of being dealt a full house from a standard shuffled deck of 52 cards (and zero otherwise).  We can generate such a variable by simulating shuffling a deck, drawing a hand and returning 1 if the hand draw is a full house (and returning 0 otherwise).  Notice in this case we are combining many random variables to get a single result.</p>
<p>One of the most important simulation techniques is Markov chain Monte Carlo methods (related to Gibbs sampling, simulated annealing and many other variations).  These method implement a complex procedure over a stream of random inputs to generate a more difficult to achieve sequence of random outputs.</p>
<p>For example:  Let T be the set of pairs of non-negative integers x, y such that x + y &le; 1000.   We could implement a Markov chain on this set from a source of coin flips.  Given a point (x,y) in T we take three coin flips and move to new point (x&#8217;,y&#8217;) (also in T) using the following procedure:</p>
<ol>
<li>Let m = 1 if the first flip is heads and m=0 if the first flip is tails.</li>
<li>Let v = (1,0) if the second flip is heads and v=(0,1) if the second flip is tails.</li>
<li>Let d = +1 if the third flip is heads and d = -1 if the third flip is tails.</li>
<li>If (x,y) + m*d*v is in T let (x&#8217;,y&#8217;) = (x,y) + m*d*v, otherwise let (x&#8217;,y&#8217;) = (x,y) (stay put).</li>
</ol>
<p>Repeating this procedure a large number of times produces a sequence of points (x,y) such that (x,y) is distributed uniformly on S (again this follows from ergodic principles).  The correctness of this simulation of or game of following a Markov chain is a very fundamental method in generating more complicated random variates and something we will write more about in an article dealing with the ergodic principle (the relation of connectedness to showing averages over time equal averages over space).</p>
<p>For simple shapes (rectangle, triangles) there are more efficient ways to generate points uniformly at random.  For squares we exploit independence and just generate the coordinates independently.  For triangles we could rejection sample from a bounding rectangle.   Or we could use a tranform method: write down a counting function that indexes all the points in the triangle and generate points by index (for example it is easy to work out there are 501501 points in our example S so if we generate a random integer uniformly from 1 to 501501 can just pick the point with given index as our sample).</p>
<p>For general convex shapes (in high dimensions) these methods become intractible and Markov chain methods are one of the few options remaining.</p>
<h3>Rejection Sampling</h3>
<p>Rejection sampling is another way to convert one sequence of random variables into another.  If we assume we can generate a random variable according to the distribution p(x) we can &#8220;rejection sample&#8221; to a new distribution using an &#8220;acceptance function&#8221; q(x) which returns a number in the interval [0,1].  Our procedure is to<br />
repeat the following: generate x with probability p(x), generate a random variable y with uniformly in the interval [0,1] if y &le; q(x) accept x as<br />
our answer and quit (otherwise draw a new x and repeat).</p>
<p>When the distribution that rejection sampling draws with is such that if x and y had a ratio of being drawn of p(x)/p(y) then under the rejection procedure they have relative odds of (p(x)q(x))/(p(y)q(y)).  An important special case is when q() is always 0 or 1, in this case we are drawing with relative odds proportional to p(x) from the subset of x with q(x)=1.</p>
<p>As an example: consider the problem of trying to draw a point (x,y) such that x^2 + y^x &lt; 1 (the open unit disk) uniformly at random.  The rejection sampling solution is: repeat the following until you have a success: generate x and y independently uniformly in the interval [-1,1], if x^2 + y^2 &lt; then 1 accept them as our sample (otherwise repeat).  This procedure is very fast as the unit disk that represents our acceptance region has area pi and the square we are generating trials from has area 4: so we over a 78% chance of success on each trial or expect to only have to run fewer that 1.28 trials (on average) to get a sample.</p>
<h3>Transform methods</h3>
<p>A transform method is used when we have the ability to generate instances of a random variable according to one distribution and we would like instances according to another distribution.</p>
<p>One method is used when we have access to the inverse of the <a target="_blank" href="http://en.wikipedia.org/wiki/Cumulative_distribution_function">cumulative distribution function</a> of the distribution we are trying to generate.  In this case  we can use this function to convert uniform variants from the interval [0,1] into our target distribution.  The commutative distribution function is the function cdf() where cdf(x) is the probability a random variate generated according to our distribution is less than or equal to x.  The inverse function function icdf() where icdf(y)  is such that cdf(icdf(y)) = y.  For example the <a target="_blank" href="http://en.wikipedia.org/wiki/Exponential_distribution">exponential distribution</a>  has an inverse cumulative distribution function icdf(y) = -ln(1-y)/lamda .  So if y is<br />
generated uniformly in the interval [0,1] then icdf(y) is a random variable generated according to the exponential distribution with parameter lambda.</p>
<p>A great example of transform methods is generating Gaussian random variables.  We could directly use the inverse cumulative distribution function method described above- but to do this we would require a special function library to perform the required calculation of the inverse cummulative distribution (or inverse of <a target="_blank" href="http://en.wikipedia.org/wiki/Error_function">erf()</a>).  Another way is the <a target="_blank" href="http://en.wikipedia.org/wiki/Marsaglia_polar_method">polar method</a>: generate x,y uniformly from the open unit disk (by, for example rejection sampling as described earlier), set s = x^2 + y^2 and return  x*sqrt(-2 ln(s)/s),  y*sqrt(-2 ln(s)/s) as two independent Gaussian random variables.   The trick being: the distribution function of r = sqrt(s) is of the form r*e^(-r*r/2) which leads to an elementary cumulative distribution function (unlike the original Gaussian density of the form e^(-r*r/2)) that is easy to invert.</p>
<h2>Conclusion</h2>
<p>Our thesis is: all major methods to generate random variables use aspects of the six methods we have listed here as fundamental.  Or you should at least have a fluid understanding of at least these methods.  You should be able to break down big &#8220;brand name&#8221; methods (like Gibbs sampling) roughly into their constituent parts (so you can reason about them).   One example: notice how ratios of probabilities enter into Markov chain Monte Carlo methods (they cause step rejections); from this you can reason if your problem has bounded ratios it is a good candidate for direct application of the technique (and if it does not you need to add some more ideas, as was demonstrated in:  <a target="_blank" href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.11.9794">&#8220;Sampling from Log-Concave Distributions,&#8221; Alan Frieze , Ravi Kannan , Nick Polson, Ann. Appl. Prob, 1994</a> ).</p>
<p>The first two methods we discuss (physical sources and empirical re-sampling) are of the class of solutions &#8220;already have the right answer.&#8221;  Pseudo random generators are the primary way to negate the need for physical sources and resampling techniques.  Simulation, rejection sampling and transform methods are the main tools for building new distributions out of old.</p>
<p>It is a matter of taste if a given trick fits into this ad-hoc taxonomy or not.   You can invent new and better generation methods- but these methods are easily derived using ideas from the fundamental methods we mentioned here.</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2011/06/what-is-a-large-enough-random-sample/' rel='bookmark' title='What is a large enough random sample?'>What is a large enough random sample?</a></li>
<li><a href='http://www.win-vector.com/blog/2011/10/kernel-methods-and-support-vector-machines-de-mystified/' rel='bookmark' title='Kernel Methods and Support Vector Machines de-Mystified'>Kernel Methods and Support Vector Machines de-Mystified</a></li>
<li><a href='http://www.win-vector.com/blog/2011/07/book-review-ensemble-methods-in-data-mining-seni-elder/' rel='bookmark' title='Book Review: Ensemble Methods in Data Mining (Seni &amp; Elder)'>Book Review: Ensemble Methods in Data Mining (Seni &#038; Elder)</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2012/01/six-fundamental-methods-to-generate-a-random-variable/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Importance Sampling</title>
		<link>http://www.win-vector.com/blog/2012/01/importance-sampling/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=importance-sampling</link>
		<comments>http://www.win-vector.com/blog/2012/01/importance-sampling/#comments</comments>
		<pubDate>Sun, 01 Jan 2012 17:31:59 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Exciting Techniques]]></category>
		<category><![CDATA[Mathematics]]></category>
		<category><![CDATA[Change of Density]]></category>
		<category><![CDATA[Cross Entropy Method]]></category>
		<category><![CDATA[Entropy]]></category>
		<category><![CDATA[Importance Sampling]]></category>
		<category><![CDATA[Mortgage Default]]></category>
		<category><![CDATA[Numeric Methods]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1915</guid>
		<description><![CDATA[We describe briefly the powerful simulation tefchnique known as &#8220;importance sampling.&#8221; Importance sampling is a technique that lets you use numerical simulation to explore events that, at first look, appear too rare to be reliably approximated numerically. The correctness of importance sampling follows almost immediately from the definition of a change of density. Like most [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2011/07/book-review-ensemble-methods-in-data-mining-seni-elder/' rel='bookmark' title='Book Review: Ensemble Methods in Data Mining (Seni &amp; Elder)'>Book Review: Ensemble Methods in Data Mining (Seni &#038; Elder)</a></li>
<li><a href='http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/' rel='bookmark' title='Automatic Differentiation with Scala'>Automatic Differentiation with Scala</a></li>
<li><a href='http://www.win-vector.com/blog/2009/11/the-local-to-global-principle/' rel='bookmark' title='The Local to Global Principle'>The Local to Global Principle</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>We describe briefly the powerful simulation tefchnique known as<br />
&#8220;importance sampling.&#8221;  Importance sampling is a technique that lets<br />
you use numerical simulation to explore events that, at first look,<br />
appear too rare to be reliably approximated numerically.  The correctness<br />
of importance sampling follows almost immediately from the definition<br />
of a change of density.  Like most mathematical techniques, importance<br />
sampling brings in its own concerns and controls that were not obvious<br />
in the original problem.  To deal with these concerns (like picking<br />
the re-weighting to use) we will largely appeal to the ideas from<br />
&#8220;A Tutorial on the Cross-Entropy Method&#8221; Pieter-Tjerk de Boer, Dirk P Kroese, Shie Mannor, and Reuven Y Rubinstein, Annals of Operations Research, 2005 vol. 134 (1) pp. 19-67.<span id="more-1915"></span>To make things concrete we describe the application of the method to a very simplified version of the problem of modeling mortgage defaults.  Our writeup re-derives most everything for clarity and can be found here: <a target="_blank" href="http://www.win-vector.com/dfiles/ImportanceSampling.pdf">http://www.win-vector.com/dfiles/ImportanceSampling.pdf</a></p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2011/07/book-review-ensemble-methods-in-data-mining-seni-elder/' rel='bookmark' title='Book Review: Ensemble Methods in Data Mining (Seni &amp; Elder)'>Book Review: Ensemble Methods in Data Mining (Seni &#038; Elder)</a></li>
<li><a href='http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/' rel='bookmark' title='Automatic Differentiation with Scala'>Automatic Differentiation with Scala</a></li>
<li><a href='http://www.win-vector.com/blog/2009/11/the-local-to-global-principle/' rel='bookmark' title='The Local to Global Principle'>The Local to Global Principle</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2012/01/importance-sampling/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Why you can not to use statistics to dispute magic</title>
		<link>http://www.win-vector.com/blog/2011/12/why-you-can-not-to-use-statistics-to-dispute-magic/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=why-you-can-not-to-use-statistics-to-dispute-magic</link>
		<comments>http://www.win-vector.com/blog/2011/12/why-you-can-not-to-use-statistics-to-dispute-magic/#comments</comments>
		<pubDate>Sat, 10 Dec 2011 17:42:02 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Opinion]]></category>
		<category><![CDATA[Rants]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Fisher]]></category>
		<category><![CDATA[Junk Science]]></category>
		<category><![CDATA[Null Hyphothesis]]></category>
		<category><![CDATA[Positivism]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1903</guid>
		<description><![CDATA[It is a subtle point that statistical modeling is different than model based science. However, empirical scientists seem to go out of their way to conflate the two before the public (as statistical modeling is easier to perform and model based science is more highly rewarded). It is often claimed that model based science is [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2009/12/statistics-to-english-translation-part-2a-%e2%80%99significant%e2%80%99-doesn%e2%80%99t-always-mean-%e2%80%99important%e2%80%99/' rel='bookmark' title='Statistics to English Translation, Part 2a: ’Significant’ Doesn’t Always Mean ’Important’'>Statistics to English Translation, Part 2a: ’Significant’ Doesn’t Always Mean ’Important’</a></li>
<li><a href='http://www.win-vector.com/blog/2009/12/statistics-to-english-translation-part-2b-calculating-significance/' rel='bookmark' title='Statistics to English Translation, Part 2b: Calculating Significance'>Statistics to English Translation, Part 2b: Calculating Significance</a></li>
<li><a href='http://www.win-vector.com/blog/2009/11/i-dont-think-that-means-what-you-think-it-means-statistics-to-english-translation-part-1-accuracy-measures/' rel='bookmark' title='&#8220;I don&#8217;t think that means what you think it means;&#8221; Statistics to English Translation, Part 1: Accuracy Measures'>&#8220;I don&#8217;t think that means what you think it means;&#8221; Statistics to English Translation, Part 1: Accuracy Measures</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>It is a subtle point that statistical modeling is different than model based science.  However, empirical scientists seem to go out of their way to conflate the two before the public (as statistical modeling is easier to perform and model based science is more highly rewarded).  It is often claimed that model based science is being done when in fact statistics is what is being done (for instance some of the unfortunate distractions of flawed reports related to <a target="_blank" href="http://www.win-vector.com/blog/2009/12/cru-graph-yet-again-with-r/">the important question of the magnitude of plausible anthropogenic global warming</a>).</p>
<p>Both model based science and statistics are wonderful fields, but it is important to not receive the results of one when you have paid for the other.</p>
<p>We will pointedly discuss one of the differences.<span id="more-1903"></span>First let us define our terms.  </p>
<p>I will take &#8220;model based science&#8221; to essentially mean <a target="_blank" href="http://en.wikipedia.org/wiki/Falsifiability">Popperian Falsifiability</a> (an alternative to <a target="_blank" href="http://en.wikipedia.org/wiki/Positivism">positivism</a>).  This is roughly: you construct a statement or model and the model is said to only have empirical content if it is in theory possible to &#8220;falsify the model.&#8221;  That is the model must form predictions that are specific enough to potentially be disproved.  If you see a single instance of the model being wrong, you say the model is wrong (or at best incomplete).  And you are done.  Frankly, for all the philosophical  sturm und drang this is closest to what is meant by science.</p>
<p>I will take statistical modeling to roughly mean <a target="_blank" href="http://en.wikipedia.org/wiki/Null_hypothesis">Fisherian Null Hypothesis rejection</a>.  This is only one branch of statistics (in addition to Fisher&#8217;s methods we also have frequentist and Bayesian methods, in particular see:  <a target="_blank" href="http://stat.stanford.edu/~ckirby/brad/other/">Controversies in the foundations of statistics, Bradley Efron, Amer. Math. Mon. 85, 231-246, 1978</a>) but it is closest to what is actually performed in statistical studies.</p>
<p>You can see the two methods sound very similar- they both emphasize rejection of a hypothesis.  But this is deceptive.  In the case of Popperian falsifiability you are essentially holding on to a hypothesis that you believe, but are very willing to give it up (one wrong prediction and it is out).  In the case of Fisherian rejection you don&#8217;t believe the null hypothesis, but you are holding back rejection until you collect enough data to get rid of it.</p>
<p>Let us go over that again.</p>
<p>In the falsifiable or model based science regime: a theory or model would be a proscriptive set of guidelines or laws that allows you to build things (like tall skyscrapers).  If ever one of your skyscrapers unexpectedly falls, you know your theory is wrong and you revise.  Rejection is quick.  But essentially you honestly believed the theory while you were using it.   You were on its side and to counter this bias you agree to reject the theory on first failure.</p>
<p>In the statistical regime you never believed the null hypothesis.  It is a stand-in you are trying to find a lot of evidence against to embarrass out of existence.  Because you know you are against the null hypothesis you do two things try and mitigate your bias against the null hypothesis: you operationally presume it is true during reasoning and you don&#8217;t reject it until there is a lot of evidence against it.</p>
<p>To sum up in model based science you believe the model and are confident it can&#8217;t be toppled easily (so you don&#8217;t defend it as it you are confident it will survive) in statistics you doubt the null hypothesis and you give it every chance to survive (because you are sure that it will not survive).</p>
<p>Now that I have stated my premises let us move on the field I intended to criticize: <a target="_blank" href="http://boingboing.net/2011/12/07/esp-proponents-claim-that-esp.html">paranormal powers</a>.  </p>
<p>To be deliberately rude: if you are investigating something that does not have a proposed mechanism that you are willing to test and reject you are not doing model based science.  And by definition the paranormal is outside of current scientific explanation.  It was too much to hope that we were doing model based science in this case (the appearance is deliberately that of science instead of statistics, but our science friends won&#8217;t help us call this out as they are often profiting from the same confusion).  So you are doing statistics (and there is nothing wrong with that).  But if you are doing statistics what is your null hypothesis?  </p>
<ul>
<li>Null Hypothesis  Candidate 1: ESP does not exist.
<p>This is a plausible hypothesis and sound &#8220;nully&#8221; (doesn&#8217;t claim much).  But you would only be able to use this null hypothesis to try to prove the existence of ESP.</p>
<p>But it is the exact wrong hypotheses to disprove ESP.<br />
&#8220;The null hypothesis can never be proven&#8221; (see <a target="_blank" href="http://en.wikipedia.org/wiki/Null_hypothesis">Null Hypothesis</a> and<br />
<a target="_blank" href="http://www.win-vector.com/blog/tag/statsmanship/">Statsmanship</a>).  Fisherian testing is unfortunately a one-sided design; it can only reject null hypothesis (not fully settle questions).</p>
</li>
<li>Null Hypothesis  Candidate 2: ESP does  exist.
</li>
<p>This is the null hypothesis you need to work with to reject ESP.</p>
<p>But here is the trap.  You must operationally work with the hypothesis (even if you don&#8217;t like it) during the rejection attempt.  Since you are forced to &#8220;operationally accept&#8221; the null hypothesis for the duration of the study you have absolutely no defense against critiques like:</p>
<blockquote><p>
This latter review didn’t find any problems in our methodology or writeup itself, but suggested that, since the three of us (Richard Wiseman, Chris French and I) are all skeptical of ESP, we might have unconsciously influenced the results using our own psychic powers.&#8217;
</p></blockquote>
<p>The paranormal is just one big game of <a target="_blank" href="http://en.wikipedia.org/wiki/Mornington_Crescent_(game)">Mornington Crescent</a>. So if you failed to claim that there is no such thing as  psychic dampening powers <em>before</em> your opponent accuses you of using such powers: you lose.  The game is all about timing, not reality.  If you don&#8217;t like this kind of situation, don&#8217;t get into this kind of situation.</p>
<p>This is why you shouldn&#8217;t use statistics to study bullshit.  Statistical testing methods are deliberately designed to be weak.  Unfortunately they are easy to work around if given enough rope.
</ul>
<p>None of this would matter if it didn&#8217;t also hold for a lot of what is called mainstream science.  Everyone wants the adulation of having imp ortant scientific results; but they seem to only to want to pay to commission statistics.</p>
<p>Take big money pharmaceuticals as an example.  Non-working drugs can deliver <em>equivocal</em> results forever (as long as you keep weakening the proposed claims after each study) and always being &#8220;on the verge&#8221; of a significant result can fund an endless number of studies and careers.</p>
<p>It now past time to define what I meant by &#8220;magic.&#8221;  Magic, for this article, is any hypothesis that is not sufficiently specific and bounded.  You can design statistical studies to test many things, but only if you can specifically describe the limits of what you are attempting to study prior to the experimental work.  There are two main classes of magic hypothesis the powerful and the weak.  Powerful magic hypothesis are unfalsifiable because they have no pre-defined limit on what they can bring in to defend theirselves post experiment.  Weak magic hypothesis are unfalsifiable for the simple reason they can be revised after any experiment to claim the effect is present but just slightly more subtle than the resolving power of the last experiment.</p>
<p>You must be very clear about when you are doing science and about when you are doing statistics.  The unfortunate truth is: it is very difficult to successfully dispute junk science using tools as deliberately delicate as statistical hypothesis testing.  Without a sufficiently critical mindset you get <a target="_blank" href="http://www.win-vector.com/blog/2009/04/the-data-enrichment-method/">deliberately bad statistics</a>, <a target="_blank" href="http://en.wikipedia.org/wiki/Cargo_cult_science">cargo cult science</a> and <a target="_blank" href="https://plus.google.com/114134834346472219368/posts/ZBNSWpqUsvb">dishonest math</a>.  A good essay on this researchers wanting to claim the benefits of the trappings of mathematics (but not willing to meet the very strict pre-conditions required) is &#8220;The Pernicious Influence of Mathematics on Science&#8221; Jack Schwartz, 1962 (collected in &#8220;Discrete Thoughts: Essays on mathematics, science, and philosophy&#8221; Mark Kac, Gian-Carlo Rota, Jacob T. Schwartz, Birkhauser  1992).</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2009/12/statistics-to-english-translation-part-2a-%e2%80%99significant%e2%80%99-doesn%e2%80%99t-always-mean-%e2%80%99important%e2%80%99/' rel='bookmark' title='Statistics to English Translation, Part 2a: ’Significant’ Doesn’t Always Mean ’Important’'>Statistics to English Translation, Part 2a: ’Significant’ Doesn’t Always Mean ’Important’</a></li>
<li><a href='http://www.win-vector.com/blog/2009/12/statistics-to-english-translation-part-2b-calculating-significance/' rel='bookmark' title='Statistics to English Translation, Part 2b: Calculating Significance'>Statistics to English Translation, Part 2b: Calculating Significance</a></li>
<li><a href='http://www.win-vector.com/blog/2009/11/i-dont-think-that-means-what-you-think-it-means-statistics-to-english-translation-part-1-accuracy-measures/' rel='bookmark' title='&#8220;I don&#8217;t think that means what you think it means;&#8221; Statistics to English Translation, Part 1: Accuracy Measures'>&#8220;I don&#8217;t think that means what you think it means;&#8221; Statistics to English Translation, Part 1: Accuracy Measures</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2011/12/why-you-can-not-to-use-statistics-to-dispute-magic/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>What to do when you run out of memory</title>
		<link>http://www.win-vector.com/blog/2011/12/what-to-do-when-you-run-out-of-memory/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=what-to-do-when-you-run-out-of-memory</link>
		<comments>http://www.win-vector.com/blog/2011/12/what-to-do-when-you-run-out-of-memory/#comments</comments>
		<pubDate>Tue, 06 Dec 2011 12:25:59 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Coding]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Tutorials]]></category>
		<category><![CDATA[Additive Combinatorics]]></category>
		<category><![CDATA[GNU sort]]></category>
		<category><![CDATA[H2]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Map Reduce]]></category>
		<category><![CDATA[Out of core]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1892</guid>
		<description><![CDATA[A constant problem for computer science (since its inception) is how to manipulate data that is larger than machine memory. We present here some general strategies for working &#8220;out of core&#8221; or what you should do when you run out of memory. Early computers were most limited by their paltry memory sizes. von Neumann himself [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/' rel='bookmark' title='Automatic Differentiation with Scala'>Automatic Differentiation with Scala</a></li>
<li><a href='http://www.win-vector.com/blog/2010/12/large-data-logistic-regression-with-example-hadoop-code/' rel='bookmark' title='Large Data Logistic Regression (with example Hadoop code)'>Large Data Logistic Regression (with example Hadoop code)</a></li>
<li><a href='http://www.win-vector.com/blog/2011/01/sql-screwdriver/' rel='bookmark' title='SQL Screwdriver'>SQL Screwdriver</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>A constant problem for computer science (since its inception) is how to manipulate data that is larger than machine memory.  We present here some general strategies for working &#8220;out of core&#8221; or what you should do when you run out of memory.</p>
<p>Early computers were most limited by their paltry memory sizes.  von Neumann himself commented that even a room full of genius mathematicians would not be capable of much if all they could communicate, think upon or remember were the characters on a single type written page (much more memory than the few hundred words available to the <a target="_blank" href="http://en.wikipedia.org/wiki/ENIAC">Eniac</a>).   The most visible portions of early computers are their external memories or secondary stores: card readers, paper tape readers and tape drives.</p>
<p><center><br />
<img style="display:block; margin-left:auto; margin-right:auto;" src="http://www.win-vector.com/blog/wp-content/uploads/2011/12/IMG_0062.jpg" alt="IMG 0062" title="IMG_0062.JPG" border="0" height="300" /></p>
<p/>
SDC 920 computer, Computer History Museum, Mountain View CA<br />
</center></p>
<p>Historically computer scientists have concentrated on streaming or online algorithms (that is algorithms that work with the data in the order it is available and use limited memory).  For many problems we have found this an insufficient model and it is much better to assume you can re-order and replicate data (such as scattering data to many processors and re-collecting it to sort).  The scatter/gather paradigm is ubiquitous and is the underpinning of large scale sorting, databases and Map Reduce.  So in one sense databases and Map Reduce different APIs on top of very related technologies (journaling, splitting and merging).  Replicating data (or even delaying duplicate elimination) that is already &#8220;too large to handle&#8221; may seem counterintuitive; but it is exploiting the primary property of secondary storage: that secondary storage tends to be much larger than primary storage (typically by 2 orders of magnitude, compare a 2 terabyte drive to an 8 gigabyte memory stick).<span id="more-1892"></span>In our web age, the typical big data problems are inverting indices (for fast search lookup) and computing term frequencies (for <a target="_blank" href="http://en.wikipedia.org/wiki/Okapi_BM25">TF/IDF scoring</a> or for things like <a target="_blank" href="http://en.wikipedia.org/wiki/Naive_bayes">Naive Bayes classifiers</a>).  Since these are over-worked examples we will use a mathematical problem from <a href="http://terrytao.wordpress.com/books/additive-combinatorics/">&#8220;Additive Combinatorics&#8221;, Terence Tao, Van Vu, (ISBN-13: 9780521853866; ISBN-10: 0521853869)</a></p>
<p>We take one problem from the field of additive combinatorics: sum sets.   For two sets of integers A = {a_1, &#8230; a_s} and B {b_1, &#8230;, b_t} the sum set is defined as the set (without repetition) A + B = { a_i + b_j | i = 1,&#8230;s, j=1&#8230;t }.   For sets of integers the size of A+B (denoted as |A+B|) can vary from |A| + |B| &#8211; 1 to |A| * |B| depending on the relations between the numbers in A and B (or the structure of A and B).  If instead of working with integers we work with integers <a target="_blank" href="http://en.wikipedia.org/wiki/Modular_arithmetic">modulo p</a> where p is a prime number (or equivalently we treat all numbers as remainders of division by p) then by the Cauchy-Davenport inequality we have |A + B| &ge; min(|A|+|B|-1,p) (so essentially the same result, except when we run out of possible integers modulo p).</p>
<p>For example we would say (working modulo 19) that [0, 1, 4, 5] + [10, 11, 14, 15] = [0, 1, 10, 11, 12, 14, 15, 16, 18].   In fact there are 19 pairs of sets that add up to  [0, 1, 10, 11, 12, 14, 15, 16, 18] ( for instance [5, 6, 9, 10] + [5, 6, 9, 10] is another such pair).  Just to move forward assume we were interested in determining how many ways a set can be written as the sum of a pair of sets (each of size 4).  For a given sum result we might try search or <a target="_blank" href="http://en.wikipedia.org/wiki/Integer_programming">integer programming</a> to find all possible summands.  However, if we want the statistics on all sums simultaneously, we can work much quicker and without need for big gun mathematics.</p>
<p>The straightforward solution is this case is a bit of code like:</p>
<p><code></p>
<pre>
for set A from all possible sets of 4 integers from 0 to 18
    for set B from all possible sets of 4 integers from 0 to 18
        let set C = A + B modulo 19
        use set C as a key and add the pair (A,B) to the list associated with C
for all key sets C tracked above
     compute the size of the list of summand pairs found for C
print how many result sets C have a given number of summand pairs
</pre>
<p></code></p>
<p>The relations C which have a summand of form A can be collected by any bit of Java code implementing the interface below (just call <code>insertReln(C,(A,B))</code>  to store the relations and then <code>entries()</code> to get them back).  A small interface that declares the needed methods is given below:</p>
<p><code></p>
<pre>
public interface RelnCollector&lt;A,B&gt; {
	void insertReln(A a, B b) throws IOException;
	Iterable&lt;Map.Entry&lt;C,Iterable&lt;B&gt;&gt;&gt; entries() throws IOException, InterruptedException;
	void close() throws IOException;
}
</pre>
<p></code></p>
<p>An in-memory relation collector is trivially implemented by a nested map adjusted to declare the above interface, as we see in the next code snippet:</p>
<pre>
public final class InMemoryRelnCollector&lt;A,B&gt;
	implements RelnCollector&lt;A,B&gt; {
	private final DataAdapter&lt;A&gt; adapterA;
	private final DataAdapter&lt;B&gt; adapterB;
	private Map&lt;A,Iterable&lt;B&gt;&gt; atoBs;

	public InMemoryRelnCollector(final DataAdapter&lt;A&gt; adapterA,
		final DataAdapter&lt;B&gt; adapterB) {
		this.adapterA = adapterA;
		this.adapterB = adapterB;
		atoBs = new TreeMap&lt;A,Iterable&lt;B&gt;&gt;(this.adapterA);
	}

	@Override
	public void insertReln(final A a, final B b) {
		Set&lt;B&gt; set = (Set&lt;B&gt;) atoBs.get(a);
		if(null==set) {
			set = new TreeSet&lt;B&gt;(adapterB);
			atoBs.put(a,set);
		}
		if(!set.contains(b)) {
			set.add(b);
		}
	}

	@Override
	public Iterable&lt;Map.Entry&lt;A,Iterable&lt;B&gt;&gt;&gt; entries() {
		return atoBs.entrySet();
	}

	@Override
	public void close() {
		atoBs = null;
	}
}
</pre>
<p>The great savings in time is that we work from summands to results sums (but keep many sets of results indexed by result sets).  Thus we don&#8217;t have to figure out how to invert the sum operation (as we do our bookkeeping forward).  However, this very bookkeeping may overwhelm us.  As we can see below, a Java implementation of the above procedure runs out of memory when trying to characterize which sets of integers modulo 19 can be split into two sets of size four (and how many ways each such set can be split).  However, this was with the deliberately small default allocation of memory available to Java processes (so for this particular instance we could avoid trouble by allocating more memory, we ran out of allocation not system memory).  What happens when we don&#8217;t manage memory is illustrated below:</p>
<pre>
Start	com.winvector.consolidate.impl.InMemoryRelnCollector
	Tue Dec 06 10:04:38 PST 2011
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
	at java.util.TreeMap.put(TreeMap.java:554)
	at java.util.TreeSet.add(TreeSet.java:238)
	at com.winvector.consolidate.example.AdditiveSets.sum(AdditiveSets.java:25)
	at com.winvector.consolidate.example.AdditiveSets.main(AdditiveSets.java:55)
</pre>
<p>An out of core solution can solve the entire problem without needing any additional system memory (just some disk space which is still of a much greater size than primary memory).  The complete calculated result is given below:</p>
<pre>
Examining sums of 4 integers chosen from 0 through 18 modulo 19.
Start	com.winvector.consolidate.impl.FileRelnCollector
	Tue Dec 06 09:54:20 PST 2011
	Inserted 15023376 relations.
 [0, 1, 4, 5] + [10, 11, 14, 15] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [0, 1, 15, 16] + [0, 14, 15, 18] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [0, 3, 4, 18] + [11, 12, 15, 16] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [0, 14, 15, 18] + [0, 1, 15, 16] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [1, 2, 5, 6] + [9, 10, 13, 14] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [1, 2, 16, 17] + [13, 14, 17, 18] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [2, 3, 6, 7] + [8, 9, 12, 13] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [2, 3, 17, 18] + [12, 13, 16, 17] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [3, 4, 7, 8] + [7, 8, 11, 12] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [4, 5, 8, 9] + [6, 7, 10, 11] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [5, 6, 9, 10] + [5, 6, 9, 10] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [6, 7, 10, 11] + [4, 5, 8, 9] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [7, 8, 11, 12] + [3, 4, 7, 8] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [8, 9, 12, 13] + [2, 3, 6, 7] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [9, 10, 13, 14] + [1, 2, 5, 6] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [10, 11, 14, 15] + [0, 1, 4, 5] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [11, 12, 15, 16] + [0, 3, 4, 18] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [12, 13, 16, 17] + [2, 3, 17, 18] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [13, 14, 17, 18] + [1, 2, 16, 17] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
	Examined 128820 sums and 15023376 summands.
	found 3705 sums with 19 distinct summands
	found 39900 sums with 38 distinct summands
	found 26847 sums with 76 distinct summands
	found 22230 sums with 114 distinct summands
	found 10602 sums with 152 distinct summands
	found 8892 sums with 190 distinct summands
	found 2736 sums with 228 distinct summands
	found 5016 sums with 266 distinct summands
	found 2736 sums with 304 distinct summands
	found 1710 sums with 342 distinct summands
	found 171 sums with 361 distinct summands
	found 1710 sums with 380 distinct summands
	found 855 sums with 418 distinct summands
	found 342 sums with 456 distinct summands
	found 342 sums with 532 distinct summands
	found 342 sums with 570 distinct summands
	found 171 sums with 722 distinct summands
	found 171 sums with 760 distinct summands
	found 171 sums with 912 distinct summands
	found 171 sums with 1026 distinct summands
Done:	com.winvector.consolidate.impl.FileRelnCollector
   elapsed time: 618473MS
   Tue Dec 06 10:04:38 PST 2011
</pre>
<p>We performed the calculation be using a different implementation of <code>RelnCollector</code> called <code>FileRelnCollector</code>.  What this implementation does is write relations to a file as they are made available.  That is <cod>FileRelnCollector</code> implementation of <code>insertReln</code> is literally a <code>println()</code>.  Something not more more complicated than the following:</p>
<p><code></p>
<pre>
	@Override
	public void insertReln(final A a, final B b) {
		System.out.println("" + a + "\t" + b);
	}
</pre>
<p></code></p>
<p>The heavy lifting is done when <code>entries()</code> is called.  When the entries are wanted the <code>FileRelnCollector</code> calls <a target="_blank" href="http://www.gnu.org/s/coreutils/manual/html_node/sort-invocation.html">GNU sort</a> on the saved file to get all the results ordered by result sum (instead of by summand).  GNU sort can sort files larger than memory by a split and merge strategy involving temporary files.  We provide such  <a target="_blank" href="https://github.com/WinVector/OutOfCore/blob/master/OutOfCore/src/com/winvector/consolidate/impl/FileRelnCollector.java">a file plus GNU sort based implementation of RelnCollector</a>.  </p>
<p>Note that this runtime can be deceptively low.  If running on a machine with a modern operating system and enough memory the file being used as "external storage" actually gets cached into memory (and gets near memory speed performance).  To get a reliable timing you need to test a problem of the size you are interested in on the size machine you are going to deploy on (not on a larger machine).</p>
<p>For better or worse this method should seem familiar as a lot of science has been done using the Unix text tools (sort, join and a few more).  This is also the basis of Map Reduce and we demonstrate a <a target="_blank" href="https://github.com/WinVector/OutOfCore/blob/master/OutOfCore/src/com/winvector/consolidate/impl/MapReduceRelnCollector.java">Hadoop implementation of RelnCollector</a> as well.  Or we can link up with the other technology designed for beyond memory size data manipulation and get <a target="_blank" href="https://github.com/WinVector/OutOfCore/blob/master/OutOfCore/src/com/winvector/consolidate/impl/DBRelnCollector.java">a database based implementation of RelnCollector</a>.  </p>
<p>In all cases the implementations we call depend on journaling (in the sense of keeping a sequential log of operations to be done instead of immediately performing the operations), scattering (splitting into multiple temp files and structures) and merging (combining data form multiple ordered files).  We could write our own code to perform all of these operations (obliviating any need for GNU sort, Hadoop or a database), but it is much less code to do as we have here and write an adapter to use existing implementations.</p>
<p>The sum-set example is deliberately artificial.  More common examples are, as we mentioned, index inversion and term frequency calculation.  All of our example code is available here: <a target="_blank" href="https://github.com/WinVector/OutOfCore">https://github.com/WinVector/OutOfCore</a> including JUnit tests and an <a target="_blank" href="https://github.com/WinVector/OutOfCore/blob/master/OutOfCore/src/com/winvector/consolidate/example/AdditiveSets.java">example program</a>.  The code depends on libraries for <a target="_blank" href="http://www.junit.org/">JUnit 4.10</a>, <a target="_blank" href="http://www.h2database.com/html/main.html">h2 database</a>, <a target="_blank" href="http://hadoop.apache.org/mapreduce/releases.html">Hadoop 0.21.0</a> for the various implementations.</p>
<p>The main trick is basing your code on a very thin storage abstraction (like the <code>RelnCollector</code> interface, instead of explicitly known data structures) and then using this abstraction to hide all of the details away from the rest of your code (keeping complexity at a manageable level).  The two things to avoid are either infecting your code with too much knowledge of your storage plans (i.e. pushing implementation details into your important code to "speed things up") or being forced to re-design your entire project to fit within some framework (like re-writing all of your code as a database stored procedure or an explicit Hadoop map/reduce pair as this over-commits you to one technology).</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/' rel='bookmark' title='Automatic Differentiation with Scala'>Automatic Differentiation with Scala</a></li>
<li><a href='http://www.win-vector.com/blog/2010/12/large-data-logistic-regression-with-example-hadoop-code/' rel='bookmark' title='Large Data Logistic Regression (with example Hadoop code)'>Large Data Logistic Regression (with example Hadoop code)</a></li>
<li><a href='http://www.win-vector.com/blog/2011/01/sql-screwdriver/' rel='bookmark' title='SQL Screwdriver'>SQL Screwdriver</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2011/12/what-to-do-when-you-run-out-of-memory/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>An Appreciation of Locality Sensitive Hashing</title>
		<link>http://www.win-vector.com/blog/2011/11/an-appreciation-of-locality-sensitive-hashing/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=an-appreciation-of-locality-sensitive-hashing</link>
		<comments>http://www.win-vector.com/blog/2011/11/an-appreciation-of-locality-sensitive-hashing/#comments</comments>
		<pubDate>Mon, 21 Nov 2011 16:41:46 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Exciting Techniques]]></category>
		<category><![CDATA[Expository Writing]]></category>
		<category><![CDATA[Opinion]]></category>
		<category><![CDATA[Tutorials]]></category>
		<category><![CDATA[Locality Sensitive Hashing]]></category>
		<category><![CDATA[Nearest Neighbor]]></category>
		<category><![CDATA[Theorist]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1848</guid>
		<description><![CDATA[We share our admiration for a set of results called &#8220;locality sensitive hashing&#8221; by demonstrating a greatly simplified example that exhibits the spirit of the techniques.Locality sensitive hashing is awe inspiring in its originality, simplicity, beauty and effectiveness. In addition locality sensitive hashing is a remarkable technique as it works even when drastically abridged and [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2008/09/a-quick-appreciation-of-the-sharpe-ratio/' rel='bookmark' title='A Quick Appreciation of the Sharpe Ratio'>A Quick Appreciation of the Sharpe Ratio</a></li>
<li><a href='http://www.win-vector.com/blog/2009/11/the-local-to-global-principle/' rel='bookmark' title='The Local to Global Principle'>The Local to Global Principle</a></li>
<li><a href='http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/' rel='bookmark' title='Automatic Differentiation with Scala'>Automatic Differentiation with Scala</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>We share our admiration for a set of results called &#8220;locality sensitive hashing&#8221; by demonstrating a greatly simplified example that exhibits the spirit of the techniques.<span id="more-1848"></span>Locality sensitive hashing is awe inspiring in its originality, simplicity, beauty and effectiveness.  In addition locality sensitive hashing is a remarkable technique as it works even when drastically abridged and simplified.  In this <a target="_blank" href="http://www.win-vector.com/dfiles/LocalitySensitiveHashing.pdf">paper (link to pdf)</a> we give a description of conditions where the technique works and a heuristic argument why it works (using only elementary math).</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2008/09/a-quick-appreciation-of-the-sharpe-ratio/' rel='bookmark' title='A Quick Appreciation of the Sharpe Ratio'>A Quick Appreciation of the Sharpe Ratio</a></li>
<li><a href='http://www.win-vector.com/blog/2009/11/the-local-to-global-principle/' rel='bookmark' title='The Local to Global Principle'>The Local to Global Principle</a></li>
<li><a href='http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/' rel='bookmark' title='Automatic Differentiation with Scala'>Automatic Differentiation with Scala</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2011/11/an-appreciation-of-locality-sensitive-hashing/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>&#8220;The Mythical Man Month&#8221; is still a good read</title>
		<link>http://www.win-vector.com/blog/2011/10/the-mythical-man-month-is-still-a-good-read/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=the-mythical-man-month-is-still-a-good-read</link>
		<comments>http://www.win-vector.com/blog/2011/10/the-mythical-man-month-is-still-a-good-read/#comments</comments>
		<pubDate>Sun, 23 Oct 2011 18:57:52 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Computers]]></category>
		<category><![CDATA[Opinion]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Architects]]></category>
		<category><![CDATA[Mythical Man Month]]></category>
		<category><![CDATA[SAGE]]></category>
		<category><![CDATA[WIMP]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1834</guid>
		<description><![CDATA[Re-read Fred Brooks &#8220;The Mythical Man Month&#8221; over vacation.  Book remains insightful about computer science and project management.My spin on some points: System architects once were the people who said &#8220;no&#8221; to features to maintain design consistency and coherency. Now architects are the people who buy and bring in external frameworks and technologies (killing any [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2009/01/map-reduce-a-good-idea/' rel='bookmark' title='Map Reduce: A Good Idea'>Map Reduce: A Good Idea</a></li>
<li><a href='http://www.win-vector.com/blog/2009/08/good-graphs-graphical-perception-and-data-visualization/' rel='bookmark' title='Good Graphs: Graphical Perception and Data Visualization'>Good Graphs: Graphical Perception and Data Visualization</a></li>
<li><a href='http://www.win-vector.com/blog/2010/05/must-have-software/' rel='bookmark' title='Must Have Software'>Must Have Software</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>Re-read Fred Brooks &#8220;The Mythical Man Month&#8221; over vacation.  Book remains insightful about computer science and project management.<span id="more-1834"></span>My spin on some points:</p>
<ul>
<li>
System architects once were the people who said &#8220;no&#8221; to features to maintain design consistency and coherency.  Now architects are the people who buy and bring in external frameworks and technologies (killing any chance of consistency or coherency).  Kind of like the Fahrenheit 451 quote &#8220;I remember firemen used to fight fires.&#8221;
</li>
<li>
By far the thing that aged the worst was the reverence for the WIMP (windows, icons, menus, pointing) paradigm.  At this point I think we can argue that WIMP codified a lot of provably bad decisions: desktops, icons, menus and mouse out of visual field.  Maybe some of the ideas prior to WIMP (like SAGE&#8217;s light-pens) or after WIMP (application launcher noun-verb theories like Quicksilver, search, touch pads, full screen apps, versioning and not forcing the user to adapt to the file storage abstraction) are actually much more fundamental.  I think we all were seduced by the 1968 Engelbart demo but forget that the Semi Automated Ground Environment was a production deployed direct (light pen) multi user information sharing point and click system since 1959.</p>
<p><center><br />
<img style="display:block; margin-left:auto; margin-right:auto;" src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/IMG_0064.jpg" alt="SAGE station" title="IMG_0064.JPG" border="0" width="600" height="450" /></p>
<p>SAGE station, Computer History Museum- Mountain View, CA</p>
<p></center>
</li>
<li>
Most everything else ages very well.  The discussions of pain of having to work &#8220;out of core&#8221; remain relevant as this is what we now call &#8220;big data&#8221; (though in Brooks&#8217; time this pain extends to documentation, source code and binaries all of which are too big to hold in memory or even in machine accessible format in the time of the IBM System/360).  </p>
<p>Though in the old days- &#8220;out of core&#8221; meant punched cards, punched tape, magnetic tape or very slow hard disks (which were a new luxury for the period Brooks writes about).<br />
<center><br />
<img style="display:block; margin-left:auto; margin-right:auto;" src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/IMG_0062.jpg" alt="IMG 0062" title="IMG_0062.JPG" border="0" width="450" height="600" /></p>
<p>SDS 920 with built in tape-drive, Computer History Museum- Mountain View, CA</p>
<p></center>
</li>
<li>
Linkers were among the biggest problems in the 1960s and remain the so now (though we now call it late binding, jars, shared libraries and APIs).  At one point Brooks throws up his hands and says that it would be faster to just re-compile everything than to deal with some relocating linkers.
</li>
<li>
Brooks definitely advocates and anticipates things like developer wikis (though he had to use microfiche as the computers of his day didn&#8217;t have enough storage to manage their own documentation).
</li>
<li>
&#8220;Literate Programming&#8221; is clearly anticipated.
</li>
<li>
Version control procedures are definitely written about, but Brooks seems not to anticipate version control software.
</li>
</ul>
<p>Overall: very well written and still interesting and relevant.</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2009/01/map-reduce-a-good-idea/' rel='bookmark' title='Map Reduce: A Good Idea'>Map Reduce: A Good Idea</a></li>
<li><a href='http://www.win-vector.com/blog/2009/08/good-graphs-graphical-perception-and-data-visualization/' rel='bookmark' title='Good Graphs: Graphical Perception and Data Visualization'>Good Graphs: Graphical Perception and Data Visualization</a></li>
<li><a href='http://www.win-vector.com/blog/2010/05/must-have-software/' rel='bookmark' title='Must Have Software'>Must Have Software</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2011/10/the-mythical-man-month-is-still-a-good-read/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Kernel Methods and Support Vector Machines de-Mystified</title>
		<link>http://www.win-vector.com/blog/2011/10/kernel-methods-and-support-vector-machines-de-mystified/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=kernel-methods-and-support-vector-machines-de-mystified</link>
		<comments>http://www.win-vector.com/blog/2011/10/kernel-methods-and-support-vector-machines-de-mystified/#comments</comments>
		<pubDate>Sat, 08 Oct 2011 00:17:46 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Expository Writing]]></category>
		<category><![CDATA[Mathematics]]></category>
		<category><![CDATA[Opinion]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Tutorials]]></category>
		<category><![CDATA[Kernel Methods]]></category>
		<category><![CDATA[Logistic Regression]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Mathematical Bedside Reading]]></category>
		<category><![CDATA[Naive Bayes]]></category>
		<category><![CDATA[Support Vector Machines]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1804</guid>
		<description><![CDATA[We give a simple explanation of the interrelated machine learning techniques called kernel methods and support vector machines. We hope to characterize and de-mystify some of the properties of these methods. To do this we work some examples and draw a few analogies. The familiar no matter how wonderful is not perceived as mystical. Goals [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2010/10/a-personal-perspective-on-machine-learning/' rel='bookmark' title='A Personal Perspective on Machine Learning'>A Personal Perspective on Machine Learning</a></li>
<li><a href='http://www.win-vector.com/blog/2011/04/do-your-tools-support-production-or-complexity/' rel='bookmark' title='Do your tools support production or complexity?'>Do your tools support production or complexity?</a></li>
<li><a href='http://www.win-vector.com/blog/2011/07/book-review-ensemble-methods-in-data-mining-seni-elder/' rel='bookmark' title='Book Review: Ensemble Methods in Data Mining (Seni &amp; Elder)'>Book Review: Ensemble Methods in Data Mining (Seni &#038; Elder)</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>We give a simple explanation of the interrelated machine learning techniques called <a target="_blank" href="http://en.wikipedia.org/wiki/Kernel_methods">kernel methods</a> and <a target="_blank" href="http://en.wikipedia.org/wiki/Support_vector_machine">support vector machines</a>.  We hope to characterize and de-mystify some of the properties of these methods.  To do this we work some examples and draw a few analogies.  The familiar no matter how wonderful is not perceived as mystical.<span id="more-1804"></span><br />
<h2>Goals of this writeup</h2>
<ol>
<li>De-mystify  kernel methods and support vector machines<br />
<blockquote><p>
Kernel methods and support vector machines have taken mythological proportions in the machine learning imagination. Partly this is because a number of good ideas are overly associated with them: support/non-support training datums, weighting training data, discounting data, regularization, margin and the bounding of generalization error.  My issue is that these are all important enough ideas to stand on their own and are often seen in simpler settings.  The observations that inform my view are as follows:</p>
<ul>
<li>Kernel methods and support vector machines are in fact two good ideas.  Each is important even without the other: kernels are useful all over and support vector machines would be  useful even if we restricted to the trivial identity kernel.</li>
<li>Small scale &#8220;kernel tricks&#8221; are not that different than the classic technique of adding &#8220;interaction variables.&#8221;  Kernels let you escape from the limits of &#8220;linear hypotheses&#8221; (really by moving to a bigger space where things are again linear but look curved from the point of view of your smaller original space).  We demonstrate the linear methods and &#8220;primal kernel tricks&#8221; in <a target="_blank" href="http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/">A Demonstration of Data Mining</a>.</li>
<li>Support/non-support is the central issue of nearest neighbor classifiers.</li>
<li>Weighing training data is most famously shared by logistic regression and boosting.</li>
<li>Re-weighting of data by <i>smoothing kernels</i> (different but related use of the work &#8220;kernel&#8221;) is central to non-parametric statistics (kernel smoothers and splines).</li>
<li>Regularization is of supreme importance in modeling in general.</li>
<li>Most practitioners get tired of &#8220;kernel shopping&#8221; and fall back to the identity, cosine or radial/Gaussian kernels.</li>
</ul>
</blockquote>
</li>
<li>Show concrete examples of what are and what are not kernels.<br />
Few sources give enough theorems to think about kernels abstractly and fewer still work concrete examples.
</li>
<li>
Use as little math as possible (which, unfortunately, turns out to be quite a bit).  We will discuss encodings and stopping conditions (important to understand what is going on) but avoid explaining the optimizers (the most beautiful part of support vector machines, but also the part that is available pre-packaged in libraries).
</li>
<li>
Call out common magical thinking and unreasonable expectations associate with kernel methods and support vector machines.  This should help the reader be in a better position to &#8220;defend their doubts&#8221; regarding machine learning promises.
</li>
<li>
Try to place all techniques in a wider context (if it is usable only one place it is a trick, if it is usable multiple places it is a technique).
</li>
<li>Discuss margin and its impact on generalization error.<br />
<blockquote><p>
Generalization error is an effect of &#8220;over fitting&#8221; where a model has learned things that are true about the training examples that do not hold for the overall truth or concept we are trying to learn (i.e. don&#8217;t generalize).   Generalization error is the excess error rate we observe when scoring new examples versus the error-rate we saw in learning the training data.  Margin is in fact a <em>posterior</em> observation.  That is: margin is observed after the training data is seen, not known before data is seen (like, for example, <a target="_blank" href="http://en.wikipedia.org/wiki/VC_dimension">VC dimension</a>).   Margin is useful as it bounds generalization error but it is not the <em>prior</em> bound it is often portrayed as.  So we assert margin estimates are not much more special than simple cross-validation estimates which can also be performed once we have data available.
</p></blockquote>
</li>
</ol>
<p>The goals are not to try to indict or try to cut down kernel methods or support vector machines, but just to dump some of the associated baggage so they can be used fluidly and without anxiety.  </p>
<h2>Example Problem</h2>
<p>Consider the following simple (caricature) machine learning problem:  we are given a number of points labeled as circles and square and we want to, given a new point, predict the label.  In our example the only input will be two numbers: x and y.  This example is simple enough that we can depict the entire situation in Figure 1 below:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/TruthAndData.png" alt="TruthAndData.png" border="0" width="480" height="480" /></p>
<p>Figure 1: Truth and Training Data</p>
<p></center></p>
<p>We are assuming (for simplicity of exposition) that we have the incredible luck that the label is indeed a deterministic function of x and y.  We portray this &#8220;ground truth&#8221; as the blue parabola- every point found in this region will be considered to be labeled as &#8220;circle&#8221; and every other point (i.e. the red region) will be labeled as &#8220;triangle.&#8221;  We have overlaid 20 example points (with appropriate shape labels).  These points will be our training data.  We will try to learn from the observed training data a good approximation of the unseen blue and red regions.  This process is called learning or training and the ability to correctly predict the label of new points is called generalization.  To make sure our concept is learnable we are going to further guarantee a moat (or margin) the distribution we are picking training and test examples from will never pick a point from the shaded region between the blue and black.  That is we will never be asked about an example where x*x is very near y.</p>
<p>To promote visual thinking we are going to avoid formulas until the section titled &#8220;precise functions used&#8221; (where we will list the formulas used for each graph).</p>
<h2>Nearest Neighbor Solution</h2>
<p>An interesting model in these days of big data and fast machines is the nearest neighbor model.  Such a model colors each point in space blue or red as it <em>thinks</em> the point should be classified.  In this case each point is colored the color given by the closest known training point.  This induces the type of model seen in Figure 2 whose boundary is piecewise line segments mid-way between training points.  The data points that determine segments of the classification boundary in this way are thought to support the boundary and are called &#8220;support examples&#8221; or &#8220;support vectors.&#8221;  In fact any training point that is not &#8220;supporting&#8221; part of the boundary is &#8220;irrelevant&#8221; (could be removed and we would get the exact same model).  This split between important and non-important training points is considered one of the important observations from  the theory of support vector machines, but we see it even here in the nearest neighbor models.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/1NN.png" alt="1NN.png" border="0" width="480" height="480" /></p>
<p>Figure 2: 1 nearest neighbor model</p>
<p></center></p>
<p>Notice that the nearest neighbor model in Figure 2 is &#8220;not half bad.&#8221;  The blue region is much too wide, but near the training data the blue mass is roughly in the correct position.  If we had infinite data and infinite computational resources this sort of model would be hard to beat (as any point we are likely to be tested on would have a lot of very near examples to work from).  We can try to clean up the shape of the model a bit by using a vote of the 2 nearest points to color each point in the plane (see Figure 3) or even the majority of the 3 nearest points (see Figure 4).  </p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/2NN.png" alt="2NN.png" border="0" width="480" height="480" /></p>
<p>Figure 3: 2 nearest neighbor model</p>
<p></center></p>
<p>The purple region in Figure 3 is a &#8220;region of uncertainty&#8221; where net-vote over the 2 nearest points is zero (they gave inconsistent advice).  We can use this as a hint that predictions take from this region are less reliable.  As you may notice, we are not getting real improvements.  We should always spend more time getting more features (useful coordinates in addition to x and y) and more data; and spend less time tweaking models.  But if your data and features are fixed all you work on is your modeling technique (alternately: your modeling technique is all you can prepare before receiving features and data).  So we will continue to discuss features and technique.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/3NN.png" alt="3NN.png" border="0" width="480" height="480" /></p>
<p>Figure 4: 3 nearest neighbor model</p>
<p></center></p>
<p>Nearest neighbor classifiers are optimal in the sense that with an infinite amount of data the 1-nearest neighbor classifier has an error rate that approaches twice the Bayes error rate (the Bayes error rate being the ideal error observed on identical repetitions, or the theoretical best error rate) and for large k the k-nearest neighbor method approaches the Bayes error rate itself (see for example <a target="_blank" href="http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm">k-nearest neighbor algorithm</a> ).</p>
<p>However the effectiveness of the nearest neighbor classifiers is coming from the very low dimension (2) of our input variables x and y.  If we were attempting to predict from n variables we might need an amount of data exponential in n to get a useful nearest neighbor classifier.  This is an <a target="_blank" href="http://en.wikipedia.org/wiki/Efficiency_(statistics)">inefficient</a> use of data of the data; what we want is to use all of the data we have effectively.  To do this we further assume there is some relational reason that the position of the point in the plane determines the shape-label (i.e. that we are learning, generalizing, interpolating and extrapolating- not just memorizing).  We set our ambitions above memorizing and move towards functional modeling.  We start thinking in terms of change: how the label density of examples changes as we move is a good clue as to how it will continue to change.  We can do this either in a parametric or primal formulation (where we attempt to directly infer parameters of an assumed functional form as in regression or logistic regression) or in a non-parametric or dual form (where we attempt to learn relations between points and the training data instead of parameters).  We have discussed this before ( <a target="_blan" href="http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/">A Demonstration of Data Mining</a> ) and expanded on primal methods ( <a target="_blank" href="http://www.win-vector.com/blog/2010/11/learn-a-powerful-machine-learning-tool-logistic-regression-and-beyond/">Learn Logistic Regression (and beyond</a>, <a target="_blank" href="http://www.win-vector.com/blog/2011/09/the-simpler-derivation-of-logistic-regression/">The Simpler Derivation of Logistic Regression</a> and <a target="_blank" href="http://www.win-vector.com/blog/2011/09/the-equivalence-of-logistic-regression-and-maximum-entropy-models/">The equivalence of logistic regression and maximum entropy models</a>).</p>
<h2>Functional Solution</h2>
<p>Our next idea is: can we use our training data to tease out a functional form for the relation between x,y and shape label?  The first function we will try is chosen to be similar to the nearest neighbor model we just demonstrated.  For this form we say each training point is the center of a <a target="_blank" href="http://en.wikipedia.org/wiki/Gaussian_function">Gaussian</a> hump (say up for blue/circle and down for red/triangle) which we will call a &#8220;discount function&#8221; (or informally a hump).  A &#8220;Gaussian&#8221; is just a function that falls off exponentially in the square of Euclidian distance from a central point (we can see the simple form of such functions in the &#8220;precise functions used&#8221; section).  The contour lines (or levels as seen from looking down upon) of one such hump or discount function are shown in figure 5 (all graphs from here on in will be looking down at a height represented by color and contour lines, much like topographic maps).</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/GaussianKernel3ExampleConcept.png" alt="GaussianKernel3ExampleConcept.png" border="0" width="480" height="480" /></p>
<p>Figure 5: Narrow Gaussian example concept</p>
<p></center></p>
<p>This particular discount function has its maximum (or peak) centered on a data point and then rapidly falls as we move away from the training point.  The only property we will use in this section is that we can evaluate the discount easily (so we don&#8217;t at this time need or make use of any restrictive properties like integrability, positive semi-definiteness or even anything like bounded level sets).<br />
We could make our functional model just the sum of all of these up and down humps over all of the training data.  Each point gets a hump of the same height and same radius, pointing up for blue/circle and down for red/triangle.  The net coloring of this sum function is shown in figure 6.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/GaussianKernel3KernelSum1.png" alt="GaussianKernel3KernelSum.png" border="0" width="480" height="480" /></p>
<p>Figure 6: Narrow Gaussian sum model</p>
<p></center></p>
<p>The idea is that this would imitate the nearest neighbor model we have already seen.  This is because even though we are summing over all data the closest training examples should usually dominate (since the Gaussian humps we picked fall to zero as we get further from their centers).   The correspondence to nearest neighbor increases if we tighten the radius of our functions (searching for the right radius is a very important statistical problem called &#8220;bandwidth estimation&#8221;).  Figure 7, for example shows the sum over much narrower Gaussians.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/GaussianKernel20KernelSum1.png" alt="GaussianKernel20KernelSum.png" border="0" width="480" height="480" /></p>
<p>Figure 7: Very narrow Gaussian sum model</p>
<p></center></p>
<p>Figure 7 looks even more like the nearest neighbor solution.  In fact your could think of the discount-weighted sum over all of the data as the natural model (as it is easer to reason about) and the nearest neighbors a more efficient approximation (summing only over near points).  We also have seen that we can alter the smoothness of our model (which has consequences on model generalization) by changing the steepness of our discount functions.  We can also build intermediate models (such as building a nearest neighbor but using discount weighted sums instead of uniform voting in the neighborhoods).   We haven&#8217;t yet greatly improved on nearest neighbor, but we have identified a couple of obvious avenues for further improvement: mess with the bandwidths (the obvious idea to try and deal with near/far scale) or re-weight the data (as one of the lessons of the nearest neighbor model is that points that are not near a boundary are less important).   </p>
<h2>Bandwidth Solution</h2>
<p>As a digression lets play with bandwidth a bit first.  Suppose simultaneously for each training datum we picked an ideal bandwidth or steepness of the Gaussian.  </p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/BandwidthModel.png" alt="BandwidthModel.png" border="0" width="480" height="480" /></p>
<p>Figure 8: Simultaneous bandwidth model</p>
<p></center> </p>
<p>This simultaneous bandwidth model depicted in figure 8 was determined by picking a simultaneous assignment of Gaussian widths (each training datum gets its own width) that maximizes the log-sum of the category signed model function on the training data (much like the logistic regression maximum likelihood decoding).  This differs from typical bandwidth selection problems where all points are given a single common &#8220;best bandwidth.&#8221;  We instead picked a different bandwidth for each point (bandwidths picked to maximize how much of the correct category portion of the sum was correct on each training example).  The blue region is reasonable, but does not match the shape of the true concept.   Partly this is because there has not been enough training data to have learned a lot about the true shape (for example there is no empirical evidence for circle/blue in the top-left quadrant).  The other reason is a bias inherent to this type of model.  If one of the bandwidth functions is wider than all of the others (which is almost certain to occur) then it falls slower than all of the others and for points very far away from the center of the training data this one function is most of what remains.  Or equivalently: due to the nature of this modeling technique one of the concepts learned is going to be the union of bounded islands and the other concept will grab all points sufficiently far away from the center of the training data.  This is strong (and undesirable) bias, but it is a bias also shared by support vector machines with Gaussian kernels (though support vector machines get it from their so-called &#8220;dc-term&#8221; b not being zero; this will be discussed later). Our bandwidth correction was successful, but frankly the optimization problem we solved to estimate the optimal bandwidths was nasty and would not be something we would advise for a large amount of data.</p>
<h2>Data weighting solution (support vector machine)</h2>
<p>So instead of messing with the bandwidths let us consider re-weighting the Gaussians in our sum.  We will leave the shape of each Gaussian the same but allow each individual training datum to have a unique weight or importance.  Perhaps by picking the right weights we can get a better model.  By &#8220;best&#8221; we will mean &#8220;best margin&#8221; (to be defined later) because with &#8220;best margin&#8221; as our objective the optimization problem of solving for the best data weights has a particularly beautiful form that can be reliably solved at great scale.  This is called a &#8220;support vector model&#8221; (or support vector machine) and we will describe these ideas in detail later.  But first lets see the result.  Figure 9 shows the original narrow Gaussians re-summed according to the weights picked by the support vector method (instead of all being 1 as in figure 6).</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/GaussianKernel3SVM1.png" alt="GaussianKernel3SVM.png" border="0" width="480" height="480" /></p>
<p>Figure 9: Narrow Gaussian support vector model</p>
<p></center></p>
<p>Figure 9 is again an improvement.  The figure is formed by taking a signed sum of our discount functions centered at our training points and weighted by the &#8220;support vector machine weights.&#8221;  The signs are picked with one class encoded as +1 and the other as -1.  The colors red and blue are picked if the sum is above or below a constant &#8220;b&#8221; called &#8220;the dc-term&#8221; (part of the support vector solution).  The solution again looks okay: all the training points are in the correct color region (and you can&#8217;t hope to interpolate let alone extrapolate if you can&#8217;t even reproduce your training concept).  Also, as with the bandwidth model, contours are set sensibly (steep/dense where categories are near each other, shallow/sparse where things are safe).   We see the same inductive bias: one of the concepts is the union of bounded islands (this will always going to be the case unless the model picks b=0).  This bias is not obvious from the kernel choice, as it is hidden in the dc-term of the support vector fitting equations (it is not part of the kernel, but can be defeated by choosing unbounded kernels).  An interesting empirical fact about support vector models is they tend to work well with a larger bandwidth.    For example if we use a wider Gaussian we get an even more convincing model (see Figure 10).</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/GaussianKernel025SVM.png" alt="GaussianKernel025SVM.png" border="0" width="480" height="480" /></p>
<p>Figure 10: Wide Gaussian support vector model</p>
<p></center></p>
<p>In figure 10 we have also called out one of the important features of support vector machines.   In the literature all datums are called &#8220;vectors&#8221; (as they are represented in coordinates) and a subset of these are called the support vectors.  In figure 10 we have drawn the 4 training examples that turn out to be support vectors as large shapes and all other training examples as small shapes.  The support vectors are the datums with non negligible weights.  The learned model is a function of the support vectors only.  So after fitting we can discard the other 16 training points. This is similar to the fact that nearest neighbor also only needs the training examples that are supporting its model boundary.</p>
<p>Notice that at this point the support vector models are not &#8220;magically&#8221; better, they in fact look less like the truth we are trying to learn than either the nearest neighbor models or the un-weighted sum models.  To fix this we need to do some &#8220;kernel shopping&#8221; or find functions that better respect what we are trying to model.  With the right functions support vector machines can do a very good job at learning the concept (but knowing &#8220;the right functions&#8221; is a huge hint as to the what the concept is.  There is nothing wrong with using a hint but we do have to a method to produce the hint or have a reasonable number of hints to try from. </p>
<p>The &#8220;sum over everything with the same weight&#8221; solutions from the earlier section are very similar to Naive Bayes which also sums over all matching features with no weight adjustment.  The support vector &#8220;use the same model but pick better weights&#8221; stands over these sum models in very much the same way Logistic Regression stands over Naive Bayes.  In fact the optimization problems are very related.  If we consider weights over features or coordinates as &#8220;primal&#8221; and weights over data items as &#8220;dual&#8221; we can informally say something like: support vector machines optimize by working over dual variables and inspecting for primal weights, and logistic regression works over primal variables and inspects for data weights.   But this is a bit vague.</p>
<p>At this point we need to get a bit more explicit and precise.</p>
<h2>Precise functions used</h2>
<p>In general we will write our training data as a sequence of n-vectors.  Our points will be named u(1) &#8230; u(m).   For each i = 1..m we also know which category the training point is labeled with.  We will encode this as y(1) &#8230; y(m) where each y() is either +1 or -1 depending on the training label.  So the example we have been working has n=2 and m=20.  Let z represent a n-vector we wish to classify.</p>
<p>Figure 6 was a picture of the sign of the function:<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/E4BCA580-314E-4BD8-AD83-3CF55C7AE24E.jpg" alt="E4BCA580-314E-4BD8-AD83-3CF55C7AE24E.jpg" border="0" height="50" /><br />
</center></p>
<p>Figure 7 was a picture of the sign of the function:<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/9F34681D-45FC-4E0B-8466-943CF2881AAC.jpg" alt="9F34681D-45FC-4E0B-8466-943CF2881AAC.jpg" border="0" height="50" /><br />
</center></p>
<p>Figure 8 was a picture of the sign of the function:<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/C70C6F0B-7EB9-44C5-9505-013604F46D9C.jpg" alt="C70C6F0B-7EB9-44C5-9505-013604F46D9C.jpg" border="0"  height="50" /><br />
</center><br />
where the w(i) are non-negative numbers picked so that<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/54765538-DADD-4353-A59A-EF3B6C42743E.jpg" alt="54765538-DADD-4353-A59A-EF3B6C42743E.jpg" border="0"  height="50" /><br />
</center><br />
is large.</p>
<p>Figure 9 was a picture of the sign of -b plus the function:<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/CD6E8576-D6D3-4D61-B285-46DB169BC885.jpg" alt="CD6E8576-D6D3-4D61-B285-46DB169BC885.jpg" border="0" height="50" /><br />
</center><br />
where the a(i) and b are picked so &#8220;the margin&#8221; (discussed later) is large.</p>
<p>Figure 10 was a picture of the sign of -b plus of the function:<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/1932DC0A-E051-41E7-8D90-F0868247A69C.jpg" alt="1932DC0A-E051-41E7-8D90-F0868247A69C.jpg" border="0" height="50" /><br />
</center><br />
where the a(i) and b are picked so the margin is large.</p>
<p>All of this repetition is to emphasize the commonality of the models.  They are all of the form:<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/1FC41F1F-16D8-40A3-BA4E-0EF38E455F00.jpg" alt="1FC41F1F-16D8-40A3-BA4E-0EF38E455F00.jpg" border="0" height="50" /><br />
</center><br />
where the a(i) are non-negative numbers called the &#8220;support weights&#8221; and k(,) is a function mapping pairs of vectors to numbers and is called a &#8220;a kernel.&#8221;  Unfortunately there are many incompatible uses of the word kernel in mathematics and statistics.  Here &#8220;kernel&#8221; is being used in the sense of <a target="_blank" href="http://en.wikipedia.org/wiki/Positive-definite_kernel">positive semi-definiteness</a> or that k(u,u) &ge; 0 for all u.   Note that the support vector machine instead of using the sign of f(z) as its decision instead uses which side f(z) is of a constant b as its category decision.  A consequence is: for support vector machines if b is non-zero (as it almost surely will be) and the kernels all go to zero as we approach infinity fast enough (as they are designed to do) then exactly one of the learned classes is infinite and the other is a union of islands (regardless if this was true for the training data).</p>
<h2>About Kernels</h2>
<p>A lot of awe and mysticism is associated with kernels, but we think they are not that big a deal.  Kernels are a combination of two good ideas, they have one important property and are subject to one major limitation.  It is also unfortunate that support vector machines and kernels are tied so tightly together.   Kernels are the idea of summing functions that imitate similarity (induce a positive-definite encoding of nearness) and support vector machines are the idea of solving a clever dual problem to maximize a quantity called margin.  Each is a useful tool even without the other.</p>
<p>The two good ideas are related and unfortunately treated as if they are the same.  The two good ideas we promised are:</p>
<ol>
<li>The &#8220;kernel trick.&#8221;  Adding new features/variables that are functions of your other input variables can change linearly inseparable problems into linearly separable problems.  For example if our points were encoded not as u(i) = (x(i),y(i)) but as u(i) = (x(i),y(i),x(i)*x(i),y(i)*y(i),x(i)*y(i))</li>
<p>  we could easily find the exact concept ( y(i) > x(i)*x(i) which is now a linear concept encoded as the vector (0,1,-1,0,0).</p>
<li>Often you don&#8217;t need the coordinates of u(i).  You are only interested in functions of distances ||u(i)-u(j)|^2 and in many cases you can get at these by inner products and relations like ||u(i)-u(j)||^2 = &lt;u(i),u(i)&gt; + &lt;u(j),u(j)&gt; &#8211; 2&lt;u(i),u(j)&gt; .</li>
</ol>
<p>We will expand on these issues later.</p>
<p>The important property is that kernels look like inner products in a transformed space.  The definition of a kernel is: there exists a magic function phi() such that for all u,v:<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/F7B3D1C2-2702-48E0-8259-C48BBDACFE50.jpg" alt="F7B3D1C2-2702-48E0-8259-C48BBDACFE50.jpg" border="0"  height="20" /> .<br />
</center><br />
This means that k(.,.) is behaving like an inner product in some (possibly unknown) space.  The important consequence is the positive semi-definiteness, which implies k(u,u)&ge;0 for all u (and this just follows from the fact about inner products over the real numbers that &lt;z,z&gt;&ge;0 for all z).  This is why optimization problems that use the kernel as their encoding are well formed (such as the optimization problem of maximizing margin which is how support vector machines work).  You can <a target="_blank" href="http://en.wikipedia.org/wiki/Regularization_(mathematics)">&#8220;regularize&#8221;</a> optimization problems with a kernel penalty because it behaves a lot like a norm.  Without the positive semidefinite property all of these optimization problems would be able to &#8220;run to negative infinity&#8221; or use negative terms (which are not possible from a kernel) to hide high error rates.  The limits of the kernel functions (not being able to turn distance penalties into bonuses) help ensure that the result of optimization is actually useful (and not just a flaw in our problem encoding).</p>
<p>And this brings us to the major limitations of kernels.  The phi() transform can be arbitrarily magic except when transforming one vector it doesn&#8217;t know what the other vector is and phi() doesn&#8217;t even know which side of the inner product it is encoding.   That is kernels are not as powerful as any of the following forms:</p>
<ul>
<li>snooping (knowing the other):<br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/1BA6AACD-96DE-4B2A-A3FA-1236AE9D2957.jpg" alt="1BA6AACD-96DE-4B2A-A3FA-1236AE9D2957.jpg" border="0"  height="20" />
</li>
<li>positional (knowing which part of inner product mapping to):<br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/D51DD0F0-DF01-4E37-B5C4-E5ECF3179165.jpg" alt="D51DD0F0-DF01-4E37-B5C4-E5ECF3179165.jpg" border="0" height="20" />
</li>
<li>fully general:<br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/845BD9A6-97DC-4F2C-AD4E-633FC9AFADA7.jpg" alt="845BD9A6-97DC-4F2C-AD4E-633FC9AFADA7.jpg" border="0" height="20" />
</li>
</ul>
<p>Not everything is a kernel.</p>
<p>Some non-kernels are:</p>
<ol>
<li> k(u,v) = -c for any c>0</li>
<blockquote><p>
A consequence of the fact that no negative number can be written as a sum of squares or as the limit of a sum of squares in the reals.
</p></blockquote>
<li> k(u,v) = ||u-v||<br />
<blockquote><p>
This can be shown as follows.  Suppose k(u,v) is a kernel with k(u,u) = 0 for all u (which is necessary to try and match ||u-v|| as ||u-u|| = 0 for all u).  By the definition<br />
of kernels k(u,u) = &lt; phi(u), phi(u) &gt; for some real vector valued function phi(.).  But, by the properties of the inner real inner product &lt;.,.&gt;, this means phi(u) is<br />
the zero vector for all u.  So k(u,v) = 0 for all u,v and does not match ||u-v|| for any u,v such that u &ne; v.
</p></blockquote>
</li>
</ol>
<p>There are some obvious kernels:</p>
<ol>
<li> k(u,v) = c for any c &ge; 0 (non-negative constant kernels) </li>
<li> k(u,v) = &lt; u , v &gt;  (the identity kernel) </li>
<li> k(u,v) = f(u) f(v) for any real valued function f(.) </li>
<li> k(u,v) = &lt; f(u) , f(v) &gt; for any real vector valued function f(.) (again, the definition of a kernel)</li>
<li> k(u,v) = transpose(u) B v where B is any symmetric positive semi-definite matrix</li>
</ol>
<p>And there are several subtle ways to build new kernels from old.  If q(.,.) and r(.,.) are kernels then so are:</p>
<ol>
<li> k(u,v) = q(u,v) + r(u,v)</li>
<li> k(u,v) = c q(u,v) for any c &ge; 0 </li>
<li> k(u,v) = q(u,v) r(u,v) </li>
<li> k(u,v) = q(f(u),f(v)) for any real vector valued function f(.)</li>
<li> k(u,v) = lim_{k-> infinity} q_k(u,v) where q_k(u,v) is sequence of kernels and the limit exists.</li>
<li> k(u,v) = p(q(u,v)) where p(.) is any polynomial with all non-negative terms.</li>
<li> k(u,v) = f(q(u,v)) where f(.) is any function with an absolutely convergent Taylor series with all non-negative terms.</li>
</ol>
<p>Most of these facts are taken from the excellent book: John Shawe-Taylor and Nello Cristianini&#8217;s <a target="_blank" href="http://www.kernel-methods.net/">&#8220;Kernel Methods for Pattern Analysis&#8221;</a>, Cambridge 2004.  We are allowing<br />
kernels of the form k(u,v) = &lt; phi(u), phi(v) &gt; where phi(.) is mapping into an infinite dimensional vector space (like a series).  Most of these facts can be checked by<br />
imagining how to alter the phi(.) function.  For example to add two kernels just build a larger vector with enough slots for all of the coordinates for the vectors encoding<br />
the phi(.)&#8217;s of the two kernels you are trying to add.  To scale a kernel by c multiply all coordinates of phi() by sqrt(c).  </p>
<p>Multiplying two kernels is the trickiest.  Without loss of generality assume q(u,v) = &lt; f(u) ,f(v) &gt; and r(u,v) = &lt; g(u), g(v) &gt; and f(.) and g(.) are both mapping into the same finite dimensional vector space R^m (any other situation can be simulated or approximated by padding with zeros and/or taking limits).  Imagine a new vector function p(.) that maps into R^{m*m} such that p(z)_{m*(i-1) + j} = f(z)_i g(z)_j .  It is easy to check that k(u,v) = &lt; p(u) , p(v) &gt; is a kernel and k(u,v) = q(u,v) r(u,v). (The reference proof uses tensor notation and the Schur product, but these are big hammers mathematicians use when they don&#8217;t want to mess around with re-encoding indices).</p>
<p>Note that one of the attractions of kernel methods is that you never have to actually implement any of the above constructions.  What you do is think in terms of sub-routines (easy for computer scientists, unpleasant for mathematicians).  For instance: if you had access to two functions q(.,.) and r(.,.) that claim to be kernels and you wanted the product kernel you would just, when asked to evaluate the kernel, just get the results for the two sub-kernels and multiply (so you never need to see the space phi(.) is implicitly working in).</p>
<p>This implicitness can be important (though far too much is made of it).  For example the Gaussians we have been using throughout are kernels, but kernels of infinite dimension, so we can not directly represent the space they live in.  To see the Gaussian is a kernel notice the following:<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/926E0E3C-883C-4879-A49C-4B91DEB3E449.jpg" alt="926E0E3C-883C-4879-A49C-4B91DEB3E449.jpg" border="0"  height="20" /> .<br />
</center> </p>
<p>And for all c &ge; 0 each of the three terms on the right is a kernel (the first because the Taylor series of exp() is absolutely convergent and non-negative and the second two are the f(u) f(v) form we listed as obvious kernels).  The trick is magic, but the idea is to use the fact that Euclidian squared distance breaks nicely into dot-products ( ||u-v||^2 = &lt; u, u &#038;gt + &lt; v, v &#038;gt &#8211; 2 &lt; u , v &gt;) and exp() converts addition to multiplication.  It is rather remarkable that kernels (which are a generalization of inner products that induce half open spaces) can encode bounded concepts (more on this later when we discuss the VC dimension of the Gaussian).</p>
<h2>Back to Support Vector Machines</h2>
<p>We can now restate what a support vector machine is: it is a method for picking data weights so that the modeling function for a given kernel has a maximum margin.  The margin is the minimal distance between a training example and the model&#8217;s boundary between blue and red.  Notice the 4 support vectors in figure 10 are all not &#8220;tight up against the boundary,&#8221; but are all the same (minimal distance) from the boundary.  This distance is called the margin and the area near the boundary is obviously a place the model has uncertainty (so it makes sense to keep the boundary as far as possible from the training data).  The great benefit of the support vector machine is that with access only to the data labels and the kernel function (in fact only the kernel function evaluated at pairs of training datums) the support vector machine can quickly solve for the optimal margin and data weights achieving this margin.</p>
<p>For fun lets plug in new kernel called the cosine kernel:<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/328D7D80-4133-4E42-A63F-F7E8F3ADC90C.jpg" alt="328D7D80-4133-4E42-A63F-F7E8F3ADC90C.jpg" border="0"  height="43" /><br />
</center><br />
(c &ge; 0).</p>
<p>This kernel can be thought of as having a phi(.) function that takes a vector z and adds an extra coordinate of sqrt(c) and then projects the resulting vector onto the unit sphere.  This kernel induces concepts that look like parabolas, as seen in figure 11.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/CosineKernelExampleConcept.png" alt="CosineKernelExampleConcept.png" border="0" width="480" height="480" /></p>
<p>Figure 11: Cosine example concept</p>
<p></center></p>
<p>Figure 12 shows the sum-concept (add all discount functions with same weight) model. In this case it is far too broad (averaging of the cosine concepts which are not bounded concepts like the Gaussians) creates an overly wide model.  The model not only generalizes poorly (fails to match the shape of the truth diagram) it also gets some of its own training data wrong.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/CosineKernelKernelSum1.png" alt="CosineKernelKernelSum.png" border="0" width="480" height="480" /></p>
<p>Figure 12: Cosine kernel sum model</p>
<p></center></p>
<p>And figure 13 shows the excellent fit returned by the support vector machine.  Actually an excellent fit determined by 4 support vectors (indicated by larger labels).  Also notice the support vector machine using unbounded concepts generated a very good unbounded model (unbounded in the sense that both the blue and red regions are infinite).  By changing the kernel or discount functions/concepts we changed the inductive bias.  So any knowledge of what sort of model we want (one class bounded or not) should greatly influence our choice of kernel functions (since the support vector machine can only pick weights for the kernel functions, not tune their shapes or bandwidths).</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/CosineKernelSVM.png" alt="CosineKernelSVM.png" border="0" width="480" height="480" /></p>
<p>Figure 13: Cosine Support Vector model</p>
<p></center></p>
<p>Another family of kernels (which I consider afflictions) are the &#8220;something for nothing&#8221; kernels.  These are kernels of the form:<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/C2307338-CB70-4EE0-B83D-0B4AE988F951.jpg" alt="C2307338-CB70-4EE0-B83D-0B4AE988F951.jpg" border="0" height="20" /><br />
</center><br />
or higher powers or other finishing functions.  The idea is that if you were to look at the expansion of these kernels you would see lots of higher order terms (powers of u_i and v_j)<br />
and if these terms were available to the support vector machine it could use them.  It is further claimed that in addition to getting new features for free you don&#8217;t use up degrees<br />
of freedom exploiting them (so you cross-validate as well as for simpler models).   Both these claims are fallacious- you can&#8217;t fully use the higher order terms because they are entangled with other terms that are not orthogonal the outcome and the complexity of a kernel is not quite as simple as degrees of freedom (proofs about support vector machines are stated in terms of margin, not in terms of degrees of freedom or even in terms of VC dimension).  The optimizer in the SVM does try hard to make a good hypothesis using the higher order terms- but due to the forced relations among the terms you get best fits like figure 14.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/SquareKernelSVM.png" alt="SquareKernelSVM.png" border="0" width="480" height="480" /></p>
<p>Figure 14: Squared magic kernel support vector model</p>
<p></center></p>
<p>If you want higher order terms I feel you are much better off performing a primal transform so the terms are available in their most general form.   That is re-encode the vector u = (x,y) as the larger vector (x,y,x*y,x*x,y*y).  You have to be honest: if you are trying to fit more complicated functions you are searching a larger hypothesis space so you need more data to falsify the bad hypotheses.   You can be efficient (don&#8217;t add terms you don&#8217;t think you can use as they make generalization harder) but you can&#8217;t get something for nothing (even with kernel methods).</p>
<h2>Back to margin</h2>
<p>Large margin (having a large distance between the decision boundary and all training examples) is a good idea.  At the simplest level it means anything close to a training data point gets classified correctly (so the model is immune to an amount of fuzz proportional to the margin width).  </p>
<p>We get the following remarkable generalization theorem (this one is the hard SVM version, we weaken the result a bit to make it easier to state).  Suppose we train a model using kernel k(.,.) on m training examples u(1),&#8230;,u(m) with training labels y(1)&#8230;y(1).  Further assume the hidden true concept is separable with respect to our kernel and data distribution with a margin of at least w (that is for any sample we draw we can find a SVM model that classifies all of the training data correctly and sees a margin of at least w).  Or more simply: assume w is behaving like a constant with respect to m (i.e. it is not going down as we increase sample size). Then with probability at least 1-d the generalization error (that is error seen on new test points not seen during training, but drawn from the same distribution as the training examples) is no more than:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/1F1B1C26-C676-46EA-9522-5DBA181CC8C6.jpg" alt="1F1B1C26-C676-46EA-9522-5DBA181CC8C6.jpg" border="0" height="65" /><br />
.<br />
</center><br />
(we take this from Theorem 7.22 of &#8220;Kernel Methods for Pattern Analysis, page 215).</p>
<p>If we further assume the expected value k(u,u) under the training distribution exists (i.e. we don&#8217;t have too many examples of very high kernel norm according to the distribution we are training with respect to) and is stable (not taking a lot of value from rare instances) then we can read off the meaning of this upper bound.  Assuming we can always build a model from the training data of margin at least w then the generalization error (excess error we expect to see on classifying similar points) is falling at a rate proportional to the square root of the amount of training data we have.  This is in one sense amazing- we have a bound on how well we are going to do on data we have not seen.  And this bound is also in some sense &#8220;best possible&#8221; because we need around this much data to even confirm such an accuracy (see <a target="_blank" href="http://www.win-vector.com/blog/2011/06/what-is-a-large-enough-random-sample/">What is a large enough random sample?</a> ).  Also notice how the dimension of the implicit features space (which can in fact be infinite) does not enter into the bound.</p>
<p>However, the bound is also some sense to be expected.  It depends on the following strong assumptions (usually not remembered or stated):</p>
<ol>
<li>Test and training data must be generated in the same manner (i.e. come from the same distribution, this is usually stated but not obeyed).</li>
<li>We must not have collapsing margin as the number of data points goes up (so there must actually be a moat between the positive and negative examples such as portrayed in figure1 where the blurry boundary was a region we never generated training or test examples from).</li>
<li>The expected value of k(u,u) must be bounded for examples drawn from our training distribution.  For example this is not true even for the identity kernel if x is just numbers drawn from a <a target="_blan" href="http://en.wikipedia.org/wiki/Cauchy_distribution">Cauchy distribution</a>.</li>
</ol>
<p>Also it is a <em>posterior</em> estimate of expected generalization error.  That is, we only get a bound after we plug in some facts found from fitting a model to the training data (the margin and the expected value of k(u,u)).  It is not like the famous <em>prior</em> bounds based on VC dimension that depend only on the structure of the kernel k(,) and not the distribution of training data or labels (beyond the concept being expressible by the kernel, which is itself a <em>posterior</em> fact).  The thing to remember for all posterior estimates is you can get such estimates at a small expense in statistical efficiency (i.e. you waste some data) and a moderate computational blow-up by cross validation (and with essentially no math and many fewer assumptions).  In fact the cross validation and hold-out estimates may be much tighter than calculated bounds (and are certainly easier to explain).</p>
<p>When we say we weakened the result we mean we have added some stated some conditions that are not strictly needed for the bound to be true.  One of these assumptions is that the margin is not collapsing (the bound remains true even for a collapsing margin; it is just not useful).  We added this assumption because most readings of the bound (or presumed applications) implicitly assume the bound remains useful (which means something near our additional assumptions must also hold).  The reader is really lured to think of the margin w as a constant depending only on the kernel when in fact it depends on the kernel, data distribution and number of training examples.  For example: consider a very simple one dimensional classification problem:  learning the concept x &ge; 0 from labeled training data drawn uniformly from the interval [-0.5,0.5] using a Gaussian kernel.  The support vector models are going to be essentially picking two neighboring points in the observed training sample that have opposite labels and saying the concept boundary is between them.  In this simple case the margin is collapsing: as m (the number of training examples increases) the expected margin width is O(1/m).  This renders the bound calculated above useless (drives it above 1 as the number of training examples increases).  The model is good but the bound is not useful.  This is because the training distribution is discovering points closer and closer to the concept boundary (a necessary discontinuity) as the number of training examples grows.  This is why we had to add the additional (weakening) assumption that not only do we need to assume the concept to be learned is separable with respect to the kernel we are using, but we must further assume the data-distribution (determining where test and training data come from) stays some minimal distance away from the decision surface (i.e. together the data distribution and underlying concept have a moat, as in our figure 1 example).  We will phrase this as &#8220;the concept is wide margin separable with respect to the training distribution&#8221; (implicitly we need to know the kernel, but it is the training distribution that is critical). This is a very strong assumption (not true for all classification problems) but is often true when the classification task is to &#8220;un-mix a mixture&#8221; (the observed data is coming from two different sources and the training labels are which source they come from, in this case you often do have a margin).  Obviously a soft-margin SVM deals better with this (some of the issue is due to our use of hard-margin)- but it is an interesting exercise to think why you would expect your learned  model to have a large margin if your underlying concept and training data do not.</p>
<p>The result remains important: support vector machines are in some sense an optimal learning procedure (up to some constants they achieve as tight a generalization error as can even be confirmed for a given training set size).  Another strong point of the support vector result is the result applies even for kernels with unbounded VC dimension (such as the Gaussian kernel).</p>
<h2>The Gaussian (or radial) kernel again</h2>
<p>We said the Gaussian kernel was of infinite VC dimension.  But we have not shown why it is true.   The fact that the only encoding we could think of off the top of our head was a power-series or an limit does not guarantee there isn&#8217;t  some more clever encoding that easily demonstrates that the Gaussian kernel is of bounded VC dimension (like the identity kernel and cosine kernels are). </p>
<p>However we can see the Gaussian kernel can not have finite VC dimension because it can place an arbitrary number of bounded patches in the plane.  Thus we can build arbitrarily large sets (by placing all points far apart) that can be shattered (thus falsifying any bounded VC dimension).</p>
<p>One of the wonders of the support vector method is that the theory works even in this situation.</p>
<h2>Back to SVM</h2>
<p>Support vector machines seem nearly magical in their power to pick &#8220;best weights.&#8221;  However, as we have seen, these best weights may not always be much better than typical good weights (for instance: using all uniform weights).  Also it is very important to remember the support vector machine is picking the best weighting of a sum of already picked kernels.  It is not picking the best kernel shape or adjusting things like bandwidth (you must pick the best kernel ahead of time or try several kernels to find a good one).  You must remember support vector machines can only directly adjust dual weights (or relative training data weighting) this can affect some changes in hypothesis shape but the support vector machine can not directly implement arbitrary changes in hypothesis shape like a nearest neighbor or a parameterized primal method can (like logistic regression with the kernel trick of adding interaction variables). </p>
<p>Support vector machines are to be much preferred to other combiners like <a target="_blan" href="http://en.wikipedia.org/wiki/Boosting">Boosting</a>.  This is because support vector machines have explicit stopping criteria (conditions known to be true at the optimal solution that characterize the optimal solution) and explicit regularization (control of over fitting).  Boosting relies a path argument (&#8220;do the computation in these stages&#8221;) so it is much harder to reason about and frankly the folklore that &#8220;early stop prevents over fitting in boosting&#8221; is just false (you are much better off explicitly regularizing).</p>
<p>It is also lore that random kernel transformations (like power series over the cosine kernel) are magic.  They can make data separable but they are unlikely to meet the (unstated but critical) conditions of the support vector theorem (margin being wide, margin not filling in and expectation of k(.,.) being small).</p>
<h3>A comment on optimization</h3>
<p>In this writeup we have skipped the most beautiful part of support vector machines- the form of the optimization problem used to solve them.  We strongly suggest consulting  John Shawe-Taylor and Nello Cristianini&#8217;s &#8220;Kernel Methods for Pattern Analysis&#8221;, Cambridge 2004 for this.  The writing is dense but accurate.  You can literally paste their formulas into a general solver and they work.</p>
<h2>Conclusion</h2>
<p>We have shown how kernel methods and support vector machines fit in a sequence of machine learning methods:</p>
<ol>
<li>nearest neighbor models</li>
<li>uniform sum models</li>
<li>support vector machines.</li>
</ol>
<p>We also showed a number of examples of kernels and non-kernels.  It is good to be able to quickly remember the constant &#8220;-1&#8243; is not a kernel and the constant &#8220;1&#8243; is a kernel (though not a very useful one).</p>
<p>It is our argument that support vector machines are &#8220;dual methods&#8221; (as they work in weights over the data instead of weights over coordinates in the features space) and the move from the primal algorithm Naive Bayes to Logistic regression is similar to the move from uniform kernel sums to support vector machines.</p>
<p>Also, if you want to directly manipulate shape (like picking bandwidth) you must in some sense use a primal method (this isn&#8217;t what support vector machines are for).</p>
<p>We end with: If you have some feeling of both what a method can and can not do then you have some understanding of the method.</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2010/10/a-personal-perspective-on-machine-learning/' rel='bookmark' title='A Personal Perspective on Machine Learning'>A Personal Perspective on Machine Learning</a></li>
<li><a href='http://www.win-vector.com/blog/2011/04/do-your-tools-support-production-or-complexity/' rel='bookmark' title='Do your tools support production or complexity?'>Do your tools support production or complexity?</a></li>
<li><a href='http://www.win-vector.com/blog/2011/07/book-review-ensemble-methods-in-data-mining-seni-elder/' rel='bookmark' title='Book Review: Ensemble Methods in Data Mining (Seni &amp; Elder)'>Book Review: Ensemble Methods in Data Mining (Seni &#038; Elder)</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2011/10/kernel-methods-and-support-vector-machines-de-mystified/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Increase your productivity</title>
		<link>http://www.win-vector.com/blog/2011/09/increase-your-productivity/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=increase-your-productivity</link>
		<comments>http://www.win-vector.com/blog/2011/09/increase-your-productivity/#comments</comments>
		<pubDate>Sat, 24 Sep 2011 17:24:29 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Opinion]]></category>
		<category><![CDATA[Public Service Article]]></category>
		<category><![CDATA[Productivity]]></category>
		<category><![CDATA[Training]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1759</guid>
		<description><![CDATA[I think I have been pretty productive on technical tasks lately and the method is (at least to me) interesting. The effect was accidental but I think one can explain it and reproduce it by synthesizing three important observations on human behavior.The three observations are: 1) Jacques Hadamard in &#8220;An Essay on the Psychology of [...]
No related posts.]]></description>
			<content:encoded><![CDATA[<p>I think I have been pretty productive on technical tasks lately and the method is (at least to me) interesting.  The effect was accidental but I think one can explain it and reproduce it by synthesizing three important observations on human behavior.<span id="more-1759"></span>The three observations are:</p>
<p>1) Jacques Hadamard in &#8220;An Essay on the Psychology of Invention in the Mathematical Field&#8221; called out the importance of non-voluntary intuitive creative leaps that occur in rest periods between intervals of intense work and preparation.  </p>
<p>2) It has been noted again and again that what actually makes people happy (versus what they anticipate would make them happy) are activities and experiences with rising challenges (for example see Daniel Gilbert&#8217;s &#8220;Stumbling on Happiness&#8221;).  </p>
<p>3) It is folklore that a number of the greatest computer scientists are also fairly accomplished musicians.</p>
<p>And here is the punch-line: take up a skill building hobby (in my case I am trying to learn how to draw).  You definitely enjoy it, but some part of your subconscious also resents being made to work (learning is work, don&#8217;t confuse that with repetition).  To defend itself your subconscious then starts throwing out more and better technical ideas during periods of repose.  Jot these down (without trying to work on them).  The effect is even stronger than Hadamard&#8217;s effect (where your brain is solving problems for you to end an effort) as it is closer to the classic trick of making progress on one task by procrastinating on another task.</p>
<p>This is similar to the &#8220;left brain/right brain&#8221; ideas of the 1970s (it assumes the existence of a subconscious) but assumes far less unverified structure of a subconscious.  And here is where the &#8220;10,000 hours to mastery effect&#8221; (Malcolm Gladwell, &#8220;Outliers: The Story of Success&#8221;) works in your favor- you can use the same source of deliberate practice (remember you have to be learning not puttering around) for a long time.</p>
<p>I think if you are in good health and have enough energy you can pull this trick off at will.</p>
<p>No related posts.</p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2011/09/increase-your-productivity/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>The equivalence of logistic regression and maximum entropy models</title>
		<link>http://www.win-vector.com/blog/2011/09/the-equivalence-of-logistic-regression-and-maximum-entropy-models/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=the-equivalence-of-logistic-regression-and-maximum-entropy-models</link>
		<comments>http://www.win-vector.com/blog/2011/09/the-equivalence-of-logistic-regression-and-maximum-entropy-models/#comments</comments>
		<pubDate>Fri, 23 Sep 2011 16:21:09 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Expository Writing]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Statistics To English Translation]]></category>
		<category><![CDATA[Tutorials]]></category>
		<category><![CDATA[Calculus of Variations]]></category>
		<category><![CDATA[log-likelihood]]></category>
		<category><![CDATA[Logistic Regression]]></category>
		<category><![CDATA[Max-Ent]]></category>
		<category><![CDATA[Maximum Entropy]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1753</guid>
		<description><![CDATA[Nina Zumel recently gave a very clear explanation of logistic regression ( The Simpler Derivation of Logistic Regression ). In particular she called out the central role of log-odds ratios and demonstrated how the &#8220;deviance&#8221; (that mysterious quantity reported by fitting packages) is both a term in &#8220;the pseudo-R^2&#8243; (so directly measures goodness of fit) [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2011/09/the-simpler-derivation-of-logistic-regression/' rel='bookmark' title='The Simpler Derivation of Logistic Regression'>The Simpler Derivation of Logistic Regression</a></li>
<li><a href='http://www.win-vector.com/blog/2010/11/learn-a-powerful-machine-learning-tool-logistic-regression-and-beyond/' rel='bookmark' title='Learn Logistic Regression (and beyond)'>Learn Logistic Regression (and beyond)</a></li>
<li><a href='http://www.win-vector.com/blog/2010/12/large-data-logistic-regression-with-example-hadoop-code/' rel='bookmark' title='Large Data Logistic Regression (with example Hadoop code)'>Large Data Logistic Regression (with example Hadoop code)</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>Nina Zumel recently gave a very clear explanation of logistic regression ( <a target="_blank" href="http://www.win-vector.com/blog/2011/09/the-simpler-derivation-of-logistic-regression/">The Simpler Derivation of Logistic Regression</a> ).  In particular she called out the central role of log-odds ratios and demonstrated how the &#8220;deviance&#8221; (that mysterious<br />
quantity reported by fitting packages) is both a term in &#8220;the pseudo-R^2&#8243; (so directly measures goodness of fit) and is the quantity that is actually optimized during the fitting procedure.  One great point of the writeup was how simple everything is once you start thinking in terms of derivatives (and that it isn&#8217;t so much the functional form of the sigmoid that is special but its relation to its own derivative that is special).</p>
<p>We adapt these presentation ideas to make explicit the well known equivalence of logistic regression and maximum entropy models.<span id="more-1753"></span>In our new writeup: <a target="_blank" href="http://www.win-vector.com/dfiles/LogisticRegressionMaxEnt.pdf">The equivalence of logistic regression and maximum entropy models</a>  we move to multi-category modeling and demonstrate how one invents something as remarkable as logistic regression.</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2011/09/the-simpler-derivation-of-logistic-regression/' rel='bookmark' title='The Simpler Derivation of Logistic Regression'>The Simpler Derivation of Logistic Regression</a></li>
<li><a href='http://www.win-vector.com/blog/2010/11/learn-a-powerful-machine-learning-tool-logistic-regression-and-beyond/' rel='bookmark' title='Learn Logistic Regression (and beyond)'>Learn Logistic Regression (and beyond)</a></li>
<li><a href='http://www.win-vector.com/blog/2010/12/large-data-logistic-regression-with-example-hadoop-code/' rel='bookmark' title='Large Data Logistic Regression (with example Hadoop code)'>Large Data Logistic Regression (with example Hadoop code)</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2011/09/the-equivalence-of-logistic-regression-and-maximum-entropy-models/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

