<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Win-Vector Blog &#187; Tutorials</title>
	<atom:link href="http://www.win-vector.com/blog/category/tutorials/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.win-vector.com/blog</link>
	<description>The Applied Theorist&#039;s Point of View</description>
	<lastBuildDate>Sat, 04 Feb 2012 17:42:12 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Six Fundamental Methods to Generate a Random Variable</title>
		<link>http://www.win-vector.com/blog/2012/01/six-fundamental-methods-to-generate-a-random-variable/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=six-fundamental-methods-to-generate-a-random-variable</link>
		<comments>http://www.win-vector.com/blog/2012/01/six-fundamental-methods-to-generate-a-random-variable/#comments</comments>
		<pubDate>Fri, 20 Jan 2012 19:23:38 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Mathematics]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Tutorials]]></category>
		<category><![CDATA[Ergodic Theory]]></category>
		<category><![CDATA[Markov Chains]]></category>
		<category><![CDATA[Markov Monte Carlo]]></category>
		<category><![CDATA[Random Sampling]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1925</guid>
		<description><![CDATA[Introduction To implement many numeric simulations you need a sophisticated source of instances of random variables. The question is: how do you generate them? The literature is full of algorithms requiring random samples as inputs or drivers (conditional random fields, Bayesian network models, particle filters and so on). The literature is also full of competing [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2011/06/what-is-a-large-enough-random-sample/' rel='bookmark' title='What is a large enough random sample?'>What is a large enough random sample?</a></li>
<li><a href='http://www.win-vector.com/blog/2011/10/kernel-methods-and-support-vector-machines-de-mystified/' rel='bookmark' title='Kernel Methods and Support Vector Machines de-Mystified'>Kernel Methods and Support Vector Machines de-Mystified</a></li>
<li><a href='http://www.win-vector.com/blog/2011/07/book-review-ensemble-methods-in-data-mining-seni-elder/' rel='bookmark' title='Book Review: Ensemble Methods in Data Mining (Seni &amp; Elder)'>Book Review: Ensemble Methods in Data Mining (Seni &#038; Elder)</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<h2> Introduction</h2>
<p>To implement many numeric simulations you need a sophisticated source of instances of random variables.  The question is: how do you generate them?  </p>
<p>The literature is full of algorithms requiring random samples as inputs or drivers (<a target="_blank" href="http://en.wikipedia.org/wiki/Conditional_random_field">conditional random fields</a>, <a target="_blank" href="http://en.wikipedia.org/wiki/Bayesian_network">Bayesian network models</a>, <a target="_blank" href="http://en.wikipedia.org/wiki/Particle_filter">particle filters</a> and so on). The literature is also full of competing methods (<a target="_blank" href="http://en.wikipedia.org/wiki/Pseudorandom_number_generator">pseudorandom generators</a>, <a target="_blank" href="http://en.wikipedia.org/wiki/Entropy_(information_theory)">entropy sources</a>, <a target="_blank" href="http://en.wikipedia.org/wiki/Gibbs_sampling">Gibbs samplers</a>, <a target="blank" href="http://en.wikipedia.org/wiki/Metropolis–Hastings_algorithm">Metropolis–Hastings algorithm</a>, <a target="_blank" href="http://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo">Markov chain Monte Carlo methods</a>, <a target="_blank" href="http://en.wikipedia.org/wiki/Bootstrapping">bootstrap methods</a> and so on).  Our thesis is: this diversity is supported by only a few fundamental methods.  And you are much better off thinking in terms of a few deliberately simple composable mechanisms than you would be in relying on some hugely complicated black box &#8220;brand name&#8221; technique. </p>
<p>We will discuss the half dozen basic methods that all of these techniques are derived from.<span id="more-1925"></span>To our mind all of the famous random variate generation/sampling techniques are derived from combinations of the following six fundamental methods:</p>
<ol>
<li>Physical sources.</li>
<li>Empirical resampling.</li>
<li>Pseudo random generators.</li>
<li>Simulation/Game-play.</li>
<li>Rejection Sampling.</li>
<li>Transform methods.</li>
</ol>
<p>The technical fights (such as: &#8220;is Gibbs sampling superior to, or even distinguishable from, Markov chain Monte Carlo?&#8221;) are all in the details, history and citation conventions.   Each field and particular method accretes its own traditions.  We will quickly discuss the fundamental methods we listed.  As we will see: complexity goes up as we move through the list (so at some point things are no longer fundamental but instead derived, allowing us to end the list).</p>
<h2>The Methods</h2>
<h3>Physical sources</h3>
<p>This is the most basic way (though not as practical in the computer age) to generate random variables.  Observe the flip of a real coin, shuffle actual cards, mix numbered balls or count the number of ticks from an actual radioactive source.  In all of these the randomness comes from physical principles (such <a target="_blank" href="http://en.wikipedia.org/wiki/Chaos_theory">chaotic dynamics</a> for coin flips or <a target="_blank" href="http://en.wikipedia.org/wiki/Quantum_mechanics">quantum mechanics</a> for radioactive decay).</p>
<p>These sources are &#8220;outside of computer science&#8221; so we will say the least about them.</p>
<h3>Empirical resampling</h3>
<p>This is what used to be called &#8220;tables&#8221; (which were themselves often generated from physical processes).   The observation is: that sometimes<br />
to run a simulation you need access to instances of random variables that are distributed in a very precise way- but you don&#8217;t have a usable  description of the desired distribution.  You would think that in this case you could do nothing.  But the principle of empirical resampling is that you can approximately generate new samples by taking samples (with repetition or replacement) from an old sample.  This is the cornerstone of Bootstrap methods.</p>
<p>As an example:  suppose we were given the sample of numbers 5, 5, 10, 5, 5 which has mean equal to 6.  Further suppose we have no<br />
description of how these number were generated but we wanted to know if a mean of at least 8 is likely or unlikely for five more numbers drawn the same way.  We can approximate this by drawing many samples of size five from this original sample (allow the same number to be in our new<br />
 sample multiple times) and get the bootstrap estimate of the probability of seeing mean of at least 8 as having a probability around 0.6%.</p>
<p>This may seem trivial- but it is very important.</p>
<h3>Pseudo random generators</h3>
<p>In the computer age, to avoid need for external tables or expensive and slow peripherals we tend to use pseudo random generators.  That is the output of deterministic iterative procedures as equivalent to true random sources.  The science of pseudo randomness has evolved from cobbled together procedures passing ad-hoc tests (such as in Knuth Volume 2) to more formal pseudo randomness based on important properties (like provably being k-wise independent) or complexity (being computationally indistinguishable from a truly random on a time or space bounded machine).  Behind the canned routines of all of the basic &#8220;random generators&#8221; commonly available is a pseudo random source.  </p>
<p>Good references for the modern theory include: 	</p>
<ul>
<li>
&#8220;Pseudorandomness and Cryptographic Applications&#8221; Michael Luby 1996.
</li>
<li>
&#8220;Modern Cryptography, Probabilistic Proofs and Pseudorandomness&#8221; Oded Goldreich, 1999.
</li>
</ul>
<p>The most basic form of a sequential pseudo random generator is a sequence of states s(1), s(2), s(3) &#8230; . Where s(i+1) = g(s(i)) where g() is our deterministic function that maps state to state.  The observed random variables are then h(s(i)) where h() is some deterministic function maps state to observables.  For example for the <a target="_blank" href="http://en.wikipedia.org/wiki/Linear_congruential_generator">linear congruential generator</a>  found in glibc we have g(x) = (1103515245*x + 12345) modulo 2^32 and h(x) = x modulo 2^30 (x an integer from 0 to 2^32 &#8211; 1).  An example application: this generator when divided by (2^30 &#8211; 1) might return numbers passably uniformly distributed in the interval [0,1].  Two such variates might be uses as a uniform sample from the unit square.</p>
<p>That a simple iterated deterministic system (like the modulo arithmetic or even a physical system like coin flipping) would even superficially appear random (let alone be safe to use as pseudo random source) turns out to be the main consequence of <a target="_blank" href="http://en.wikipedia.org/wiki/Ergodic_theory">Ergodic theory</a> (which we will touch on in a later article).  The point is: it should not be obvious (without bringing in some more theory) why you should trust pseudo-random sources.</p>
<h3>Simulation/Game-play</h3>
<p>Another fundamental method is direct simulation or game play.  If we wanted a random variable that was 1 with probability equal to the odds of being dealt a full house from a standard shuffled deck of 52 cards (and zero otherwise).  We can generate such a variable by simulating shuffling a deck, drawing a hand and returning 1 if the hand draw is a full house (and returning 0 otherwise).  Notice in this case we are combining many random variables to get a single result.</p>
<p>One of the most important simulation techniques is Markov chain Monte Carlo methods (related to Gibbs sampling, simulated annealing and many other variations).  These method implement a complex procedure over a stream of random inputs to generate a more difficult to achieve sequence of random outputs.</p>
<p>For example:  Let T be the set of pairs of non-negative integers x, y such that x + y &le; 1000.   We could implement a Markov chain on this set from a source of coin flips.  Given a point (x,y) in T we take three coin flips and move to new point (x&#8217;,y&#8217;) (also in T) using the following procedure:</p>
<ol>
<li>Let m = 1 if the first flip is heads and m=0 if the first flip is tails.</li>
<li>Let v = (1,0) if the second flip is heads and v=(0,1) if the second flip is tails.</li>
<li>Let d = +1 if the third flip is heads and d = -1 if the third flip is tails.</li>
<li>If (x,y) + m*d*v is in T let (x&#8217;,y&#8217;) = (x,y) + m*d*v, otherwise let (x&#8217;,y&#8217;) = (x,y) (stay put).</li>
</ol>
<p>Repeating this procedure a large number of times produces a sequence of points (x,y) such that (x,y) is distributed uniformly on S (again this follows from ergodic principles).  The correctness of this simulation of or game of following a Markov chain is a very fundamental method in generating more complicated random variates and something we will write more about in an article dealing with the ergodic principle (the relation of connectedness to showing averages over time equal averages over space).</p>
<p>For simple shapes (rectangle, triangles) there are more efficient ways to generate points uniformly at random.  For squares we exploit independence and just generate the coordinates independently.  For triangles we could rejection sample from a bounding rectangle.   Or we could use a tranform method: write down a counting function that indexes all the points in the triangle and generate points by index (for example it is easy to work out there are 501501 points in our example S so if we generate a random integer uniformly from 1 to 501501 can just pick the point with given index as our sample).</p>
<p>For general convex shapes (in high dimensions) these methods become intractible and Markov chain methods are one of the few options remaining.</p>
<h3>Rejection Sampling</h3>
<p>Rejection sampling is another way to convert one sequence of random variables into another.  If we assume we can generate a random variable according to the distribution p(x) we can &#8220;rejection sample&#8221; to a new distribution using an &#8220;acceptance function&#8221; q(x) which returns a number in the interval [0,1].  Our procedure is to<br />
repeat the following: generate x with probability p(x), generate a random variable y with uniformly in the interval [0,1] if y &le; q(x) accept x as<br />
our answer and quit (otherwise draw a new x and repeat).</p>
<p>When the distribution that rejection sampling draws with is such that if x and y had a ratio of being drawn of p(x)/p(y) then under the rejection procedure they have relative odds of (p(x)q(x))/(p(y)q(y)).  An important special case is when q() is always 0 or 1, in this case we are drawing with relative odds proportional to p(x) from the subset of x with q(x)=1.</p>
<p>As an example: consider the problem of trying to draw a point (x,y) such that x^2 + y^x &lt; 1 (the open unit disk) uniformly at random.  The rejection sampling solution is: repeat the following until you have a success: generate x and y independently uniformly in the interval [-1,1], if x^2 + y^2 &lt; then 1 accept them as our sample (otherwise repeat).  This procedure is very fast as the unit disk that represents our acceptance region has area pi and the square we are generating trials from has area 4: so we over a 78% chance of success on each trial or expect to only have to run fewer that 1.28 trials (on average) to get a sample.</p>
<h3>Transform methods</h3>
<p>A transform method is used when we have the ability to generate instances of a random variable according to one distribution and we would like instances according to another distribution.</p>
<p>One method is used when we have access to the inverse of the <a target="_blank" href="http://en.wikipedia.org/wiki/Cumulative_distribution_function">cumulative distribution function</a> of the distribution we are trying to generate.  In this case  we can use this function to convert uniform variants from the interval [0,1] into our target distribution.  The commutative distribution function is the function cdf() where cdf(x) is the probability a random variate generated according to our distribution is less than or equal to x.  The inverse function function icdf() where icdf(y)  is such that cdf(icdf(y)) = y.  For example the <a target="_blank" href="http://en.wikipedia.org/wiki/Exponential_distribution">exponential distribution</a>  has an inverse cumulative distribution function icdf(y) = -ln(1-y)/lamda .  So if y is<br />
generated uniformly in the interval [0,1] then icdf(y) is a random variable generated according to the exponential distribution with parameter lambda.</p>
<p>A great example of transform methods is generating Gaussian random variables.  We could directly use the inverse cumulative distribution function method described above- but to do this we would require a special function library to perform the required calculation of the inverse cummulative distribution (or inverse of <a target="_blank" href="http://en.wikipedia.org/wiki/Error_function">erf()</a>).  Another way is the <a target="_blank" href="http://en.wikipedia.org/wiki/Marsaglia_polar_method">polar method</a>: generate x,y uniformly from the open unit disk (by, for example rejection sampling as described earlier), set s = x^2 + y^2 and return  x*sqrt(-2 ln(s)/s),  y*sqrt(-2 ln(s)/s) as two independent Gaussian random variables.   The trick being: the distribution function of r = sqrt(s) is of the form r*e^(-r*r/2) which leads to an elementary cumulative distribution function (unlike the original Gaussian density of the form e^(-r*r/2)) that is easy to invert.</p>
<h2>Conclusion</h2>
<p>Our thesis is: all major methods to generate random variables use aspects of the six methods we have listed here as fundamental.  Or you should at least have a fluid understanding of at least these methods.  You should be able to break down big &#8220;brand name&#8221; methods (like Gibbs sampling) roughly into their constituent parts (so you can reason about them).   One example: notice how ratios of probabilities enter into Markov chain Monte Carlo methods (they cause step rejections); from this you can reason if your problem has bounded ratios it is a good candidate for direct application of the technique (and if it does not you need to add some more ideas, as was demonstrated in:  <a target="_blank" href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.11.9794">&#8220;Sampling from Log-Concave Distributions,&#8221; Alan Frieze , Ravi Kannan , Nick Polson, Ann. Appl. Prob, 1994</a> ).</p>
<p>The first two methods we discuss (physical sources and empirical re-sampling) are of the class of solutions &#8220;already have the right answer.&#8221;  Pseudo random generators are the primary way to negate the need for physical sources and resampling techniques.  Simulation, rejection sampling and transform methods are the main tools for building new distributions out of old.</p>
<p>It is a matter of taste if a given trick fits into this ad-hoc taxonomy or not.   You can invent new and better generation methods- but these methods are easily derived using ideas from the fundamental methods we mentioned here.</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2011/06/what-is-a-large-enough-random-sample/' rel='bookmark' title='What is a large enough random sample?'>What is a large enough random sample?</a></li>
<li><a href='http://www.win-vector.com/blog/2011/10/kernel-methods-and-support-vector-machines-de-mystified/' rel='bookmark' title='Kernel Methods and Support Vector Machines de-Mystified'>Kernel Methods and Support Vector Machines de-Mystified</a></li>
<li><a href='http://www.win-vector.com/blog/2011/07/book-review-ensemble-methods-in-data-mining-seni-elder/' rel='bookmark' title='Book Review: Ensemble Methods in Data Mining (Seni &amp; Elder)'>Book Review: Ensemble Methods in Data Mining (Seni &#038; Elder)</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2012/01/six-fundamental-methods-to-generate-a-random-variable/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>What to do when you run out of memory</title>
		<link>http://www.win-vector.com/blog/2011/12/what-to-do-when-you-run-out-of-memory/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=what-to-do-when-you-run-out-of-memory</link>
		<comments>http://www.win-vector.com/blog/2011/12/what-to-do-when-you-run-out-of-memory/#comments</comments>
		<pubDate>Tue, 06 Dec 2011 12:25:59 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Coding]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Tutorials]]></category>
		<category><![CDATA[Additive Combinatorics]]></category>
		<category><![CDATA[GNU sort]]></category>
		<category><![CDATA[H2]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Map Reduce]]></category>
		<category><![CDATA[Out of core]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1892</guid>
		<description><![CDATA[A constant problem for computer science (since its inception) is how to manipulate data that is larger than machine memory. We present here some general strategies for working &#8220;out of core&#8221; or what you should do when you run out of memory. Early computers were most limited by their paltry memory sizes. von Neumann himself [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/' rel='bookmark' title='Automatic Differentiation with Scala'>Automatic Differentiation with Scala</a></li>
<li><a href='http://www.win-vector.com/blog/2010/12/large-data-logistic-regression-with-example-hadoop-code/' rel='bookmark' title='Large Data Logistic Regression (with example Hadoop code)'>Large Data Logistic Regression (with example Hadoop code)</a></li>
<li><a href='http://www.win-vector.com/blog/2011/01/sql-screwdriver/' rel='bookmark' title='SQL Screwdriver'>SQL Screwdriver</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>A constant problem for computer science (since its inception) is how to manipulate data that is larger than machine memory.  We present here some general strategies for working &#8220;out of core&#8221; or what you should do when you run out of memory.</p>
<p>Early computers were most limited by their paltry memory sizes.  von Neumann himself commented that even a room full of genius mathematicians would not be capable of much if all they could communicate, think upon or remember were the characters on a single type written page (much more memory than the few hundred words available to the <a target="_blank" href="http://en.wikipedia.org/wiki/ENIAC">Eniac</a>).   The most visible portions of early computers are their external memories or secondary stores: card readers, paper tape readers and tape drives.</p>
<p><center><br />
<img style="display:block; margin-left:auto; margin-right:auto;" src="http://www.win-vector.com/blog/wp-content/uploads/2011/12/IMG_0062.jpg" alt="IMG 0062" title="IMG_0062.JPG" border="0" height="300" /></p>
<p/>
SDC 920 computer, Computer History Museum, Mountain View CA<br />
</center></p>
<p>Historically computer scientists have concentrated on streaming or online algorithms (that is algorithms that work with the data in the order it is available and use limited memory).  For many problems we have found this an insufficient model and it is much better to assume you can re-order and replicate data (such as scattering data to many processors and re-collecting it to sort).  The scatter/gather paradigm is ubiquitous and is the underpinning of large scale sorting, databases and Map Reduce.  So in one sense databases and Map Reduce different APIs on top of very related technologies (journaling, splitting and merging).  Replicating data (or even delaying duplicate elimination) that is already &#8220;too large to handle&#8221; may seem counterintuitive; but it is exploiting the primary property of secondary storage: that secondary storage tends to be much larger than primary storage (typically by 2 orders of magnitude, compare a 2 terabyte drive to an 8 gigabyte memory stick).<span id="more-1892"></span>In our web age, the typical big data problems are inverting indices (for fast search lookup) and computing term frequencies (for <a target="_blank" href="http://en.wikipedia.org/wiki/Okapi_BM25">TF/IDF scoring</a> or for things like <a target="_blank" href="http://en.wikipedia.org/wiki/Naive_bayes">Naive Bayes classifiers</a>).  Since these are over-worked examples we will use a mathematical problem from <a href="http://terrytao.wordpress.com/books/additive-combinatorics/">&#8220;Additive Combinatorics&#8221;, Terence Tao, Van Vu, (ISBN-13: 9780521853866; ISBN-10: 0521853869)</a></p>
<p>We take one problem from the field of additive combinatorics: sum sets.   For two sets of integers A = {a_1, &#8230; a_s} and B {b_1, &#8230;, b_t} the sum set is defined as the set (without repetition) A + B = { a_i + b_j | i = 1,&#8230;s, j=1&#8230;t }.   For sets of integers the size of A+B (denoted as |A+B|) can vary from |A| + |B| &#8211; 1 to |A| * |B| depending on the relations between the numbers in A and B (or the structure of A and B).  If instead of working with integers we work with integers <a target="_blank" href="http://en.wikipedia.org/wiki/Modular_arithmetic">modulo p</a> where p is a prime number (or equivalently we treat all numbers as remainders of division by p) then by the Cauchy-Davenport inequality we have |A + B| &ge; min(|A|+|B|-1,p) (so essentially the same result, except when we run out of possible integers modulo p).</p>
<p>For example we would say (working modulo 19) that [0, 1, 4, 5] + [10, 11, 14, 15] = [0, 1, 10, 11, 12, 14, 15, 16, 18].   In fact there are 19 pairs of sets that add up to  [0, 1, 10, 11, 12, 14, 15, 16, 18] ( for instance [5, 6, 9, 10] + [5, 6, 9, 10] is another such pair).  Just to move forward assume we were interested in determining how many ways a set can be written as the sum of a pair of sets (each of size 4).  For a given sum result we might try search or <a target="_blank" href="http://en.wikipedia.org/wiki/Integer_programming">integer programming</a> to find all possible summands.  However, if we want the statistics on all sums simultaneously, we can work much quicker and without need for big gun mathematics.</p>
<p>The straightforward solution is this case is a bit of code like:</p>
<p><code></p>
<pre>
for set A from all possible sets of 4 integers from 0 to 18
    for set B from all possible sets of 4 integers from 0 to 18
        let set C = A + B modulo 19
        use set C as a key and add the pair (A,B) to the list associated with C
for all key sets C tracked above
     compute the size of the list of summand pairs found for C
print how many result sets C have a given number of summand pairs
</pre>
<p></code></p>
<p>The relations C which have a summand of form A can be collected by any bit of Java code implementing the interface below (just call <code>insertReln(C,(A,B))</code>  to store the relations and then <code>entries()</code> to get them back).  A small interface that declares the needed methods is given below:</p>
<p><code></p>
<pre>
public interface RelnCollector&lt;A,B&gt; {
	void insertReln(A a, B b) throws IOException;
	Iterable&lt;Map.Entry&lt;C,Iterable&lt;B&gt;&gt;&gt; entries() throws IOException, InterruptedException;
	void close() throws IOException;
}
</pre>
<p></code></p>
<p>An in-memory relation collector is trivially implemented by a nested map adjusted to declare the above interface, as we see in the next code snippet:</p>
<pre>
public final class InMemoryRelnCollector&lt;A,B&gt;
	implements RelnCollector&lt;A,B&gt; {
	private final DataAdapter&lt;A&gt; adapterA;
	private final DataAdapter&lt;B&gt; adapterB;
	private Map&lt;A,Iterable&lt;B&gt;&gt; atoBs;

	public InMemoryRelnCollector(final DataAdapter&lt;A&gt; adapterA,
		final DataAdapter&lt;B&gt; adapterB) {
		this.adapterA = adapterA;
		this.adapterB = adapterB;
		atoBs = new TreeMap&lt;A,Iterable&lt;B&gt;&gt;(this.adapterA);
	}

	@Override
	public void insertReln(final A a, final B b) {
		Set&lt;B&gt; set = (Set&lt;B&gt;) atoBs.get(a);
		if(null==set) {
			set = new TreeSet&lt;B&gt;(adapterB);
			atoBs.put(a,set);
		}
		if(!set.contains(b)) {
			set.add(b);
		}
	}

	@Override
	public Iterable&lt;Map.Entry&lt;A,Iterable&lt;B&gt;&gt;&gt; entries() {
		return atoBs.entrySet();
	}

	@Override
	public void close() {
		atoBs = null;
	}
}
</pre>
<p>The great savings in time is that we work from summands to results sums (but keep many sets of results indexed by result sets).  Thus we don&#8217;t have to figure out how to invert the sum operation (as we do our bookkeeping forward).  However, this very bookkeeping may overwhelm us.  As we can see below, a Java implementation of the above procedure runs out of memory when trying to characterize which sets of integers modulo 19 can be split into two sets of size four (and how many ways each such set can be split).  However, this was with the deliberately small default allocation of memory available to Java processes (so for this particular instance we could avoid trouble by allocating more memory, we ran out of allocation not system memory).  What happens when we don&#8217;t manage memory is illustrated below:</p>
<pre>
Start	com.winvector.consolidate.impl.InMemoryRelnCollector
	Tue Dec 06 10:04:38 PST 2011
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
	at java.util.TreeMap.put(TreeMap.java:554)
	at java.util.TreeSet.add(TreeSet.java:238)
	at com.winvector.consolidate.example.AdditiveSets.sum(AdditiveSets.java:25)
	at com.winvector.consolidate.example.AdditiveSets.main(AdditiveSets.java:55)
</pre>
<p>An out of core solution can solve the entire problem without needing any additional system memory (just some disk space which is still of a much greater size than primary memory).  The complete calculated result is given below:</p>
<pre>
Examining sums of 4 integers chosen from 0 through 18 modulo 19.
Start	com.winvector.consolidate.impl.FileRelnCollector
	Tue Dec 06 09:54:20 PST 2011
	Inserted 15023376 relations.
 [0, 1, 4, 5] + [10, 11, 14, 15] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [0, 1, 15, 16] + [0, 14, 15, 18] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [0, 3, 4, 18] + [11, 12, 15, 16] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [0, 14, 15, 18] + [0, 1, 15, 16] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [1, 2, 5, 6] + [9, 10, 13, 14] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [1, 2, 16, 17] + [13, 14, 17, 18] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [2, 3, 6, 7] + [8, 9, 12, 13] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [2, 3, 17, 18] + [12, 13, 16, 17] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [3, 4, 7, 8] + [7, 8, 11, 12] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [4, 5, 8, 9] + [6, 7, 10, 11] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [5, 6, 9, 10] + [5, 6, 9, 10] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [6, 7, 10, 11] + [4, 5, 8, 9] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [7, 8, 11, 12] + [3, 4, 7, 8] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [8, 9, 12, 13] + [2, 3, 6, 7] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [9, 10, 13, 14] + [1, 2, 5, 6] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [10, 11, 14, 15] + [0, 1, 4, 5] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [11, 12, 15, 16] + [0, 3, 4, 18] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [12, 13, 16, 17] + [2, 3, 17, 18] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [13, 14, 17, 18] + [1, 2, 16, 17] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
	Examined 128820 sums and 15023376 summands.
	found 3705 sums with 19 distinct summands
	found 39900 sums with 38 distinct summands
	found 26847 sums with 76 distinct summands
	found 22230 sums with 114 distinct summands
	found 10602 sums with 152 distinct summands
	found 8892 sums with 190 distinct summands
	found 2736 sums with 228 distinct summands
	found 5016 sums with 266 distinct summands
	found 2736 sums with 304 distinct summands
	found 1710 sums with 342 distinct summands
	found 171 sums with 361 distinct summands
	found 1710 sums with 380 distinct summands
	found 855 sums with 418 distinct summands
	found 342 sums with 456 distinct summands
	found 342 sums with 532 distinct summands
	found 342 sums with 570 distinct summands
	found 171 sums with 722 distinct summands
	found 171 sums with 760 distinct summands
	found 171 sums with 912 distinct summands
	found 171 sums with 1026 distinct summands
Done:	com.winvector.consolidate.impl.FileRelnCollector
   elapsed time: 618473MS
   Tue Dec 06 10:04:38 PST 2011
</pre>
<p>We performed the calculation be using a different implementation of <code>RelnCollector</code> called <code>FileRelnCollector</code>.  What this implementation does is write relations to a file as they are made available.  That is <cod>FileRelnCollector</code> implementation of <code>insertReln</code> is literally a <code>println()</code>.  Something not more more complicated than the following:</p>
<p><code></p>
<pre>
	@Override
	public void insertReln(final A a, final B b) {
		System.out.println("" + a + "\t" + b);
	}
</pre>
<p></code></p>
<p>The heavy lifting is done when <code>entries()</code> is called.  When the entries are wanted the <code>FileRelnCollector</code> calls <a target="_blank" href="http://www.gnu.org/s/coreutils/manual/html_node/sort-invocation.html">GNU sort</a> on the saved file to get all the results ordered by result sum (instead of by summand).  GNU sort can sort files larger than memory by a split and merge strategy involving temporary files.  We provide such  <a target="_blank" href="https://github.com/WinVector/OutOfCore/blob/master/OutOfCore/src/com/winvector/consolidate/impl/FileRelnCollector.java">a file plus GNU sort based implementation of RelnCollector</a>.  </p>
<p>Note that this runtime can be deceptively low.  If running on a machine with a modern operating system and enough memory the file being used as "external storage" actually gets cached into memory (and gets near memory speed performance).  To get a reliable timing you need to test a problem of the size you are interested in on the size machine you are going to deploy on (not on a larger machine).</p>
<p>For better or worse this method should seem familiar as a lot of science has been done using the Unix text tools (sort, join and a few more).  This is also the basis of Map Reduce and we demonstrate a <a target="_blank" href="https://github.com/WinVector/OutOfCore/blob/master/OutOfCore/src/com/winvector/consolidate/impl/MapReduceRelnCollector.java">Hadoop implementation of RelnCollector</a> as well.  Or we can link up with the other technology designed for beyond memory size data manipulation and get <a target="_blank" href="https://github.com/WinVector/OutOfCore/blob/master/OutOfCore/src/com/winvector/consolidate/impl/DBRelnCollector.java">a database based implementation of RelnCollector</a>.  </p>
<p>In all cases the implementations we call depend on journaling (in the sense of keeping a sequential log of operations to be done instead of immediately performing the operations), scattering (splitting into multiple temp files and structures) and merging (combining data form multiple ordered files).  We could write our own code to perform all of these operations (obliviating any need for GNU sort, Hadoop or a database), but it is much less code to do as we have here and write an adapter to use existing implementations.</p>
<p>The sum-set example is deliberately artificial.  More common examples are, as we mentioned, index inversion and term frequency calculation.  All of our example code is available here: <a target="_blank" href="https://github.com/WinVector/OutOfCore">https://github.com/WinVector/OutOfCore</a> including JUnit tests and an <a target="_blank" href="https://github.com/WinVector/OutOfCore/blob/master/OutOfCore/src/com/winvector/consolidate/example/AdditiveSets.java">example program</a>.  The code depends on libraries for <a target="_blank" href="http://www.junit.org/">JUnit 4.10</a>, <a target="_blank" href="http://www.h2database.com/html/main.html">h2 database</a>, <a target="_blank" href="http://hadoop.apache.org/mapreduce/releases.html">Hadoop 0.21.0</a> for the various implementations.</p>
<p>The main trick is basing your code on a very thin storage abstraction (like the <code>RelnCollector</code> interface, instead of explicitly known data structures) and then using this abstraction to hide all of the details away from the rest of your code (keeping complexity at a manageable level).  The two things to avoid are either infecting your code with too much knowledge of your storage plans (i.e. pushing implementation details into your important code to "speed things up") or being forced to re-design your entire project to fit within some framework (like re-writing all of your code as a database stored procedure or an explicit Hadoop map/reduce pair as this over-commits you to one technology).</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/' rel='bookmark' title='Automatic Differentiation with Scala'>Automatic Differentiation with Scala</a></li>
<li><a href='http://www.win-vector.com/blog/2010/12/large-data-logistic-regression-with-example-hadoop-code/' rel='bookmark' title='Large Data Logistic Regression (with example Hadoop code)'>Large Data Logistic Regression (with example Hadoop code)</a></li>
<li><a href='http://www.win-vector.com/blog/2011/01/sql-screwdriver/' rel='bookmark' title='SQL Screwdriver'>SQL Screwdriver</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2011/12/what-to-do-when-you-run-out-of-memory/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>My Favorite Graphs</title>
		<link>http://www.win-vector.com/blog/2011/12/my-favorite-graphs/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=my-favorite-graphs</link>
		<comments>http://www.win-vector.com/blog/2011/12/my-favorite-graphs/#comments</comments>
		<pubDate>Tue, 06 Dec 2011 00:59:19 +0000</pubDate>
		<dc:creator>Nina Zumel</dc:creator>
				<category><![CDATA[Applications]]></category>
		<category><![CDATA[Opinion]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Tutorials]]></category>
		<category><![CDATA[boxplots]]></category>
		<category><![CDATA[ggplot]]></category>
		<category><![CDATA[ggplot2]]></category>
		<category><![CDATA[graphical perception]]></category>
		<category><![CDATA[linear regression]]></category>
		<category><![CDATA[Logistic Regression]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[statistical graphs]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1886</guid>
		<description><![CDATA[The important criterion for a graph is not simply how fast we can see a result; rather it is whether through the use of the graph we can see something that would have been harder to see otherwise or that could not have been seen at all. &#8211; William Cleveland, The Elements of Graphing Data, [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2011/02/the-cranky-guide-to-trying-r-packages/' rel='bookmark' title='The cranky guide to trying R packages'>The cranky guide to trying R packages</a></li>
<li><a href='http://www.win-vector.com/blog/2009/08/good-graphs-graphical-perception-and-data-visualization/' rel='bookmark' title='Good Graphs: Graphical Perception and Data Visualization'>Good Graphs: Graphical Perception and Data Visualization</a></li>
<li><a href='http://www.win-vector.com/blog/2010/11/learn-a-powerful-machine-learning-tool-logistic-regression-and-beyond/' rel='bookmark' title='Learn Logistic Regression (and beyond)'>Learn Logistic Regression (and beyond)</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<blockquote><p>The important criterion for a graph is not simply how fast we can see a result; rather it is whether through the use of the graph we can see something that would have been harder to see otherwise or that could not have been seen at all.</p></blockquote>
<p>&#8211; William Cleveland, <em>The Elements of Graphing Data</em>, Chapter 2</p>
<p>In this article, I will discuss some graphs that I find extremely useful in my day-to-day work as a data scientist. While all of them are helpful (to me) for statistical visualization during the analysis process, not all of them will necessarily be useful for presentation of final results, especially to non-technical audiences.</p>
<p>I tend to follow Cleveland&#8217;s philosophy, quoted above; these graphs show me &#8212; and hopefully you &#8212; aspects of data and models that I might not otherwise see. Some of them, however, are non-standard, and tend to require explanation. My purpose here is to share with our readers some ideas for graphical analysis that are either useful to you directly, or will give you some ideas of your own.</p>
<p><span id="more-1886"></span>
<p>The graphs are all produced in <code>R</code>, using the <code>ggplot2</code> package. While <code>ggplot2</code> has a fairly high learning curve, it is the most flexible of the <code>R</code> graphing packages that I have encountered, and I&#8217;ve been able to quickly create rich graphics more easily than I would be able to with the <code>R</code> base graphics, or with other graphics packages.</p>
<p>Let&#8217;s start with some exploratory analysis. We will use the <code>AdultUCI</code> dataset that is included in the <code>arules</code> package.</p>
<pre><code>
library(arules)
data("AdultUCI")
dframe = AdultUCI[, c("education", "hours-per-week")]
colnames(dframe) = c("education", "hours_per_week")
         # get rid of the annoying minus signs in the column names
</code></pre>
<p>We want to compare the distribution of work-week length to education, using a box-and-whisker plot that is overlaid on a jittered scatterplot of the data.</p>
<pre><code>
library(ggplot2)
ggplot(dframe, aes(x=education, y=hours_per_week)) +
          geom_point(colour="lightblue", alpha=0.1, position="jitter") +
          geom_boxplot(outlier.size=0, alpha=0.2) + coord_flip()
</code></pre>
<p>The <code>outlier.size=0</code> argument to <code>geom_boxplot</code> turns off the outlier plotting, and <code>coord_flip</code> switches the coordinate axes (because there are a lot of education levels).</p>
<p>The resulting graph:</p>
<p><img style="display:block; margin-left:auto; margin-right:auto;" src="http://www.win-vector.com/blog/wp-content/uploads/2011/12/Rplot.png" alt="Rplot" border="0"/></p>
<p>Recall that the box of a box-and-whisker plot covers the central 50% of the data distribution; the line in the center marks the median. In this case, the work-week length concentrates so strongly at 40 hours (except for PhDs and those with professional degrees; they are doomed to work longer hours, typically) that most of the boxes appear one-sided; it&#8217;s easier to see what is happening with both the scatterplot and box-and-whisker superimposed, than it might be with the box-and-whisker alone. We can also see the relative concentration of the subjects along each educational level.</p>
<p>I&#8217;ve found that this superimposed graph is fairly easy to explain in a presentation (easier than a plain box-and-whisker, actually). The primary disadvantage that the scatterplot can get illegible for high volume datasets (this set has about 49 thousand rows). In this case, we have to return to the box-and-whisker plot alone.
</p>
<p>Beyond exploratory analysis, we also want plots to evaluate the models that we fit. Win-Vector&#8217;s bread-and-butter recently has been logistic regression, so we will start with some visualizations for evaluating binary logistic regression models. We&#8217;ll use the heart disease dataset that Hastie, et.al, used in the <em>Elements of Statistical Learning</em>.</p>
<pre><code>
path = "http://www-stat.stanford.edu/~tibs/ElemStatLearn/datasets/SAheart.data"
saheart = read.table(path, sep=",",head=T,row.names=1)
fmla = "chd ~ sbp + tobacco + ldl + adiposity + famhist + typea + obesity"
model = glm(fmla, data=saheart, family=binomial(link="logit"),
             na.action=na.exclude)
</code></pre>
<p>We will make a data frame of <em>chd</em> (the true response, coronary heart disease), and the score from the model.</p>
<pre><code>
dframe = data.frame(chd=as.factor(saheart$chd),
                    prediction=predict(model, type="response"))
</code></pre>
<p>The standard diagnostic plot for logistic models is the ROC curve, which is fine, but personally, I don&#8217;t get a visceral feel for the model from looking at the ROC. Also, if you are interested in setting a score threshold on the model for classification purposes, the ROC adds an additional level of indirection, since it essentially integrates the score away. I used to plot the distribution of score (prediction) versus true response, like so:</p>
<pre><code>
ggplot(dframe, aes(x=prediction, colour=chd)) + geom_density()
</code></pre>
<p><img style="display:block; margin-left:auto; margin-right:auto;" src="http://www.win-vector.com/blog/wp-content/uploads/2011/12/Rplot01.png" alt="Rplot01" border="0"/></p>
<p>This visualization tells me whether or not the model scores actually separate the response &#8212; in this case, the model identifies negative cases (no coronary heart disease) better than positive cases. The graph is hard to explain to a non-technical audience, and it has the disadvantage that both distributions are separately normalized to have unit area, so you get no sense of the relative proportion of positive and negative cases (in this case, about 35% of the population have coronary heart disease). </p>
<p>Here&#8217;s an alternate graph:</p>
<pre><code>
ggplot(dframe, aes(x=prediction, fill=chd)) +
               geom_histogram(position="identity", binwidth=0.05, alpha=0.5)
</code></pre>
<p><img style="display:block; margin-left:auto; margin-right:auto;" src="http://www.win-vector.com/blog/wp-content/uploads/2011/12/Rplot02.png" alt="Rplot02" border="0" /></p>
<p>This is two semi-transparent histograms; the blue histogram for <code>chd=1</code> is &#8220;in front&#8221; of the the red histogram. Because they are histograms, rather than density plots, we can more clearly see the relative distribution of positive to negative cases, and we have a better sense of how well (or not) the model separates the positive cases from the negative ones. Clearly, for most score thresholds, the model will have a fairly high false positive rate. I use this visualization all the time, but it is also fairly hard to explain, the transparency in particular.</p>
<p>We can also use our friend the box-and-whisker scatterplot.</p>
<pre><code>
ggplot(dframe, aes(x=chd, y=prediction)) +
               geom_point(position="jitter", alpha=0.2) +
               geom_boxplot(outlier.size=0, alpha=0.5)

</code></pre>
<p><img style="display:block; margin-left:auto; margin-right:auto;" src="http://www.win-vector.com/blog/wp-content/uploads/2011/12/Rplot03.png" alt="Rplot03" border="0" /></p>
<p>The median score for the coronary heart disease cases is pulled away from the median score of the healthy subjects, but the central 50% of the two distributions still overlap. </p>
<p>Finally, let&#8217;s look at visualizations for linear regression. We&#8217;ll use the <a href="http://www-stat.stanford.edu/~tibs/ElemStatLearn/datasets/prostate.data">prostate cancer data</a> from <em>Elements of Statistical Learning</em>.</p>
<pre><code>
fmla = "lpsa ~ lcavol + lweight + age + lbph + svi + lcp + gleason + pgg45"
model = lm(fmla, data=prostate.data)
</code></pre>
<p>We can just <code>plot(model)</code> for some diagnostic graphs:</p>
<pre><code>
par(mfrow = c(2, 2), oma = c(0, 0, 2, 0))
plot(model)
</pre>
<p></code><br />
<img style="display:block; margin-left:auto; margin-right:auto;" src="http://www.win-vector.com/blog/wp-content/uploads/2011/12/Rplot04.png" alt="Rplot04" border="0" /></p>
<p>These diagnostics are useful to determine whether or not a linear model is suitable, and to identify outliers; but again, I personally don't get a visceral feel for the model. I prefer to directly plot prediction against true response:</p>
<pre><code>
dframe = data.frame(lpsa=prostate.data$lpsa, prediction=predict(model))

title = sprintf("Prostate Cancer model\n R-squared = %1.3f",
                summary(model)$r.squared)
ggplot(dframe, aes(x=lpsa, y=prediction)) +
               geom_point(alpha=0.2) +
               geom_line(aes(y=lpsa), colour="blue") +
               opts(title=title)
</pre>
<p></code><br />
<img style="display:block; margin-left:auto; margin-right:auto;" src="http://www.win-vector.com/blog/wp-content/uploads/2011/12/Rplot05.png" alt="Rplot05" border="0" /></p>
<p>This graph gives you the same information as the Residuals vs. Fitted plot, and the Q-Q plot -- in particular, whether there is systematic over- or under-prediction in specific ranges of the data. It will expose outliers, and it is intuitive to explain when presenting your results. Furthermore, it can be used to evaluate other models that predict a continuous response, such as regression trees or polynomial fits. </p>
<p>Which graphs do you find especially useful for your day-to-day work?</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2011/02/the-cranky-guide-to-trying-r-packages/' rel='bookmark' title='The cranky guide to trying R packages'>The cranky guide to trying R packages</a></li>
<li><a href='http://www.win-vector.com/blog/2009/08/good-graphs-graphical-perception-and-data-visualization/' rel='bookmark' title='Good Graphs: Graphical Perception and Data Visualization'>Good Graphs: Graphical Perception and Data Visualization</a></li>
<li><a href='http://www.win-vector.com/blog/2010/11/learn-a-powerful-machine-learning-tool-logistic-regression-and-beyond/' rel='bookmark' title='Learn Logistic Regression (and beyond)'>Learn Logistic Regression (and beyond)</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2011/12/my-favorite-graphs/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>An Appreciation of Locality Sensitive Hashing</title>
		<link>http://www.win-vector.com/blog/2011/11/an-appreciation-of-locality-sensitive-hashing/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=an-appreciation-of-locality-sensitive-hashing</link>
		<comments>http://www.win-vector.com/blog/2011/11/an-appreciation-of-locality-sensitive-hashing/#comments</comments>
		<pubDate>Mon, 21 Nov 2011 16:41:46 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Exciting Techniques]]></category>
		<category><![CDATA[Expository Writing]]></category>
		<category><![CDATA[Opinion]]></category>
		<category><![CDATA[Tutorials]]></category>
		<category><![CDATA[Locality Sensitive Hashing]]></category>
		<category><![CDATA[Nearest Neighbor]]></category>
		<category><![CDATA[Theorist]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1848</guid>
		<description><![CDATA[We share our admiration for a set of results called &#8220;locality sensitive hashing&#8221; by demonstrating a greatly simplified example that exhibits the spirit of the techniques.Locality sensitive hashing is awe inspiring in its originality, simplicity, beauty and effectiveness. In addition locality sensitive hashing is a remarkable technique as it works even when drastically abridged and [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2008/09/a-quick-appreciation-of-the-sharpe-ratio/' rel='bookmark' title='A Quick Appreciation of the Sharpe Ratio'>A Quick Appreciation of the Sharpe Ratio</a></li>
<li><a href='http://www.win-vector.com/blog/2009/11/the-local-to-global-principle/' rel='bookmark' title='The Local to Global Principle'>The Local to Global Principle</a></li>
<li><a href='http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/' rel='bookmark' title='Automatic Differentiation with Scala'>Automatic Differentiation with Scala</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>We share our admiration for a set of results called &#8220;locality sensitive hashing&#8221; by demonstrating a greatly simplified example that exhibits the spirit of the techniques.<span id="more-1848"></span>Locality sensitive hashing is awe inspiring in its originality, simplicity, beauty and effectiveness.  In addition locality sensitive hashing is a remarkable technique as it works even when drastically abridged and simplified.  In this <a target="_blank" href="http://www.win-vector.com/dfiles/LocalitySensitiveHashing.pdf">paper (link to pdf)</a> we give a description of conditions where the technique works and a heuristic argument why it works (using only elementary math).</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2008/09/a-quick-appreciation-of-the-sharpe-ratio/' rel='bookmark' title='A Quick Appreciation of the Sharpe Ratio'>A Quick Appreciation of the Sharpe Ratio</a></li>
<li><a href='http://www.win-vector.com/blog/2009/11/the-local-to-global-principle/' rel='bookmark' title='The Local to Global Principle'>The Local to Global Principle</a></li>
<li><a href='http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/' rel='bookmark' title='Automatic Differentiation with Scala'>Automatic Differentiation with Scala</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2011/11/an-appreciation-of-locality-sensitive-hashing/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Kernel Methods and Support Vector Machines de-Mystified</title>
		<link>http://www.win-vector.com/blog/2011/10/kernel-methods-and-support-vector-machines-de-mystified/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=kernel-methods-and-support-vector-machines-de-mystified</link>
		<comments>http://www.win-vector.com/blog/2011/10/kernel-methods-and-support-vector-machines-de-mystified/#comments</comments>
		<pubDate>Sat, 08 Oct 2011 00:17:46 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Expository Writing]]></category>
		<category><![CDATA[Mathematics]]></category>
		<category><![CDATA[Opinion]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Tutorials]]></category>
		<category><![CDATA[Kernel Methods]]></category>
		<category><![CDATA[Logistic Regression]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Mathematical Bedside Reading]]></category>
		<category><![CDATA[Naive Bayes]]></category>
		<category><![CDATA[Support Vector Machines]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1804</guid>
		<description><![CDATA[We give a simple explanation of the interrelated machine learning techniques called kernel methods and support vector machines. We hope to characterize and de-mystify some of the properties of these methods. To do this we work some examples and draw a few analogies. The familiar no matter how wonderful is not perceived as mystical. Goals [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2010/10/a-personal-perspective-on-machine-learning/' rel='bookmark' title='A Personal Perspective on Machine Learning'>A Personal Perspective on Machine Learning</a></li>
<li><a href='http://www.win-vector.com/blog/2011/04/do-your-tools-support-production-or-complexity/' rel='bookmark' title='Do your tools support production or complexity?'>Do your tools support production or complexity?</a></li>
<li><a href='http://www.win-vector.com/blog/2011/07/book-review-ensemble-methods-in-data-mining-seni-elder/' rel='bookmark' title='Book Review: Ensemble Methods in Data Mining (Seni &amp; Elder)'>Book Review: Ensemble Methods in Data Mining (Seni &#038; Elder)</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>We give a simple explanation of the interrelated machine learning techniques called <a target="_blank" href="http://en.wikipedia.org/wiki/Kernel_methods">kernel methods</a> and <a target="_blank" href="http://en.wikipedia.org/wiki/Support_vector_machine">support vector machines</a>.  We hope to characterize and de-mystify some of the properties of these methods.  To do this we work some examples and draw a few analogies.  The familiar no matter how wonderful is not perceived as mystical.<span id="more-1804"></span><br />
<h2>Goals of this writeup</h2>
<ol>
<li>De-mystify  kernel methods and support vector machines<br />
<blockquote><p>
Kernel methods and support vector machines have taken mythological proportions in the machine learning imagination. Partly this is because a number of good ideas are overly associated with them: support/non-support training datums, weighting training data, discounting data, regularization, margin and the bounding of generalization error.  My issue is that these are all important enough ideas to stand on their own and are often seen in simpler settings.  The observations that inform my view are as follows:</p>
<ul>
<li>Kernel methods and support vector machines are in fact two good ideas.  Each is important even without the other: kernels are useful all over and support vector machines would be  useful even if we restricted to the trivial identity kernel.</li>
<li>Small scale &#8220;kernel tricks&#8221; are not that different than the classic technique of adding &#8220;interaction variables.&#8221;  Kernels let you escape from the limits of &#8220;linear hypotheses&#8221; (really by moving to a bigger space where things are again linear but look curved from the point of view of your smaller original space).  We demonstrate the linear methods and &#8220;primal kernel tricks&#8221; in <a target="_blank" href="http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/">A Demonstration of Data Mining</a>.</li>
<li>Support/non-support is the central issue of nearest neighbor classifiers.</li>
<li>Weighing training data is most famously shared by logistic regression and boosting.</li>
<li>Re-weighting of data by <i>smoothing kernels</i> (different but related use of the work &#8220;kernel&#8221;) is central to non-parametric statistics (kernel smoothers and splines).</li>
<li>Regularization is of supreme importance in modeling in general.</li>
<li>Most practitioners get tired of &#8220;kernel shopping&#8221; and fall back to the identity, cosine or radial/Gaussian kernels.</li>
</ul>
</blockquote>
</li>
<li>Show concrete examples of what are and what are not kernels.<br />
Few sources give enough theorems to think about kernels abstractly and fewer still work concrete examples.
</li>
<li>
Use as little math as possible (which, unfortunately, turns out to be quite a bit).  We will discuss encodings and stopping conditions (important to understand what is going on) but avoid explaining the optimizers (the most beautiful part of support vector machines, but also the part that is available pre-packaged in libraries).
</li>
<li>
Call out common magical thinking and unreasonable expectations associate with kernel methods and support vector machines.  This should help the reader be in a better position to &#8220;defend their doubts&#8221; regarding machine learning promises.
</li>
<li>
Try to place all techniques in a wider context (if it is usable only one place it is a trick, if it is usable multiple places it is a technique).
</li>
<li>Discuss margin and its impact on generalization error.<br />
<blockquote><p>
Generalization error is an effect of &#8220;over fitting&#8221; where a model has learned things that are true about the training examples that do not hold for the overall truth or concept we are trying to learn (i.e. don&#8217;t generalize).   Generalization error is the excess error rate we observe when scoring new examples versus the error-rate we saw in learning the training data.  Margin is in fact a <em>posterior</em> observation.  That is: margin is observed after the training data is seen, not known before data is seen (like, for example, <a target="_blank" href="http://en.wikipedia.org/wiki/VC_dimension">VC dimension</a>).   Margin is useful as it bounds generalization error but it is not the <em>prior</em> bound it is often portrayed as.  So we assert margin estimates are not much more special than simple cross-validation estimates which can also be performed once we have data available.
</p></blockquote>
</li>
</ol>
<p>The goals are not to try to indict or try to cut down kernel methods or support vector machines, but just to dump some of the associated baggage so they can be used fluidly and without anxiety.  </p>
<h2>Example Problem</h2>
<p>Consider the following simple (caricature) machine learning problem:  we are given a number of points labeled as circles and square and we want to, given a new point, predict the label.  In our example the only input will be two numbers: x and y.  This example is simple enough that we can depict the entire situation in Figure 1 below:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/TruthAndData.png" alt="TruthAndData.png" border="0" width="480" height="480" /></p>
<p>Figure 1: Truth and Training Data</p>
<p></center></p>
<p>We are assuming (for simplicity of exposition) that we have the incredible luck that the label is indeed a deterministic function of x and y.  We portray this &#8220;ground truth&#8221; as the blue parabola- every point found in this region will be considered to be labeled as &#8220;circle&#8221; and every other point (i.e. the red region) will be labeled as &#8220;triangle.&#8221;  We have overlaid 20 example points (with appropriate shape labels).  These points will be our training data.  We will try to learn from the observed training data a good approximation of the unseen blue and red regions.  This process is called learning or training and the ability to correctly predict the label of new points is called generalization.  To make sure our concept is learnable we are going to further guarantee a moat (or margin) the distribution we are picking training and test examples from will never pick a point from the shaded region between the blue and black.  That is we will never be asked about an example where x*x is very near y.</p>
<p>To promote visual thinking we are going to avoid formulas until the section titled &#8220;precise functions used&#8221; (where we will list the formulas used for each graph).</p>
<h2>Nearest Neighbor Solution</h2>
<p>An interesting model in these days of big data and fast machines is the nearest neighbor model.  Such a model colors each point in space blue or red as it <em>thinks</em> the point should be classified.  In this case each point is colored the color given by the closest known training point.  This induces the type of model seen in Figure 2 whose boundary is piecewise line segments mid-way between training points.  The data points that determine segments of the classification boundary in this way are thought to support the boundary and are called &#8220;support examples&#8221; or &#8220;support vectors.&#8221;  In fact any training point that is not &#8220;supporting&#8221; part of the boundary is &#8220;irrelevant&#8221; (could be removed and we would get the exact same model).  This split between important and non-important training points is considered one of the important observations from  the theory of support vector machines, but we see it even here in the nearest neighbor models.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/1NN.png" alt="1NN.png" border="0" width="480" height="480" /></p>
<p>Figure 2: 1 nearest neighbor model</p>
<p></center></p>
<p>Notice that the nearest neighbor model in Figure 2 is &#8220;not half bad.&#8221;  The blue region is much too wide, but near the training data the blue mass is roughly in the correct position.  If we had infinite data and infinite computational resources this sort of model would be hard to beat (as any point we are likely to be tested on would have a lot of very near examples to work from).  We can try to clean up the shape of the model a bit by using a vote of the 2 nearest points to color each point in the plane (see Figure 3) or even the majority of the 3 nearest points (see Figure 4).  </p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/2NN.png" alt="2NN.png" border="0" width="480" height="480" /></p>
<p>Figure 3: 2 nearest neighbor model</p>
<p></center></p>
<p>The purple region in Figure 3 is a &#8220;region of uncertainty&#8221; where net-vote over the 2 nearest points is zero (they gave inconsistent advice).  We can use this as a hint that predictions take from this region are less reliable.  As you may notice, we are not getting real improvements.  We should always spend more time getting more features (useful coordinates in addition to x and y) and more data; and spend less time tweaking models.  But if your data and features are fixed all you work on is your modeling technique (alternately: your modeling technique is all you can prepare before receiving features and data).  So we will continue to discuss features and technique.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/3NN.png" alt="3NN.png" border="0" width="480" height="480" /></p>
<p>Figure 4: 3 nearest neighbor model</p>
<p></center></p>
<p>Nearest neighbor classifiers are optimal in the sense that with an infinite amount of data the 1-nearest neighbor classifier has an error rate that approaches twice the Bayes error rate (the Bayes error rate being the ideal error observed on identical repetitions, or the theoretical best error rate) and for large k the k-nearest neighbor method approaches the Bayes error rate itself (see for example <a target="_blank" href="http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm">k-nearest neighbor algorithm</a> ).</p>
<p>However the effectiveness of the nearest neighbor classifiers is coming from the very low dimension (2) of our input variables x and y.  If we were attempting to predict from n variables we might need an amount of data exponential in n to get a useful nearest neighbor classifier.  This is an <a target="_blank" href="http://en.wikipedia.org/wiki/Efficiency_(statistics)">inefficient</a> use of data of the data; what we want is to use all of the data we have effectively.  To do this we further assume there is some relational reason that the position of the point in the plane determines the shape-label (i.e. that we are learning, generalizing, interpolating and extrapolating- not just memorizing).  We set our ambitions above memorizing and move towards functional modeling.  We start thinking in terms of change: how the label density of examples changes as we move is a good clue as to how it will continue to change.  We can do this either in a parametric or primal formulation (where we attempt to directly infer parameters of an assumed functional form as in regression or logistic regression) or in a non-parametric or dual form (where we attempt to learn relations between points and the training data instead of parameters).  We have discussed this before ( <a target="_blan" href="http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/">A Demonstration of Data Mining</a> ) and expanded on primal methods ( <a target="_blank" href="http://www.win-vector.com/blog/2010/11/learn-a-powerful-machine-learning-tool-logistic-regression-and-beyond/">Learn Logistic Regression (and beyond</a>, <a target="_blank" href="http://www.win-vector.com/blog/2011/09/the-simpler-derivation-of-logistic-regression/">The Simpler Derivation of Logistic Regression</a> and <a target="_blank" href="http://www.win-vector.com/blog/2011/09/the-equivalence-of-logistic-regression-and-maximum-entropy-models/">The equivalence of logistic regression and maximum entropy models</a>).</p>
<h2>Functional Solution</h2>
<p>Our next idea is: can we use our training data to tease out a functional form for the relation between x,y and shape label?  The first function we will try is chosen to be similar to the nearest neighbor model we just demonstrated.  For this form we say each training point is the center of a <a target="_blank" href="http://en.wikipedia.org/wiki/Gaussian_function">Gaussian</a> hump (say up for blue/circle and down for red/triangle) which we will call a &#8220;discount function&#8221; (or informally a hump).  A &#8220;Gaussian&#8221; is just a function that falls off exponentially in the square of Euclidian distance from a central point (we can see the simple form of such functions in the &#8220;precise functions used&#8221; section).  The contour lines (or levels as seen from looking down upon) of one such hump or discount function are shown in figure 5 (all graphs from here on in will be looking down at a height represented by color and contour lines, much like topographic maps).</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/GaussianKernel3ExampleConcept.png" alt="GaussianKernel3ExampleConcept.png" border="0" width="480" height="480" /></p>
<p>Figure 5: Narrow Gaussian example concept</p>
<p></center></p>
<p>This particular discount function has its maximum (or peak) centered on a data point and then rapidly falls as we move away from the training point.  The only property we will use in this section is that we can evaluate the discount easily (so we don&#8217;t at this time need or make use of any restrictive properties like integrability, positive semi-definiteness or even anything like bounded level sets).<br />
We could make our functional model just the sum of all of these up and down humps over all of the training data.  Each point gets a hump of the same height and same radius, pointing up for blue/circle and down for red/triangle.  The net coloring of this sum function is shown in figure 6.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/GaussianKernel3KernelSum1.png" alt="GaussianKernel3KernelSum.png" border="0" width="480" height="480" /></p>
<p>Figure 6: Narrow Gaussian sum model</p>
<p></center></p>
<p>The idea is that this would imitate the nearest neighbor model we have already seen.  This is because even though we are summing over all data the closest training examples should usually dominate (since the Gaussian humps we picked fall to zero as we get further from their centers).   The correspondence to nearest neighbor increases if we tighten the radius of our functions (searching for the right radius is a very important statistical problem called &#8220;bandwidth estimation&#8221;).  Figure 7, for example shows the sum over much narrower Gaussians.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/GaussianKernel20KernelSum1.png" alt="GaussianKernel20KernelSum.png" border="0" width="480" height="480" /></p>
<p>Figure 7: Very narrow Gaussian sum model</p>
<p></center></p>
<p>Figure 7 looks even more like the nearest neighbor solution.  In fact your could think of the discount-weighted sum over all of the data as the natural model (as it is easer to reason about) and the nearest neighbors a more efficient approximation (summing only over near points).  We also have seen that we can alter the smoothness of our model (which has consequences on model generalization) by changing the steepness of our discount functions.  We can also build intermediate models (such as building a nearest neighbor but using discount weighted sums instead of uniform voting in the neighborhoods).   We haven&#8217;t yet greatly improved on nearest neighbor, but we have identified a couple of obvious avenues for further improvement: mess with the bandwidths (the obvious idea to try and deal with near/far scale) or re-weight the data (as one of the lessons of the nearest neighbor model is that points that are not near a boundary are less important).   </p>
<h2>Bandwidth Solution</h2>
<p>As a digression lets play with bandwidth a bit first.  Suppose simultaneously for each training datum we picked an ideal bandwidth or steepness of the Gaussian.  </p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/BandwidthModel.png" alt="BandwidthModel.png" border="0" width="480" height="480" /></p>
<p>Figure 8: Simultaneous bandwidth model</p>
<p></center> </p>
<p>This simultaneous bandwidth model depicted in figure 8 was determined by picking a simultaneous assignment of Gaussian widths (each training datum gets its own width) that maximizes the log-sum of the category signed model function on the training data (much like the logistic regression maximum likelihood decoding).  This differs from typical bandwidth selection problems where all points are given a single common &#8220;best bandwidth.&#8221;  We instead picked a different bandwidth for each point (bandwidths picked to maximize how much of the correct category portion of the sum was correct on each training example).  The blue region is reasonable, but does not match the shape of the true concept.   Partly this is because there has not been enough training data to have learned a lot about the true shape (for example there is no empirical evidence for circle/blue in the top-left quadrant).  The other reason is a bias inherent to this type of model.  If one of the bandwidth functions is wider than all of the others (which is almost certain to occur) then it falls slower than all of the others and for points very far away from the center of the training data this one function is most of what remains.  Or equivalently: due to the nature of this modeling technique one of the concepts learned is going to be the union of bounded islands and the other concept will grab all points sufficiently far away from the center of the training data.  This is strong (and undesirable) bias, but it is a bias also shared by support vector machines with Gaussian kernels (though support vector machines get it from their so-called &#8220;dc-term&#8221; b not being zero; this will be discussed later). Our bandwidth correction was successful, but frankly the optimization problem we solved to estimate the optimal bandwidths was nasty and would not be something we would advise for a large amount of data.</p>
<h2>Data weighting solution (support vector machine)</h2>
<p>So instead of messing with the bandwidths let us consider re-weighting the Gaussians in our sum.  We will leave the shape of each Gaussian the same but allow each individual training datum to have a unique weight or importance.  Perhaps by picking the right weights we can get a better model.  By &#8220;best&#8221; we will mean &#8220;best margin&#8221; (to be defined later) because with &#8220;best margin&#8221; as our objective the optimization problem of solving for the best data weights has a particularly beautiful form that can be reliably solved at great scale.  This is called a &#8220;support vector model&#8221; (or support vector machine) and we will describe these ideas in detail later.  But first lets see the result.  Figure 9 shows the original narrow Gaussians re-summed according to the weights picked by the support vector method (instead of all being 1 as in figure 6).</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/GaussianKernel3SVM1.png" alt="GaussianKernel3SVM.png" border="0" width="480" height="480" /></p>
<p>Figure 9: Narrow Gaussian support vector model</p>
<p></center></p>
<p>Figure 9 is again an improvement.  The figure is formed by taking a signed sum of our discount functions centered at our training points and weighted by the &#8220;support vector machine weights.&#8221;  The signs are picked with one class encoded as +1 and the other as -1.  The colors red and blue are picked if the sum is above or below a constant &#8220;b&#8221; called &#8220;the dc-term&#8221; (part of the support vector solution).  The solution again looks okay: all the training points are in the correct color region (and you can&#8217;t hope to interpolate let alone extrapolate if you can&#8217;t even reproduce your training concept).  Also, as with the bandwidth model, contours are set sensibly (steep/dense where categories are near each other, shallow/sparse where things are safe).   We see the same inductive bias: one of the concepts is the union of bounded islands (this will always going to be the case unless the model picks b=0).  This bias is not obvious from the kernel choice, as it is hidden in the dc-term of the support vector fitting equations (it is not part of the kernel, but can be defeated by choosing unbounded kernels).  An interesting empirical fact about support vector models is they tend to work well with a larger bandwidth.    For example if we use a wider Gaussian we get an even more convincing model (see Figure 10).</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/GaussianKernel025SVM.png" alt="GaussianKernel025SVM.png" border="0" width="480" height="480" /></p>
<p>Figure 10: Wide Gaussian support vector model</p>
<p></center></p>
<p>In figure 10 we have also called out one of the important features of support vector machines.   In the literature all datums are called &#8220;vectors&#8221; (as they are represented in coordinates) and a subset of these are called the support vectors.  In figure 10 we have drawn the 4 training examples that turn out to be support vectors as large shapes and all other training examples as small shapes.  The support vectors are the datums with non negligible weights.  The learned model is a function of the support vectors only.  So after fitting we can discard the other 16 training points. This is similar to the fact that nearest neighbor also only needs the training examples that are supporting its model boundary.</p>
<p>Notice that at this point the support vector models are not &#8220;magically&#8221; better, they in fact look less like the truth we are trying to learn than either the nearest neighbor models or the un-weighted sum models.  To fix this we need to do some &#8220;kernel shopping&#8221; or find functions that better respect what we are trying to model.  With the right functions support vector machines can do a very good job at learning the concept (but knowing &#8220;the right functions&#8221; is a huge hint as to the what the concept is.  There is nothing wrong with using a hint but we do have to a method to produce the hint or have a reasonable number of hints to try from. </p>
<p>The &#8220;sum over everything with the same weight&#8221; solutions from the earlier section are very similar to Naive Bayes which also sums over all matching features with no weight adjustment.  The support vector &#8220;use the same model but pick better weights&#8221; stands over these sum models in very much the same way Logistic Regression stands over Naive Bayes.  In fact the optimization problems are very related.  If we consider weights over features or coordinates as &#8220;primal&#8221; and weights over data items as &#8220;dual&#8221; we can informally say something like: support vector machines optimize by working over dual variables and inspecting for primal weights, and logistic regression works over primal variables and inspects for data weights.   But this is a bit vague.</p>
<p>At this point we need to get a bit more explicit and precise.</p>
<h2>Precise functions used</h2>
<p>In general we will write our training data as a sequence of n-vectors.  Our points will be named u(1) &#8230; u(m).   For each i = 1..m we also know which category the training point is labeled with.  We will encode this as y(1) &#8230; y(m) where each y() is either +1 or -1 depending on the training label.  So the example we have been working has n=2 and m=20.  Let z represent a n-vector we wish to classify.</p>
<p>Figure 6 was a picture of the sign of the function:<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/E4BCA580-314E-4BD8-AD83-3CF55C7AE24E.jpg" alt="E4BCA580-314E-4BD8-AD83-3CF55C7AE24E.jpg" border="0" height="50" /><br />
</center></p>
<p>Figure 7 was a picture of the sign of the function:<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/9F34681D-45FC-4E0B-8466-943CF2881AAC.jpg" alt="9F34681D-45FC-4E0B-8466-943CF2881AAC.jpg" border="0" height="50" /><br />
</center></p>
<p>Figure 8 was a picture of the sign of the function:<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/C70C6F0B-7EB9-44C5-9505-013604F46D9C.jpg" alt="C70C6F0B-7EB9-44C5-9505-013604F46D9C.jpg" border="0"  height="50" /><br />
</center><br />
where the w(i) are non-negative numbers picked so that<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/54765538-DADD-4353-A59A-EF3B6C42743E.jpg" alt="54765538-DADD-4353-A59A-EF3B6C42743E.jpg" border="0"  height="50" /><br />
</center><br />
is large.</p>
<p>Figure 9 was a picture of the sign of -b plus the function:<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/CD6E8576-D6D3-4D61-B285-46DB169BC885.jpg" alt="CD6E8576-D6D3-4D61-B285-46DB169BC885.jpg" border="0" height="50" /><br />
</center><br />
where the a(i) and b are picked so &#8220;the margin&#8221; (discussed later) is large.</p>
<p>Figure 10 was a picture of the sign of -b plus of the function:<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/1932DC0A-E051-41E7-8D90-F0868247A69C.jpg" alt="1932DC0A-E051-41E7-8D90-F0868247A69C.jpg" border="0" height="50" /><br />
</center><br />
where the a(i) and b are picked so the margin is large.</p>
<p>All of this repetition is to emphasize the commonality of the models.  They are all of the form:<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/1FC41F1F-16D8-40A3-BA4E-0EF38E455F00.jpg" alt="1FC41F1F-16D8-40A3-BA4E-0EF38E455F00.jpg" border="0" height="50" /><br />
</center><br />
where the a(i) are non-negative numbers called the &#8220;support weights&#8221; and k(,) is a function mapping pairs of vectors to numbers and is called a &#8220;a kernel.&#8221;  Unfortunately there are many incompatible uses of the word kernel in mathematics and statistics.  Here &#8220;kernel&#8221; is being used in the sense of <a target="_blank" href="http://en.wikipedia.org/wiki/Positive-definite_kernel">positive semi-definiteness</a> or that k(u,u) &ge; 0 for all u.   Note that the support vector machine instead of using the sign of f(z) as its decision instead uses which side f(z) is of a constant b as its category decision.  A consequence is: for support vector machines if b is non-zero (as it almost surely will be) and the kernels all go to zero as we approach infinity fast enough (as they are designed to do) then exactly one of the learned classes is infinite and the other is a union of islands (regardless if this was true for the training data).</p>
<h2>About Kernels</h2>
<p>A lot of awe and mysticism is associated with kernels, but we think they are not that big a deal.  Kernels are a combination of two good ideas, they have one important property and are subject to one major limitation.  It is also unfortunate that support vector machines and kernels are tied so tightly together.   Kernels are the idea of summing functions that imitate similarity (induce a positive-definite encoding of nearness) and support vector machines are the idea of solving a clever dual problem to maximize a quantity called margin.  Each is a useful tool even without the other.</p>
<p>The two good ideas are related and unfortunately treated as if they are the same.  The two good ideas we promised are:</p>
<ol>
<li>The &#8220;kernel trick.&#8221;  Adding new features/variables that are functions of your other input variables can change linearly inseparable problems into linearly separable problems.  For example if our points were encoded not as u(i) = (x(i),y(i)) but as u(i) = (x(i),y(i),x(i)*x(i),y(i)*y(i),x(i)*y(i))</li>
<p>  we could easily find the exact concept ( y(i) > x(i)*x(i) which is now a linear concept encoded as the vector (0,1,-1,0,0).</p>
<li>Often you don&#8217;t need the coordinates of u(i).  You are only interested in functions of distances ||u(i)-u(j)|^2 and in many cases you can get at these by inner products and relations like ||u(i)-u(j)||^2 = &lt;u(i),u(i)&gt; + &lt;u(j),u(j)&gt; &#8211; 2&lt;u(i),u(j)&gt; .</li>
</ol>
<p>We will expand on these issues later.</p>
<p>The important property is that kernels look like inner products in a transformed space.  The definition of a kernel is: there exists a magic function phi() such that for all u,v:<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/F7B3D1C2-2702-48E0-8259-C48BBDACFE50.jpg" alt="F7B3D1C2-2702-48E0-8259-C48BBDACFE50.jpg" border="0"  height="20" /> .<br />
</center><br />
This means that k(.,.) is behaving like an inner product in some (possibly unknown) space.  The important consequence is the positive semi-definiteness, which implies k(u,u)&ge;0 for all u (and this just follows from the fact about inner products over the real numbers that &lt;z,z&gt;&ge;0 for all z).  This is why optimization problems that use the kernel as their encoding are well formed (such as the optimization problem of maximizing margin which is how support vector machines work).  You can <a target="_blank" href="http://en.wikipedia.org/wiki/Regularization_(mathematics)">&#8220;regularize&#8221;</a> optimization problems with a kernel penalty because it behaves a lot like a norm.  Without the positive semidefinite property all of these optimization problems would be able to &#8220;run to negative infinity&#8221; or use negative terms (which are not possible from a kernel) to hide high error rates.  The limits of the kernel functions (not being able to turn distance penalties into bonuses) help ensure that the result of optimization is actually useful (and not just a flaw in our problem encoding).</p>
<p>And this brings us to the major limitations of kernels.  The phi() transform can be arbitrarily magic except when transforming one vector it doesn&#8217;t know what the other vector is and phi() doesn&#8217;t even know which side of the inner product it is encoding.   That is kernels are not as powerful as any of the following forms:</p>
<ul>
<li>snooping (knowing the other):<br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/1BA6AACD-96DE-4B2A-A3FA-1236AE9D2957.jpg" alt="1BA6AACD-96DE-4B2A-A3FA-1236AE9D2957.jpg" border="0"  height="20" />
</li>
<li>positional (knowing which part of inner product mapping to):<br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/D51DD0F0-DF01-4E37-B5C4-E5ECF3179165.jpg" alt="D51DD0F0-DF01-4E37-B5C4-E5ECF3179165.jpg" border="0" height="20" />
</li>
<li>fully general:<br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/845BD9A6-97DC-4F2C-AD4E-633FC9AFADA7.jpg" alt="845BD9A6-97DC-4F2C-AD4E-633FC9AFADA7.jpg" border="0" height="20" />
</li>
</ul>
<p>Not everything is a kernel.</p>
<p>Some non-kernels are:</p>
<ol>
<li> k(u,v) = -c for any c>0</li>
<blockquote><p>
A consequence of the fact that no negative number can be written as a sum of squares or as the limit of a sum of squares in the reals.
</p></blockquote>
<li> k(u,v) = ||u-v||<br />
<blockquote><p>
This can be shown as follows.  Suppose k(u,v) is a kernel with k(u,u) = 0 for all u (which is necessary to try and match ||u-v|| as ||u-u|| = 0 for all u).  By the definition<br />
of kernels k(u,u) = &lt; phi(u), phi(u) &gt; for some real vector valued function phi(.).  But, by the properties of the inner real inner product &lt;.,.&gt;, this means phi(u) is<br />
the zero vector for all u.  So k(u,v) = 0 for all u,v and does not match ||u-v|| for any u,v such that u &ne; v.
</p></blockquote>
</li>
</ol>
<p>There are some obvious kernels:</p>
<ol>
<li> k(u,v) = c for any c &ge; 0 (non-negative constant kernels) </li>
<li> k(u,v) = &lt; u , v &gt;  (the identity kernel) </li>
<li> k(u,v) = f(u) f(v) for any real valued function f(.) </li>
<li> k(u,v) = &lt; f(u) , f(v) &gt; for any real vector valued function f(.) (again, the definition of a kernel)</li>
<li> k(u,v) = transpose(u) B v where B is any symmetric positive semi-definite matrix</li>
</ol>
<p>And there are several subtle ways to build new kernels from old.  If q(.,.) and r(.,.) are kernels then so are:</p>
<ol>
<li> k(u,v) = q(u,v) + r(u,v)</li>
<li> k(u,v) = c q(u,v) for any c &ge; 0 </li>
<li> k(u,v) = q(u,v) r(u,v) </li>
<li> k(u,v) = q(f(u),f(v)) for any real vector valued function f(.)</li>
<li> k(u,v) = lim_{k-> infinity} q_k(u,v) where q_k(u,v) is sequence of kernels and the limit exists.</li>
<li> k(u,v) = p(q(u,v)) where p(.) is any polynomial with all non-negative terms.</li>
<li> k(u,v) = f(q(u,v)) where f(.) is any function with an absolutely convergent Taylor series with all non-negative terms.</li>
</ol>
<p>Most of these facts are taken from the excellent book: John Shawe-Taylor and Nello Cristianini&#8217;s <a target="_blank" href="http://www.kernel-methods.net/">&#8220;Kernel Methods for Pattern Analysis&#8221;</a>, Cambridge 2004.  We are allowing<br />
kernels of the form k(u,v) = &lt; phi(u), phi(v) &gt; where phi(.) is mapping into an infinite dimensional vector space (like a series).  Most of these facts can be checked by<br />
imagining how to alter the phi(.) function.  For example to add two kernels just build a larger vector with enough slots for all of the coordinates for the vectors encoding<br />
the phi(.)&#8217;s of the two kernels you are trying to add.  To scale a kernel by c multiply all coordinates of phi() by sqrt(c).  </p>
<p>Multiplying two kernels is the trickiest.  Without loss of generality assume q(u,v) = &lt; f(u) ,f(v) &gt; and r(u,v) = &lt; g(u), g(v) &gt; and f(.) and g(.) are both mapping into the same finite dimensional vector space R^m (any other situation can be simulated or approximated by padding with zeros and/or taking limits).  Imagine a new vector function p(.) that maps into R^{m*m} such that p(z)_{m*(i-1) + j} = f(z)_i g(z)_j .  It is easy to check that k(u,v) = &lt; p(u) , p(v) &gt; is a kernel and k(u,v) = q(u,v) r(u,v). (The reference proof uses tensor notation and the Schur product, but these are big hammers mathematicians use when they don&#8217;t want to mess around with re-encoding indices).</p>
<p>Note that one of the attractions of kernel methods is that you never have to actually implement any of the above constructions.  What you do is think in terms of sub-routines (easy for computer scientists, unpleasant for mathematicians).  For instance: if you had access to two functions q(.,.) and r(.,.) that claim to be kernels and you wanted the product kernel you would just, when asked to evaluate the kernel, just get the results for the two sub-kernels and multiply (so you never need to see the space phi(.) is implicitly working in).</p>
<p>This implicitness can be important (though far too much is made of it).  For example the Gaussians we have been using throughout are kernels, but kernels of infinite dimension, so we can not directly represent the space they live in.  To see the Gaussian is a kernel notice the following:<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/926E0E3C-883C-4879-A49C-4B91DEB3E449.jpg" alt="926E0E3C-883C-4879-A49C-4B91DEB3E449.jpg" border="0"  height="20" /> .<br />
</center> </p>
<p>And for all c &ge; 0 each of the three terms on the right is a kernel (the first because the Taylor series of exp() is absolutely convergent and non-negative and the second two are the f(u) f(v) form we listed as obvious kernels).  The trick is magic, but the idea is to use the fact that Euclidian squared distance breaks nicely into dot-products ( ||u-v||^2 = &lt; u, u &#038;gt + &lt; v, v &#038;gt &#8211; 2 &lt; u , v &gt;) and exp() converts addition to multiplication.  It is rather remarkable that kernels (which are a generalization of inner products that induce half open spaces) can encode bounded concepts (more on this later when we discuss the VC dimension of the Gaussian).</p>
<h2>Back to Support Vector Machines</h2>
<p>We can now restate what a support vector machine is: it is a method for picking data weights so that the modeling function for a given kernel has a maximum margin.  The margin is the minimal distance between a training example and the model&#8217;s boundary between blue and red.  Notice the 4 support vectors in figure 10 are all not &#8220;tight up against the boundary,&#8221; but are all the same (minimal distance) from the boundary.  This distance is called the margin and the area near the boundary is obviously a place the model has uncertainty (so it makes sense to keep the boundary as far as possible from the training data).  The great benefit of the support vector machine is that with access only to the data labels and the kernel function (in fact only the kernel function evaluated at pairs of training datums) the support vector machine can quickly solve for the optimal margin and data weights achieving this margin.</p>
<p>For fun lets plug in new kernel called the cosine kernel:<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/328D7D80-4133-4E42-A63F-F7E8F3ADC90C.jpg" alt="328D7D80-4133-4E42-A63F-F7E8F3ADC90C.jpg" border="0"  height="43" /><br />
</center><br />
(c &ge; 0).</p>
<p>This kernel can be thought of as having a phi(.) function that takes a vector z and adds an extra coordinate of sqrt(c) and then projects the resulting vector onto the unit sphere.  This kernel induces concepts that look like parabolas, as seen in figure 11.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/CosineKernelExampleConcept.png" alt="CosineKernelExampleConcept.png" border="0" width="480" height="480" /></p>
<p>Figure 11: Cosine example concept</p>
<p></center></p>
<p>Figure 12 shows the sum-concept (add all discount functions with same weight) model. In this case it is far too broad (averaging of the cosine concepts which are not bounded concepts like the Gaussians) creates an overly wide model.  The model not only generalizes poorly (fails to match the shape of the truth diagram) it also gets some of its own training data wrong.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/CosineKernelKernelSum1.png" alt="CosineKernelKernelSum.png" border="0" width="480" height="480" /></p>
<p>Figure 12: Cosine kernel sum model</p>
<p></center></p>
<p>And figure 13 shows the excellent fit returned by the support vector machine.  Actually an excellent fit determined by 4 support vectors (indicated by larger labels).  Also notice the support vector machine using unbounded concepts generated a very good unbounded model (unbounded in the sense that both the blue and red regions are infinite).  By changing the kernel or discount functions/concepts we changed the inductive bias.  So any knowledge of what sort of model we want (one class bounded or not) should greatly influence our choice of kernel functions (since the support vector machine can only pick weights for the kernel functions, not tune their shapes or bandwidths).</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/CosineKernelSVM.png" alt="CosineKernelSVM.png" border="0" width="480" height="480" /></p>
<p>Figure 13: Cosine Support Vector model</p>
<p></center></p>
<p>Another family of kernels (which I consider afflictions) are the &#8220;something for nothing&#8221; kernels.  These are kernels of the form:<br />
<center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/C2307338-CB70-4EE0-B83D-0B4AE988F951.jpg" alt="C2307338-CB70-4EE0-B83D-0B4AE988F951.jpg" border="0" height="20" /><br />
</center><br />
or higher powers or other finishing functions.  The idea is that if you were to look at the expansion of these kernels you would see lots of higher order terms (powers of u_i and v_j)<br />
and if these terms were available to the support vector machine it could use them.  It is further claimed that in addition to getting new features for free you don&#8217;t use up degrees<br />
of freedom exploiting them (so you cross-validate as well as for simpler models).   Both these claims are fallacious- you can&#8217;t fully use the higher order terms because they are entangled with other terms that are not orthogonal the outcome and the complexity of a kernel is not quite as simple as degrees of freedom (proofs about support vector machines are stated in terms of margin, not in terms of degrees of freedom or even in terms of VC dimension).  The optimizer in the SVM does try hard to make a good hypothesis using the higher order terms- but due to the forced relations among the terms you get best fits like figure 14.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/SquareKernelSVM.png" alt="SquareKernelSVM.png" border="0" width="480" height="480" /></p>
<p>Figure 14: Squared magic kernel support vector model</p>
<p></center></p>
<p>If you want higher order terms I feel you are much better off performing a primal transform so the terms are available in their most general form.   That is re-encode the vector u = (x,y) as the larger vector (x,y,x*y,x*x,y*y).  You have to be honest: if you are trying to fit more complicated functions you are searching a larger hypothesis space so you need more data to falsify the bad hypotheses.   You can be efficient (don&#8217;t add terms you don&#8217;t think you can use as they make generalization harder) but you can&#8217;t get something for nothing (even with kernel methods).</p>
<h2>Back to margin</h2>
<p>Large margin (having a large distance between the decision boundary and all training examples) is a good idea.  At the simplest level it means anything close to a training data point gets classified correctly (so the model is immune to an amount of fuzz proportional to the margin width).  </p>
<p>We get the following remarkable generalization theorem (this one is the hard SVM version, we weaken the result a bit to make it easier to state).  Suppose we train a model using kernel k(.,.) on m training examples u(1),&#8230;,u(m) with training labels y(1)&#8230;y(1).  Further assume the hidden true concept is separable with respect to our kernel and data distribution with a margin of at least w (that is for any sample we draw we can find a SVM model that classifies all of the training data correctly and sees a margin of at least w).  Or more simply: assume w is behaving like a constant with respect to m (i.e. it is not going down as we increase sample size). Then with probability at least 1-d the generalization error (that is error seen on new test points not seen during training, but drawn from the same distribution as the training examples) is no more than:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/1F1B1C26-C676-46EA-9522-5DBA181CC8C6.jpg" alt="1F1B1C26-C676-46EA-9522-5DBA181CC8C6.jpg" border="0" height="65" /><br />
.<br />
</center><br />
(we take this from Theorem 7.22 of &#8220;Kernel Methods for Pattern Analysis, page 215).</p>
<p>If we further assume the expected value k(u,u) under the training distribution exists (i.e. we don&#8217;t have too many examples of very high kernel norm according to the distribution we are training with respect to) and is stable (not taking a lot of value from rare instances) then we can read off the meaning of this upper bound.  Assuming we can always build a model from the training data of margin at least w then the generalization error (excess error we expect to see on classifying similar points) is falling at a rate proportional to the square root of the amount of training data we have.  This is in one sense amazing- we have a bound on how well we are going to do on data we have not seen.  And this bound is also in some sense &#8220;best possible&#8221; because we need around this much data to even confirm such an accuracy (see <a target="_blank" href="http://www.win-vector.com/blog/2011/06/what-is-a-large-enough-random-sample/">What is a large enough random sample?</a> ).  Also notice how the dimension of the implicit features space (which can in fact be infinite) does not enter into the bound.</p>
<p>However, the bound is also some sense to be expected.  It depends on the following strong assumptions (usually not remembered or stated):</p>
<ol>
<li>Test and training data must be generated in the same manner (i.e. come from the same distribution, this is usually stated but not obeyed).</li>
<li>We must not have collapsing margin as the number of data points goes up (so there must actually be a moat between the positive and negative examples such as portrayed in figure1 where the blurry boundary was a region we never generated training or test examples from).</li>
<li>The expected value of k(u,u) must be bounded for examples drawn from our training distribution.  For example this is not true even for the identity kernel if x is just numbers drawn from a <a target="_blan" href="http://en.wikipedia.org/wiki/Cauchy_distribution">Cauchy distribution</a>.</li>
</ol>
<p>Also it is a <em>posterior</em> estimate of expected generalization error.  That is, we only get a bound after we plug in some facts found from fitting a model to the training data (the margin and the expected value of k(u,u)).  It is not like the famous <em>prior</em> bounds based on VC dimension that depend only on the structure of the kernel k(,) and not the distribution of training data or labels (beyond the concept being expressible by the kernel, which is itself a <em>posterior</em> fact).  The thing to remember for all posterior estimates is you can get such estimates at a small expense in statistical efficiency (i.e. you waste some data) and a moderate computational blow-up by cross validation (and with essentially no math and many fewer assumptions).  In fact the cross validation and hold-out estimates may be much tighter than calculated bounds (and are certainly easier to explain).</p>
<p>When we say we weakened the result we mean we have added some stated some conditions that are not strictly needed for the bound to be true.  One of these assumptions is that the margin is not collapsing (the bound remains true even for a collapsing margin; it is just not useful).  We added this assumption because most readings of the bound (or presumed applications) implicitly assume the bound remains useful (which means something near our additional assumptions must also hold).  The reader is really lured to think of the margin w as a constant depending only on the kernel when in fact it depends on the kernel, data distribution and number of training examples.  For example: consider a very simple one dimensional classification problem:  learning the concept x &ge; 0 from labeled training data drawn uniformly from the interval [-0.5,0.5] using a Gaussian kernel.  The support vector models are going to be essentially picking two neighboring points in the observed training sample that have opposite labels and saying the concept boundary is between them.  In this simple case the margin is collapsing: as m (the number of training examples increases) the expected margin width is O(1/m).  This renders the bound calculated above useless (drives it above 1 as the number of training examples increases).  The model is good but the bound is not useful.  This is because the training distribution is discovering points closer and closer to the concept boundary (a necessary discontinuity) as the number of training examples grows.  This is why we had to add the additional (weakening) assumption that not only do we need to assume the concept to be learned is separable with respect to the kernel we are using, but we must further assume the data-distribution (determining where test and training data come from) stays some minimal distance away from the decision surface (i.e. together the data distribution and underlying concept have a moat, as in our figure 1 example).  We will phrase this as &#8220;the concept is wide margin separable with respect to the training distribution&#8221; (implicitly we need to know the kernel, but it is the training distribution that is critical). This is a very strong assumption (not true for all classification problems) but is often true when the classification task is to &#8220;un-mix a mixture&#8221; (the observed data is coming from two different sources and the training labels are which source they come from, in this case you often do have a margin).  Obviously a soft-margin SVM deals better with this (some of the issue is due to our use of hard-margin)- but it is an interesting exercise to think why you would expect your learned  model to have a large margin if your underlying concept and training data do not.</p>
<p>The result remains important: support vector machines are in some sense an optimal learning procedure (up to some constants they achieve as tight a generalization error as can even be confirmed for a given training set size).  Another strong point of the support vector result is the result applies even for kernels with unbounded VC dimension (such as the Gaussian kernel).</p>
<h2>The Gaussian (or radial) kernel again</h2>
<p>We said the Gaussian kernel was of infinite VC dimension.  But we have not shown why it is true.   The fact that the only encoding we could think of off the top of our head was a power-series or an limit does not guarantee there isn&#8217;t  some more clever encoding that easily demonstrates that the Gaussian kernel is of bounded VC dimension (like the identity kernel and cosine kernels are). </p>
<p>However we can see the Gaussian kernel can not have finite VC dimension because it can place an arbitrary number of bounded patches in the plane.  Thus we can build arbitrarily large sets (by placing all points far apart) that can be shattered (thus falsifying any bounded VC dimension).</p>
<p>One of the wonders of the support vector method is that the theory works even in this situation.</p>
<h2>Back to SVM</h2>
<p>Support vector machines seem nearly magical in their power to pick &#8220;best weights.&#8221;  However, as we have seen, these best weights may not always be much better than typical good weights (for instance: using all uniform weights).  Also it is very important to remember the support vector machine is picking the best weighting of a sum of already picked kernels.  It is not picking the best kernel shape or adjusting things like bandwidth (you must pick the best kernel ahead of time or try several kernels to find a good one).  You must remember support vector machines can only directly adjust dual weights (or relative training data weighting) this can affect some changes in hypothesis shape but the support vector machine can not directly implement arbitrary changes in hypothesis shape like a nearest neighbor or a parameterized primal method can (like logistic regression with the kernel trick of adding interaction variables). </p>
<p>Support vector machines are to be much preferred to other combiners like <a target="_blan" href="http://en.wikipedia.org/wiki/Boosting">Boosting</a>.  This is because support vector machines have explicit stopping criteria (conditions known to be true at the optimal solution that characterize the optimal solution) and explicit regularization (control of over fitting).  Boosting relies a path argument (&#8220;do the computation in these stages&#8221;) so it is much harder to reason about and frankly the folklore that &#8220;early stop prevents over fitting in boosting&#8221; is just false (you are much better off explicitly regularizing).</p>
<p>It is also lore that random kernel transformations (like power series over the cosine kernel) are magic.  They can make data separable but they are unlikely to meet the (unstated but critical) conditions of the support vector theorem (margin being wide, margin not filling in and expectation of k(.,.) being small).</p>
<h3>A comment on optimization</h3>
<p>In this writeup we have skipped the most beautiful part of support vector machines- the form of the optimization problem used to solve them.  We strongly suggest consulting  John Shawe-Taylor and Nello Cristianini&#8217;s &#8220;Kernel Methods for Pattern Analysis&#8221;, Cambridge 2004 for this.  The writing is dense but accurate.  You can literally paste their formulas into a general solver and they work.</p>
<h2>Conclusion</h2>
<p>We have shown how kernel methods and support vector machines fit in a sequence of machine learning methods:</p>
<ol>
<li>nearest neighbor models</li>
<li>uniform sum models</li>
<li>support vector machines.</li>
</ol>
<p>We also showed a number of examples of kernels and non-kernels.  It is good to be able to quickly remember the constant &#8220;-1&#8243; is not a kernel and the constant &#8220;1&#8243; is a kernel (though not a very useful one).</p>
<p>It is our argument that support vector machines are &#8220;dual methods&#8221; (as they work in weights over the data instead of weights over coordinates in the features space) and the move from the primal algorithm Naive Bayes to Logistic regression is similar to the move from uniform kernel sums to support vector machines.</p>
<p>Also, if you want to directly manipulate shape (like picking bandwidth) you must in some sense use a primal method (this isn&#8217;t what support vector machines are for).</p>
<p>We end with: If you have some feeling of both what a method can and can not do then you have some understanding of the method.</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2010/10/a-personal-perspective-on-machine-learning/' rel='bookmark' title='A Personal Perspective on Machine Learning'>A Personal Perspective on Machine Learning</a></li>
<li><a href='http://www.win-vector.com/blog/2011/04/do-your-tools-support-production-or-complexity/' rel='bookmark' title='Do your tools support production or complexity?'>Do your tools support production or complexity?</a></li>
<li><a href='http://www.win-vector.com/blog/2011/07/book-review-ensemble-methods-in-data-mining-seni-elder/' rel='bookmark' title='Book Review: Ensemble Methods in Data Mining (Seni &amp; Elder)'>Book Review: Ensemble Methods in Data Mining (Seni &#038; Elder)</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2011/10/kernel-methods-and-support-vector-machines-de-mystified/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>The equivalence of logistic regression and maximum entropy models</title>
		<link>http://www.win-vector.com/blog/2011/09/the-equivalence-of-logistic-regression-and-maximum-entropy-models/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=the-equivalence-of-logistic-regression-and-maximum-entropy-models</link>
		<comments>http://www.win-vector.com/blog/2011/09/the-equivalence-of-logistic-regression-and-maximum-entropy-models/#comments</comments>
		<pubDate>Fri, 23 Sep 2011 16:21:09 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Expository Writing]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Statistics To English Translation]]></category>
		<category><![CDATA[Tutorials]]></category>
		<category><![CDATA[Calculus of Variations]]></category>
		<category><![CDATA[log-likelihood]]></category>
		<category><![CDATA[Logistic Regression]]></category>
		<category><![CDATA[Max-Ent]]></category>
		<category><![CDATA[Maximum Entropy]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1753</guid>
		<description><![CDATA[Nina Zumel recently gave a very clear explanation of logistic regression ( The Simpler Derivation of Logistic Regression ). In particular she called out the central role of log-odds ratios and demonstrated how the &#8220;deviance&#8221; (that mysterious quantity reported by fitting packages) is both a term in &#8220;the pseudo-R^2&#8243; (so directly measures goodness of fit) [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2011/09/the-simpler-derivation-of-logistic-regression/' rel='bookmark' title='The Simpler Derivation of Logistic Regression'>The Simpler Derivation of Logistic Regression</a></li>
<li><a href='http://www.win-vector.com/blog/2010/11/learn-a-powerful-machine-learning-tool-logistic-regression-and-beyond/' rel='bookmark' title='Learn Logistic Regression (and beyond)'>Learn Logistic Regression (and beyond)</a></li>
<li><a href='http://www.win-vector.com/blog/2010/12/large-data-logistic-regression-with-example-hadoop-code/' rel='bookmark' title='Large Data Logistic Regression (with example Hadoop code)'>Large Data Logistic Regression (with example Hadoop code)</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>Nina Zumel recently gave a very clear explanation of logistic regression ( <a target="_blank" href="http://www.win-vector.com/blog/2011/09/the-simpler-derivation-of-logistic-regression/">The Simpler Derivation of Logistic Regression</a> ).  In particular she called out the central role of log-odds ratios and demonstrated how the &#8220;deviance&#8221; (that mysterious<br />
quantity reported by fitting packages) is both a term in &#8220;the pseudo-R^2&#8243; (so directly measures goodness of fit) and is the quantity that is actually optimized during the fitting procedure.  One great point of the writeup was how simple everything is once you start thinking in terms of derivatives (and that it isn&#8217;t so much the functional form of the sigmoid that is special but its relation to its own derivative that is special).</p>
<p>We adapt these presentation ideas to make explicit the well known equivalence of logistic regression and maximum entropy models.<span id="more-1753"></span>In our new writeup: <a target="_blank" href="http://www.win-vector.com/dfiles/LogisticRegressionMaxEnt.pdf">The equivalence of logistic regression and maximum entropy models</a>  we move to multi-category modeling and demonstrate how one invents something as remarkable as logistic regression.</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2011/09/the-simpler-derivation-of-logistic-regression/' rel='bookmark' title='The Simpler Derivation of Logistic Regression'>The Simpler Derivation of Logistic Regression</a></li>
<li><a href='http://www.win-vector.com/blog/2010/11/learn-a-powerful-machine-learning-tool-logistic-regression-and-beyond/' rel='bookmark' title='Learn Logistic Regression (and beyond)'>Learn Logistic Regression (and beyond)</a></li>
<li><a href='http://www.win-vector.com/blog/2010/12/large-data-logistic-regression-with-example-hadoop-code/' rel='bookmark' title='Large Data Logistic Regression (with example Hadoop code)'>Large Data Logistic Regression (with example Hadoop code)</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2011/09/the-equivalence-of-logistic-regression-and-maximum-entropy-models/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The Simpler Derivation of Logistic Regression</title>
		<link>http://www.win-vector.com/blog/2011/09/the-simpler-derivation-of-logistic-regression/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=the-simpler-derivation-of-logistic-regression</link>
		<comments>http://www.win-vector.com/blog/2011/09/the-simpler-derivation-of-logistic-regression/#comments</comments>
		<pubDate>Wed, 14 Sep 2011 15:36:37 +0000</pubDate>
		<dc:creator>Nina Zumel</dc:creator>
				<category><![CDATA[Expository Writing]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Statistics To English Translation]]></category>
		<category><![CDATA[Tutorials]]></category>
		<category><![CDATA[likelihood]]></category>
		<category><![CDATA[log-likelihood]]></category>
		<category><![CDATA[Logistic Regression]]></category>
		<category><![CDATA[newton's method]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1740</guid>
		<description><![CDATA[Logistic regression is one of the most popular ways to fit models for categorical data, especially for binary response data. It is the most important (and probably most used) member of a class of models called generalized linear models. Unlike linear regression, logistic regression can directly predict probabilities (values that are restricted to the (0,1) [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2011/09/the-equivalence-of-logistic-regression-and-maximum-entropy-models/' rel='bookmark' title='The equivalence of logistic regression and maximum entropy models'>The equivalence of logistic regression and maximum entropy models</a></li>
<li><a href='http://www.win-vector.com/blog/2010/11/learn-a-powerful-machine-learning-tool-logistic-regression-and-beyond/' rel='bookmark' title='Learn Logistic Regression (and beyond)'>Learn Logistic Regression (and beyond)</a></li>
<li><a href='http://www.win-vector.com/blog/2010/12/large-data-logistic-regression-with-example-hadoop-code/' rel='bookmark' title='Large Data Logistic Regression (with example Hadoop code)'>Large Data Logistic Regression (with example Hadoop code)</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>Logistic regression is one of the most popular ways to fit models for categorical data, especially for binary response data. It is the most important (and probably most used) member of a class of models called generalized linear models. Unlike linear regression, logistic regression can directly predict probabilities (values that are restricted to the (0,1) interval); furthermore, those probabilities are well-calibrated when compared to the probabilities predicted by some other classifiers, such as Naive Bayes. Logistic regression preserves the marginal probabilities of the training data. The coefficients of the model also provide some hint of the relative importance of each input variable.</p>
<p> While you don&#8217;t have to know how to derive logistic regression or how to implement it in order to use it, the details of its derivation give important insights into interpreting and troubleshooting the resulting models. Unfortunately, most derivations (like the ones in [Agresti, 1990] or [Hastie, et.al, 2009]) are too terse for easy comprehension. Here, we give a derivation that is less terse (and less general than Agresti&#8217;s), and we&#8217;ll take the time to point out some details and useful facts that sometimes get lost in the discussion.<span id="more-1740"></span><br/><br/>To make the discussion easier, we will focus on the binary response case. We assume that the case of interest (or &#8220;true&#8221;) is coded to 1, and the alternative case (or &#8220;false&#8221;) is coded to 0.</p>
<p>The logistic regression model assumes that the log-odds of an observation <em>y</em> can be expressed as a linear function of the K input variables <strong>x</strong>:</p>
<div style="text-align:center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2011/09/B437C723-50ED-4B7B-97B6-56D36C66CD3C.jpg" alt="B437C723-50ED-4B7B-97B6-56D36C66CD3C.jpg" border="0" width="189" /></div>
<p>Here, we add the constant term <em>b<sub>0</sub></em>, by setting <em>x<sub>0</sub></em> = 1. This gives us  K+1 parameters. The left hand side of the above equation is called the <em>logit</em><sun> of P (hence, the name logistic regression). </p>
<p>Let&#8217;s take the exponent of both sides of the logit equation.</p>
<div style="text-align:center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2011/09/B4F430F4-5B17-4151-B7B9-AD0282034582.jpg" alt="B4F430F4-5B17-4151-B7B9-AD0282034582.jpg" border="0" width="200" /></div>
<p>This immediately tells us that logistic models are multiplicative in their inputs (rather than additive, like a linear model), and it gives us a way to interpret the coefficients. The value exp(<em>b<sub>j</sub></em>) tells us how the odds of the response being &#8220;true&#8221; increase (or decrease) as <em>x<sub>j</sub> </em>increases by one unit, all other things being equal. For example, suppose <em>b<sub>j</sub></em> = 0.693. Then exp(<em>b<sub>j</sub></em>) = 2. If <em>x<sub>j</sub></em> is a numerical variable (say, age in years), then every year&#8217;s increase in age doubles the odds of the response being true — all other things being equal. If <em>x<sub>j</sub> </em>is a binary variable (say, sex, with female coded as 1 and male as 0), then if the subject is female, then the response is two times more likely to be true than if the subject is male, all other things being equal.</p>
<p>We can also invert the logit equation to get a new expression for P(x):</p>
<div style="text-align:center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2011/09/9512CE26-F521-4F55-A9C7-2F951C8CD791.jpg" alt="9512CE26-F521-4F55-A9C7-2F951C8CD791.jpg" border="0" width="141" /></div>
<p><br/>The right hand side of the top equation is the sigmoid of <em>z</em>, which maps the real line to the interval (0, 1), and is approximately linear near the origin. A useful fact about P(<em>z</em>) is that the derivative P&#8217;(<em>z</em>) = P(<em>z</em>) (1 &#8211; P(<em>z</em>)). Here&#8217;s the derivation:</p>
<div style="text-align:center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2011/09/B2F9E40A-5775-47FE-B582-28B62F19BE99.jpg" alt="B2F9E40A-5775-47FE-B582-28B62F19BE99.jpg" border="0" width="600" /></div>
<p>Later, we will want to take the gradient of P with respect to the set of coefficients <strong>b</strong>, rather than <em>z</em>. In that case, P&#8217;(<em>z</em>) = P(<em>z</em>) (1 &#8211; P(<em>z</em>))<em>z</em>&#8216;, where &#8216; is the gradient taken with respect to <strong>b</strong>.</p>
<p>The solution to a Logistic Regression problem is the set of parameters <strong>b</strong> that maximizes the likelihood of the data, which is expressed as the product of the predicted probabilities of the N individual observations.</p>
<div style="text-align:center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2011/09/2FC8A9C0-D435-459A-8D58-5B183EF952F5.jpg" alt="2FC8A9C0-D435-459A-8D58-5B183EF952F5.jpg" border="0" width="340" /></div>
<p>(<em>X, y</em>) is the set of observations; <em>X</em> is a K+1 by N matrix of inputs, where each column corresponds to an observation, and the first row is <strong>1</strong>; <em>y</em> is an N-dimensional vector of responses; and (<strong>x</strong><sub>i</sub>, <em>y<sub>i</sub></em>) are the individual observations.</p>
<p>It&#8217;s generally easier to work with the log of this expression, known (of course) as the log-likelihood.</p>
<div style="text-align:center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2011/09/B4A3F94A-D4B5-4391-ABFD-C872839F3CFC.jpg" alt="B4A3F94A-D4B5-4391-ABFD-C872839F3CFC.jpg" border="0" width="412" /></div>
<p>Maximizing the log-likelihood will maximize the likelihood. As a side note, the quantity −2*log-likelihood is called the <em>deviance</em> of the model. It is analogous to the residual sum of squares (RSS) of a linear model. Ordinary least squares minimizes RSS; logistic regression minimizes deviance. A useful goodness-of-fit heuristic for a logistic regression model is to compare the deviance of the model with the so-called <em>null deviance</em>: the deviance of the constant model that returns only the global response probability for every data point. One minus the ratio of deviance to null deviance is sometimes called <em>pseudo-R<sup>2</sup></em>, and is used the way one would use R<sup>2</sup> to evaluate a linear model.</p>
<div style="text-align:center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2011/09/E5A7630A-2EA5-47A3-9A8A-10BF46B64F6A.jpg" alt="E5A7630A-2EA5-47A3-9A8A-10BF46B64F6A.jpg" border="0" width="241" /></div>
<p><br/>Traditional derivations of Logistic Regression tend to start by substituting the logit function directly into the log-likelihood equations, and expanding from there. The derivation is much simpler if we don&#8217;t plug the logit function in immediately. To maximize the log-likelihood, we take its gradient with respect to <strong>b</strong>:</p>
<div style="text-align:center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2011/09/B2673255-4B13-4FF1-88F8-ACF004F28518.jpg" alt="B2673255-4B13-4FF1-88F8-ACF004F28518.jpg" border="0" width="260" /></div>
<p>where P<sub>i</sub> is shorthand for P(<string>x</strong><sub>i</sub>). The maximum occurs where the gradient is zero.</p>
<p>We can expand this equation further, when we remember that P&#8217; = P(1-P):</p>
<div style="text-align:center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2011/09/C6B59E85-25AB-4537-A1C1-3664A98EEE00.jpg" alt="C6B59E85-25AB-4537-A1C1-3664A98EEE00.jpg" border="0" width="352" /></div>
<p>The last line merges the two cases (<em>y<sub>i</sub></em> = 1 and <em>y<sub>i</sub></em> = 0) into a single sum. We can now cancel terms and set the gradient to zero. This gives us the set of simultaneous equations that are true at the optimum:</p>
<div style="text-align:center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2011/09/9D60BD95-E4E0-40AF-9350-01DA19EFE845.jpg" alt="9D60BD95-E4E0-40AF-9350-01DA19EFE845.jpg" border="0" width="150" /></div>
<p>Notice that the equations to be solved are in terms of the probabilities P (which are a function of <strong>b</strong>), not directly in terms of the coefficients <strong>b</strong> themselves. This means that logistic models are coordinate-free: for a given set of input variables, the probabilities returned by the model will be the same even if the variables are shifted, combined, or rescaled. Only the values of the coefficients will change.</p>
<p>The other thing to notice from the above equations is that the sum of probability mass across each coordinate of the <strong>x</strong><sub>i</sub> vectors is equal to the count of observations with that coordinate value for which the response was true. For example, suppose the jth input variable is 1 if the subject is female, 0 if the subject is male. Then </p>
<div style="text-align:center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2011/09/2357AF36-1D0E-4EF1-88A9-8694220B62E4.jpg" alt="2357AF36-1D0E-4EF1-88A9-8694220B62E4.jpg" border="0" width="211" /></div>
<p>In other words, the summed probability mass for the female subjects equals the count of female subjects with the response &#8220;true&#8221;. It is also true that the sum of all the probability mass over the entire training set will equal the number of &#8220;true&#8221; responses in the training set. This is what we mean when we say that logistic regression preserves the marginal probabilities of the training data.</p>
<p><strong>Solving for the Coefficients</strong></p>
<p>The most straightforward way to solve for the coefficients <strong>b</strong> is Newton&#8217;s method. The Fisher scoring method that is used in most off-the-shelf implementations is a more general variation of Newton&#8217;s method; it works on the same principles. We will describe solving for the coefficients using Newton&#8217;s method.</p>
<p>Suppose you have a vector valued function <strong>f</strong>: <strong> y = f(b)</strong>.  You want to find the value <strong>b</strong><sub>opt</sub> such that  <strong>f(b)</strong><sub>opt</sub> = <strong>0</strong>. Assuming that we start with an initial guess <strong>b</strong><sub>0</sub>, we can take the Taylor expansion of <strong>f</strong> around <strong>b</strong><sub>0</sub>:</p>
<div style="text-align:center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2011/09/DCAFCF7F-1991-4005-8CAD-39F5002B9C6C.jpg" alt="DCAFCF7F-1991-4005-8CAD-39F5002B9C6C.jpg" border="0" width="237" /></div>
<p>Here, <strong>f</strong>&#8216;  is a matrix; it is the Jacobean of first derivatives of <strong>f</strong> with respect to <strong>b</strong>. Setting the left hand side to zero, we can solve for &#916 as </p>
<div style="text-align:center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2011/09/7447A740-4D4A-4BC5-AC66-45D9CF2665DD.jpg" alt="7447A740-4D4A-4BC5-AC66-45D9CF2665DD.jpg" border="0" width="176" /></div>
<p>We then update our estimate for <strong>b</strong>:</p>
<div style="text-align:center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2011/09/6D54A73B-69F1-4B5B-9103-80AC40A0D08D.jpg" alt="6D54A73B-69F1-4B5B-9103-80AC40A0D08D.jpg" border="0" width="108" /></div>
<p>and iterate until convergence.</p>
<p>In our case, <strong>f</strong> is the gradient of the log-likelihood, and its Jacobean is the Hessian (the matrix of second derivatives) of the log-likelihood function.</p>
<div style="text-align:center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2011/09/F52762C7-618B-45AC-8396-01A228D0A54E.jpg" alt="F52762C7-618B-45AC-8396-01A228D0A54E.jpg" border="0" width="203" /></div>
<p>where <strong>W</strong> is a diagonal matrix of the derivatives P&#8217;<sub>i</sub>, and the ith column of <strong>X</strong> corresponds to the ith observation. So we can solve for &#916 at each iteration as</p>
<div style="text-align:center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2011/09/16E1E7D7-1EEF-43A6-BA5C-B77CBA2E2EED.jpg" alt="16E1E7D7-1EEF-43A6-BA5C-B77CBA2E2EED.jpg" border="0" width="239" /></div>
<p>where <strong>W</strong> is the current matrix of derivatives, <strong>y</strong> is the vector of observed responses, and <strong>P</strong><sub>k</sub> is the vector of probabilities as calculated by the current estimate of <strong>b</strong>.</p>
<p>Compare this to the solution of a linear regression:</p>
<div style="text-align:center;"><img src="http://www.win-vector.com/blog/wp-content/uploads/2011/09/2B73A3FE-3667-4BCF-BC67-B5A701014785.jpg" alt="2B73A3FE-3667-4BCF-BC67-B5A701014785.jpg" border="0" width="152"/></div>
<p>Comparing the two, we can see that at each iteration, &#916 is the solution of a weighted least square problem, where the &#8220;response&#8221; is the difference between the observed response and its current estimated probability of being true. This is why the technique for solving logistic regression problems is sometimes referred to as <em>iteratively re-weighted least squares</em>. Generally, the method does not take long to converge (about 6 or so iterations).</p>
<p>Thinking of logistic regression as a weighted least squares problem immediately tells you a few things that can go wrong, and how. For example, if some of the input variables are correlated, then the Hessian <strong>H</strong> will be ill-conditioned, or even singular. This will result in large error bars (or &#8220;loss of significance&#8221;) around the estimates of certain coefficients. It can also result in coefficients with excessively large magnitudes, and often the wrong sign. If an input perfectly predicts the response for some subset of the data (but not all), then the term P<sub>i</sub> (1 &#8211; P<sub>i</sub>) will be driven to zero for that subset, which will drive the coefficient for that input to infinity (if the input perfectly predicted all the data, then the residual (<strong>y</strong> &#8211; <strong>P</strong><sub>k</sub>) has already gone to zero, which means that you are already at the optimum).</p>
<p>On the other hand, the least squares analogy also gives us the solution to these problems: <em>regularized regression</em>, such as lasso or ridge. Regularized regression penalizes excessively large coefficients, and keeps them bounded. If you are implementing your own logistic regression procedure, rather than using a package, then it is straightforward to implement a regularized least squares for the iteration step (<a href="http://www.win-vector.com/blog/2010/11/learn-a-powerful-machine-learning-tool-logistic-regression-and-beyond/">as Win-Vector has done</a>). But even if you are using an off-the-shelf implementation, the above discussion will help give you a sense of how to interpret the coefficients of your model, and how to recognize and troubleshoot some issues that might arise.</p>
<p><strong>Conclusion</strong></p>
<p>Here is what you should now know from going through the derivation of logistic regression step by step: </p>
<p>- Logistic regression models are multiplicative in their inputs. <br/></p>
<p>- The exponent of each coefficient tells you how a unit change in that input variable affects the odds ratio of the response being true. </p>
<p>- Logistic regression is coordinate-free: translations, rotations, and rescaling of the input variables will not affect the resulting probabilities. </p>
<p>- Logistic regression preserves the marginal probabilities of the training data. <br/></p>
<p>- Overly large coefficient magnitudes, overly large error bars on the coefficient estimates, and the wrong sign on a coefficient could be indications of correlated inputs. </br></p>
<p>- Coefficients that tend to infinity could be a sign that an input is perfectly correlated with a subset of your responses. Or put another way, it could be a sign that this input is only really useful on a subset of your data, so perhaps it is time to segment the data. </p>
<p>- Pseudo-R<sup>2</sup> is a useful goodness-of-fit heuristic. <br/></p>
<p><strong>References</strong></p>
<p>[Agresti, 1990] Agresti, A. (1990). Categorical Data Analysis.</p>
<p>[Hastie, et.al, 2009] Hastie, T., R. Tibshirani, and J. Friedman (2009). The Elements of Statistical Learning, 2nd Edition.</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2011/09/the-equivalence-of-logistic-regression-and-maximum-entropy-models/' rel='bookmark' title='The equivalence of logistic regression and maximum entropy models'>The equivalence of logistic regression and maximum entropy models</a></li>
<li><a href='http://www.win-vector.com/blog/2010/11/learn-a-powerful-machine-learning-tool-logistic-regression-and-beyond/' rel='bookmark' title='Learn Logistic Regression (and beyond)'>Learn Logistic Regression (and beyond)</a></li>
<li><a href='http://www.win-vector.com/blog/2010/12/large-data-logistic-regression-with-example-hadoop-code/' rel='bookmark' title='Large Data Logistic Regression (with example Hadoop code)'>Large Data Logistic Regression (with example Hadoop code)</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2011/09/the-simpler-derivation-of-logistic-regression/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Programmers Should Know R</title>
		<link>http://www.win-vector.com/blog/2011/08/programmers-should-know-r/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=programmers-should-know-r</link>
		<comments>http://www.win-vector.com/blog/2011/08/programmers-should-know-r/#comments</comments>
		<pubDate>Sat, 06 Aug 2011 15:29:22 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Tutorials]]></category>
		<category><![CDATA[Analytics]]></category>
		<category><![CDATA[diagnosis]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[R]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1711</guid>
		<description><![CDATA[Programmers should definitely know how to use R. I don&#8217;t mean they should switch from their current language to R, but they should think of R as a handy tool during development.Again and again I find myself working with Java code like the following. public class SomeBigProject1 { public static double logStirlingApproximation(final int n) { [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/' rel='bookmark' title='Automatic Differentiation with Scala'>Automatic Differentiation with Scala</a></li>
<li><a href='http://www.win-vector.com/blog/2009/11/r-examine-objects-tutorial/' rel='bookmark' title='R examine objects tutorial'>R examine objects tutorial</a></li>
<li><a href='http://www.win-vector.com/blog/2009/09/survive-r/' rel='bookmark' title='Survive R'>Survive R</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>Programmers should definitely know how to use <a target="_blan" href="http://cran.r-project.org/">R</a>.  I don&#8217;t mean they should switch from their current language to R, but they should think of R as a handy tool during development.<span id="more-1711"></span>Again and again I find myself working with Java code like the following.</p>
<style type="text/css">
td.linenos { background-color: #f0f0f0; padding-right: 10px; }
span.lineno { background-color: #f0f0f0; padding: 0 5px 0 5px; }
pre { line-height: 125%; }
body .hll { background-color: #ffffcc }
body  { background: #f8f8f8; }
body .c { color: #408080; font-style: italic } /* Comment */
body .err { border: 1px solid #FF0000 } /* Error */
body .k { color: #008000; font-weight: bold } /* Keyword */
body .o { color: #666666 } /* Operator */
body .cm { color: #408080; font-style: italic } /* Comment.Multiline */
body .cp { color: #BC7A00 } /* Comment.Preproc */
body .c1 { color: #408080; font-style: italic } /* Comment.Single */
body .cs { color: #408080; font-style: italic } /* Comment.Special */
body .gd { color: #A00000 } /* Generic.Deleted */
body .ge { font-style: italic } /* Generic.Emph */
body .gr { color: #FF0000 } /* Generic.Error */
body .gh { color: #000080; font-weight: bold } /* Generic.Heading */
body .gi { color: #00A000 } /* Generic.Inserted */
body .go { color: #808080 } /* Generic.Output */
body .gp { color: #000080; font-weight: bold } /* Generic.Prompt */
body .gs { font-weight: bold } /* Generic.Strong */
body .gu { color: #800080; font-weight: bold } /* Generic.Subheading */
body .gt { color: #0040D0 } /* Generic.Traceback */
body .kc { color: #008000; font-weight: bold } /* Keyword.Constant */
body .kd { color: #008000; font-weight: bold } /* Keyword.Declaration */
body .kn { color: #008000; font-weight: bold } /* Keyword.Namespace */
body .kp { color: #008000 } /* Keyword.Pseudo */
body .kr { color: #008000; font-weight: bold } /* Keyword.Reserved */
body .kt { color: #B00040 } /* Keyword.Type */
body .m { color: #666666 } /* Literal.Number */
body .s { color: #BA2121 } /* Literal.String */
body .na { color: #7D9029 } /* Name.Attribute */
body .nb { color: #008000 } /* Name.Builtin */
body .nc { color: #0000FF; font-weight: bold } /* Name.Class */
body .no { color: #880000 } /* Name.Constant */
body .nd { color: #AA22FF } /* Name.Decorator */
body .ni { color: #999999; font-weight: bold } /* Name.Entity */
body .ne { color: #D2413A; font-weight: bold } /* Name.Exception */
body .nf { color: #0000FF } /* Name.Function */
body .nl { color: #A0A000 } /* Name.Label */
body .nn { color: #0000FF; font-weight: bold } /* Name.Namespace */
body .nt { color: #008000; font-weight: bold } /* Name.Tag */
body .nv { color: #19177C } /* Name.Variable */
body .ow { color: #AA22FF; font-weight: bold } /* Operator.Word */
body .w { color: #bbbbbb } /* Text.Whitespace */
body .mf { color: #666666 } /* Literal.Number.Float */
body .mh { color: #666666 } /* Literal.Number.Hex */
body .mi { color: #666666 } /* Literal.Number.Integer */
body .mo { color: #666666 } /* Literal.Number.Oct */
body .sb { color: #BA2121 } /* Literal.String.Backtick */
body .sc { color: #BA2121 } /* Literal.String.Char */
body .sd { color: #BA2121; font-style: italic } /* Literal.String.Doc */
body .s2 { color: #BA2121 } /* Literal.String.Double */
body .se { color: #BB6622; font-weight: bold } /* Literal.String.Escape */
body .sh { color: #BA2121 } /* Literal.String.Heredoc */
body .si { color: #BB6688; font-weight: bold } /* Literal.String.Interpol */
body .sx { color: #008000 } /* Literal.String.Other */
body .sr { color: #BB6688 } /* Literal.String.Regex */
body .s1 { color: #BA2121 } /* Literal.String.Single */
body .ss { color: #19177C } /* Literal.String.Symbol */
body .bp { color: #008000 } /* Name.Builtin.Pseudo */
body .vc { color: #19177C } /* Name.Variable.Class */
body .vg { color: #19177C } /* Name.Variable.Global */
body .vi { color: #19177C } /* Name.Variable.Instance */
body .il { color: #666666 } /* Literal.Number.Integer.Long */
</style>
<div class="highlight">
<pre><span class="kd">public</span> <span class="kd">class</span> <span class="nc">SomeBigProject1</span> <span class="o">{</span>
	<span class="kd">public</span> <span class="kd">static</span> <span class="kt">double</span> <span class="nf">logStirlingApproximation</span><span class="o">(</span><span class="kd">final</span> <span class="kt">int</span> <span class="n">n</span><span class="o">)</span> <span class="o">{</span>
		<span class="k">return</span> <span class="n">n</span><span class="o">*(</span><span class="n">Math</span><span class="o">.</span><span class="na">log</span><span class="o">(</span><span class="n">n</span><span class="o">)-</span><span class="mi">1</span><span class="o">)</span> <span class="o">+</span> <span class="mf">0.5</span><span class="o">*</span><span class="n">Math</span><span class="o">.</span><span class="na">log</span><span class="o">(</span><span class="mi">2</span><span class="o">*</span><span class="n">Math</span><span class="o">.</span><span class="na">PI</span><span class="o">*</span><span class="n">n</span><span class="o">);</span>
	<span class="o">}</span>

	<span class="kd">public</span> <span class="kd">static</span> <span class="kt">double</span> <span class="nf">logFactorial</span><span class="o">(</span><span class="kd">final</span> <span class="kt">int</span> <span class="n">n</span><span class="o">)</span> <span class="o">{</span>
		<span class="kt">double</span> <span class="n">r</span> <span class="o">=</span> <span class="mf">0.0</span><span class="o">;</span>
		<span class="k">for</span><span class="o">(</span><span class="kt">int</span> <span class="n">i</span><span class="o">=</span><span class="n">n</span><span class="o">;</span><span class="n">i</span><span class="o">&gt;</span><span class="mi">1</span><span class="o">;--</span><span class="n">i</span><span class="o">)</span> <span class="o">{</span>
			<span class="n">r</span> <span class="o">+=</span> <span class="n">Math</span><span class="o">.</span><span class="na">log</span><span class="o">(</span><span class="n">i</span><span class="o">);</span>
		<span class="o">}</span>
		<span class="k">return</span> <span class="n">r</span><span class="o">;</span>
	<span class="o">}</span>

	<span class="kd">public</span> <span class="kd">static</span> <span class="kt">void</span> <span class="nf">main</span><span class="o">(</span><span class="kd">final</span> <span class="n">String</span><span class="o">[]</span> <span class="n">args</span><span class="o">)</span> <span class="o">{</span>
		<span class="kt">int</span> <span class="n">nbad</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span>
		<span class="k">for</span><span class="o">(</span><span class="kt">int</span> <span class="n">n</span><span class="o">=</span><span class="mi">1000</span><span class="o">;</span><span class="n">n</span><span class="o">&lt;</span><span class="mi">10000</span><span class="o">;++</span><span class="n">n</span><span class="o">)</span> <span class="o">{</span>
			<span class="k">if</span><span class="o">(</span><span class="n">Math</span><span class="o">.</span><span class="na">abs</span><span class="o">(</span><span class="n">logFactorial</span><span class="o">(</span><span class="n">n</span><span class="o">)-</span><span class="n">logStirlingApproximation</span><span class="o">(</span><span class="n">n</span><span class="o">))&gt;=</span><span class="mf">1.0</span><span class="n">e</span><span class="o">-</span><span class="mi">5</span><span class="o">)</span> <span class="o">{</span>
				<span class="o">++</span><span class="n">nbad</span><span class="o">;</span>
			<span class="o">}</span>
		<span class="o">}</span>
		<span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="s">&quot;nbad: &quot;</span> <span class="o">+</span> <span class="n">nbad</span><span class="o">);</span>
	<span class="o">}</span>
<span class="o">}</span>
</pre>
</div>
<p>Imagine that this is some humongous project to use <a target="_blank" href="http://en.wikipedia.org/wiki/Stirling's_approximation">Stirling&#8217;s Approximation</a> as a replacement for factorial.  All the code up until main is great.  But the unfortunate developer has hard-coded an acceptance test into <code>main()</code>.  If they run their big project all they get out is:</p>
<pre>
nbad: 7334
</pre>
<p>The developer needs to re-code and re-build to diagnose the failure, tweak their acceptance criteria or add more measurements.</p>
<p>I strongly recommend a different work pattern.  Instead of bringing criteria into the code, bring the data out:</p>
<div class="highlight">
<pre><span class="kd">public</span> <span class="kd">class</span> <span class="nc">SomeBigProject2</span> <span class="o">{</span>
	<span class="kd">public</span> <span class="kd">static</span> <span class="kt">void</span> <span class="nf">main</span><span class="o">(</span><span class="kd">final</span> <span class="n">String</span><span class="o">[]</span> <span class="n">args</span><span class="o">)</span> <span class="o">{</span>
		<span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="s">&quot;n&quot;</span>
				<span class="o">+</span> <span class="s">&quot;\t&quot;</span> <span class="o">+</span> <span class="s">&quot;logFactorial&quot;</span>
				<span class="o">+</span> <span class="s">&quot;\t&quot;</span> <span class="o">+</span> <span class="s">&quot;logStirlingApproximation&quot;</span><span class="o">);</span>
		<span class="k">for</span><span class="o">(</span><span class="kt">int</span> <span class="n">n</span><span class="o">=</span><span class="mi">1000</span><span class="o">;</span><span class="n">n</span><span class="o">&lt;</span><span class="mi">10000</span><span class="o">;++</span><span class="n">n</span><span class="o">)</span> <span class="o">{</span>
			<span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="n">String</span><span class="o">.</span><span class="na">valueOf</span><span class="o">(</span><span class="n">n</span><span class="o">)</span>
					<span class="o">+</span> <span class="s">&quot;\t&quot;</span> <span class="o">+</span> <span class="n">SomeBigProject1</span><span class="o">.</span><span class="na">logFactorial</span><span class="o">(</span><span class="n">n</span><span class="o">)</span>
					<span class="o">+</span> <span class="s">&quot;\t&quot;</span> <span class="o">+</span> <span class="n">SomeBigProject1</span><span class="o">.</span><span class="na">logStirlingApproximation</span><span class="o">(</span><span class="n">n</span><span class="o">));</span>
		<span class="o">}</span>
	<span class="o">}</span>
<span class="o">}</span>
</pre>
</div>
<p>Capture this output in a file named &#8220;data.tsv&#8221; and both Microsoft Excel and R can open it.  Naturally I prefer to use R (so that is what I will demonstrate).  To read the results into R you start up an R and type in a command like the following:</p>
<pre>
 &gt; d &lt;- read.table('data.tsv',
        header=T,sep='\t',quote='',as.is=T,
        stringsAsFactors=F,comment.char='',allowEscapes=F)
</pre>
<p>Most of the arguments controlling the style of file R is to expected (what the field separator is, weather to expect escapes and quotes and so on).  The settings I suggest here are the &#8220;ultra hardened&#8221; settings.  If you make sure none of your fields have a tab or line-break in them when you print then it is guaranteed R can read the data (not matter what whacky symbols are in it).  On the java side that usually means making sure any varying text fields are run through <code>.replaceAll("\\s+"," ")</code> &#8220;just in case.&#8221; At this point you can already look at your data with the <code>summary()</code> command:</p>
<pre>
 &gt; summary(d)
</pre>
<pre>
       n         logFactorial   logStirlingApproximation
 Min.   :1000   Min.   : 5912   Min.   : 5912
 1st Qu.:3250   1st Qu.:23034   1st Qu.:23034
 Median :5500   Median :41870   Median :41870
 Mean   :5500   Mean   :42536   Mean   :42536
 3rd Qu.:7749   3rd Qu.:61653   3rd Qu.:61653
 Max.   :9999   Max.   :82100   Max.   :82100
</pre>
<p>This immediately hints that you should have been thinking in terms of relative error instead of absolute error (since insisting on high absolute accuracy on large results does not always make sense).</p>
<p>You also have access to standard statistical measures of agreement like correlation: </p>
<pre>
 &gt; with(d,cor(logFactorial,logStirlingApproximation))
</pre>
<pre>
result: 1
</pre>
<p>You can see where your failures were:</p>
<pre>
 &gt; library(ggplot2)
 &gt; d$bad &lt;- with(d,abs(logFactorial-logStirlingApproximation)&gt;=1.0e-5)
 &gt; ggplot(d) + geom_point(aes(x=n,y=bad))
</pre>
<p>Yields the graph:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/07/bad.png" alt="bad.png" border="0" width="525" height="525" /><br />
</center></p>
<p>You can see all your failures are in the initial interval.  You can then drill in:</p>
<pre>
 &gt; ggplot(d) + geom_point(aes(x=n,y=logFactorial-logStirlingApproximation))
                + scale_y_log10()
</pre>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/07/diff.png" alt="diff.png" border="0" width="525" height="525" /><br />
</center></p>
<p>And here we see some things (that are in general true for Stirling&#8217;s approximation):</p>
<ol>
<li>It is very accurate.</li>
<li>It is always an under estimate.</li>
<li>It gets better as n gets larger.</li>
</ol>
<p>Essentially by poking around with graphs in R you can figure out the nature of your errors (telling you what to fix) and generate findings that tell you how to fix your criteria (perhaps your code is working- but your test wasn&#8217;t sensible).  The &#8220;dump everything and then use R&#8221; technique is also particularly good for generating reports on code timings using either <code>geom_histogram</code> or <code>geom_density</code>. </p>
<p>For example, if we had data with a field <code>runTimeMS</code> then it is a simple one-liner to get plot like the following:</p>
<pre>
 &gt; ggplot(t) + geom_density(aes(x=runTimeMS))
</pre>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/07/timing.png" alt="timing.png" border="0" width="525" height="525" /><br />
</center></p>
<p>From this graph we can immediately see:</p>
<ol>
<li>Most of our run-times are very fast.</li>
<li>We have a heavy right-tail (evidence of &#8220;contagion&#8221; or one slow-down causing others, like CPU or IO contention).</li>
<li>Data is truncated at 100MS (could be something &#8220;censoring&#8221; the measurement, an exception being thrown or an abort).</li>
<li>There is a spike at 30MS (something is true and slow for some subset of the data that isn&#8217;t present in the majority).</li>
</ol>
<p>This is a lot more that would be seen in a mean-only or mean and standard deviation summary.  We may even being seeings signs of two different bugs (the truncation and the spike).</p>
<p>In all cases the key is to dump a lot of data in machine readable form and then come back to to analyze.  This is far more flexible than hoping to code in the right summaries and then further hoping the summaries don&#8217;t miss something important (or that you at least get a chance to notice if they do miss something).  Being able to do exploratory statistics on dumps from your code (both results and timing) gives you incredible measurement, tuning and debugging powers.   The scriptability of R means any later analysis is as easy as cut and paste.</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/' rel='bookmark' title='Automatic Differentiation with Scala'>Automatic Differentiation with Scala</a></li>
<li><a href='http://www.win-vector.com/blog/2009/11/r-examine-objects-tutorial/' rel='bookmark' title='R examine objects tutorial'>R examine objects tutorial</a></li>
<li><a href='http://www.win-vector.com/blog/2009/09/survive-r/' rel='bookmark' title='Survive R'>Survive R</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2011/08/programmers-should-know-r/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Your Data is Never the Right Shape</title>
		<link>http://www.win-vector.com/blog/2011/07/your-data-is-never-the-right-shape/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=your-data-is-never-the-right-shape</link>
		<comments>http://www.win-vector.com/blog/2011/07/your-data-is-never-the-right-shape/#comments</comments>
		<pubDate>Sun, 31 Jul 2011 20:27:00 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Tutorials]]></category>
		<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[plyr]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[reshape]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1687</guid>
		<description><![CDATA[One of the recurring frustrations in data analytics is that your data is never in the right shape. Worst case: you are not aware of this and every step you attempt is more expensive, less reliable and less informative than you would want. Best case: you notice this and have the tools to reshape your [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/' rel='bookmark' title='A Demonstration of Data Mining'>A Demonstration of Data Mining</a></li>
<li><a href='http://www.win-vector.com/blog/2009/08/good-graphs-graphical-perception-and-data-visualization/' rel='bookmark' title='Good Graphs: Graphical Perception and Data Visualization'>Good Graphs: Graphical Perception and Data Visualization</a></li>
<li><a href='http://www.win-vector.com/blog/2009/04/the-data-enrichment-method/' rel='bookmark' title='The Data Enrichment Method'>The Data Enrichment Method</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>One of the recurring frustrations in data analytics is that your data is never in the right shape.  Worst case: you are not aware of this and every step you attempt is more expensive, less reliable and less informative than you would want.  Best case: you notice this and have the tools to reshape your data.  </p>
<p>There is no final &#8220;right shape.&#8221;  In fact even your data is never right. You will always be called to re-do your analysis (new variables, new data, corrections) so you should always understand you are on your &#8220;penultimate analysis&#8221; (always one more to come).  This is why we insist on using general methods and scripted techniques, as these methods are much much easier to reliably reapply on new data than GUI/WYSWYG techniques.</p>
<p>In this article we will work a small example and call out some <a target="_blank" href="http://cran.r-project.org/">R</a> tools that make reshaping your data much easier.  The idea is to think in terms of &#8220;relational algebra&#8221; (like SQL) and transform your data towards your tools (and not to attempt to adapt your tools towards the data in an ad-hoc manner).<span id="more-1687"></span>Take a simple example where you are designing a new score called &#8220;<code>score2</code>&#8221; to predict or track an already known value called &#8220;<code>score1</code>.&#8221;  The typical situation is <code>score1</code> is a future outcome (such as the number of dollars profit on a transaction) and <code>score2</code> is a prediction (such as the estimated profit before the transaction is attempted).  Training data is usually assembled by performing a large number of transactions, recording what was known before the transaction and then aligning or joining this data with measured results when they become available.  For this example we are not interested in the inputs driving the model (a rare situation, but we are trying to make our example as simple as possible) but only examining the quality of <code>score2</code> (which is defined as how well it tracks <code>score1</code>).</p>
<p>All of this example will be in R, but the principles are chosen apply more generally.  First let us enter some example data:</p>
<p><code><br />
<br/> &gt; d &lt;- data.frame(id=c(1,2,3,1,2,3),score1=c(17,5,6,10,13,7),score2=c(13,10,5,13,10,5))<br />
<br/> &gt; d<br />
</code></p>
<p>This gives us our example data.  Each row is numbered (1 through 6) has an <code>id</code> and both our scores:</p>
<pre>
  id score1 score2
1  1     17     13
2  2      5     10
3  3      6      5
4  1     10     13
5  2     13     10
6  3      7      5
</pre>
<p>We said our only task was to characterize how well <code>score2</code> works at predicting <code>score1</code> (or how good a substitute <code>score2</code> is for <code>score1</code>).  We could compute correlation, RMS error, info-gain or some such.  But instead lets look at this graphically.  We will prepare a graph showing how well <code>score1</code> is represented by <code>score2</code>.  For this we choose to place <code>score1</code> on the y-axis (as it is the outcome) and <code>score2</code> on the x-axis (as it is the driver).</p>
<p><code><br />
<br/> &gt; library(ggplot2)<br />
<br/> &gt; ggplot(d) + geom_point(aes(x=score2,y=score1))<br />
</code></p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/07/plot1.png" alt="plot1.png" border="0" width="525" height="525" /></p>
<p>Figure 1: <code>score1</code> as a function of  <code>score2</code>.</p>
<p></center></p>
<p>This does not look good.  We would liked to have seen all of the dots falling on the line &#8220;y=x.&#8221;  This plot shows <code>score2</code> is not predicting <code>score1</code> very well.  Part of this is that we missed an important feature of the data (and because we missed it the feature becomes a problem): the <code>id</code>s repeat.  First we re-order by <code>id</code> to make this more obvious.</p>
<p><code><br />
<br/> &gt; dsort &lt;- d[order(d$id),]<br />
<br/> &gt; dsort<br />
</code></p>
<pre>
  id score1 score2
1  1     17     13
4  1     10     13
2  2      5     10
5  2     13     10
3  3      6      5
6  3      7      5
</pre>
<p>This is a very common situation.  The original score is not completely a function of the known inputs.  We are using &#8220;<code>id</code>&#8221; to abstract represent all of the inputs, two rows in our example have the same <code>id</code> if and only if all known inputs are exactly the same.  The repeating <code>id</code>s are the same experiment run at different times (a good idea) and the variation in <code>score1</code> could be the effect of an un-modeled input that changed value or something simple like a &#8220;noise term&#8221; (a random un-modeled effect).   Notice that <code>score2</code> is behaving as a function of <code>id</code>- all rows with the same <code>id</code> have the same value for <code>score2</code>.  If <code>score2</code> is a model then it has to be a function of the inputs (or more precisely if it is not a function of the inputs you have done something wrong).  So any variation of <code>score1</code> between rows with identical <code>id</code> is &#8220;unexplainable variation&#8221; (unexplainable from the point of view of currently tracked inputs).  You should know about, characterize and report this variation (why it is good to have some repeated experiments).  But this variation is not the model&#8217;s fault, if we want to know how good a job we did constructing the model (which we now see can be a slightly different question than how well the model works at prediction) we need to see how much of the explainable variation the model accounts for.</p>
<p>If we assume (as is traditional) the unexplained variation is from a &#8220;unbiased noise source&#8221; then we can lessen the impact of the noise source by replacing <code>score1</code> with a value averaged over rows with the same <code>id</code>.  This assumption is traditional because an unbiased noise source is present in many problems and assuming anything more requires more research into the problem domain.   You would eventually fold such research into your model- so your goal is always have all effects or biases in your model and hope what is left over is unbiased.  This is usually not strictly true, but not accounting for the unexplained variation at all is in many cases even worse than modeling the unexplained variation as being bias-free.</p>
<p>And now we find our data is the &#8220;wrong shape.&#8221;  To replace <code>score1</code> with the appropriate averages we need to do some significant data manipulation.  We need to group sets of rows and add new columns. We could do this imperatively (write some loops and design some variables to track and manipulate state) or declaratively (find a path of operations from what you have to what you need through R&#8217;s data manipulation algebra).  Even though the declarative method is more work the first time (you could often write the code in less time than it takes to read the manuals) it is the right way to go (as it is more versatile and powerful in the end).</p>
<p>Luckily we don&#8217;t have to use raw R.  There are a number of remarkable packages (all by <a target="_blank" href="http://had.co.nz/">Hadley Wickham</a> who is also the author of the <a target="_blank" href="http://had.co.nz/ggplot2/">ggplot2</a> package we use to prepare our figures) that really improve R&#8217;s ability to coherently manage data.  The easiest (on us) way do fix up our data is to make the computer work hard and use the powerful melt/cast technique.  These functions are found in the libraries <a target="_blank" href="http://www.jstatsoft.org/v21/i12/paper">reshape</a> and <a target="_blank" href="http://www.jstatsoft.org/v40/i01/paper">plyr</a> (which were automatically loaded with we loaded ggplot2 library).</p>
<p>melt is a bit abstract.  What it does convert your data into a &#8220;narrow&#8221; format where rows are split into many rows each carrying just one result column of the original row.  For example we can melt our data by <code>id</code> as follows:</p>
<p><code><br />
<br/> &gt; dmelt &lt;- melt(d,id.vars=c('id'))<br />
<br/> &gt; dmelt<br />
</code></p>
<p>Which yields the following:</p>
<pre>
   id variable value
1   1   score1    17
2   1   score1    10
3   2   score1     5
4   2   score1    13
5   3   score1     6
6   3   score1     7
7   1   score2    13
8   1   score2    13
9   2   score2    10
10  2   score2    10
11  3   score2     5
12  3   score2     5
</pre>
<p>Each of the two facts (<code>score1</code>, <code>score2</code>) from our original row is split into its own row.  The <code>id</code> column plus the new variable column are now considered to be keys.  This format is not used directly but used because it is easy to express important data transformations in terms of it.  For instance we wanted our table to have duplicate rows collected and <code>score1</code> replaced by its average (to attempt to remove the unexplainable variation).  That is now easy:</p>
<p><code><br />
<br/> &gt; dmean &lt;- cast(dmelt,fun.aggregate=mean)<br />
<br/> &gt; dmean<br />
</code></p>
<pre>
  id score1 score2
1  1   13.5     13
2  2    9.0     10
3  3    6.5      5
</pre>
<p>We used <code>cast()</code> in its default mode, where it assumes all columns not equal to &#8220;value&#8221; are the keyset.  It then collects all rows with identical keying and combines them back into wide rows using mean or average as the function to deal with duplicates.  Notice <code>score1</code> is now the desired average, and <code>score2</code> is as before (as it was a function of the keys or inputs, so it is not affected by averaging).  With this new smaller data set we can re-try our original graph:</p>
<p><code><br />
<br/> &gt; ggplot(dmean) + geom_point(aes(x=score2,y=score1))<br />
</code></p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/07/plot2.png" alt="plot2.png" border="0" width="525" height="525" /></p>
<p>Figure 2: <code>mean(score1)</code> as a function of  <code>score2</code>.</p>
<p></center></p>
<p>This doesn&#8217;t look so bad.  A lot of the error or variation in the first plot was unexplainable variation.  <code>score2</code> isn&#8217;t bad given its inputs.  If you wanted to do better than <code>score2</code> you would be advised to find more modeling inputs (versus trying more exotic modeling techniques).</p>
<p>Of course a client or user is not interested if <code>score2</code> is &#8220;best possible.&#8221;  They want to know if it is any good.  To do this we should show them (either by graph or by quantitative summary statistics like we mentioned earlier) at least 3 things:</p>
<ol>
<li>How well the model predicts overall (the very first graph we presented).</li>
<li>How much of the explainable variation the model predicts (the second graph).</li>
<li>The nature of the unexplained variation (which we will explore next).</li>
</ol>
<p>We said earlier we are hoping the unexplained variation is noise (or if it is not noise it would be nice if it is a clue to new important modeling features).  So the unexplained variation must not go unexamined.  We will finish by showing how to characterize the unexplained variation.  As before will will just make a graph, but the data preparation steps would be exactly the same if we were using a quantitive summary (like correlation, or any other).  And, of course, our data is still not the right shape for this step.  Luckily there is another tool ready to fix this: <code>join()</code>.</p>
<p><code><br />
<br/> &gt; djoin &lt;- join(dsort,dsort,'id')<br />
<br/> &gt; fixnames &lt;- function(cn) {<br />
     n &lt;- length(cn);<br />
     for(i in 2:((n+1)/2)) { cn[i] &lt;- paste('a',cn[i],sep='') };<br />
     for(i in ((n+3)/2):n) { cn[i] &lt;- paste('b',cn[i],sep='') };<br />
     cn<br />
  }<br />
<br/> &gt; colnames(djoin) &lt;- fixnames(colnames(djoin))<br />
<br/> &gt; djoin<br />
</code></p>
<p>which produces:</p>
<pre>
   id ascore1 ascore2 bscore1 bscore2
1   1      17      13      17      13
2   1      17      13      10      13
3   1      10      13      17      13
4   1      10      13      10      13
5   2       5      10       5      10
6   2       5      10      13      10
7   2      13      10       5      10
8   2      13      10      13      10
9   3       6       5       6       5
10  3       6       5       7       5
11  3       7       5       6       5
12  3       7       5       7       5
</pre>
<p>All of the work was done by the single line &#8220;<code>djoin &lt;- join(dsort,dsort,'id')</code>&#8221; the rest was just fixing the column names (as self-join is not the central use case of join).  What we have now is a table that is exactly right for studying unexplained variation.  For each <code>id</code> we have each row with the same <code>id</code> matched.  This blows every <code>id</code> from having 2 rows in <code>dsort</code> to 4 rows in <code>djoin</code>.  Notices this gives us every pair of <code>score1</code> values seen for the same <code>id</code> (which will let us examine unexplained variation) and <code>score2</code> is still constant over all rows with the same <code>id</code> (as it has always been throughout our analysis).  With this table we can now plot how <code>score1</code> varies for rows with the same <code>id</code>:</p>
<p><code><br />
<br/> &gt; ggplot(djoin) + geom_point(aes(x=ascore1,y=bscore1))<br />
</code></p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/07/unex.png" alt="unex.png" border="0" width="525" height="525" /></p>
<p>Figure 3: <code>score1</code> as a function of  <code>score1</code>.</p>
<p></center></p>
<p>And we can see, as we expected, the unexplained variation in <code>score1</code> is about as large as the mismatch between <code>score1</code> and <code>score2</code> in our original plot.  The important thing is this is all about <code>score1</code> (<code>score2</code> is now literally out of the picture).  The analyst&#8217;s job would now be to try and tie bits of the unexplained variation to new inputs (that can be folded into a new <code>score2</code>) and/or characterize the noise term (so the customer knows how close they should expect repeated experiments to be).</p>
<p>What we are trying to encourage with the use of &#8220;big hammer tools&#8221; is an ability and willingness to look at and transform your data in meaningful steps.  It often seems easier and more efficient to build one more piece of data tubing, but a lot of data tubes become an unmanageable collection of spaghetti code.  The analyst should, in some sense, always be looking at data and not looking at coding details.  For these sort of analyses we encourage analysts to think in terms of &#8220;data shape&#8221; and transforms.  This discipline leaves more of the analysts energy and attention to think productively about the data and actual problem domain.</p>
<hr />
Note:</p>
<p>For the third plot showing the variation of <code>score1</code> across different rows (but same <code>id</code>s) it may be appropriate to use a slightly more complicated <code>join()</code> procedure than we showed.  The join shown produced rows of artificial agreement where both values of <code>score1</code> came from the same row (thus had no chance of being different, so in some sense deserve no credit).  This is also the only way any non-duplicated evaluations could make it to the plot.  To eliminate these uninteresting agreements from the plot do the following:</p>
<p><code><br />
<br/> &gt; d$rowNumber &lt;- 1:(dim(d)[1])<br />
<br/> &gt; djoin &lt;- join(d,d,'id')<br />
<br/> &gt; colnames(djoin) &lt;- fixnames(colnames(djoin))<br />
<br/> &gt; djoin &lt;- djoin[djoin$arowNumber!=djoin$browNumber,]<br />
<br/> &gt; djoin<br />
</code></p>
<p>This gives us a table that shows only values of <code>score1</code> from different rows:</p>
<pre>
   id ascore1 ascore2 arowNumber bscore1 bscore2 browNumber
2   1      17      13          1      10      13          4
4   2       5      10          2      13      10          5
6   3       6       5          3       7       5          6
7   1      10      13          4      17      13          1
9   2      13      10          5       5      10          2
11  3       7       5          6       6       5          3
</pre>
<p>And only plots points on the diagonal if &#8220;you have really earned them&#8221;:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/07/fig4.png" alt="fig4.png" border="0" width="525" height="525" /><br />
</center></p>
<p>So while the direct <code>join()</code> may not be the immediate perfect answer it is still a good intermediate to form as what you want is only simple data transformation away from it.</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/' rel='bookmark' title='A Demonstration of Data Mining'>A Demonstration of Data Mining</a></li>
<li><a href='http://www.win-vector.com/blog/2009/08/good-graphs-graphical-perception-and-data-visualization/' rel='bookmark' title='Good Graphs: Graphical Perception and Data Visualization'>Good Graphs: Graphical Perception and Data Visualization</a></li>
<li><a href='http://www.win-vector.com/blog/2009/04/the-data-enrichment-method/' rel='bookmark' title='The Data Enrichment Method'>The Data Enrichment Method</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2011/07/your-data-is-never-the-right-shape/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>What is a large enough random sample?</title>
		<link>http://www.win-vector.com/blog/2011/06/what-is-a-large-enough-random-sample/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=what-is-a-large-enough-random-sample</link>
		<comments>http://www.win-vector.com/blog/2011/06/what-is-a-large-enough-random-sample/#comments</comments>
		<pubDate>Sun, 26 Jun 2011 18:54:14 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Statistics To English Translation]]></category>
		<category><![CDATA[Tutorials]]></category>
		<category><![CDATA[Estimation]]></category>
		<category><![CDATA[Hoeffding's Inequality]]></category>
		<category><![CDATA[Pseudorandomness]]></category>
		<category><![CDATA[Random Sampling]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1667</guid>
		<description><![CDATA[With the well deserved popularity of A/B testing computer scientists are finally becoming practicing statisticians. One part of experiment design that has always been particularly hard to teach is how to pick the size of your sample. The two points that are hard to communicate are that: The required sample size is essentially independent of [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2010/12/large-data-logistic-regression-with-example-hadoop-code/' rel='bookmark' title='Large Data Logistic Regression (with example Hadoop code)'>Large Data Logistic Regression (with example Hadoop code)</a></li>
<li><a href='http://www.win-vector.com/blog/2009/02/volunteers-in-large-clubs-the-theorists-view/' rel='bookmark' title='Volunteers in Large Clubs: The Theorist&#8217;s View'>Volunteers in Large Clubs: The Theorist&#8217;s View</a></li>
<li><a href='http://www.win-vector.com/blog/2009/12/statistics-to-english-translation-part-2a-%e2%80%99significant%e2%80%99-doesn%e2%80%99t-always-mean-%e2%80%99important%e2%80%99/' rel='bookmark' title='Statistics to English Translation, Part 2a: ’Significant’ Doesn’t Always Mean ’Important’'>Statistics to English Translation, Part 2a: ’Significant’ Doesn’t Always Mean ’Important’</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>With the well deserved popularity of A/B testing computer scientists are finally becoming practicing statisticians.  One part of experiment design that has always been particularly hard to teach is how to pick  the size of your sample.  The two points that are hard to communicate are that:</p>
<ul>
<li>The required sample size is essentially <i>independent</i> of the total population size.</li>
<li>The required sample size <i>depends</i> strongly on the strength of the effect you are trying to measure.</li>
</ul>
<p>These things are only hard to explain because the literature is overly technical (too many buzzwords and too many irrelevant concerns) and these misapprehensions can&#8217;t be relieved unless you spend some time addressing the legitimate underlying concerns they are standing in for.  As usual explanation requires common ground (moving to shared assumptions) not mere technical bullying.</p>
<p>We will try to work through these assumptions and then discuss proper sample size.<span id="more-1667"></span><br />
<h2>The problem of population size.</h2>
<p>Many technical people (including some good physical scientists) have a hard time understanding that to test the effect of a treatment on a population the sample size you need does not depend on the size of the overall population.  You see vestiges of this discomfort when for a treatment or an opinion poll you seen something like the percent of the population polled listed as an important feature.  The math says this is wrong, but there is absolutely no point in trotting out the math until we move to common ground (or a shared set of assumptions).  After all math is just the pushing of axioms to their consequences- so if you don&#8217;t share the same axioms the math is irrelevant.</p>
<p>The statistical definition of a good sample from a population usually insists on independence or exchangeability.  That is: if we draw a sample of 100 people from a population of 1,000,000 that each person in the sample had an equal chance of being in the sample independent of who else was in the sample or who else was left out of the sample.  Or (in a slightly different form): if we had a good sample and then picked one of the sample members uniformly at random and exchanged them with one of the non sample members (also picked uniformly at random) we end up with a different but equally good sample.</p>
<p>The ideal notion of a sample is what is at odds with experience.  People know that their sampling procedure is often unlikely to rise to the standard demanded by statistics.  A doctor knows that the patients that walk into his office are not an exchangeable sample of the total population (his or her patients tend to live near them, may be grouped in age and share many other common traits).  Political polls (which are typically done on land line telephones) have been famously bad samples dating back from &#8220;Dewey Beats Truman&#8221; mis-prediction (when telephones were more concentrated among the rich) to the current blindness of land-line telephone polls to the younger cell phone only sub-population.  Another example is small molecule drug discovery where experiments are inevitably run where each molecule is related to previous experiments and earlier experiments have a strong bias towards cheaper reagent cost and low molecular weight.</p>
<p>These real world situations are non-uniform, non-stationary, not independent and non-exchangeable (to use the statistical terms).  Since the usual statistical requirements are not met it should not be surprising the desired statistical theories do not apply.  This isn&#8217;t a paradox it is just the real world failing to meet the desired assumptions.  The intuition that if your sample is not in a trustable order that you must increase sample size to cover the entire population is essentially justified.  But that misses the often cheaper alternative of fixing your sampling procedure so you don&#8217;t need to push to large sample sizes.</p>
<p>This is why statisticians are so concerned with experimental design and experimental procedure.  If you can design your experiment so the standard statistical assumptions are met: you get the enormous benefits of the theory.</p>
<h2>The problem of effect strength.</h2>
<p>The basic problem of effect strength is simple, so in this case part of the confusion is actually from not having thought hard about it.  The (incorrect) common intuition is that the size of an experimental sample needed is independent of the type of effect you are trying to measure.  You see this implicitly when studies are commissioned that have no hope of success.  Typically this would be A/B testing a minor change on a web page on a small population (perhaps too small to even see the change) or even running an expensive clinical trial with too few patients.  These recurring errors come from the mistaken intuition that the effect strength isn&#8217;t important in designing your experiment.</p>
<p>To try and set your intuition consider the extreme cases.  Suppose you are trying to test if an insecticide is &#8220;very very very deadly&#8221; to fleas.  If the insecticide is strong then the only confounding effect would be natural death of fleas.  You need an experiment size big enough that it is unlikely most of your fleas died of natural causes.  So maybe use 5 fleas instead of 1.  The measurement is: dose the fleas,  if they all die you can conclude the insecticide is very deadly.  If only half die: the experiment fails (the insecticide is deadly, but not very very deadly).  The key to good statistical experiment design is committing to what you are trying to test before starting the experiment.  This is the important thing.  After that you can improve your technique by adding such methods as &#8220;control groups&#8221; (groups of untreated fleas to try to get an estimate of the base mortality rate) and repetitions (trying to control for systematic error, all the fleas in one cage dying due to some other cause, like spilled cleaning solution).</p>
<p>Now consider the same experiment and suppose you were trying to establish the same insecticide &#8220;kills all fleas and their eggs.&#8221;  The word &#8220;all&#8221; made this something that can not be proven by empirical experiment.   This is too strong a statement to be proven empirically.  The only way to prove a categorical absolute is through logic (not empiricism or statistics).  This sort of proof looks something like:</p>
<ul>
<li>Major Premise: Sulfuric acid completely dissolves organic matter.</li>
<li>Minor Premise: Fleas are organic matter.</li>
<li>Conclusion:Sulfuric acid completely dissolves fleas.</li>
</ul>
<p>That is, you don&#8217;t run an experiment to prove an absolute- you combine other things you already accept as absolutes to infer additional absolutes. (We dodge the question how you acquire these initial absolutes, we just point out you can usually only create new absolutes by combining other absolutes).</p>
<p>One last variation: suppose we are asked to prove the insecticide &#8220;kills at least 99.9% of all fleas.&#8221;  This can be done by sampling- but it requires a big sample as we are trying to prove that very very few fleas survive (or in technical terms- trying to measure a rare event).</p>
<p>This is deliberately silly, but the point is: easy effects can use small samples, hard effects require big samples and absolutes can&#8217;t be measured by sampling. </p>
<p>There is a second, very subtle problem regarding effect strengths: they cost more than you would reasonably suppose.  Roughly this is because when you try to measure weaker (more subtle) effects you also increase your need for precision.  That is it might make sense to try and measure a disease that affects 10% of the population to +-5% absolute error (that is accept any number from 15% through 5% as being a good enough estimate).  But when you try to measure a disease that affects 1% of the population this wide interval ( 6% to -4%) no longer makes sense.  You would likely be required to use a +-0.5% interval.  You would need 10 times as much data to see the same number of affected patients (1% versus 10% incidence rate) and on top of that you are now requiring a high precision ( +- 0.5% instead of +-5%).   You end up needing 100 times the data to get a similar confidence on the smaller measurement range (+-0.5%).  In my opinion: this is always going to feel wrong- you expect to need 10 times the data but you actually need 100 times the data (due to your change in measurement range).</p>
<h2>The rule of thumb.</h2>
<p>The typical requirements of a random experiment are: you are given a very small number d>0 and a small number e>0 and you want a sample such that if p is the (unknown) true proportion of the world that has the property you are trying to measure then with probability no more than d your measured proportion q is bad (i.e. |p-q| > e).  If somebody asks for d or e to be zero then you are pretty much forced to test the whole population (sampling won&#8217;t work).  The larger d and e get the easier sampling becomes.</p>
<p>The rules to know in designing experiment sizes are:</p>
<ul>
<li>d is easy to lower (it doesn&#8217;t take much work to increase confidence).</li>
<li>e is hard to lower (it takes a lot of work to increase precision).</li>
</ul>
<p>A good formula is: if your sample is picked uniformly at random and is of size at least:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/06/est.png" alt="est.png" border="0" width="163" height="77" /><br />
</center></p>
<p>then with probability at least 1-d your sample measured proportion q is no further away than e from the unknown true population proportion p (what you are trying to estimate).  This formula is technical consequence of <a target="_blan" href="http://en.wikipedia.org/wiki/Hoeffding's_inequality">Hoeffding&#8217;s</a> inequality (it is a convenient form, you can get a slightly better bound by directly using the Chernoff bounds).  The point is: you can plug this into a calculator.  Need to estimate with 99% certainty the unknown true population rate to +- 10% precision?  That is d=0.01, e=0.1 and a sample of size 265 should be enough.  A stricter precision of +-0.05 causes the estimated sample size to increase to 1060, but increasing the certainty requirement to 99.5% only increases the estimated sample size to 300.  This is what me mean by confidence is cheap and precision is expensive.</p>
<p>Notice the effect strength doesn&#8217;t directly enter into the formula- it only comes in the increase in precision needed to measure weak effects.  This is why you want to avoid attempting to empirically measure weak effects (see <a target="_blan" href="http://www.win-vector.com/blog/2010/08/statsmanship-failure-through-analytics-sabotage/">Analytics Sabotage</a>).  </p>
<p>Also notice the faster than linear dependence on 1/e (a square in this case).  This is very expensive (liner would be better) and unintuitive.  The undesirable super-linear rate follows from the fact that if you flip a fair coin (50/50 odds of heads tails and independent outcomes on each flip) n-times the observed number of heads tends to be more than  +-sqrt(n) away from the expected value of n/2.  To get a linear rate it would have to be usually within a constant from n/2 (independent of n!) which is too perfect and way too much to hope for (also relevant is <a target="_blank" href="http://en.wikipedia.org/wiki/Law_of_the_iterated_logarithm">the law of the iterated logarithm</a>).  This is &#8220;hand wavy&#8221; but it is intended to help you reset your intuition (intuiting being a less precise, but less brittle way of reasoning than doing all the math).</p>
<h2>A <em>very</em> technical paper on sampling from <em>very</em> large populations.</h2>
<p>A long time ago we took some of the ideas here to the extreme.  We considered the situation of &#8220;stream processing&#8221;- that is when you have so much data flowing by you that you do not even have enough storage to remember a significant fraction of it.  This might arise in a networking application, a scientific instrument or a map-reduce node.  The question is: can you still estimate rates in this situation?  The answer is you can.  The trick is even though you don&#8217;t have enough memory to store a significant fraction of what you saw (or even enough memory to store your sample design!) you can store the specification of a pseudorandom sample.  A pseudorandom sample is like a sample except it has a very succinct description (a description small enough to store).  The idea is to store only the description of the pseudorandom sample (not the sample itself) and the intersection of this sample and the stream (or mark &#8220;failure&#8221; if this intersection gets too large).</p>
<p>There are some additional technical details in that you have to simultaneously work with many guesses of the unknown true population rate (because each sample design only works properly for a narrow range of this unknown rate as you either miss too much or run out of memory when you use the wrong design).  But overall the ideas in this paper are just the type of bounds we discussed here (but in a very compressed form for publication).</p>
<p>This sort of paper is what we theorists call &#8220;fun&#8221; (or really just sharpening our knives between real world problems):</p>
<p><a target="_blank" href="http://www.mzlabs.com/JMPubs/Estimating%20the%20range%20of%20a-Mount.pdf">Estimating the range of a function in an online setting</a>, J. A. Mount, Information Processing Letters  72  31&#8211;35  (1999) .</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2010/12/large-data-logistic-regression-with-example-hadoop-code/' rel='bookmark' title='Large Data Logistic Regression (with example Hadoop code)'>Large Data Logistic Regression (with example Hadoop code)</a></li>
<li><a href='http://www.win-vector.com/blog/2009/02/volunteers-in-large-clubs-the-theorists-view/' rel='bookmark' title='Volunteers in Large Clubs: The Theorist&#8217;s View'>Volunteers in Large Clubs: The Theorist&#8217;s View</a></li>
<li><a href='http://www.win-vector.com/blog/2009/12/statistics-to-english-translation-part-2a-%e2%80%99significant%e2%80%99-doesn%e2%80%99t-always-mean-%e2%80%99important%e2%80%99/' rel='bookmark' title='Statistics to English Translation, Part 2a: ’Significant’ Doesn’t Always Mean ’Important’'>Statistics to English Translation, Part 2a: ’Significant’ Doesn’t Always Mean ’Important’</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2011/06/what-is-a-large-enough-random-sample/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

