<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Win-Vector Blog &#187; Computer Science</title>
	<atom:link href="http://www.win-vector.com/blog/category/computer-science/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.win-vector.com/blog</link>
	<description>The Applied Theorist&#039;s Point of View</description>
	<lastBuildDate>Sat, 04 Feb 2012 17:42:12 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Ergodic Theory for Interested Computer Scientists</title>
		<link>http://www.win-vector.com/blog/2012/02/ergodic-theory-for-interested-computer-scientists/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=ergodic-theory-for-interested-computer-scientists</link>
		<comments>http://www.win-vector.com/blog/2012/02/ergodic-theory-for-interested-computer-scientists/#comments</comments>
		<pubDate>Sat, 04 Feb 2012 17:42:07 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Expository Writing]]></category>
		<category><![CDATA[Mathematics]]></category>
		<category><![CDATA[Ergodic Theorem]]></category>
		<category><![CDATA[Gibbs Sampler]]></category>
		<category><![CDATA[Markov Chains]]></category>
		<category><![CDATA[Random Sampling]]></category>
		<category><![CDATA[Randomized Algorithms]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1933</guid>
		<description><![CDATA[We describe ergodic theory in modern notation accessible to interested computer scientists. The ergodic theorem (http://en.wikipedia.org/wiki/Ergodic theory (link)) is an important principle of recurrence and averaging in dynamical systems. However, there are some inconsistent uses of the term, much of the machinery is intended to work with deterministic dynamical systems (not probabilistic systems, as is [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2008/02/hello-world-an-instance-rhetoric-in-computer-science/' rel='bookmark' title='Hello World: An Instance Of Rhetoric in Computer Science'>Hello World: An Instance Of Rhetoric in Computer Science</a></li>
<li><a href='http://www.win-vector.com/blog/2012/01/six-fundamental-methods-to-generate-a-random-variable/' rel='bookmark' title='Six Fundamental Methods to Generate a Random Variable'>Six Fundamental Methods to Generate a Random Variable</a></li>
<li><a href='http://www.win-vector.com/blog/2010/08/what-did-theorists-do-before-the-age-of-big-data/' rel='bookmark' title='What Did Theorists Do Before The Age Of Big Data?'>What Did Theorists Do Before The Age Of Big Data?</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>We describe ergodic theory in modern notation accessible to interested computer scientists.</p>
<p>The ergodic theorem (http://en.wikipedia.org/wiki/Ergodic theory (link)) is an important principle of recurrence and averaging in dynamical systems. However, there are some inconsistent uses of the term, much of the machinery is intended to work with deterministic dynamical systems (not probabilistic systems, as is often implied) and often the conclusion of the theory is mis-described as its premises.</p>
<p>By “interested computer scientists” we mean people who know math and work with probabilistic systems1, but know not to accept mathematical definitions without some justification (actually a good attitude for mathematicians also).<span id="more-1933"></span>Please click through to read <a target="_blank" href="http://www.win-vector.com/dfiles/ErgodicTheory.pdf">Ergodic Theory for Interested Computer Scientists</a>.</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2008/02/hello-world-an-instance-rhetoric-in-computer-science/' rel='bookmark' title='Hello World: An Instance Of Rhetoric in Computer Science'>Hello World: An Instance Of Rhetoric in Computer Science</a></li>
<li><a href='http://www.win-vector.com/blog/2012/01/six-fundamental-methods-to-generate-a-random-variable/' rel='bookmark' title='Six Fundamental Methods to Generate a Random Variable'>Six Fundamental Methods to Generate a Random Variable</a></li>
<li><a href='http://www.win-vector.com/blog/2010/08/what-did-theorists-do-before-the-age-of-big-data/' rel='bookmark' title='What Did Theorists Do Before The Age Of Big Data?'>What Did Theorists Do Before The Age Of Big Data?</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2012/02/ergodic-theory-for-interested-computer-scientists/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Six Fundamental Methods to Generate a Random Variable</title>
		<link>http://www.win-vector.com/blog/2012/01/six-fundamental-methods-to-generate-a-random-variable/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=six-fundamental-methods-to-generate-a-random-variable</link>
		<comments>http://www.win-vector.com/blog/2012/01/six-fundamental-methods-to-generate-a-random-variable/#comments</comments>
		<pubDate>Fri, 20 Jan 2012 19:23:38 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Mathematics]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Tutorials]]></category>
		<category><![CDATA[Ergodic Theory]]></category>
		<category><![CDATA[Markov Chains]]></category>
		<category><![CDATA[Markov Monte Carlo]]></category>
		<category><![CDATA[Random Sampling]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1925</guid>
		<description><![CDATA[Introduction To implement many numeric simulations you need a sophisticated source of instances of random variables. The question is: how do you generate them? The literature is full of algorithms requiring random samples as inputs or drivers (conditional random fields, Bayesian network models, particle filters and so on). The literature is also full of competing [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2011/06/what-is-a-large-enough-random-sample/' rel='bookmark' title='What is a large enough random sample?'>What is a large enough random sample?</a></li>
<li><a href='http://www.win-vector.com/blog/2011/10/kernel-methods-and-support-vector-machines-de-mystified/' rel='bookmark' title='Kernel Methods and Support Vector Machines de-Mystified'>Kernel Methods and Support Vector Machines de-Mystified</a></li>
<li><a href='http://www.win-vector.com/blog/2011/07/book-review-ensemble-methods-in-data-mining-seni-elder/' rel='bookmark' title='Book Review: Ensemble Methods in Data Mining (Seni &amp; Elder)'>Book Review: Ensemble Methods in Data Mining (Seni &#038; Elder)</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<h2> Introduction</h2>
<p>To implement many numeric simulations you need a sophisticated source of instances of random variables.  The question is: how do you generate them?  </p>
<p>The literature is full of algorithms requiring random samples as inputs or drivers (<a target="_blank" href="http://en.wikipedia.org/wiki/Conditional_random_field">conditional random fields</a>, <a target="_blank" href="http://en.wikipedia.org/wiki/Bayesian_network">Bayesian network models</a>, <a target="_blank" href="http://en.wikipedia.org/wiki/Particle_filter">particle filters</a> and so on). The literature is also full of competing methods (<a target="_blank" href="http://en.wikipedia.org/wiki/Pseudorandom_number_generator">pseudorandom generators</a>, <a target="_blank" href="http://en.wikipedia.org/wiki/Entropy_(information_theory)">entropy sources</a>, <a target="_blank" href="http://en.wikipedia.org/wiki/Gibbs_sampling">Gibbs samplers</a>, <a target="blank" href="http://en.wikipedia.org/wiki/Metropolis–Hastings_algorithm">Metropolis–Hastings algorithm</a>, <a target="_blank" href="http://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo">Markov chain Monte Carlo methods</a>, <a target="_blank" href="http://en.wikipedia.org/wiki/Bootstrapping">bootstrap methods</a> and so on).  Our thesis is: this diversity is supported by only a few fundamental methods.  And you are much better off thinking in terms of a few deliberately simple composable mechanisms than you would be in relying on some hugely complicated black box &#8220;brand name&#8221; technique. </p>
<p>We will discuss the half dozen basic methods that all of these techniques are derived from.<span id="more-1925"></span>To our mind all of the famous random variate generation/sampling techniques are derived from combinations of the following six fundamental methods:</p>
<ol>
<li>Physical sources.</li>
<li>Empirical resampling.</li>
<li>Pseudo random generators.</li>
<li>Simulation/Game-play.</li>
<li>Rejection Sampling.</li>
<li>Transform methods.</li>
</ol>
<p>The technical fights (such as: &#8220;is Gibbs sampling superior to, or even distinguishable from, Markov chain Monte Carlo?&#8221;) are all in the details, history and citation conventions.   Each field and particular method accretes its own traditions.  We will quickly discuss the fundamental methods we listed.  As we will see: complexity goes up as we move through the list (so at some point things are no longer fundamental but instead derived, allowing us to end the list).</p>
<h2>The Methods</h2>
<h3>Physical sources</h3>
<p>This is the most basic way (though not as practical in the computer age) to generate random variables.  Observe the flip of a real coin, shuffle actual cards, mix numbered balls or count the number of ticks from an actual radioactive source.  In all of these the randomness comes from physical principles (such <a target="_blank" href="http://en.wikipedia.org/wiki/Chaos_theory">chaotic dynamics</a> for coin flips or <a target="_blank" href="http://en.wikipedia.org/wiki/Quantum_mechanics">quantum mechanics</a> for radioactive decay).</p>
<p>These sources are &#8220;outside of computer science&#8221; so we will say the least about them.</p>
<h3>Empirical resampling</h3>
<p>This is what used to be called &#8220;tables&#8221; (which were themselves often generated from physical processes).   The observation is: that sometimes<br />
to run a simulation you need access to instances of random variables that are distributed in a very precise way- but you don&#8217;t have a usable  description of the desired distribution.  You would think that in this case you could do nothing.  But the principle of empirical resampling is that you can approximately generate new samples by taking samples (with repetition or replacement) from an old sample.  This is the cornerstone of Bootstrap methods.</p>
<p>As an example:  suppose we were given the sample of numbers 5, 5, 10, 5, 5 which has mean equal to 6.  Further suppose we have no<br />
description of how these number were generated but we wanted to know if a mean of at least 8 is likely or unlikely for five more numbers drawn the same way.  We can approximate this by drawing many samples of size five from this original sample (allow the same number to be in our new<br />
 sample multiple times) and get the bootstrap estimate of the probability of seeing mean of at least 8 as having a probability around 0.6%.</p>
<p>This may seem trivial- but it is very important.</p>
<h3>Pseudo random generators</h3>
<p>In the computer age, to avoid need for external tables or expensive and slow peripherals we tend to use pseudo random generators.  That is the output of deterministic iterative procedures as equivalent to true random sources.  The science of pseudo randomness has evolved from cobbled together procedures passing ad-hoc tests (such as in Knuth Volume 2) to more formal pseudo randomness based on important properties (like provably being k-wise independent) or complexity (being computationally indistinguishable from a truly random on a time or space bounded machine).  Behind the canned routines of all of the basic &#8220;random generators&#8221; commonly available is a pseudo random source.  </p>
<p>Good references for the modern theory include: 	</p>
<ul>
<li>
&#8220;Pseudorandomness and Cryptographic Applications&#8221; Michael Luby 1996.
</li>
<li>
&#8220;Modern Cryptography, Probabilistic Proofs and Pseudorandomness&#8221; Oded Goldreich, 1999.
</li>
</ul>
<p>The most basic form of a sequential pseudo random generator is a sequence of states s(1), s(2), s(3) &#8230; . Where s(i+1) = g(s(i)) where g() is our deterministic function that maps state to state.  The observed random variables are then h(s(i)) where h() is some deterministic function maps state to observables.  For example for the <a target="_blank" href="http://en.wikipedia.org/wiki/Linear_congruential_generator">linear congruential generator</a>  found in glibc we have g(x) = (1103515245*x + 12345) modulo 2^32 and h(x) = x modulo 2^30 (x an integer from 0 to 2^32 &#8211; 1).  An example application: this generator when divided by (2^30 &#8211; 1) might return numbers passably uniformly distributed in the interval [0,1].  Two such variates might be uses as a uniform sample from the unit square.</p>
<p>That a simple iterated deterministic system (like the modulo arithmetic or even a physical system like coin flipping) would even superficially appear random (let alone be safe to use as pseudo random source) turns out to be the main consequence of <a target="_blank" href="http://en.wikipedia.org/wiki/Ergodic_theory">Ergodic theory</a> (which we will touch on in a later article).  The point is: it should not be obvious (without bringing in some more theory) why you should trust pseudo-random sources.</p>
<h3>Simulation/Game-play</h3>
<p>Another fundamental method is direct simulation or game play.  If we wanted a random variable that was 1 with probability equal to the odds of being dealt a full house from a standard shuffled deck of 52 cards (and zero otherwise).  We can generate such a variable by simulating shuffling a deck, drawing a hand and returning 1 if the hand draw is a full house (and returning 0 otherwise).  Notice in this case we are combining many random variables to get a single result.</p>
<p>One of the most important simulation techniques is Markov chain Monte Carlo methods (related to Gibbs sampling, simulated annealing and many other variations).  These method implement a complex procedure over a stream of random inputs to generate a more difficult to achieve sequence of random outputs.</p>
<p>For example:  Let T be the set of pairs of non-negative integers x, y such that x + y &le; 1000.   We could implement a Markov chain on this set from a source of coin flips.  Given a point (x,y) in T we take three coin flips and move to new point (x&#8217;,y&#8217;) (also in T) using the following procedure:</p>
<ol>
<li>Let m = 1 if the first flip is heads and m=0 if the first flip is tails.</li>
<li>Let v = (1,0) if the second flip is heads and v=(0,1) if the second flip is tails.</li>
<li>Let d = +1 if the third flip is heads and d = -1 if the third flip is tails.</li>
<li>If (x,y) + m*d*v is in T let (x&#8217;,y&#8217;) = (x,y) + m*d*v, otherwise let (x&#8217;,y&#8217;) = (x,y) (stay put).</li>
</ol>
<p>Repeating this procedure a large number of times produces a sequence of points (x,y) such that (x,y) is distributed uniformly on S (again this follows from ergodic principles).  The correctness of this simulation of or game of following a Markov chain is a very fundamental method in generating more complicated random variates and something we will write more about in an article dealing with the ergodic principle (the relation of connectedness to showing averages over time equal averages over space).</p>
<p>For simple shapes (rectangle, triangles) there are more efficient ways to generate points uniformly at random.  For squares we exploit independence and just generate the coordinates independently.  For triangles we could rejection sample from a bounding rectangle.   Or we could use a tranform method: write down a counting function that indexes all the points in the triangle and generate points by index (for example it is easy to work out there are 501501 points in our example S so if we generate a random integer uniformly from 1 to 501501 can just pick the point with given index as our sample).</p>
<p>For general convex shapes (in high dimensions) these methods become intractible and Markov chain methods are one of the few options remaining.</p>
<h3>Rejection Sampling</h3>
<p>Rejection sampling is another way to convert one sequence of random variables into another.  If we assume we can generate a random variable according to the distribution p(x) we can &#8220;rejection sample&#8221; to a new distribution using an &#8220;acceptance function&#8221; q(x) which returns a number in the interval [0,1].  Our procedure is to<br />
repeat the following: generate x with probability p(x), generate a random variable y with uniformly in the interval [0,1] if y &le; q(x) accept x as<br />
our answer and quit (otherwise draw a new x and repeat).</p>
<p>When the distribution that rejection sampling draws with is such that if x and y had a ratio of being drawn of p(x)/p(y) then under the rejection procedure they have relative odds of (p(x)q(x))/(p(y)q(y)).  An important special case is when q() is always 0 or 1, in this case we are drawing with relative odds proportional to p(x) from the subset of x with q(x)=1.</p>
<p>As an example: consider the problem of trying to draw a point (x,y) such that x^2 + y^x &lt; 1 (the open unit disk) uniformly at random.  The rejection sampling solution is: repeat the following until you have a success: generate x and y independently uniformly in the interval [-1,1], if x^2 + y^2 &lt; then 1 accept them as our sample (otherwise repeat).  This procedure is very fast as the unit disk that represents our acceptance region has area pi and the square we are generating trials from has area 4: so we over a 78% chance of success on each trial or expect to only have to run fewer that 1.28 trials (on average) to get a sample.</p>
<h3>Transform methods</h3>
<p>A transform method is used when we have the ability to generate instances of a random variable according to one distribution and we would like instances according to another distribution.</p>
<p>One method is used when we have access to the inverse of the <a target="_blank" href="http://en.wikipedia.org/wiki/Cumulative_distribution_function">cumulative distribution function</a> of the distribution we are trying to generate.  In this case  we can use this function to convert uniform variants from the interval [0,1] into our target distribution.  The commutative distribution function is the function cdf() where cdf(x) is the probability a random variate generated according to our distribution is less than or equal to x.  The inverse function function icdf() where icdf(y)  is such that cdf(icdf(y)) = y.  For example the <a target="_blank" href="http://en.wikipedia.org/wiki/Exponential_distribution">exponential distribution</a>  has an inverse cumulative distribution function icdf(y) = -ln(1-y)/lamda .  So if y is<br />
generated uniformly in the interval [0,1] then icdf(y) is a random variable generated according to the exponential distribution with parameter lambda.</p>
<p>A great example of transform methods is generating Gaussian random variables.  We could directly use the inverse cumulative distribution function method described above- but to do this we would require a special function library to perform the required calculation of the inverse cummulative distribution (or inverse of <a target="_blank" href="http://en.wikipedia.org/wiki/Error_function">erf()</a>).  Another way is the <a target="_blank" href="http://en.wikipedia.org/wiki/Marsaglia_polar_method">polar method</a>: generate x,y uniformly from the open unit disk (by, for example rejection sampling as described earlier), set s = x^2 + y^2 and return  x*sqrt(-2 ln(s)/s),  y*sqrt(-2 ln(s)/s) as two independent Gaussian random variables.   The trick being: the distribution function of r = sqrt(s) is of the form r*e^(-r*r/2) which leads to an elementary cumulative distribution function (unlike the original Gaussian density of the form e^(-r*r/2)) that is easy to invert.</p>
<h2>Conclusion</h2>
<p>Our thesis is: all major methods to generate random variables use aspects of the six methods we have listed here as fundamental.  Or you should at least have a fluid understanding of at least these methods.  You should be able to break down big &#8220;brand name&#8221; methods (like Gibbs sampling) roughly into their constituent parts (so you can reason about them).   One example: notice how ratios of probabilities enter into Markov chain Monte Carlo methods (they cause step rejections); from this you can reason if your problem has bounded ratios it is a good candidate for direct application of the technique (and if it does not you need to add some more ideas, as was demonstrated in:  <a target="_blank" href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.11.9794">&#8220;Sampling from Log-Concave Distributions,&#8221; Alan Frieze , Ravi Kannan , Nick Polson, Ann. Appl. Prob, 1994</a> ).</p>
<p>The first two methods we discuss (physical sources and empirical re-sampling) are of the class of solutions &#8220;already have the right answer.&#8221;  Pseudo random generators are the primary way to negate the need for physical sources and resampling techniques.  Simulation, rejection sampling and transform methods are the main tools for building new distributions out of old.</p>
<p>It is a matter of taste if a given trick fits into this ad-hoc taxonomy or not.   You can invent new and better generation methods- but these methods are easily derived using ideas from the fundamental methods we mentioned here.</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2011/06/what-is-a-large-enough-random-sample/' rel='bookmark' title='What is a large enough random sample?'>What is a large enough random sample?</a></li>
<li><a href='http://www.win-vector.com/blog/2011/10/kernel-methods-and-support-vector-machines-de-mystified/' rel='bookmark' title='Kernel Methods and Support Vector Machines de-Mystified'>Kernel Methods and Support Vector Machines de-Mystified</a></li>
<li><a href='http://www.win-vector.com/blog/2011/07/book-review-ensemble-methods-in-data-mining-seni-elder/' rel='bookmark' title='Book Review: Ensemble Methods in Data Mining (Seni &amp; Elder)'>Book Review: Ensemble Methods in Data Mining (Seni &#038; Elder)</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2012/01/six-fundamental-methods-to-generate-a-random-variable/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>What to do when you run out of memory</title>
		<link>http://www.win-vector.com/blog/2011/12/what-to-do-when-you-run-out-of-memory/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=what-to-do-when-you-run-out-of-memory</link>
		<comments>http://www.win-vector.com/blog/2011/12/what-to-do-when-you-run-out-of-memory/#comments</comments>
		<pubDate>Tue, 06 Dec 2011 12:25:59 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Coding]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Tutorials]]></category>
		<category><![CDATA[Additive Combinatorics]]></category>
		<category><![CDATA[GNU sort]]></category>
		<category><![CDATA[H2]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Map Reduce]]></category>
		<category><![CDATA[Out of core]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1892</guid>
		<description><![CDATA[A constant problem for computer science (since its inception) is how to manipulate data that is larger than machine memory. We present here some general strategies for working &#8220;out of core&#8221; or what you should do when you run out of memory. Early computers were most limited by their paltry memory sizes. von Neumann himself [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/' rel='bookmark' title='Automatic Differentiation with Scala'>Automatic Differentiation with Scala</a></li>
<li><a href='http://www.win-vector.com/blog/2010/12/large-data-logistic-regression-with-example-hadoop-code/' rel='bookmark' title='Large Data Logistic Regression (with example Hadoop code)'>Large Data Logistic Regression (with example Hadoop code)</a></li>
<li><a href='http://www.win-vector.com/blog/2011/01/sql-screwdriver/' rel='bookmark' title='SQL Screwdriver'>SQL Screwdriver</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>A constant problem for computer science (since its inception) is how to manipulate data that is larger than machine memory.  We present here some general strategies for working &#8220;out of core&#8221; or what you should do when you run out of memory.</p>
<p>Early computers were most limited by their paltry memory sizes.  von Neumann himself commented that even a room full of genius mathematicians would not be capable of much if all they could communicate, think upon or remember were the characters on a single type written page (much more memory than the few hundred words available to the <a target="_blank" href="http://en.wikipedia.org/wiki/ENIAC">Eniac</a>).   The most visible portions of early computers are their external memories or secondary stores: card readers, paper tape readers and tape drives.</p>
<p><center><br />
<img style="display:block; margin-left:auto; margin-right:auto;" src="http://www.win-vector.com/blog/wp-content/uploads/2011/12/IMG_0062.jpg" alt="IMG 0062" title="IMG_0062.JPG" border="0" height="300" /></p>
<p/>
SDC 920 computer, Computer History Museum, Mountain View CA<br />
</center></p>
<p>Historically computer scientists have concentrated on streaming or online algorithms (that is algorithms that work with the data in the order it is available and use limited memory).  For many problems we have found this an insufficient model and it is much better to assume you can re-order and replicate data (such as scattering data to many processors and re-collecting it to sort).  The scatter/gather paradigm is ubiquitous and is the underpinning of large scale sorting, databases and Map Reduce.  So in one sense databases and Map Reduce different APIs on top of very related technologies (journaling, splitting and merging).  Replicating data (or even delaying duplicate elimination) that is already &#8220;too large to handle&#8221; may seem counterintuitive; but it is exploiting the primary property of secondary storage: that secondary storage tends to be much larger than primary storage (typically by 2 orders of magnitude, compare a 2 terabyte drive to an 8 gigabyte memory stick).<span id="more-1892"></span>In our web age, the typical big data problems are inverting indices (for fast search lookup) and computing term frequencies (for <a target="_blank" href="http://en.wikipedia.org/wiki/Okapi_BM25">TF/IDF scoring</a> or for things like <a target="_blank" href="http://en.wikipedia.org/wiki/Naive_bayes">Naive Bayes classifiers</a>).  Since these are over-worked examples we will use a mathematical problem from <a href="http://terrytao.wordpress.com/books/additive-combinatorics/">&#8220;Additive Combinatorics&#8221;, Terence Tao, Van Vu, (ISBN-13: 9780521853866; ISBN-10: 0521853869)</a></p>
<p>We take one problem from the field of additive combinatorics: sum sets.   For two sets of integers A = {a_1, &#8230; a_s} and B {b_1, &#8230;, b_t} the sum set is defined as the set (without repetition) A + B = { a_i + b_j | i = 1,&#8230;s, j=1&#8230;t }.   For sets of integers the size of A+B (denoted as |A+B|) can vary from |A| + |B| &#8211; 1 to |A| * |B| depending on the relations between the numbers in A and B (or the structure of A and B).  If instead of working with integers we work with integers <a target="_blank" href="http://en.wikipedia.org/wiki/Modular_arithmetic">modulo p</a> where p is a prime number (or equivalently we treat all numbers as remainders of division by p) then by the Cauchy-Davenport inequality we have |A + B| &ge; min(|A|+|B|-1,p) (so essentially the same result, except when we run out of possible integers modulo p).</p>
<p>For example we would say (working modulo 19) that [0, 1, 4, 5] + [10, 11, 14, 15] = [0, 1, 10, 11, 12, 14, 15, 16, 18].   In fact there are 19 pairs of sets that add up to  [0, 1, 10, 11, 12, 14, 15, 16, 18] ( for instance [5, 6, 9, 10] + [5, 6, 9, 10] is another such pair).  Just to move forward assume we were interested in determining how many ways a set can be written as the sum of a pair of sets (each of size 4).  For a given sum result we might try search or <a target="_blank" href="http://en.wikipedia.org/wiki/Integer_programming">integer programming</a> to find all possible summands.  However, if we want the statistics on all sums simultaneously, we can work much quicker and without need for big gun mathematics.</p>
<p>The straightforward solution is this case is a bit of code like:</p>
<p><code></p>
<pre>
for set A from all possible sets of 4 integers from 0 to 18
    for set B from all possible sets of 4 integers from 0 to 18
        let set C = A + B modulo 19
        use set C as a key and add the pair (A,B) to the list associated with C
for all key sets C tracked above
     compute the size of the list of summand pairs found for C
print how many result sets C have a given number of summand pairs
</pre>
<p></code></p>
<p>The relations C which have a summand of form A can be collected by any bit of Java code implementing the interface below (just call <code>insertReln(C,(A,B))</code>  to store the relations and then <code>entries()</code> to get them back).  A small interface that declares the needed methods is given below:</p>
<p><code></p>
<pre>
public interface RelnCollector&lt;A,B&gt; {
	void insertReln(A a, B b) throws IOException;
	Iterable&lt;Map.Entry&lt;C,Iterable&lt;B&gt;&gt;&gt; entries() throws IOException, InterruptedException;
	void close() throws IOException;
}
</pre>
<p></code></p>
<p>An in-memory relation collector is trivially implemented by a nested map adjusted to declare the above interface, as we see in the next code snippet:</p>
<pre>
public final class InMemoryRelnCollector&lt;A,B&gt;
	implements RelnCollector&lt;A,B&gt; {
	private final DataAdapter&lt;A&gt; adapterA;
	private final DataAdapter&lt;B&gt; adapterB;
	private Map&lt;A,Iterable&lt;B&gt;&gt; atoBs;

	public InMemoryRelnCollector(final DataAdapter&lt;A&gt; adapterA,
		final DataAdapter&lt;B&gt; adapterB) {
		this.adapterA = adapterA;
		this.adapterB = adapterB;
		atoBs = new TreeMap&lt;A,Iterable&lt;B&gt;&gt;(this.adapterA);
	}

	@Override
	public void insertReln(final A a, final B b) {
		Set&lt;B&gt; set = (Set&lt;B&gt;) atoBs.get(a);
		if(null==set) {
			set = new TreeSet&lt;B&gt;(adapterB);
			atoBs.put(a,set);
		}
		if(!set.contains(b)) {
			set.add(b);
		}
	}

	@Override
	public Iterable&lt;Map.Entry&lt;A,Iterable&lt;B&gt;&gt;&gt; entries() {
		return atoBs.entrySet();
	}

	@Override
	public void close() {
		atoBs = null;
	}
}
</pre>
<p>The great savings in time is that we work from summands to results sums (but keep many sets of results indexed by result sets).  Thus we don&#8217;t have to figure out how to invert the sum operation (as we do our bookkeeping forward).  However, this very bookkeeping may overwhelm us.  As we can see below, a Java implementation of the above procedure runs out of memory when trying to characterize which sets of integers modulo 19 can be split into two sets of size four (and how many ways each such set can be split).  However, this was with the deliberately small default allocation of memory available to Java processes (so for this particular instance we could avoid trouble by allocating more memory, we ran out of allocation not system memory).  What happens when we don&#8217;t manage memory is illustrated below:</p>
<pre>
Start	com.winvector.consolidate.impl.InMemoryRelnCollector
	Tue Dec 06 10:04:38 PST 2011
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
	at java.util.TreeMap.put(TreeMap.java:554)
	at java.util.TreeSet.add(TreeSet.java:238)
	at com.winvector.consolidate.example.AdditiveSets.sum(AdditiveSets.java:25)
	at com.winvector.consolidate.example.AdditiveSets.main(AdditiveSets.java:55)
</pre>
<p>An out of core solution can solve the entire problem without needing any additional system memory (just some disk space which is still of a much greater size than primary memory).  The complete calculated result is given below:</p>
<pre>
Examining sums of 4 integers chosen from 0 through 18 modulo 19.
Start	com.winvector.consolidate.impl.FileRelnCollector
	Tue Dec 06 09:54:20 PST 2011
	Inserted 15023376 relations.
 [0, 1, 4, 5] + [10, 11, 14, 15] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [0, 1, 15, 16] + [0, 14, 15, 18] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [0, 3, 4, 18] + [11, 12, 15, 16] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [0, 14, 15, 18] + [0, 1, 15, 16] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [1, 2, 5, 6] + [9, 10, 13, 14] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [1, 2, 16, 17] + [13, 14, 17, 18] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [2, 3, 6, 7] + [8, 9, 12, 13] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [2, 3, 17, 18] + [12, 13, 16, 17] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [3, 4, 7, 8] + [7, 8, 11, 12] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [4, 5, 8, 9] + [6, 7, 10, 11] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [5, 6, 9, 10] + [5, 6, 9, 10] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [6, 7, 10, 11] + [4, 5, 8, 9] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [7, 8, 11, 12] + [3, 4, 7, 8] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [8, 9, 12, 13] + [2, 3, 6, 7] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [9, 10, 13, 14] + [1, 2, 5, 6] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [10, 11, 14, 15] + [0, 1, 4, 5] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [11, 12, 15, 16] + [0, 3, 4, 18] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [12, 13, 16, 17] + [2, 3, 17, 18] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
 [13, 14, 17, 18] + [1, 2, 16, 17] = [0, 1, 10, 11, 12, 14, 15, 16, 18]
	Examined 128820 sums and 15023376 summands.
	found 3705 sums with 19 distinct summands
	found 39900 sums with 38 distinct summands
	found 26847 sums with 76 distinct summands
	found 22230 sums with 114 distinct summands
	found 10602 sums with 152 distinct summands
	found 8892 sums with 190 distinct summands
	found 2736 sums with 228 distinct summands
	found 5016 sums with 266 distinct summands
	found 2736 sums with 304 distinct summands
	found 1710 sums with 342 distinct summands
	found 171 sums with 361 distinct summands
	found 1710 sums with 380 distinct summands
	found 855 sums with 418 distinct summands
	found 342 sums with 456 distinct summands
	found 342 sums with 532 distinct summands
	found 342 sums with 570 distinct summands
	found 171 sums with 722 distinct summands
	found 171 sums with 760 distinct summands
	found 171 sums with 912 distinct summands
	found 171 sums with 1026 distinct summands
Done:	com.winvector.consolidate.impl.FileRelnCollector
   elapsed time: 618473MS
   Tue Dec 06 10:04:38 PST 2011
</pre>
<p>We performed the calculation be using a different implementation of <code>RelnCollector</code> called <code>FileRelnCollector</code>.  What this implementation does is write relations to a file as they are made available.  That is <cod>FileRelnCollector</code> implementation of <code>insertReln</code> is literally a <code>println()</code>.  Something not more more complicated than the following:</p>
<p><code></p>
<pre>
	@Override
	public void insertReln(final A a, final B b) {
		System.out.println("" + a + "\t" + b);
	}
</pre>
<p></code></p>
<p>The heavy lifting is done when <code>entries()</code> is called.  When the entries are wanted the <code>FileRelnCollector</code> calls <a target="_blank" href="http://www.gnu.org/s/coreutils/manual/html_node/sort-invocation.html">GNU sort</a> on the saved file to get all the results ordered by result sum (instead of by summand).  GNU sort can sort files larger than memory by a split and merge strategy involving temporary files.  We provide such  <a target="_blank" href="https://github.com/WinVector/OutOfCore/blob/master/OutOfCore/src/com/winvector/consolidate/impl/FileRelnCollector.java">a file plus GNU sort based implementation of RelnCollector</a>.  </p>
<p>Note that this runtime can be deceptively low.  If running on a machine with a modern operating system and enough memory the file being used as "external storage" actually gets cached into memory (and gets near memory speed performance).  To get a reliable timing you need to test a problem of the size you are interested in on the size machine you are going to deploy on (not on a larger machine).</p>
<p>For better or worse this method should seem familiar as a lot of science has been done using the Unix text tools (sort, join and a few more).  This is also the basis of Map Reduce and we demonstrate a <a target="_blank" href="https://github.com/WinVector/OutOfCore/blob/master/OutOfCore/src/com/winvector/consolidate/impl/MapReduceRelnCollector.java">Hadoop implementation of RelnCollector</a> as well.  Or we can link up with the other technology designed for beyond memory size data manipulation and get <a target="_blank" href="https://github.com/WinVector/OutOfCore/blob/master/OutOfCore/src/com/winvector/consolidate/impl/DBRelnCollector.java">a database based implementation of RelnCollector</a>.  </p>
<p>In all cases the implementations we call depend on journaling (in the sense of keeping a sequential log of operations to be done instead of immediately performing the operations), scattering (splitting into multiple temp files and structures) and merging (combining data form multiple ordered files).  We could write our own code to perform all of these operations (obliviating any need for GNU sort, Hadoop or a database), but it is much less code to do as we have here and write an adapter to use existing implementations.</p>
<p>The sum-set example is deliberately artificial.  More common examples are, as we mentioned, index inversion and term frequency calculation.  All of our example code is available here: <a target="_blank" href="https://github.com/WinVector/OutOfCore">https://github.com/WinVector/OutOfCore</a> including JUnit tests and an <a target="_blank" href="https://github.com/WinVector/OutOfCore/blob/master/OutOfCore/src/com/winvector/consolidate/example/AdditiveSets.java">example program</a>.  The code depends on libraries for <a target="_blank" href="http://www.junit.org/">JUnit 4.10</a>, <a target="_blank" href="http://www.h2database.com/html/main.html">h2 database</a>, <a target="_blank" href="http://hadoop.apache.org/mapreduce/releases.html">Hadoop 0.21.0</a> for the various implementations.</p>
<p>The main trick is basing your code on a very thin storage abstraction (like the <code>RelnCollector</code> interface, instead of explicitly known data structures) and then using this abstraction to hide all of the details away from the rest of your code (keeping complexity at a manageable level).  The two things to avoid are either infecting your code with too much knowledge of your storage plans (i.e. pushing implementation details into your important code to "speed things up") or being forced to re-design your entire project to fit within some framework (like re-writing all of your code as a database stored procedure or an explicit Hadoop map/reduce pair as this over-commits you to one technology).</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/' rel='bookmark' title='Automatic Differentiation with Scala'>Automatic Differentiation with Scala</a></li>
<li><a href='http://www.win-vector.com/blog/2010/12/large-data-logistic-regression-with-example-hadoop-code/' rel='bookmark' title='Large Data Logistic Regression (with example Hadoop code)'>Large Data Logistic Regression (with example Hadoop code)</a></li>
<li><a href='http://www.win-vector.com/blog/2011/01/sql-screwdriver/' rel='bookmark' title='SQL Screwdriver'>SQL Screwdriver</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2011/12/what-to-do-when-you-run-out-of-memory/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>An Appreciation of Locality Sensitive Hashing</title>
		<link>http://www.win-vector.com/blog/2011/11/an-appreciation-of-locality-sensitive-hashing/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=an-appreciation-of-locality-sensitive-hashing</link>
		<comments>http://www.win-vector.com/blog/2011/11/an-appreciation-of-locality-sensitive-hashing/#comments</comments>
		<pubDate>Mon, 21 Nov 2011 16:41:46 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Exciting Techniques]]></category>
		<category><![CDATA[Expository Writing]]></category>
		<category><![CDATA[Opinion]]></category>
		<category><![CDATA[Tutorials]]></category>
		<category><![CDATA[Locality Sensitive Hashing]]></category>
		<category><![CDATA[Nearest Neighbor]]></category>
		<category><![CDATA[Theorist]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1848</guid>
		<description><![CDATA[We share our admiration for a set of results called &#8220;locality sensitive hashing&#8221; by demonstrating a greatly simplified example that exhibits the spirit of the techniques.Locality sensitive hashing is awe inspiring in its originality, simplicity, beauty and effectiveness. In addition locality sensitive hashing is a remarkable technique as it works even when drastically abridged and [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2008/09/a-quick-appreciation-of-the-sharpe-ratio/' rel='bookmark' title='A Quick Appreciation of the Sharpe Ratio'>A Quick Appreciation of the Sharpe Ratio</a></li>
<li><a href='http://www.win-vector.com/blog/2009/11/the-local-to-global-principle/' rel='bookmark' title='The Local to Global Principle'>The Local to Global Principle</a></li>
<li><a href='http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/' rel='bookmark' title='Automatic Differentiation with Scala'>Automatic Differentiation with Scala</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>We share our admiration for a set of results called &#8220;locality sensitive hashing&#8221; by demonstrating a greatly simplified example that exhibits the spirit of the techniques.<span id="more-1848"></span>Locality sensitive hashing is awe inspiring in its originality, simplicity, beauty and effectiveness.  In addition locality sensitive hashing is a remarkable technique as it works even when drastically abridged and simplified.  In this <a target="_blank" href="http://www.win-vector.com/dfiles/LocalitySensitiveHashing.pdf">paper (link to pdf)</a> we give a description of conditions where the technique works and a heuristic argument why it works (using only elementary math).</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2008/09/a-quick-appreciation-of-the-sharpe-ratio/' rel='bookmark' title='A Quick Appreciation of the Sharpe Ratio'>A Quick Appreciation of the Sharpe Ratio</a></li>
<li><a href='http://www.win-vector.com/blog/2009/11/the-local-to-global-principle/' rel='bookmark' title='The Local to Global Principle'>The Local to Global Principle</a></li>
<li><a href='http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/' rel='bookmark' title='Automatic Differentiation with Scala'>Automatic Differentiation with Scala</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2011/11/an-appreciation-of-locality-sensitive-hashing/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>&#8220;The Mythical Man Month&#8221; is still a good read</title>
		<link>http://www.win-vector.com/blog/2011/10/the-mythical-man-month-is-still-a-good-read/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=the-mythical-man-month-is-still-a-good-read</link>
		<comments>http://www.win-vector.com/blog/2011/10/the-mythical-man-month-is-still-a-good-read/#comments</comments>
		<pubDate>Sun, 23 Oct 2011 18:57:52 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Computers]]></category>
		<category><![CDATA[Opinion]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Architects]]></category>
		<category><![CDATA[Mythical Man Month]]></category>
		<category><![CDATA[SAGE]]></category>
		<category><![CDATA[WIMP]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1834</guid>
		<description><![CDATA[Re-read Fred Brooks &#8220;The Mythical Man Month&#8221; over vacation.  Book remains insightful about computer science and project management.My spin on some points: System architects once were the people who said &#8220;no&#8221; to features to maintain design consistency and coherency. Now architects are the people who buy and bring in external frameworks and technologies (killing any [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2009/01/map-reduce-a-good-idea/' rel='bookmark' title='Map Reduce: A Good Idea'>Map Reduce: A Good Idea</a></li>
<li><a href='http://www.win-vector.com/blog/2009/08/good-graphs-graphical-perception-and-data-visualization/' rel='bookmark' title='Good Graphs: Graphical Perception and Data Visualization'>Good Graphs: Graphical Perception and Data Visualization</a></li>
<li><a href='http://www.win-vector.com/blog/2010/05/must-have-software/' rel='bookmark' title='Must Have Software'>Must Have Software</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>Re-read Fred Brooks &#8220;The Mythical Man Month&#8221; over vacation.  Book remains insightful about computer science and project management.<span id="more-1834"></span>My spin on some points:</p>
<ul>
<li>
System architects once were the people who said &#8220;no&#8221; to features to maintain design consistency and coherency.  Now architects are the people who buy and bring in external frameworks and technologies (killing any chance of consistency or coherency).  Kind of like the Fahrenheit 451 quote &#8220;I remember firemen used to fight fires.&#8221;
</li>
<li>
By far the thing that aged the worst was the reverence for the WIMP (windows, icons, menus, pointing) paradigm.  At this point I think we can argue that WIMP codified a lot of provably bad decisions: desktops, icons, menus and mouse out of visual field.  Maybe some of the ideas prior to WIMP (like SAGE&#8217;s light-pens) or after WIMP (application launcher noun-verb theories like Quicksilver, search, touch pads, full screen apps, versioning and not forcing the user to adapt to the file storage abstraction) are actually much more fundamental.  I think we all were seduced by the 1968 Engelbart demo but forget that the Semi Automated Ground Environment was a production deployed direct (light pen) multi user information sharing point and click system since 1959.</p>
<p><center><br />
<img style="display:block; margin-left:auto; margin-right:auto;" src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/IMG_0064.jpg" alt="SAGE station" title="IMG_0064.JPG" border="0" width="600" height="450" /></p>
<p>SAGE station, Computer History Museum- Mountain View, CA</p>
<p></center>
</li>
<li>
Most everything else ages very well.  The discussions of pain of having to work &#8220;out of core&#8221; remain relevant as this is what we now call &#8220;big data&#8221; (though in Brooks&#8217; time this pain extends to documentation, source code and binaries all of which are too big to hold in memory or even in machine accessible format in the time of the IBM System/360).  </p>
<p>Though in the old days- &#8220;out of core&#8221; meant punched cards, punched tape, magnetic tape or very slow hard disks (which were a new luxury for the period Brooks writes about).<br />
<center><br />
<img style="display:block; margin-left:auto; margin-right:auto;" src="http://www.win-vector.com/blog/wp-content/uploads/2011/10/IMG_0062.jpg" alt="IMG 0062" title="IMG_0062.JPG" border="0" width="450" height="600" /></p>
<p>SDS 920 with built in tape-drive, Computer History Museum- Mountain View, CA</p>
<p></center>
</li>
<li>
Linkers were among the biggest problems in the 1960s and remain the so now (though we now call it late binding, jars, shared libraries and APIs).  At one point Brooks throws up his hands and says that it would be faster to just re-compile everything than to deal with some relocating linkers.
</li>
<li>
Brooks definitely advocates and anticipates things like developer wikis (though he had to use microfiche as the computers of his day didn&#8217;t have enough storage to manage their own documentation).
</li>
<li>
&#8220;Literate Programming&#8221; is clearly anticipated.
</li>
<li>
Version control procedures are definitely written about, but Brooks seems not to anticipate version control software.
</li>
</ul>
<p>Overall: very well written and still interesting and relevant.</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2009/01/map-reduce-a-good-idea/' rel='bookmark' title='Map Reduce: A Good Idea'>Map Reduce: A Good Idea</a></li>
<li><a href='http://www.win-vector.com/blog/2009/08/good-graphs-graphical-perception-and-data-visualization/' rel='bookmark' title='Good Graphs: Graphical Perception and Data Visualization'>Good Graphs: Graphical Perception and Data Visualization</a></li>
<li><a href='http://www.win-vector.com/blog/2010/05/must-have-software/' rel='bookmark' title='Must Have Software'>Must Have Software</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2011/10/the-mythical-man-month-is-still-a-good-read/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Programmers Should Know R</title>
		<link>http://www.win-vector.com/blog/2011/08/programmers-should-know-r/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=programmers-should-know-r</link>
		<comments>http://www.win-vector.com/blog/2011/08/programmers-should-know-r/#comments</comments>
		<pubDate>Sat, 06 Aug 2011 15:29:22 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Tutorials]]></category>
		<category><![CDATA[Analytics]]></category>
		<category><![CDATA[diagnosis]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[R]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1711</guid>
		<description><![CDATA[Programmers should definitely know how to use R. I don&#8217;t mean they should switch from their current language to R, but they should think of R as a handy tool during development.Again and again I find myself working with Java code like the following. public class SomeBigProject1 { public static double logStirlingApproximation(final int n) { [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/' rel='bookmark' title='Automatic Differentiation with Scala'>Automatic Differentiation with Scala</a></li>
<li><a href='http://www.win-vector.com/blog/2009/11/r-examine-objects-tutorial/' rel='bookmark' title='R examine objects tutorial'>R examine objects tutorial</a></li>
<li><a href='http://www.win-vector.com/blog/2009/09/survive-r/' rel='bookmark' title='Survive R'>Survive R</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>Programmers should definitely know how to use <a target="_blan" href="http://cran.r-project.org/">R</a>.  I don&#8217;t mean they should switch from their current language to R, but they should think of R as a handy tool during development.<span id="more-1711"></span>Again and again I find myself working with Java code like the following.</p>
<style type="text/css">
td.linenos { background-color: #f0f0f0; padding-right: 10px; }
span.lineno { background-color: #f0f0f0; padding: 0 5px 0 5px; }
pre { line-height: 125%; }
body .hll { background-color: #ffffcc }
body  { background: #f8f8f8; }
body .c { color: #408080; font-style: italic } /* Comment */
body .err { border: 1px solid #FF0000 } /* Error */
body .k { color: #008000; font-weight: bold } /* Keyword */
body .o { color: #666666 } /* Operator */
body .cm { color: #408080; font-style: italic } /* Comment.Multiline */
body .cp { color: #BC7A00 } /* Comment.Preproc */
body .c1 { color: #408080; font-style: italic } /* Comment.Single */
body .cs { color: #408080; font-style: italic } /* Comment.Special */
body .gd { color: #A00000 } /* Generic.Deleted */
body .ge { font-style: italic } /* Generic.Emph */
body .gr { color: #FF0000 } /* Generic.Error */
body .gh { color: #000080; font-weight: bold } /* Generic.Heading */
body .gi { color: #00A000 } /* Generic.Inserted */
body .go { color: #808080 } /* Generic.Output */
body .gp { color: #000080; font-weight: bold } /* Generic.Prompt */
body .gs { font-weight: bold } /* Generic.Strong */
body .gu { color: #800080; font-weight: bold } /* Generic.Subheading */
body .gt { color: #0040D0 } /* Generic.Traceback */
body .kc { color: #008000; font-weight: bold } /* Keyword.Constant */
body .kd { color: #008000; font-weight: bold } /* Keyword.Declaration */
body .kn { color: #008000; font-weight: bold } /* Keyword.Namespace */
body .kp { color: #008000 } /* Keyword.Pseudo */
body .kr { color: #008000; font-weight: bold } /* Keyword.Reserved */
body .kt { color: #B00040 } /* Keyword.Type */
body .m { color: #666666 } /* Literal.Number */
body .s { color: #BA2121 } /* Literal.String */
body .na { color: #7D9029 } /* Name.Attribute */
body .nb { color: #008000 } /* Name.Builtin */
body .nc { color: #0000FF; font-weight: bold } /* Name.Class */
body .no { color: #880000 } /* Name.Constant */
body .nd { color: #AA22FF } /* Name.Decorator */
body .ni { color: #999999; font-weight: bold } /* Name.Entity */
body .ne { color: #D2413A; font-weight: bold } /* Name.Exception */
body .nf { color: #0000FF } /* Name.Function */
body .nl { color: #A0A000 } /* Name.Label */
body .nn { color: #0000FF; font-weight: bold } /* Name.Namespace */
body .nt { color: #008000; font-weight: bold } /* Name.Tag */
body .nv { color: #19177C } /* Name.Variable */
body .ow { color: #AA22FF; font-weight: bold } /* Operator.Word */
body .w { color: #bbbbbb } /* Text.Whitespace */
body .mf { color: #666666 } /* Literal.Number.Float */
body .mh { color: #666666 } /* Literal.Number.Hex */
body .mi { color: #666666 } /* Literal.Number.Integer */
body .mo { color: #666666 } /* Literal.Number.Oct */
body .sb { color: #BA2121 } /* Literal.String.Backtick */
body .sc { color: #BA2121 } /* Literal.String.Char */
body .sd { color: #BA2121; font-style: italic } /* Literal.String.Doc */
body .s2 { color: #BA2121 } /* Literal.String.Double */
body .se { color: #BB6622; font-weight: bold } /* Literal.String.Escape */
body .sh { color: #BA2121 } /* Literal.String.Heredoc */
body .si { color: #BB6688; font-weight: bold } /* Literal.String.Interpol */
body .sx { color: #008000 } /* Literal.String.Other */
body .sr { color: #BB6688 } /* Literal.String.Regex */
body .s1 { color: #BA2121 } /* Literal.String.Single */
body .ss { color: #19177C } /* Literal.String.Symbol */
body .bp { color: #008000 } /* Name.Builtin.Pseudo */
body .vc { color: #19177C } /* Name.Variable.Class */
body .vg { color: #19177C } /* Name.Variable.Global */
body .vi { color: #19177C } /* Name.Variable.Instance */
body .il { color: #666666 } /* Literal.Number.Integer.Long */
</style>
<div class="highlight">
<pre><span class="kd">public</span> <span class="kd">class</span> <span class="nc">SomeBigProject1</span> <span class="o">{</span>
	<span class="kd">public</span> <span class="kd">static</span> <span class="kt">double</span> <span class="nf">logStirlingApproximation</span><span class="o">(</span><span class="kd">final</span> <span class="kt">int</span> <span class="n">n</span><span class="o">)</span> <span class="o">{</span>
		<span class="k">return</span> <span class="n">n</span><span class="o">*(</span><span class="n">Math</span><span class="o">.</span><span class="na">log</span><span class="o">(</span><span class="n">n</span><span class="o">)-</span><span class="mi">1</span><span class="o">)</span> <span class="o">+</span> <span class="mf">0.5</span><span class="o">*</span><span class="n">Math</span><span class="o">.</span><span class="na">log</span><span class="o">(</span><span class="mi">2</span><span class="o">*</span><span class="n">Math</span><span class="o">.</span><span class="na">PI</span><span class="o">*</span><span class="n">n</span><span class="o">);</span>
	<span class="o">}</span>

	<span class="kd">public</span> <span class="kd">static</span> <span class="kt">double</span> <span class="nf">logFactorial</span><span class="o">(</span><span class="kd">final</span> <span class="kt">int</span> <span class="n">n</span><span class="o">)</span> <span class="o">{</span>
		<span class="kt">double</span> <span class="n">r</span> <span class="o">=</span> <span class="mf">0.0</span><span class="o">;</span>
		<span class="k">for</span><span class="o">(</span><span class="kt">int</span> <span class="n">i</span><span class="o">=</span><span class="n">n</span><span class="o">;</span><span class="n">i</span><span class="o">&gt;</span><span class="mi">1</span><span class="o">;--</span><span class="n">i</span><span class="o">)</span> <span class="o">{</span>
			<span class="n">r</span> <span class="o">+=</span> <span class="n">Math</span><span class="o">.</span><span class="na">log</span><span class="o">(</span><span class="n">i</span><span class="o">);</span>
		<span class="o">}</span>
		<span class="k">return</span> <span class="n">r</span><span class="o">;</span>
	<span class="o">}</span>

	<span class="kd">public</span> <span class="kd">static</span> <span class="kt">void</span> <span class="nf">main</span><span class="o">(</span><span class="kd">final</span> <span class="n">String</span><span class="o">[]</span> <span class="n">args</span><span class="o">)</span> <span class="o">{</span>
		<span class="kt">int</span> <span class="n">nbad</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span>
		<span class="k">for</span><span class="o">(</span><span class="kt">int</span> <span class="n">n</span><span class="o">=</span><span class="mi">1000</span><span class="o">;</span><span class="n">n</span><span class="o">&lt;</span><span class="mi">10000</span><span class="o">;++</span><span class="n">n</span><span class="o">)</span> <span class="o">{</span>
			<span class="k">if</span><span class="o">(</span><span class="n">Math</span><span class="o">.</span><span class="na">abs</span><span class="o">(</span><span class="n">logFactorial</span><span class="o">(</span><span class="n">n</span><span class="o">)-</span><span class="n">logStirlingApproximation</span><span class="o">(</span><span class="n">n</span><span class="o">))&gt;=</span><span class="mf">1.0</span><span class="n">e</span><span class="o">-</span><span class="mi">5</span><span class="o">)</span> <span class="o">{</span>
				<span class="o">++</span><span class="n">nbad</span><span class="o">;</span>
			<span class="o">}</span>
		<span class="o">}</span>
		<span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="s">&quot;nbad: &quot;</span> <span class="o">+</span> <span class="n">nbad</span><span class="o">);</span>
	<span class="o">}</span>
<span class="o">}</span>
</pre>
</div>
<p>Imagine that this is some humongous project to use <a target="_blank" href="http://en.wikipedia.org/wiki/Stirling's_approximation">Stirling&#8217;s Approximation</a> as a replacement for factorial.  All the code up until main is great.  But the unfortunate developer has hard-coded an acceptance test into <code>main()</code>.  If they run their big project all they get out is:</p>
<pre>
nbad: 7334
</pre>
<p>The developer needs to re-code and re-build to diagnose the failure, tweak their acceptance criteria or add more measurements.</p>
<p>I strongly recommend a different work pattern.  Instead of bringing criteria into the code, bring the data out:</p>
<div class="highlight">
<pre><span class="kd">public</span> <span class="kd">class</span> <span class="nc">SomeBigProject2</span> <span class="o">{</span>
	<span class="kd">public</span> <span class="kd">static</span> <span class="kt">void</span> <span class="nf">main</span><span class="o">(</span><span class="kd">final</span> <span class="n">String</span><span class="o">[]</span> <span class="n">args</span><span class="o">)</span> <span class="o">{</span>
		<span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="s">&quot;n&quot;</span>
				<span class="o">+</span> <span class="s">&quot;\t&quot;</span> <span class="o">+</span> <span class="s">&quot;logFactorial&quot;</span>
				<span class="o">+</span> <span class="s">&quot;\t&quot;</span> <span class="o">+</span> <span class="s">&quot;logStirlingApproximation&quot;</span><span class="o">);</span>
		<span class="k">for</span><span class="o">(</span><span class="kt">int</span> <span class="n">n</span><span class="o">=</span><span class="mi">1000</span><span class="o">;</span><span class="n">n</span><span class="o">&lt;</span><span class="mi">10000</span><span class="o">;++</span><span class="n">n</span><span class="o">)</span> <span class="o">{</span>
			<span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="n">String</span><span class="o">.</span><span class="na">valueOf</span><span class="o">(</span><span class="n">n</span><span class="o">)</span>
					<span class="o">+</span> <span class="s">&quot;\t&quot;</span> <span class="o">+</span> <span class="n">SomeBigProject1</span><span class="o">.</span><span class="na">logFactorial</span><span class="o">(</span><span class="n">n</span><span class="o">)</span>
					<span class="o">+</span> <span class="s">&quot;\t&quot;</span> <span class="o">+</span> <span class="n">SomeBigProject1</span><span class="o">.</span><span class="na">logStirlingApproximation</span><span class="o">(</span><span class="n">n</span><span class="o">));</span>
		<span class="o">}</span>
	<span class="o">}</span>
<span class="o">}</span>
</pre>
</div>
<p>Capture this output in a file named &#8220;data.tsv&#8221; and both Microsoft Excel and R can open it.  Naturally I prefer to use R (so that is what I will demonstrate).  To read the results into R you start up an R and type in a command like the following:</p>
<pre>
 &gt; d &lt;- read.table('data.tsv',
        header=T,sep='\t',quote='',as.is=T,
        stringsAsFactors=F,comment.char='',allowEscapes=F)
</pre>
<p>Most of the arguments controlling the style of file R is to expected (what the field separator is, weather to expect escapes and quotes and so on).  The settings I suggest here are the &#8220;ultra hardened&#8221; settings.  If you make sure none of your fields have a tab or line-break in them when you print then it is guaranteed R can read the data (not matter what whacky symbols are in it).  On the java side that usually means making sure any varying text fields are run through <code>.replaceAll("\\s+"," ")</code> &#8220;just in case.&#8221; At this point you can already look at your data with the <code>summary()</code> command:</p>
<pre>
 &gt; summary(d)
</pre>
<pre>
       n         logFactorial   logStirlingApproximation
 Min.   :1000   Min.   : 5912   Min.   : 5912
 1st Qu.:3250   1st Qu.:23034   1st Qu.:23034
 Median :5500   Median :41870   Median :41870
 Mean   :5500   Mean   :42536   Mean   :42536
 3rd Qu.:7749   3rd Qu.:61653   3rd Qu.:61653
 Max.   :9999   Max.   :82100   Max.   :82100
</pre>
<p>This immediately hints that you should have been thinking in terms of relative error instead of absolute error (since insisting on high absolute accuracy on large results does not always make sense).</p>
<p>You also have access to standard statistical measures of agreement like correlation: </p>
<pre>
 &gt; with(d,cor(logFactorial,logStirlingApproximation))
</pre>
<pre>
result: 1
</pre>
<p>You can see where your failures were:</p>
<pre>
 &gt; library(ggplot2)
 &gt; d$bad &lt;- with(d,abs(logFactorial-logStirlingApproximation)&gt;=1.0e-5)
 &gt; ggplot(d) + geom_point(aes(x=n,y=bad))
</pre>
<p>Yields the graph:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/07/bad.png" alt="bad.png" border="0" width="525" height="525" /><br />
</center></p>
<p>You can see all your failures are in the initial interval.  You can then drill in:</p>
<pre>
 &gt; ggplot(d) + geom_point(aes(x=n,y=logFactorial-logStirlingApproximation))
                + scale_y_log10()
</pre>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/07/diff.png" alt="diff.png" border="0" width="525" height="525" /><br />
</center></p>
<p>And here we see some things (that are in general true for Stirling&#8217;s approximation):</p>
<ol>
<li>It is very accurate.</li>
<li>It is always an under estimate.</li>
<li>It gets better as n gets larger.</li>
</ol>
<p>Essentially by poking around with graphs in R you can figure out the nature of your errors (telling you what to fix) and generate findings that tell you how to fix your criteria (perhaps your code is working- but your test wasn&#8217;t sensible).  The &#8220;dump everything and then use R&#8221; technique is also particularly good for generating reports on code timings using either <code>geom_histogram</code> or <code>geom_density</code>. </p>
<p>For example, if we had data with a field <code>runTimeMS</code> then it is a simple one-liner to get plot like the following:</p>
<pre>
 &gt; ggplot(t) + geom_density(aes(x=runTimeMS))
</pre>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/07/timing.png" alt="timing.png" border="0" width="525" height="525" /><br />
</center></p>
<p>From this graph we can immediately see:</p>
<ol>
<li>Most of our run-times are very fast.</li>
<li>We have a heavy right-tail (evidence of &#8220;contagion&#8221; or one slow-down causing others, like CPU or IO contention).</li>
<li>Data is truncated at 100MS (could be something &#8220;censoring&#8221; the measurement, an exception being thrown or an abort).</li>
<li>There is a spike at 30MS (something is true and slow for some subset of the data that isn&#8217;t present in the majority).</li>
</ol>
<p>This is a lot more that would be seen in a mean-only or mean and standard deviation summary.  We may even being seeings signs of two different bugs (the truncation and the spike).</p>
<p>In all cases the key is to dump a lot of data in machine readable form and then come back to to analyze.  This is far more flexible than hoping to code in the right summaries and then further hoping the summaries don&#8217;t miss something important (or that you at least get a chance to notice if they do miss something).  Being able to do exploratory statistics on dumps from your code (both results and timing) gives you incredible measurement, tuning and debugging powers.   The scriptability of R means any later analysis is as easy as cut and paste.</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/' rel='bookmark' title='Automatic Differentiation with Scala'>Automatic Differentiation with Scala</a></li>
<li><a href='http://www.win-vector.com/blog/2009/11/r-examine-objects-tutorial/' rel='bookmark' title='R examine objects tutorial'>R examine objects tutorial</a></li>
<li><a href='http://www.win-vector.com/blog/2009/09/survive-r/' rel='bookmark' title='Survive R'>Survive R</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2011/08/programmers-should-know-r/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Automatic Detection of Potential Deadlock</title>
		<link>http://www.win-vector.com/blog/2011/06/automatic-detection-of-potential-deadlock/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=automatic-detection-of-potential-deadlock</link>
		<comments>http://www.win-vector.com/blog/2011/06/automatic-detection-of-potential-deadlock/#comments</comments>
		<pubDate>Sat, 04 Jun 2011 16:55:52 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[cycle detection]]></category>
		<category><![CDATA[deadlock]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1664</guid>
		<description><![CDATA[We would like to share a programming article we wrote on the automatic detection of potential deadlock.The article touches on some fun issues: multithreaded programming, graph algorithms. It was also back when I was considering the bipartite graph as a fundamental basis for data structures (instead of lists, arrays or maps). Related posts: Automatic Differentiation [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/' rel='bookmark' title='Automatic Differentiation with Scala'>Automatic Differentiation with Scala</a></li>
<li><a href='http://www.win-vector.com/blog/2009/05/programs-reduced-to-statistics/' rel='bookmark' title='Programs reduced to statistics'>Programs reduced to statistics</a></li>
<li><a href='http://www.win-vector.com/blog/2008/04/i-know-i-am-the-one-being-a-jerk/' rel='bookmark' title='I know, I am the one being a jerk'>I know, I am the one being a jerk</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>We would like to share a programming article we wrote on the <a target="_blank" href="http://www.mzlabs.com/JMPubs/Automatic%20Detection%20of%20Potential%20Deadlock-Mount.pdf">automatic detection of potential deadlock</a>.<span id="more-1664"></span>The article touches on some fun issues: multithreaded programming, graph algorithms.  It was also back when I was considering the bipartite graph as a fundamental basis for data structures (instead of lists, arrays or maps).</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/' rel='bookmark' title='Automatic Differentiation with Scala'>Automatic Differentiation with Scala</a></li>
<li><a href='http://www.win-vector.com/blog/2009/05/programs-reduced-to-statistics/' rel='bookmark' title='Programs reduced to statistics'>Programs reduced to statistics</a></li>
<li><a href='http://www.win-vector.com/blog/2008/04/i-know-i-am-the-one-being-a-jerk/' rel='bookmark' title='I know, I am the one being a jerk'>I know, I am the one being a jerk</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2011/06/automatic-detection-of-potential-deadlock/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Brevity is a Virtue</title>
		<link>http://www.win-vector.com/blog/2011/04/brevity-is-a-virtue/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=brevity-is-a-virtue</link>
		<comments>http://www.win-vector.com/blog/2011/04/brevity-is-a-virtue/#comments</comments>
		<pubDate>Wed, 27 Apr 2011 14:58:33 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Hadoop]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1652</guid>
		<description><![CDATA[Our friends at Dataspora have a nice article on the more modern Map Reduce languages. A very good read and clearly a lot of thought went into preparing it.In passing we are rightfully taken to task for hiding a huge glob of code in a tar file that few people are likely to open. Using [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2011/12/what-to-do-when-you-run-out-of-memory/' rel='bookmark' title='What to do when you run out of memory'>What to do when you run out of memory</a></li>
<li><a href='http://www.win-vector.com/blog/2011/01/sql-screwdriver/' rel='bookmark' title='SQL Screwdriver'>SQL Screwdriver</a></li>
<li><a href='http://www.win-vector.com/blog/2011/08/programmers-should-know-r/' rel='bookmark' title='Programmers Should Know R'>Programmers Should Know R</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>Our friends at <a target="_blank" href="http://www.dataspora.com/">Dataspora</a> have a nice <a target="_blank" href="http://www.dataspora.com/2011/04/pigs-bees-and-elephants-a-comparison-of-eight-mapreduce-languages/">article on the more modern Map Reduce languages</a>.  A very good read and clearly a lot of thought went into preparing it.<span id="more-1652"></span>In passing we are rightfully taken to task for hiding a huge glob of code in a tar file that few people are likely to open.   Using higher order tools could indeed make the code smaller.  Perhaps small enough that we could share it in a more readable format.  It is a good point and our only answer to it is we at Win-Vector LLC see ourselves as tool builders delivering complete tools that perform well defined tasks (like a logistic regression) so that most people do not have to open the tar file (but they can if they need to).  That is: we believe in higher order languages tools, and we supply some of them.  We also, however, like to minimize external dependencies so that our code can run on more systems.</p>
<p>Back to the tar file issue.  We had been meaning to get our code up on github or some other public source control system.  Instead we have <a target="_blank" href="http://www.win-vector.com/dfiles/WinVectorLogisticRegression/WinVectorLogisticHadoopHTML/list.html">HTMLified it</a> (with some cross reference links, it still isn&#8217;t pretty).</p>
<p>And Antonio Piccolboni, thanks for the great article.</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2011/12/what-to-do-when-you-run-out-of-memory/' rel='bookmark' title='What to do when you run out of memory'>What to do when you run out of memory</a></li>
<li><a href='http://www.win-vector.com/blog/2011/01/sql-screwdriver/' rel='bookmark' title='SQL Screwdriver'>SQL Screwdriver</a></li>
<li><a href='http://www.win-vector.com/blog/2011/08/programmers-should-know-r/' rel='bookmark' title='Programmers Should Know R'>Programmers Should Know R</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2011/04/brevity-is-a-virtue/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>A Personal Perspective on Machine Learning</title>
		<link>http://www.win-vector.com/blog/2010/10/a-personal-perspective-on-machine-learning/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=a-personal-perspective-on-machine-learning</link>
		<comments>http://www.win-vector.com/blog/2010/10/a-personal-perspective-on-machine-learning/#comments</comments>
		<pubDate>Sun, 31 Oct 2010 21:45:48 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Expository Writing]]></category>
		<category><![CDATA[History]]></category>
		<category><![CDATA[Opinion]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Clustering]]></category>
		<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Machine Learning]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1551</guid>
		<description><![CDATA[Having a bit of history as both a user of machine learning and a researcher in the field I feel I have developed a useful perspective on the various trends, flavors and nuances in machine learning and artificial intelligence. I thought I would take a moment to outline a bit of it here and demonstrate [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/' rel='bookmark' title='A Demonstration of Data Mining'>A Demonstration of Data Mining</a></li>
<li><a href='http://www.win-vector.com/blog/2011/10/kernel-methods-and-support-vector-machines-de-mystified/' rel='bookmark' title='Kernel Methods and Support Vector Machines de-Mystified'>Kernel Methods and Support Vector Machines de-Mystified</a></li>
<li><a href='http://www.win-vector.com/blog/2011/07/book-review-ensemble-methods-in-data-mining-seni-elder/' rel='bookmark' title='Book Review: Ensemble Methods in Data Mining (Seni &amp; Elder)'>Book Review: Ensemble Methods in Data Mining (Seni &#038; Elder)</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>Having a bit of history as both a user of machine learning and a researcher in the field I feel I have developed a useful perspective on the various trends, flavors and nuances in machine learning and artificial intelligence.  I thought I would take a moment to outline a bit of it here and demonstrate how what we call artificial intelligence is becoming more statistical in nature.<span id="more-1551"></span><br />
In the early days <a target="_blank" href="http://en.wikipedia.org/wiki/Machine_learning">machine learning</a> and artificial intelligence were famous for promising far too much and delivering far too little.  This has changed.  Artificial decision and reasoning systems are now everywhere.  One of the things masking the breadth and authority of artificial intelligence is the current prejudice: &#8220;if a system is well understood or works then it is no longer called artificial intelligence.&#8221;  A working system becomes a database, expert system, rules engine, machine learning platform, analytics dashboard, pattern recognition system or statistics warehouse.  We clearly have not reached anywhere near building a conversational intelligence (like Hal from 2001 or <a target="_blank" href="http://mzlabs.com/MZLabsJM/page6/Gerty/Gerty.html">Gerty</a> from Moon).  Yet every day machines decide if your credit card is accepted, advise on medical care, route goods, curate information and control vast industrial plants.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/10/Hal-9000.jpg" alt="Hal-9000.jpg" border="0" width="150" height="150" /><br />
<br/>Hal 9000<br />
</center></p>
<p>There have been vast improvements in artificial intelligence.  Much of the improvement has been driven by the engineering effects of Moore&#8217;s Law (resulting in my mobile phone&#8217;s processor having 12 times the clock speed and over 32 times the memory of an $8 million <a target="_blank" href="http://en.wikipedia.org/wiki/Cray-1">Cray 1 super computer</a>)  and significant machine learning research results.  These machine size changes happened during the productive careers of many researchers, so ideas are often evaluated at a series of radically different machine capabilities and data scales.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/10/Cray-1-deutsches-museum.jpg" alt="Cray-1-deutsches-museum.jpg" border="0" width="487" height="536" /><br />
<br/>Cray 1<br />
</center></p>
<p>von Neuman himself commented that scale was a major limiting factor in early computers.  He asked the question how you could be expected to achieve anything significant even from a roomful of geniuses if (as with his early computers) all notes, communication and memory were limited to less than a single typed page.  von Neuman&#8217;s comment stands in contrast to science fiction scientists and early boosters of artificial intelligence who always seem to be in awe of their own creations.  Computers are certainly much larger- but we need to be humble and put off deciding if we are yet in the era of large computers (compared to human or animal brains).  Everything we are doing now may still just be artificial intelligence&#8217;s pre-history and prologue.  Feynman in his lectures on computation mentions that RNA transcription can be estimated to take around 100 kT of energy to transcribe a bit while a transistor may easily use 100,000,000 kT energy units to switch states.  This means for the amount of heat the human head dissipates (energy supply and heat dissipation are rapidly becoming the most relevant measures of computational power) you could do a million times more work using RNA techniques (if you knew how) than with transistors.  So computers may not yet be what we should call large (though they are likely getting there).  What we currently call <a target="_blank" href="http://www.win-vector.com/blog/2009/08/on-the-hysteria-over-the-cloud/">&#8220;datacenters&#8221;</a> are in fact block sized computers (consuming an enormous amount of energy and dissipating a huge amount of heat).</p>
<p><center><br />
<img  target="_blank" src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/google_data_center_lenoir.jpg" alt="google_data_center_lenoir.jpg" border="0" width="512" height="355" /><br />
<br/>A datacenter (or a block sized computer)<br />
</center></p>
<p>Not all improvements in machine intelligence have come from (or are to come from) improvements in hardware.  Many of the improvements came from machine learning research results and these are what I will outline below.</p>
<p>Early machine learning algorithms were driven by analogy.  This led us to perceptrons (1957, fairly early in the history of computer science) and neural nets.  These methods have their successes but were largely over used and developed before researchers developed a good list of desirable properties of a machine learning method.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/10/220px-Neural_network_example.svg_.png" alt="220px-Neural_network_example.svg.png" border="0" width="220" height="293" /><br />
<br/>Neural Net diagram<br />
</center></p>
<p>These methods live on but are,  in my opinion, not currently competitive.  Some of their important ideas and contributions have been revived from time to time, such as the online update rules becoming what we now call stochastic gradients.</p>
<p>A list of (often incompatible) desirable properties of a machine learning algorithm is the following:</p>
<ul>
<li>Able to represent complicated functions</li>
<li>Good generalization performance (quality predictions on data not seen during training)</li>
<li>Unique optimal model for a given set of data and feature definitions</li>
<li>Efficient and well characterized solution method</li>
<li>Consistent summary statistics</li>
<li>Preference for simple models</li>
</ul>
<p>We divert from this list for a bit of background and context.</p>
<p>The neural net was largely celebrated for its ability to represent complex functions and the perceived efficiency of its newer back-propagation based training method (related to the <a target="_blank" href="http://www.win-vector.com/blog/2010/07/gradients-via-reverse-accumulation/">efficient calculation of gradients</a>).  The downsides were you never knew if your neural net was the right one (even assuming you had the right features, layout and training data) and could not be sure you were biasing towards simple models that might perform well on novel queries.  Great effort was expended in extending neural nets based on the supposition they should work as they were an analogy to how we imagined biological neurons might function.  An almost mystic hope was derived from the non-linear nature and special properties of the sigmoid curve (which was in fact a curve already known to statisticians).</p>
<p>Other methods than neural nets also had early success.  The field of information retrieval (which was not &#8220;sexy&#8221; prior to the Web) had huge success since the 1960s with <a taret="_blank" href="http://en.wikipedia.org/wiki/Naive_Bayes_classifier">Naive Bayes</a>, <a target="_blank" href="http://en.wikipedia.org/wiki/Rocchio_Classification">Rocchio Classification</a>, and <a target="_blank" href="http://en.wikipedia.org/wiki/Tf–idf">TF/IDF</a> methods.  The early success of these methods may have in fact delayed research on current hot research areas such as segmentation and author topic models.</p>
<p>Theoretical computer science initially sought to characterize machine learning methods in non-statistical language.  In the 1980s a great amount of ink was spilled on &#8220;learning boolean functions.&#8221;  Papers proving nothing was learnable (by picking a function related to cryptography) alternated with papers proving everything was learnable (for example via amplification techniques like boosting).  Generalization of models to new data remained a theoretical problem that was dealt with by appeals to model complexity and <a target="_blank" href="http://en.wikipedia.org/wiki/Minimum_description_length">MDL</a> (minimum description length).  A major breakthrough in characterizing generalization performance was the <a target="_blank" href="http://en.wikipedia.org/wiki/Probably_approximately_correct_learning">PAC model</a> (probably approximately correct) framework which finally allowed direct treatment of generalization performance.</p>
<p>We now have enough context  to discuss some of the current best of breed machine learning techniques (that address many of the desired properties mentioned above):</p>
<ul>
<li><a target="_blank" href="http://en.wikipedia.org/wiki/Support_vector_machine">Support Vector Machines</a></li>
<li><a target="_blank" href="http://en.wikipedia.org/wiki/Kernel_methods">Kernel Methods</a></li>
<li><a target="_blank" href="http://en.wikipedia.org/wiki/Logistic_regression">Logistic Regression</a></li>
<li><a target="_blank" href="http://en.wikipedia.org/wiki/Principle_of_maximum_entropy">Maximum Entropy Methods</a></li>
<li><a target="_blank" href="http://en.wikipedia.org/wiki/Regularization_(mathematics)">Regularization</a></li>
<li><a target="_blank" href="http://en.wikipedia.org/wiki/Graphical_model">Graphical Models</a></li>
<li><a target="_blank" href="http://en.wikipedia.org/wiki/Conditional_random_field">Conditional Random Fields</a></li>
<p> </ul>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/10/556px-Svm_max_sep_hyperplane_with_margin.png" alt="556px-Svm_max_sep_hyperplane_with_margin.png" border="0" width="278" /><br />
<br/><br />
Typical SVM maximum margin diagram<br />
</center></p>
<p>Not all of these methods are new (Logistic Regression for example dates from 1925 and is itself based on regression which goes back to Gauss).  But the concerns these methods address are all much more statistical than artificial intelligence in nature.  For example we don&#8217;t  suppose that there is some cryptographically obscured combination of features that we need to find to make the best prediction.  We instead worry about detecting which features are useful and note that it is a significant (though solvable) problem to correctly use combinations of useful features (phrased as statistical concerns: feature to feature dependencies and higher order interactions).  Machine learning has always run where statisticians fear to tread.   But more and  more often we are seeing that the methods and concerns of statisticians are what are needed to achieve many of the listed desired properties of machine learning models.</p>
<p>The methods I have singled out for praise are very effective and achieve a number of our listed desired properties.  For example:  both logistic regression and maximum entropy have a unique solution that is easy to find.  They are also both consistent with all summaries known during training.  That is: if 30% of the positive training data has a feature present then 30% of the data also has the feature present when weighted by the model&#8217;s score (so the model score shares a lot of properties with training truth).  Support Vector Machines also have well understood solutions and a theory (called maximum margin) that directly addresses generalization (good predictions on new data).  Kernel Methods (both as used in SVMs and elsewhere) allow controlled introduction of very complex functions.  Graphical Models and Conditional Random Fields also allow the controlled introduction of modeled dependencies in the data.</p>
<p>It is now common to call what was previously thought of as artificial intelligence or machine learning: &#8220;statistical machine learning.&#8221;  This reflects that the kind of prediction and characterization we expect from machine learning algorithms are in fact statistical concerns that we can deal with if we have enough data and enough computational resources. </p>
<p>The current important issues for statistical machine learning include:</p>
<ul>
<li>Dealing with very large datasets (driving the return of simpler methods like Naive Bayes)</li>
<li>Dealing with lack of training data (driving interest in clustering and manifold regularization methods)</li>
<li>Dealing with unstructured data and text mining (driving interest in information extraction and segmentation via generative models)</li>
</ul>
<p>Just as Wigner famously wrote about &#8220;The Unreasonable Effectiveness of Mathematics&#8221; in the 1960s  Halevy,Norvig and Pereira write about the &#8220;Unreasonable Effectiveness of Data.&#8221;   They argue that we are in the age of big data (or the age of analysts).   Or, as Varian observed: &#8220;it is a good time to supply a good complementary to data&#8221; (i.e. it is a good time to be an analyst).  I would temper this with we are likely in the age of unmarked data and unstructured data.  Less often are we asked to automate a known prediction and more often we are asked to cluster, characterize and segment wild data. In my opinion the hard problem in machine learning has moved from prediction to characterization.  With enough marked training data (that is data for which we know both the observables and desired outcome) it is now quite possible to use standard techniques and libraries to build a very good predictive model.  However, it is still hard to characterize, segment or extract useful information from the wealth of unstructured and unmarked data that is upon us.  And this is where a lot of the current research in statistical machine learning is directed.  </p>
<p>Or course characterization and clustering have their own infamous history.  Rota wrote: &#8220;&#8230; Or a subject is important, but nobody understands what is going on; such is the case with quantum field theory, the distribution of primes, pattern recognition and cluster analysis.&#8221;  Artificial intelligence may be moving from areas where computer scientists have over-promised to areas where statisticians have over-promised.  But this is not a disaster: the most valuable research tends to be done in hectic times in messy fields, not in calm times in neat fields.  And the already large scale adoption of statistical machine learning techniques means there is immediate great client value in even seemingly small improvements in understanding, explanation, documentation, training, tools, libraries and techniques.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/10/Xbarst1.jpg" alt="Xbarst1.jpg" border="0" width="384" height="398" /><br />
<br/><br />
Classic attempt to add structure to text<br />
</center></p>
<p>(images from Wikipedia)</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/' rel='bookmark' title='A Demonstration of Data Mining'>A Demonstration of Data Mining</a></li>
<li><a href='http://www.win-vector.com/blog/2011/10/kernel-methods-and-support-vector-machines-de-mystified/' rel='bookmark' title='Kernel Methods and Support Vector Machines de-Mystified'>Kernel Methods and Support Vector Machines de-Mystified</a></li>
<li><a href='http://www.win-vector.com/blog/2011/07/book-review-ensemble-methods-in-data-mining-seni-elder/' rel='bookmark' title='Book Review: Ensemble Methods in Data Mining (Seni &amp; Elder)'>Book Review: Ensemble Methods in Data Mining (Seni &#038; Elder)</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2010/10/a-personal-perspective-on-machine-learning/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>What Did Theorists Do Before The Age Of Big Data?</title>
		<link>http://www.win-vector.com/blog/2010/08/what-did-theorists-do-before-the-age-of-big-data/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=what-did-theorists-do-before-the-age-of-big-data</link>
		<comments>http://www.win-vector.com/blog/2010/08/what-did-theorists-do-before-the-age-of-big-data/#comments</comments>
		<pubDate>Mon, 02 Aug 2010 18:42:45 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Expository Writing]]></category>
		<category><![CDATA[Mathematics]]></category>
		<category><![CDATA[Age of Big Data]]></category>
		<category><![CDATA[Big Data]]></category>
		<category><![CDATA[Mathematical Bedside Reading]]></category>
		<category><![CDATA[Mean]]></category>
		<category><![CDATA[Mean of Medians]]></category>
		<category><![CDATA[Median]]></category>
		<category><![CDATA[Median of Means]]></category>
		<category><![CDATA[Theorist]]></category>
		<category><![CDATA[Winsorized mean]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1514</guid>
		<description><![CDATA[We have been living in the age of &#8220;big data&#8221; for some time now. This is an age where incredible things can be accomplished through the effective application of statistics and machine learning at large scale (for example see: &#8220;The Unreasonable Effectiveness of Data&#8221; Alon Halevy, Peter Norvig, Fernando Pereira, IEEE Intelligent Systems (2009)). But [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2009/08/good-graphs-graphical-perception-and-data-visualization/' rel='bookmark' title='Good Graphs: Graphical Perception and Data Visualization'>Good Graphs: Graphical Perception and Data Visualization</a></li>
<li><a href='http://www.win-vector.com/blog/2009/04/the-data-enrichment-method/' rel='bookmark' title='The Data Enrichment Method'>The Data Enrichment Method</a></li>
<li><a href='http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/' rel='bookmark' title='A Demonstration of Data Mining'>A Demonstration of Data Mining</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>We have been living in the age of &#8220;big data&#8221; for some time now.  This is an age where incredible things can be accomplished through the effective application of statistics and machine learning at large scale (for example see: &#8220;The Unreasonable Effectiveness of Data&#8221; Alon Halevy, Peter Norvig, Fernando Pereira, IEEE Intelligent Systems (2009)).  But I have gotten to thinking about the period before this.   The period before we had easy access to so much data, before most computation was aggregation and before we accepted numerical analysis style convergence as &#8220;efficient.&#8221;  A small problem I needed to solve (as part of a bigger project)  reminded me what theoretical computer scientists did then: we worried about provable worst case efficiency.</p>
<p><span id="more-1514"></span><br />
The problem that got me thinking is this: </p>
<p>Given a sequence of n integers x1 through xn and an integer k (1 &le; k &le; n), find the mean value of all of the medians of the k-sized selections from x1 through xn.  Or as a formula:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/07/EMedian.png" alt="EMedian.png" border="0" width="220" /><br />
</center></p>
<p>where x_s is defined as the sequence of integers whose indices are in the set s (not necessarily a contiguous sequence).   The median is the &#8220;value in the middle&#8221; (a value such that half of the selected data are above it and half are below) and &#8220;(n choose k)&#8221; is the number of ways to choose k items from a collection of n items (which is just the number: n!/((n-k)! k!)).  So our sum is adding up a number of terms and then dividing by the number of such terms to get the average or mean of the terms.  We will call this sum a &#8220;mean of medians&#8221;.</p>
<p>Some obvious special cases are: for k=1 the<br />
expression simplifies to the sum of the x_i divided by n (or just the mean of all the x_i) and for k=n it simplifies to the median of all of the x_i.  For intermediate values of k it is not immediately obvious how to efficiently compute the value of the sum.  Directly adding all (n choose k)  terms (as the sum is written) would be very slow for large n with even moderate sized k.  Instead we look to find some method that has the same value of the total sum, without directly computing all of the terms.</p>
<p>This gets us to the ad-hoc side of theoretical computer science.  We need a clever idea.  In this case the idea is simple.  To keep things simple: suppose all of the n integers in our sequence have different values and that k is odd (neither of these conditions are important they just let us avoid some non-essential technicalities).  What values can median(x_s) take on? median(x_s) is always x_i form some i in the subset s.  In fact our sum is equivalent to:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/07/Sum2.png" alt="Sum2.png" border="0" width="330"  /><br />
</center></p>
<p>This new sum has a reasonable number of terms- so we can actually calculate it directly if we knew the values of the terms.  Without loss of generality assume the x_i are sorted in increasing order.  Then the number of times x_i is the median of some x_s is exactly:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/07/term.png" alt="term.png" border="0" width="191" /><br />
</center><br />
(and 0 for i &lt; 1+(k-1)/2 or i &gt; n &#8211; (k-1)/2).  This is just the number of ways to place x_i in the center of a subset- by choosing (k-1)/2 smaller neighbors and (k-1)/2 larger neighbors.   The count is given by multiplying the number of ways to choose smaller neighbors by the number of ways to choose the larger neighbors.</p>
<p>The complete solution calculating the mean of medians for distinct sorted x_i is:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/07/fullsum1.png" alt="fullsum.png" border="0" width="333"  /><br />
</center></p>
<p>A statistician would recognize this expression as a kind of centrally weighted Winsorized mean.  The shape of the graph of weights (in this case the n=10, k=5) is suggestive of<br />
a bounded normal window (though i is a rank, not a free-ranging value):</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/07/10w5.png" alt="10w5.png" border="0" width="525" height="525" /><br />
</center></p>
<p>Likely we have re-invented a data treatment known to statisticians.  But the above steps were really just combinatorics.  What a theorist does is abstract something down to this sort of problem and think of variations and solutions.   The formal side of theoretical computer science is attempting to organize all of the problems in the world into two classes: problems presumed easy and problems presumed hard.</p>
<p>For example- what if we had wanted to know the median of many means instead of the mean of many medians?<br />
It turns out a small variation of the median of means problem is already known to be difficult.  The hard version of the reversed problem is called &#8220;Kth largest subset&#8221; (this is a different K than we have been using up until now).   The Kth largest subset problem is: given a sequence of integers and constants K and B do K or more distinct subsets have a sum of no more than B?  The Kth largest subset problem is known to be &#8220;NP hard&#8221; which means that a whole host of other thought to be hard problems could be solved if we had the ability to solve this problem (see &#8220;Computers and Intractability: A Guide to the Theory of NP-Completeness&#8221; Michael R. Garey and David S. Johnson, 1979).  The median of many means is not quite as expressive as the Kth largest subset problem (so we have <em>not</em> proven the median of many means is itself NP hard) but the relation is strong: to even verify that we had the right median we would need to solve a Kth largest subset problem with K=(n choose k)/2 and B= k*the_claimed_median (assuming we modified the subset problem to restrict itself to size k sub-sequences).   If fact even checking if a given number is one of the possible means (let alone if it is the median of the means) is equivalent to the NP Complete knapsack problem.  This correspondence makes us suspect that something as simple as the trick we devised for the mean of medians problem will not solve the median of means problem.  One should not take this argument too seriously though, as the same argument would seem to apply to the pair of problems &#8220;min of means&#8221; and &#8220;mean of mins&#8221; both of which are in fact easy.  We have not proven the median of means problem to be hard, we have just found relevant algorithms and related problems.  </p>
<p>What theorists do is find these analogs and then do the heavy lifting (continue the work even when the solution is not simple) to find a way to either efficiently solve the problem at hand or prove it is in fact as hard as other important problems.  This kind of thing is hard work for small gains, but the value accumulates because each of these gains is permanent.  Finally additional variations of the problem are tried and characterized, to help check we hare not &#8220;leaving money on the table&#8221; (missing nearby improvements).  Some of this may seem like pointless work on toy problems- but every once in a while one of these solutions or characterizations is exactly what is needed to take a large system (like a search engine) to the next level of performance.</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2009/08/good-graphs-graphical-perception-and-data-visualization/' rel='bookmark' title='Good Graphs: Graphical Perception and Data Visualization'>Good Graphs: Graphical Perception and Data Visualization</a></li>
<li><a href='http://www.win-vector.com/blog/2009/04/the-data-enrichment-method/' rel='bookmark' title='The Data Enrichment Method'>The Data Enrichment Method</a></li>
<li><a href='http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/' rel='bookmark' title='A Demonstration of Data Mining'>A Demonstration of Data Mining</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2010/08/what-did-theorists-do-before-the-age-of-big-data/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>

