<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Win-Vector Blog &#187; Theorist</title>
	<atom:link href="http://www.win-vector.com/blog/tag/theorist/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.win-vector.com/blog</link>
	<description>The Applied Theorist&#039;s Point of View</description>
	<lastBuildDate>Sat, 04 Feb 2012 17:42:12 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>An Appreciation of Locality Sensitive Hashing</title>
		<link>http://www.win-vector.com/blog/2011/11/an-appreciation-of-locality-sensitive-hashing/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=an-appreciation-of-locality-sensitive-hashing</link>
		<comments>http://www.win-vector.com/blog/2011/11/an-appreciation-of-locality-sensitive-hashing/#comments</comments>
		<pubDate>Mon, 21 Nov 2011 16:41:46 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Exciting Techniques]]></category>
		<category><![CDATA[Expository Writing]]></category>
		<category><![CDATA[Opinion]]></category>
		<category><![CDATA[Tutorials]]></category>
		<category><![CDATA[Locality Sensitive Hashing]]></category>
		<category><![CDATA[Nearest Neighbor]]></category>
		<category><![CDATA[Theorist]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1848</guid>
		<description><![CDATA[We share our admiration for a set of results called &#8220;locality sensitive hashing&#8221; by demonstrating a greatly simplified example that exhibits the spirit of the techniques.Locality sensitive hashing is awe inspiring in its originality, simplicity, beauty and effectiveness. In addition locality sensitive hashing is a remarkable technique as it works even when drastically abridged and [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2008/09/a-quick-appreciation-of-the-sharpe-ratio/' rel='bookmark' title='A Quick Appreciation of the Sharpe Ratio'>A Quick Appreciation of the Sharpe Ratio</a></li>
<li><a href='http://www.win-vector.com/blog/2009/11/the-local-to-global-principle/' rel='bookmark' title='The Local to Global Principle'>The Local to Global Principle</a></li>
<li><a href='http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/' rel='bookmark' title='Automatic Differentiation with Scala'>Automatic Differentiation with Scala</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>We share our admiration for a set of results called &#8220;locality sensitive hashing&#8221; by demonstrating a greatly simplified example that exhibits the spirit of the techniques.<span id="more-1848"></span>Locality sensitive hashing is awe inspiring in its originality, simplicity, beauty and effectiveness.  In addition locality sensitive hashing is a remarkable technique as it works even when drastically abridged and simplified.  In this <a target="_blank" href="http://www.win-vector.com/dfiles/LocalitySensitiveHashing.pdf">paper (link to pdf)</a> we give a description of conditions where the technique works and a heuristic argument why it works (using only elementary math).</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2008/09/a-quick-appreciation-of-the-sharpe-ratio/' rel='bookmark' title='A Quick Appreciation of the Sharpe Ratio'>A Quick Appreciation of the Sharpe Ratio</a></li>
<li><a href='http://www.win-vector.com/blog/2009/11/the-local-to-global-principle/' rel='bookmark' title='The Local to Global Principle'>The Local to Global Principle</a></li>
<li><a href='http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/' rel='bookmark' title='Automatic Differentiation with Scala'>Automatic Differentiation with Scala</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2011/11/an-appreciation-of-locality-sensitive-hashing/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>What Did Theorists Do Before The Age Of Big Data?</title>
		<link>http://www.win-vector.com/blog/2010/08/what-did-theorists-do-before-the-age-of-big-data/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=what-did-theorists-do-before-the-age-of-big-data</link>
		<comments>http://www.win-vector.com/blog/2010/08/what-did-theorists-do-before-the-age-of-big-data/#comments</comments>
		<pubDate>Mon, 02 Aug 2010 18:42:45 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Expository Writing]]></category>
		<category><![CDATA[Mathematics]]></category>
		<category><![CDATA[Age of Big Data]]></category>
		<category><![CDATA[Big Data]]></category>
		<category><![CDATA[Mathematical Bedside Reading]]></category>
		<category><![CDATA[Mean]]></category>
		<category><![CDATA[Mean of Medians]]></category>
		<category><![CDATA[Median]]></category>
		<category><![CDATA[Median of Means]]></category>
		<category><![CDATA[Theorist]]></category>
		<category><![CDATA[Winsorized mean]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1514</guid>
		<description><![CDATA[We have been living in the age of &#8220;big data&#8221; for some time now. This is an age where incredible things can be accomplished through the effective application of statistics and machine learning at large scale (for example see: &#8220;The Unreasonable Effectiveness of Data&#8221; Alon Halevy, Peter Norvig, Fernando Pereira, IEEE Intelligent Systems (2009)). But [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2009/08/good-graphs-graphical-perception-and-data-visualization/' rel='bookmark' title='Good Graphs: Graphical Perception and Data Visualization'>Good Graphs: Graphical Perception and Data Visualization</a></li>
<li><a href='http://www.win-vector.com/blog/2009/04/the-data-enrichment-method/' rel='bookmark' title='The Data Enrichment Method'>The Data Enrichment Method</a></li>
<li><a href='http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/' rel='bookmark' title='A Demonstration of Data Mining'>A Demonstration of Data Mining</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>We have been living in the age of &#8220;big data&#8221; for some time now.  This is an age where incredible things can be accomplished through the effective application of statistics and machine learning at large scale (for example see: &#8220;The Unreasonable Effectiveness of Data&#8221; Alon Halevy, Peter Norvig, Fernando Pereira, IEEE Intelligent Systems (2009)).  But I have gotten to thinking about the period before this.   The period before we had easy access to so much data, before most computation was aggregation and before we accepted numerical analysis style convergence as &#8220;efficient.&#8221;  A small problem I needed to solve (as part of a bigger project)  reminded me what theoretical computer scientists did then: we worried about provable worst case efficiency.</p>
<p><span id="more-1514"></span><br />
The problem that got me thinking is this: </p>
<p>Given a sequence of n integers x1 through xn and an integer k (1 &le; k &le; n), find the mean value of all of the medians of the k-sized selections from x1 through xn.  Or as a formula:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/07/EMedian.png" alt="EMedian.png" border="0" width="220" /><br />
</center></p>
<p>where x_s is defined as the sequence of integers whose indices are in the set s (not necessarily a contiguous sequence).   The median is the &#8220;value in the middle&#8221; (a value such that half of the selected data are above it and half are below) and &#8220;(n choose k)&#8221; is the number of ways to choose k items from a collection of n items (which is just the number: n!/((n-k)! k!)).  So our sum is adding up a number of terms and then dividing by the number of such terms to get the average or mean of the terms.  We will call this sum a &#8220;mean of medians&#8221;.</p>
<p>Some obvious special cases are: for k=1 the<br />
expression simplifies to the sum of the x_i divided by n (or just the mean of all the x_i) and for k=n it simplifies to the median of all of the x_i.  For intermediate values of k it is not immediately obvious how to efficiently compute the value of the sum.  Directly adding all (n choose k)  terms (as the sum is written) would be very slow for large n with even moderate sized k.  Instead we look to find some method that has the same value of the total sum, without directly computing all of the terms.</p>
<p>This gets us to the ad-hoc side of theoretical computer science.  We need a clever idea.  In this case the idea is simple.  To keep things simple: suppose all of the n integers in our sequence have different values and that k is odd (neither of these conditions are important they just let us avoid some non-essential technicalities).  What values can median(x_s) take on? median(x_s) is always x_i form some i in the subset s.  In fact our sum is equivalent to:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/07/Sum2.png" alt="Sum2.png" border="0" width="330"  /><br />
</center></p>
<p>This new sum has a reasonable number of terms- so we can actually calculate it directly if we knew the values of the terms.  Without loss of generality assume the x_i are sorted in increasing order.  Then the number of times x_i is the median of some x_s is exactly:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/07/term.png" alt="term.png" border="0" width="191" /><br />
</center><br />
(and 0 for i &lt; 1+(k-1)/2 or i &gt; n &#8211; (k-1)/2).  This is just the number of ways to place x_i in the center of a subset- by choosing (k-1)/2 smaller neighbors and (k-1)/2 larger neighbors.   The count is given by multiplying the number of ways to choose smaller neighbors by the number of ways to choose the larger neighbors.</p>
<p>The complete solution calculating the mean of medians for distinct sorted x_i is:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/07/fullsum1.png" alt="fullsum.png" border="0" width="333"  /><br />
</center></p>
<p>A statistician would recognize this expression as a kind of centrally weighted Winsorized mean.  The shape of the graph of weights (in this case the n=10, k=5) is suggestive of<br />
a bounded normal window (though i is a rank, not a free-ranging value):</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/07/10w5.png" alt="10w5.png" border="0" width="525" height="525" /><br />
</center></p>
<p>Likely we have re-invented a data treatment known to statisticians.  But the above steps were really just combinatorics.  What a theorist does is abstract something down to this sort of problem and think of variations and solutions.   The formal side of theoretical computer science is attempting to organize all of the problems in the world into two classes: problems presumed easy and problems presumed hard.</p>
<p>For example- what if we had wanted to know the median of many means instead of the mean of many medians?<br />
It turns out a small variation of the median of means problem is already known to be difficult.  The hard version of the reversed problem is called &#8220;Kth largest subset&#8221; (this is a different K than we have been using up until now).   The Kth largest subset problem is: given a sequence of integers and constants K and B do K or more distinct subsets have a sum of no more than B?  The Kth largest subset problem is known to be &#8220;NP hard&#8221; which means that a whole host of other thought to be hard problems could be solved if we had the ability to solve this problem (see &#8220;Computers and Intractability: A Guide to the Theory of NP-Completeness&#8221; Michael R. Garey and David S. Johnson, 1979).  The median of many means is not quite as expressive as the Kth largest subset problem (so we have <em>not</em> proven the median of many means is itself NP hard) but the relation is strong: to even verify that we had the right median we would need to solve a Kth largest subset problem with K=(n choose k)/2 and B= k*the_claimed_median (assuming we modified the subset problem to restrict itself to size k sub-sequences).   If fact even checking if a given number is one of the possible means (let alone if it is the median of the means) is equivalent to the NP Complete knapsack problem.  This correspondence makes us suspect that something as simple as the trick we devised for the mean of medians problem will not solve the median of means problem.  One should not take this argument too seriously though, as the same argument would seem to apply to the pair of problems &#8220;min of means&#8221; and &#8220;mean of mins&#8221; both of which are in fact easy.  We have not proven the median of means problem to be hard, we have just found relevant algorithms and related problems.  </p>
<p>What theorists do is find these analogs and then do the heavy lifting (continue the work even when the solution is not simple) to find a way to either efficiently solve the problem at hand or prove it is in fact as hard as other important problems.  This kind of thing is hard work for small gains, but the value accumulates because each of these gains is permanent.  Finally additional variations of the problem are tried and characterized, to help check we hare not &#8220;leaving money on the table&#8221; (missing nearby improvements).  Some of this may seem like pointless work on toy problems- but every once in a while one of these solutions or characterizations is exactly what is needed to take a large system (like a search engine) to the next level of performance.</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2009/08/good-graphs-graphical-perception-and-data-visualization/' rel='bookmark' title='Good Graphs: Graphical Perception and Data Visualization'>Good Graphs: Graphical Perception and Data Visualization</a></li>
<li><a href='http://www.win-vector.com/blog/2009/04/the-data-enrichment-method/' rel='bookmark' title='The Data Enrichment Method'>The Data Enrichment Method</a></li>
<li><a href='http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/' rel='bookmark' title='A Demonstration of Data Mining'>A Demonstration of Data Mining</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2010/08/what-did-theorists-do-before-the-age-of-big-data/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Volunteers in Large Clubs: The Theorist&#8217;s View</title>
		<link>http://www.win-vector.com/blog/2009/02/volunteers-in-large-clubs-the-theorists-view/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=volunteers-in-large-clubs-the-theorists-view</link>
		<comments>http://www.win-vector.com/blog/2009/02/volunteers-in-large-clubs-the-theorists-view/#comments</comments>
		<pubDate>Thu, 26 Feb 2009 23:43:56 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Expository Writing]]></category>
		<category><![CDATA[Mathematics]]></category>
		<category><![CDATA[Organization]]></category>
		<category><![CDATA[Theorist]]></category>
		<category><![CDATA[Volunteers]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=48</guid>
		<description><![CDATA[I have just posted a new write-up: Volunteers in Large Clubs: The Theorist&#8217;s View. This paper describes some interesting issues in organizing volunteers in a large club and tries to show (without math) how a theoretical computer scientist attacks such problems. Volunteers in Large Clubs: The Theorist&#8217;s View John Mount1 Date: February 26, 2009 Introduction [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2010/08/what-did-theorists-do-before-the-age-of-big-data/' rel='bookmark' title='What Did Theorists Do Before The Age Of Big Data?'>What Did Theorists Do Before The Age Of Big Data?</a></li>
<li><a href='http://www.win-vector.com/blog/2010/12/large-data-logistic-regression-with-example-hadoop-code/' rel='bookmark' title='Large Data Logistic Regression (with example Hadoop code)'>Large Data Logistic Regression (with example Hadoop code)</a></li>
<li><a href='http://www.win-vector.com/blog/2011/06/what-is-a-large-enough-random-sample/' rel='bookmark' title='What is a large enough random sample?'>What is a large enough random sample?</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>I have just posted a new write-up: <a href="http://www.win-vector.com/papers/volunteer.pdf">Volunteers in Large Clubs: The Theorist&#8217;s View</a>.  This paper describes some interesting issues in organizing volunteers in a large club and tries to show (without math) how a theoretical computer scientist attacks such problems.<span id="more-48"></span></p>
<h1 align="center">Volunteers in Large Clubs: The Theorist&#8217;s View</h1>
<p align="center"><strong>John Mount<a name="tex2html2" href="#foot17"><sup>1</sup></a></strong></p>
<p></p>
<p align="center"><b>Date:</b> February 26, 2009</p>
<hr />
<h1><a name="SECTION00010000000000000000">Introduction</a></h1>
<p>A recurring problem in large clubs and institutions is: how to recruit volunteers for various club projects. As a club gets larger it becomes progressively more difficult for the right groups of people to find each other and coordinate volunteer activities. This loss of effectiveness is often (incorrectly) perceived as lack of interest or uncharitable attitudes among club members. However, the root cause is actually a mathematical problem brought about by the fact that: as a club grows the number of possible pairs of people grows much faster than the number of people.</p>
<p>This problem presents a good opportunity to demonstrate &#8220;how a theoretical computer scientists thinks&#8221; to a general audience. I will work through some interesting aspects of the problem and touch on some ideas that have been used to solve this problem. This write-up isn&#8217;t short but it is intended to be an easy to read walk through (with no math) of how a theoretical computer scientist thinks about this kind of problem.</p>
<h1><a name="SECTION00020000000000000000">Aside: What is a Theoretical Computer Scientist?</a></h1>
<p>Essentially a theoretical computer scientist is a type of mathematician. Theorists (as they are called) use mathematical techniques to study very simple procedures. The mathematics is often very difficult because even simple procedures can have incredibly complex consequences in the long run.</p>
<p>For example: one masterpiece of theoretical computer science is Donald Knuth and Arne Jonassen&#8217;s analysis of a procedure for maintaining a sorted list of just three items.[<a href="volunteer.html#Knuth:1978p1260">8</a>] Notice there is no mention of computers or even of mathematics in the problem: just a keeping a list of three items in order. Anyone can keep a list sorted as we ask then to add and remove items- and for just 3 items the procedure is so simple as to be called trivial. However, Knuth and Jonassen required 21 pages of mathematics to precisely calculate how much work is needed to keep the list in order. In addition they were able to convincingly argue that no simpler analysis could find the right answer.</p>
<p>This incredible difference of scale in problem complexity and solution complexity is one of the hallmarks of theoretical computer science. It is also why deep down theorists value simplicity so highly: they know how quickly complexity drives up expense.</p>
<h1><a name="SECTION00030000000000000000">Back To The Problem</a></h1>
<p>Pretend you are running a growing club or volunteer organization. You wish to allow volunteers to form small groups and perform charitable works. You start with two methods to organize volunteers: announcements at the club and having organizers ask individuals to join their groups. For quite a while this works well.</p>
<p>As the club gets larger you expect the amount of service the club can provide to get larger (you have more people, so you have a lot more capacity to do good). This means more small groups. Soon allowing calls for volunteers at your meetings is eating up too much time and organizers have a harder and harder time finding volunteers.</p>
<p>What has happened? Has the club lost its spirit? Are the members becoming uncharitable. No. What has happened is that logistics and communication get progressively more difficult as clubs get larger because as the club grows the number of pairs of people grows much faster than the actual number of people. As the club forms more groups the groups tend to be specialized so each group/task organizer ends up asking more and more of the wrong people (people who would be better suited to another volunteer task) to find matches.</p>
<h1><a name="SECTION00040000000000000000">Various Solution</a></h1>
<p>The solution is more organization. But what kind? It is often a surprise which type of solution is best so the theorist usually explores and rejects a large number of different solutions before settling on a method.</p>
<p>Since I am trying to show the theorists way of thinking about this problem I will list a few different methods of organization. The job of the theorist is to have a large ready set of analogous processes and to see if a given problem can be re-cast into one that has a known solution.</p>
<h2><a name="SECTION00041000000000000000">Hierarchy</a></h2>
<p>Companies typically organize in a hierarchy with employees sorted by type of activity. If a company wants to add a task they know fairly precisely which subset of their employees to assign it to.</p>
<p>For volunteer clubs there is usually some hierarchy- but too rigid a structure is undesirable.</p>
<h2><a name="SECTION00042000000000000000">Scrip</a></h2>
<p>Joan and Richard Sweeney described the dynamics of a baby-sitting coop that used scrip (any substitute for currency)[<a href="volunteer.html#Sweeney:1977p2266">11</a>] as its organizing tool. The idea was: by exchanging coupons (or scrip) families could recruit each other as volunteers to babysit for each other. The fascinating thing is that this economic style solution quickly developed all of the complexities of an actual economy (inflation, deflation, currency policy and business cycles). Similar problems could be expected with an auction (or more fashionably a &#8220;mechanism design&#8221;[<a href="volunteer.html#Nisan:2007p621">10</a>]) based approach.</p>
<p>Computer scientists have a catchy phrase for expected outcome when you dogmatically apply a method to solve a problem: &#8220;now you have two problems&#8221;[<a href="volunteer.html#twoproblems">4</a>].</p>
<h2><a name="SECTION00043000000000000000">Matching Theory</a></h2>
<p>Matching Theory is the idea of organizing the information into two lists: one of potential volunteers and one of tasks. Typically one list is written in a column on the left and the other is a column on the right. From each potential volunteer we draw a line to each and every task they are willing to take on. For example we could have the following three volunteers and tasks:</p>
<div align="center"><img width="600"  align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/diag1.png" alt="Image diag1"/></div>
<p>A &#8220;matching&#8221; is when we remove many of the lines and leave only the assignments. For example the previous diagram of possibilities would allow us to make a complete assignment as given below:</p>
<div align="center"><img width="600"  align="bottom" border="0" src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/diag2.png" alt="Image diag2"/></div>
<p>In matching theory any assignment or recruitment of volunteers is called &#8220;a matching.&#8221; There are various ways to attempt to build a matching.</p>
<h3><a name="SECTION00043100000000000000">Random Matching</a></h3>
<p>One method would be to randomly pick pairings that are allowed by the original diagram. The main benefit is that by centralizing the book keeping we have allowed people with more limited availability (in this case Herbert Taylor) to enter their preferences without a lot of communication. However these harder to match people (that are available for fewer tasks) can easily be missed in a random matching. For example Herbert Taylor can only be matched to his task as long has Homer Wood has not already taken it. A very good study of how effective random assignments are is found in [<a href="volunteer.html#Karp:1990p149">7</a>].</p>
<h3><a name="SECTION00043200000000000000">Greedy Matching</a></h3>
<p>Another method is the so-called &#8220;greedy method.&#8221; A greedy method scans the list and (with a very myopic view) picks the best match. For example we could design our greedy method to proceed across the names in order and match each person up with the task they point to that has the fewest remaining edges. We would first match Paul Harris to &#8220;Donate Books&#8221;, then Homer Wood to &#8220;Deliver Meals&#8221; (as this now has only one edge remaining after Paul Harris is removed) and finally Herbert Taylor to &#8220;Give Lecture&#8221;. The method is not always guaranteed to produce a best matching but a very good analysis of how well it does work found in [<a href="volunteer.html#Dyer:1993p2145">3</a>].</p>
<h3><a name="SECTION00043300000000000000">On-line Matching</a></h3>
<p>The Karp, Vazirani and Vazirani paper (&#8220;An Optimal Algorithm for On-line Bipartite Matching&#8221;, [<a href="volunteer.html#Karp:1990p149">7</a>]) actually proposes an algorithm that combines many of the ideas we have already mentioned. The authors noticed that the worst thing about the random matching is that it tends to match people who are easy to match too early (instead of holding onto them for later). So instead of matching at random the new algorithm first places all of the potential volunteers into a list in random order. It then inspects the tasks one at a time and always assigns the available volunteer that is highest on the volunteer list. What is going on is that the algorithm is building a preference for matching those that are hard to match early. When we examine a task and see that a volunteer very high on the list is available it means that they must not have been available for very many of the tasks we previously examined (else they would have been already matched). So it is a good idea to match them while we can.<a name="tex2html3" href="#foot43"><sup>2</sup></a></p>
<p>This idea is so powerful that it has been suggested as an improvement to the bidding model used to price Google AdWords[<a href="volunteer.html#Mehta:2007p51">9</a>].</p>
<h3><a name="SECTION00043400000000000000">Stable Marriage</a></h3>
<p>Stable marriage theory[<a href="volunteer.html#Gale:1962p2267">5</a>] is an idea that looks at the entire diagram at once and allows us to assign preferences to each matching. In its most general form each potential volunteer submits a list of tasks they are willing to perform and orders them by their preference. Each task organizer also submits a list of volunteers they are willing to accept ordered by their own preferences. The stable marriage algorithm then finds a very special matching called a stable marriage. This matching tends to be very good. In fact this is the algorithm used to assign interns to hospitals.</p>
<p>A stable marriage is an assignment of pairs (a matching) where no assignment swaps are practical. That is if we pick a volunteer and a task the volunteer is not matched to then either the volunteer already has an assignment they like more than the task or the task already has an assignment they like more than the volunteer. Thus there is nobody who wants to trade tasks that can find a task willing to have them.</p>
<p>Gale and Shapley proved a stable marriage always exists and gave an effective algorithm for finding one. However this algorithm is not well suited for situations there are many tasks volunteers are unwilling to do. Notice, however, what ideas stable marriage shares with the on-line matching procedure (such as use of pre-prepared sorted lists).</p>
<h3><a name="SECTION00043500000000000000">Max-Flow</a></h3>
<p>Another &#8220;look at the whole diagram&#8221; idea is called &#8220;maximum-flow&#8221;[<a href="volunteer.html#CLRS00">2</a>]. In this case each line is considered as a directional pipe able to move 1 unit of fluid per hour from the left to the right and each volunteer is a source of 1 unit of fluid per hour and each task is a drain with a capacity of 1 unit of fluid per hour. The maximum flow algorithm can find a minimal subset of pipes that carry as much flow as possible. This minimal configuration is in fact a matching that assigns all tasks and volunteers (as in our second diagram, such a matching is called &#8220;maximal&#8221;).</p>
<p>The maximum flow algorithm can include preference weights (allow some respect of volunteer preferences in addition to allowing each volunteer to mark a subset of tasks they are willing to take). This is a case where the details of how maximum flows are computed is irrelevant, we just need to remember it is possible and see how to encode our problem as a flow.</p>
<h2><a name="SECTION00044000000000000000">The Lieberman Queue</a></h2>
<p>The graduate students of Carnegie Mellon&#8217;s School of Computer Science have, for over 25 years, used a system called &#8220;the Lieberman Queue&#8221; (named after its inventor Bob Lieberman)[<a href="volunteer.html#Hancock:2009p2253">6</a>] to organize and encourage volunteerism.</p>
<p>The principles are: we take volunteerism as a responsibility and getting the task you want as a mere privilege. The queue is just a sorted list of the graduate students. A second list of tasks sorted by when they are needed is also maintained. Both lists are publicly available. Anybody can volunteer for any task at any time. When you complete a task your name is moved to the bottom of the queue (causing the the people you pass to each move up one position in the queue). The final point is that if you reach the top of the queue (which happens if you volunteer for tasks a significantly slower rate than your peers) the queue manager (originally Bob Lieberman) could forcibly select you for a task. So it is in your interest to periodically inspect the available task list and see if there was one you would like to do (to stay away from the top of the queue).</p>
<p>You can see this system cuts down immensely on the required amount of communication. Each member only needs to check the public task list every once in a while and the queue manager is always either accepting volunteers or forcibly assigning tasks (so there are no &#8220;declines&#8221;).</p>
<p>It must be admitted this system is fairly radical. First the system is very normative (imputes ethical judgements on actions) and is in fact a (benevolent) shame culture. This may not be appropriate for many organizations and may not be appropriate when the task frequency is high.</p>
<h1><a name="SECTION00050000000000000000">Conclusion</a></h1>
<p>There are a number of ways of improving the quality of volunteerism in a club. Most of the ones I explored involve some form of central tracking of volunteers and tasks and replace inefficient direct communication with sorting performed on the abstract records.</p>
<p>A number of important research fields can have their spirit summed-up in a single sentence. Rudolf Beran said: &#8220;statistics is the study of algorithms for data analysis&#8221;[<a href="volunteer.html#Beran:2003p2262">1</a>]. Stanislaw Ulam said &#8220;the best mathematicians see analogies between analogies.&#8221; To this I would like to add &#8220;theorists build analogies between processes.&#8221;</p>
<h2><a name="SECTION00060000000000000000">Bibliography</a></h2>
<dl compact>
<dt><a name="Beran:2003p2262">1</a></dt>
<dd>B<small>ERAN,</small> R.<br />
The impact of the bootstrap on statistical algorithms and theory.<br />
<em>Statistical Science 18</em> (2003), 175-184.</dd>
<dt><a name="CLRS00">2</a></dt>
<dd>C<small>ORMEN,</small> T.&nbsp;H., L<small>EISERSON,</small> C.&nbsp;E., R<small>IVEST,</small> R.&nbsp;L., <small>AND</small> S<small>TEIN,</small> C.<br />
<em>Introduction to Algorithms, 2nd edition</em>.<br />
MIT Press, McGraw-Hill Book Company, 2000.</dd>
<dt><a name="Dyer:1993p2145">3</a></dt>
<dd>D<small>YER,</small> M., F<small>RIEZE,</small> A.&nbsp;M., <small>AND</small> P<small>ITTEL,</small> B.&nbsp;G.<br />
The average performance of the greedy matching algorithm.<br />
<em>The Annals of Applied Probability 3</em>, 2 (1993), 526-552.</dd>
<dt><a name="twoproblems">4</a></dt>
<dd>F<small>RIEDL,</small> J.<br />
Source of the famous &quot;now you have two problems&quote; quote.<br />
<tt><a name="tex2html4" href="http://regex.info/blog/2006-09-15/247#comment-3085">http://regex.info/blog/2006-09-15/247#comment-3085</a></tt>.</dd>
<dt><a name="Gale:1962p2267">5</a></dt>
<dd>G<small>ALE,</small> D., <small>AND</small> S<small>HAPLEY,</small> L.&nbsp;S.<br />
College admissions and the stability of marriage.<br />
<em>The American Mathematical Monthly 69</em>, 1 (Jan 1962), 9-15.</dd>
<dt><a name="Hancock:2009p2253">6</a></dt>
<dd>H<small>ANCOCK,</small> J.<br />
Vasc queue description.<br />
<tt><a name="tex2html5" href="http://vasc.ri.cmu.edu/old_help/Admin/Queue/descrip.html">http://vasc.ri.cmu.edu/old_help/Admin/Queue/descrip.html</a></tt>.</dd>
<dt><a name="Karp:1990p149">7</a></dt>
<dd>K<small>ARP,</small> R.&nbsp;M., V<small>AZIRANI,</small> U.&nbsp;V., <small>AND</small> V<small>AZIRANI,</small> V.&nbsp;V.<br />
An optimal algorithm for on-line bipartite matching.<br />
<em>STOC 22</em> (1990), 352-358.</dd>
<dt><a name="Knuth:1978p1260">8</a></dt>
<dd>K<small>NUTH,</small> D.&nbsp;E., <small>AND</small> J<small>ONASSEN,</small> A.&nbsp;T.<br />
A trivial algorithm whose analysis isn&#8217;t.<br />
<em>Journal of Computer and System Sciences 16</em> (1978), 301-322.</dd>
<dt><a name="Mehta:2007p51">9</a></dt>
<dd>M<small>EHTA,</small> A., S<small>ABERI,</small> A., V<small>AZIRANI,</small> U.&nbsp;V., <small>AND</small> V<small>AZIRANI,</small> V.&nbsp;V.<br />
Adwords and generalized on-line matching.<br />
<em>Jornal of the ACM 54</em>, 5 (2007).</dd>
<dt><a name="Nisan:2007p621">10</a></dt>
<dd>N<small>ISAN,</small> N.<br />
Introduction to mechanism design (for computer scientists).<br />
<em>(Book) Algorithmic Game Theory</em> (2007).</dd>
<dt><a name="Sweeney:1977p2266">11</a></dt>
<dd>S<small>WEENEY,</small> J., <small>AND</small> S<small>WEENEY,</small> R.&nbsp;J.<br />
Monetary theory and the great capitol hill baby sitting co-op crisis.<br />
<em>Journal of Money, Credit and Banking 9</em>, 1 (Feb 1977), 86-89.</dd>
</dl>
<p></p>
<hr />
<h4>Footnotes</h4>
<dl>
<dt><a name="foot17">&#8230; Mount</a><a href="volunteer.html#tex2html2"><sup>1</sup></a></dt>
<dd>http://www.mzlabs.com/</dd>
<dt><a name="foot43">&#8230; can.</a><a href="volunteer.html#tex2html3"><sup>2</sup></a></dt>
<dd>That is the intuition. As is often the case in theoretical computer science the intuition is in fact too hard to work with and the proof that the technique is good has to proceed on a longer and more difficult path. Also it is typical of theoretical computer science that the algorithm being analysed is much simpler than a number of obviously better heuristics. The heuristics may be better, but any improvement that yields too much complexity interferes with analysis.</dd>
</dl>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2010/08/what-did-theorists-do-before-the-age-of-big-data/' rel='bookmark' title='What Did Theorists Do Before The Age Of Big Data?'>What Did Theorists Do Before The Age Of Big Data?</a></li>
<li><a href='http://www.win-vector.com/blog/2010/12/large-data-logistic-regression-with-example-hadoop-code/' rel='bookmark' title='Large Data Logistic Regression (with example Hadoop code)'>Large Data Logistic Regression (with example Hadoop code)</a></li>
<li><a href='http://www.win-vector.com/blog/2011/06/what-is-a-large-enough-random-sample/' rel='bookmark' title='What is a large enough random sample?'>What is a large enough random sample?</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2009/02/volunteers-in-large-clubs-the-theorists-view/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

