<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Win-Vector Blog &#187; Data Mining</title>
	<atom:link href="http://www.win-vector.com/blog/tag/data-mining/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.win-vector.com/blog</link>
	<description>The Applied Theorist&#039;s Point of View</description>
	<lastBuildDate>Sat, 04 Feb 2012 17:42:12 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Your Data is Never the Right Shape</title>
		<link>http://www.win-vector.com/blog/2011/07/your-data-is-never-the-right-shape/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=your-data-is-never-the-right-shape</link>
		<comments>http://www.win-vector.com/blog/2011/07/your-data-is-never-the-right-shape/#comments</comments>
		<pubDate>Sun, 31 Jul 2011 20:27:00 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Tutorials]]></category>
		<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[plyr]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[reshape]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1687</guid>
		<description><![CDATA[One of the recurring frustrations in data analytics is that your data is never in the right shape. Worst case: you are not aware of this and every step you attempt is more expensive, less reliable and less informative than you would want. Best case: you notice this and have the tools to reshape your [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/' rel='bookmark' title='A Demonstration of Data Mining'>A Demonstration of Data Mining</a></li>
<li><a href='http://www.win-vector.com/blog/2009/08/good-graphs-graphical-perception-and-data-visualization/' rel='bookmark' title='Good Graphs: Graphical Perception and Data Visualization'>Good Graphs: Graphical Perception and Data Visualization</a></li>
<li><a href='http://www.win-vector.com/blog/2009/04/the-data-enrichment-method/' rel='bookmark' title='The Data Enrichment Method'>The Data Enrichment Method</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>One of the recurring frustrations in data analytics is that your data is never in the right shape.  Worst case: you are not aware of this and every step you attempt is more expensive, less reliable and less informative than you would want.  Best case: you notice this and have the tools to reshape your data.  </p>
<p>There is no final &#8220;right shape.&#8221;  In fact even your data is never right. You will always be called to re-do your analysis (new variables, new data, corrections) so you should always understand you are on your &#8220;penultimate analysis&#8221; (always one more to come).  This is why we insist on using general methods and scripted techniques, as these methods are much much easier to reliably reapply on new data than GUI/WYSWYG techniques.</p>
<p>In this article we will work a small example and call out some <a target="_blank" href="http://cran.r-project.org/">R</a> tools that make reshaping your data much easier.  The idea is to think in terms of &#8220;relational algebra&#8221; (like SQL) and transform your data towards your tools (and not to attempt to adapt your tools towards the data in an ad-hoc manner).<span id="more-1687"></span>Take a simple example where you are designing a new score called &#8220;<code>score2</code>&#8221; to predict or track an already known value called &#8220;<code>score1</code>.&#8221;  The typical situation is <code>score1</code> is a future outcome (such as the number of dollars profit on a transaction) and <code>score2</code> is a prediction (such as the estimated profit before the transaction is attempted).  Training data is usually assembled by performing a large number of transactions, recording what was known before the transaction and then aligning or joining this data with measured results when they become available.  For this example we are not interested in the inputs driving the model (a rare situation, but we are trying to make our example as simple as possible) but only examining the quality of <code>score2</code> (which is defined as how well it tracks <code>score1</code>).</p>
<p>All of this example will be in R, but the principles are chosen apply more generally.  First let us enter some example data:</p>
<p><code><br />
<br/> &gt; d &lt;- data.frame(id=c(1,2,3,1,2,3),score1=c(17,5,6,10,13,7),score2=c(13,10,5,13,10,5))<br />
<br/> &gt; d<br />
</code></p>
<p>This gives us our example data.  Each row is numbered (1 through 6) has an <code>id</code> and both our scores:</p>
<pre>
  id score1 score2
1  1     17     13
2  2      5     10
3  3      6      5
4  1     10     13
5  2     13     10
6  3      7      5
</pre>
<p>We said our only task was to characterize how well <code>score2</code> works at predicting <code>score1</code> (or how good a substitute <code>score2</code> is for <code>score1</code>).  We could compute correlation, RMS error, info-gain or some such.  But instead lets look at this graphically.  We will prepare a graph showing how well <code>score1</code> is represented by <code>score2</code>.  For this we choose to place <code>score1</code> on the y-axis (as it is the outcome) and <code>score2</code> on the x-axis (as it is the driver).</p>
<p><code><br />
<br/> &gt; library(ggplot2)<br />
<br/> &gt; ggplot(d) + geom_point(aes(x=score2,y=score1))<br />
</code></p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/07/plot1.png" alt="plot1.png" border="0" width="525" height="525" /></p>
<p>Figure 1: <code>score1</code> as a function of  <code>score2</code>.</p>
<p></center></p>
<p>This does not look good.  We would liked to have seen all of the dots falling on the line &#8220;y=x.&#8221;  This plot shows <code>score2</code> is not predicting <code>score1</code> very well.  Part of this is that we missed an important feature of the data (and because we missed it the feature becomes a problem): the <code>id</code>s repeat.  First we re-order by <code>id</code> to make this more obvious.</p>
<p><code><br />
<br/> &gt; dsort &lt;- d[order(d$id),]<br />
<br/> &gt; dsort<br />
</code></p>
<pre>
  id score1 score2
1  1     17     13
4  1     10     13
2  2      5     10
5  2     13     10
3  3      6      5
6  3      7      5
</pre>
<p>This is a very common situation.  The original score is not completely a function of the known inputs.  We are using &#8220;<code>id</code>&#8221; to abstract represent all of the inputs, two rows in our example have the same <code>id</code> if and only if all known inputs are exactly the same.  The repeating <code>id</code>s are the same experiment run at different times (a good idea) and the variation in <code>score1</code> could be the effect of an un-modeled input that changed value or something simple like a &#8220;noise term&#8221; (a random un-modeled effect).   Notice that <code>score2</code> is behaving as a function of <code>id</code>- all rows with the same <code>id</code> have the same value for <code>score2</code>.  If <code>score2</code> is a model then it has to be a function of the inputs (or more precisely if it is not a function of the inputs you have done something wrong).  So any variation of <code>score1</code> between rows with identical <code>id</code> is &#8220;unexplainable variation&#8221; (unexplainable from the point of view of currently tracked inputs).  You should know about, characterize and report this variation (why it is good to have some repeated experiments).  But this variation is not the model&#8217;s fault, if we want to know how good a job we did constructing the model (which we now see can be a slightly different question than how well the model works at prediction) we need to see how much of the explainable variation the model accounts for.</p>
<p>If we assume (as is traditional) the unexplained variation is from a &#8220;unbiased noise source&#8221; then we can lessen the impact of the noise source by replacing <code>score1</code> with a value averaged over rows with the same <code>id</code>.  This assumption is traditional because an unbiased noise source is present in many problems and assuming anything more requires more research into the problem domain.   You would eventually fold such research into your model- so your goal is always have all effects or biases in your model and hope what is left over is unbiased.  This is usually not strictly true, but not accounting for the unexplained variation at all is in many cases even worse than modeling the unexplained variation as being bias-free.</p>
<p>And now we find our data is the &#8220;wrong shape.&#8221;  To replace <code>score1</code> with the appropriate averages we need to do some significant data manipulation.  We need to group sets of rows and add new columns. We could do this imperatively (write some loops and design some variables to track and manipulate state) or declaratively (find a path of operations from what you have to what you need through R&#8217;s data manipulation algebra).  Even though the declarative method is more work the first time (you could often write the code in less time than it takes to read the manuals) it is the right way to go (as it is more versatile and powerful in the end).</p>
<p>Luckily we don&#8217;t have to use raw R.  There are a number of remarkable packages (all by <a target="_blank" href="http://had.co.nz/">Hadley Wickham</a> who is also the author of the <a target="_blank" href="http://had.co.nz/ggplot2/">ggplot2</a> package we use to prepare our figures) that really improve R&#8217;s ability to coherently manage data.  The easiest (on us) way do fix up our data is to make the computer work hard and use the powerful melt/cast technique.  These functions are found in the libraries <a target="_blank" href="http://www.jstatsoft.org/v21/i12/paper">reshape</a> and <a target="_blank" href="http://www.jstatsoft.org/v40/i01/paper">plyr</a> (which were automatically loaded with we loaded ggplot2 library).</p>
<p>melt is a bit abstract.  What it does convert your data into a &#8220;narrow&#8221; format where rows are split into many rows each carrying just one result column of the original row.  For example we can melt our data by <code>id</code> as follows:</p>
<p><code><br />
<br/> &gt; dmelt &lt;- melt(d,id.vars=c('id'))<br />
<br/> &gt; dmelt<br />
</code></p>
<p>Which yields the following:</p>
<pre>
   id variable value
1   1   score1    17
2   1   score1    10
3   2   score1     5
4   2   score1    13
5   3   score1     6
6   3   score1     7
7   1   score2    13
8   1   score2    13
9   2   score2    10
10  2   score2    10
11  3   score2     5
12  3   score2     5
</pre>
<p>Each of the two facts (<code>score1</code>, <code>score2</code>) from our original row is split into its own row.  The <code>id</code> column plus the new variable column are now considered to be keys.  This format is not used directly but used because it is easy to express important data transformations in terms of it.  For instance we wanted our table to have duplicate rows collected and <code>score1</code> replaced by its average (to attempt to remove the unexplainable variation).  That is now easy:</p>
<p><code><br />
<br/> &gt; dmean &lt;- cast(dmelt,fun.aggregate=mean)<br />
<br/> &gt; dmean<br />
</code></p>
<pre>
  id score1 score2
1  1   13.5     13
2  2    9.0     10
3  3    6.5      5
</pre>
<p>We used <code>cast()</code> in its default mode, where it assumes all columns not equal to &#8220;value&#8221; are the keyset.  It then collects all rows with identical keying and combines them back into wide rows using mean or average as the function to deal with duplicates.  Notice <code>score1</code> is now the desired average, and <code>score2</code> is as before (as it was a function of the keys or inputs, so it is not affected by averaging).  With this new smaller data set we can re-try our original graph:</p>
<p><code><br />
<br/> &gt; ggplot(dmean) + geom_point(aes(x=score2,y=score1))<br />
</code></p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/07/plot2.png" alt="plot2.png" border="0" width="525" height="525" /></p>
<p>Figure 2: <code>mean(score1)</code> as a function of  <code>score2</code>.</p>
<p></center></p>
<p>This doesn&#8217;t look so bad.  A lot of the error or variation in the first plot was unexplainable variation.  <code>score2</code> isn&#8217;t bad given its inputs.  If you wanted to do better than <code>score2</code> you would be advised to find more modeling inputs (versus trying more exotic modeling techniques).</p>
<p>Of course a client or user is not interested if <code>score2</code> is &#8220;best possible.&#8221;  They want to know if it is any good.  To do this we should show them (either by graph or by quantitative summary statistics like we mentioned earlier) at least 3 things:</p>
<ol>
<li>How well the model predicts overall (the very first graph we presented).</li>
<li>How much of the explainable variation the model predicts (the second graph).</li>
<li>The nature of the unexplained variation (which we will explore next).</li>
</ol>
<p>We said earlier we are hoping the unexplained variation is noise (or if it is not noise it would be nice if it is a clue to new important modeling features).  So the unexplained variation must not go unexamined.  We will finish by showing how to characterize the unexplained variation.  As before will will just make a graph, but the data preparation steps would be exactly the same if we were using a quantitive summary (like correlation, or any other).  And, of course, our data is still not the right shape for this step.  Luckily there is another tool ready to fix this: <code>join()</code>.</p>
<p><code><br />
<br/> &gt; djoin &lt;- join(dsort,dsort,'id')<br />
<br/> &gt; fixnames &lt;- function(cn) {<br />
     n &lt;- length(cn);<br />
     for(i in 2:((n+1)/2)) { cn[i] &lt;- paste('a',cn[i],sep='') };<br />
     for(i in ((n+3)/2):n) { cn[i] &lt;- paste('b',cn[i],sep='') };<br />
     cn<br />
  }<br />
<br/> &gt; colnames(djoin) &lt;- fixnames(colnames(djoin))<br />
<br/> &gt; djoin<br />
</code></p>
<p>which produces:</p>
<pre>
   id ascore1 ascore2 bscore1 bscore2
1   1      17      13      17      13
2   1      17      13      10      13
3   1      10      13      17      13
4   1      10      13      10      13
5   2       5      10       5      10
6   2       5      10      13      10
7   2      13      10       5      10
8   2      13      10      13      10
9   3       6       5       6       5
10  3       6       5       7       5
11  3       7       5       6       5
12  3       7       5       7       5
</pre>
<p>All of the work was done by the single line &#8220;<code>djoin &lt;- join(dsort,dsort,'id')</code>&#8221; the rest was just fixing the column names (as self-join is not the central use case of join).  What we have now is a table that is exactly right for studying unexplained variation.  For each <code>id</code> we have each row with the same <code>id</code> matched.  This blows every <code>id</code> from having 2 rows in <code>dsort</code> to 4 rows in <code>djoin</code>.  Notices this gives us every pair of <code>score1</code> values seen for the same <code>id</code> (which will let us examine unexplained variation) and <code>score2</code> is still constant over all rows with the same <code>id</code> (as it has always been throughout our analysis).  With this table we can now plot how <code>score1</code> varies for rows with the same <code>id</code>:</p>
<p><code><br />
<br/> &gt; ggplot(djoin) + geom_point(aes(x=ascore1,y=bscore1))<br />
</code></p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/07/unex.png" alt="unex.png" border="0" width="525" height="525" /></p>
<p>Figure 3: <code>score1</code> as a function of  <code>score1</code>.</p>
<p></center></p>
<p>And we can see, as we expected, the unexplained variation in <code>score1</code> is about as large as the mismatch between <code>score1</code> and <code>score2</code> in our original plot.  The important thing is this is all about <code>score1</code> (<code>score2</code> is now literally out of the picture).  The analyst&#8217;s job would now be to try and tie bits of the unexplained variation to new inputs (that can be folded into a new <code>score2</code>) and/or characterize the noise term (so the customer knows how close they should expect repeated experiments to be).</p>
<p>What we are trying to encourage with the use of &#8220;big hammer tools&#8221; is an ability and willingness to look at and transform your data in meaningful steps.  It often seems easier and more efficient to build one more piece of data tubing, but a lot of data tubes become an unmanageable collection of spaghetti code.  The analyst should, in some sense, always be looking at data and not looking at coding details.  For these sort of analyses we encourage analysts to think in terms of &#8220;data shape&#8221; and transforms.  This discipline leaves more of the analysts energy and attention to think productively about the data and actual problem domain.</p>
<hr />
Note:</p>
<p>For the third plot showing the variation of <code>score1</code> across different rows (but same <code>id</code>s) it may be appropriate to use a slightly more complicated <code>join()</code> procedure than we showed.  The join shown produced rows of artificial agreement where both values of <code>score1</code> came from the same row (thus had no chance of being different, so in some sense deserve no credit).  This is also the only way any non-duplicated evaluations could make it to the plot.  To eliminate these uninteresting agreements from the plot do the following:</p>
<p><code><br />
<br/> &gt; d$rowNumber &lt;- 1:(dim(d)[1])<br />
<br/> &gt; djoin &lt;- join(d,d,'id')<br />
<br/> &gt; colnames(djoin) &lt;- fixnames(colnames(djoin))<br />
<br/> &gt; djoin &lt;- djoin[djoin$arowNumber!=djoin$browNumber,]<br />
<br/> &gt; djoin<br />
</code></p>
<p>This gives us a table that shows only values of <code>score1</code> from different rows:</p>
<pre>
   id ascore1 ascore2 arowNumber bscore1 bscore2 browNumber
2   1      17      13          1      10      13          4
4   2       5      10          2      13      10          5
6   3       6       5          3       7       5          6
7   1      10      13          4      17      13          1
9   2      13      10          5       5      10          2
11  3       7       5          6       6       5          3
</pre>
<p>And only plots points on the diagonal if &#8220;you have really earned them&#8221;:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2011/07/fig4.png" alt="fig4.png" border="0" width="525" height="525" /><br />
</center></p>
<p>So while the direct <code>join()</code> may not be the immediate perfect answer it is still a good intermediate to form as what you want is only simple data transformation away from it.</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/' rel='bookmark' title='A Demonstration of Data Mining'>A Demonstration of Data Mining</a></li>
<li><a href='http://www.win-vector.com/blog/2009/08/good-graphs-graphical-perception-and-data-visualization/' rel='bookmark' title='Good Graphs: Graphical Perception and Data Visualization'>Good Graphs: Graphical Perception and Data Visualization</a></li>
<li><a href='http://www.win-vector.com/blog/2009/04/the-data-enrichment-method/' rel='bookmark' title='The Data Enrichment Method'>The Data Enrichment Method</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2011/07/your-data-is-never-the-right-shape/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>A Personal Perspective on Machine Learning</title>
		<link>http://www.win-vector.com/blog/2010/10/a-personal-perspective-on-machine-learning/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=a-personal-perspective-on-machine-learning</link>
		<comments>http://www.win-vector.com/blog/2010/10/a-personal-perspective-on-machine-learning/#comments</comments>
		<pubDate>Sun, 31 Oct 2010 21:45:48 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Expository Writing]]></category>
		<category><![CDATA[History]]></category>
		<category><![CDATA[Opinion]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Clustering]]></category>
		<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Machine Learning]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1551</guid>
		<description><![CDATA[Having a bit of history as both a user of machine learning and a researcher in the field I feel I have developed a useful perspective on the various trends, flavors and nuances in machine learning and artificial intelligence. I thought I would take a moment to outline a bit of it here and demonstrate [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/' rel='bookmark' title='A Demonstration of Data Mining'>A Demonstration of Data Mining</a></li>
<li><a href='http://www.win-vector.com/blog/2011/10/kernel-methods-and-support-vector-machines-de-mystified/' rel='bookmark' title='Kernel Methods and Support Vector Machines de-Mystified'>Kernel Methods and Support Vector Machines de-Mystified</a></li>
<li><a href='http://www.win-vector.com/blog/2011/07/book-review-ensemble-methods-in-data-mining-seni-elder/' rel='bookmark' title='Book Review: Ensemble Methods in Data Mining (Seni &amp; Elder)'>Book Review: Ensemble Methods in Data Mining (Seni &#038; Elder)</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>Having a bit of history as both a user of machine learning and a researcher in the field I feel I have developed a useful perspective on the various trends, flavors and nuances in machine learning and artificial intelligence.  I thought I would take a moment to outline a bit of it here and demonstrate how what we call artificial intelligence is becoming more statistical in nature.<span id="more-1551"></span><br />
In the early days <a target="_blank" href="http://en.wikipedia.org/wiki/Machine_learning">machine learning</a> and artificial intelligence were famous for promising far too much and delivering far too little.  This has changed.  Artificial decision and reasoning systems are now everywhere.  One of the things masking the breadth and authority of artificial intelligence is the current prejudice: &#8220;if a system is well understood or works then it is no longer called artificial intelligence.&#8221;  A working system becomes a database, expert system, rules engine, machine learning platform, analytics dashboard, pattern recognition system or statistics warehouse.  We clearly have not reached anywhere near building a conversational intelligence (like Hal from 2001 or <a target="_blank" href="http://mzlabs.com/MZLabsJM/page6/Gerty/Gerty.html">Gerty</a> from Moon).  Yet every day machines decide if your credit card is accepted, advise on medical care, route goods, curate information and control vast industrial plants.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/10/Hal-9000.jpg" alt="Hal-9000.jpg" border="0" width="150" height="150" /><br />
<br/>Hal 9000<br />
</center></p>
<p>There have been vast improvements in artificial intelligence.  Much of the improvement has been driven by the engineering effects of Moore&#8217;s Law (resulting in my mobile phone&#8217;s processor having 12 times the clock speed and over 32 times the memory of an $8 million <a target="_blank" href="http://en.wikipedia.org/wiki/Cray-1">Cray 1 super computer</a>)  and significant machine learning research results.  These machine size changes happened during the productive careers of many researchers, so ideas are often evaluated at a series of radically different machine capabilities and data scales.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/10/Cray-1-deutsches-museum.jpg" alt="Cray-1-deutsches-museum.jpg" border="0" width="487" height="536" /><br />
<br/>Cray 1<br />
</center></p>
<p>von Neuman himself commented that scale was a major limiting factor in early computers.  He asked the question how you could be expected to achieve anything significant even from a roomful of geniuses if (as with his early computers) all notes, communication and memory were limited to less than a single typed page.  von Neuman&#8217;s comment stands in contrast to science fiction scientists and early boosters of artificial intelligence who always seem to be in awe of their own creations.  Computers are certainly much larger- but we need to be humble and put off deciding if we are yet in the era of large computers (compared to human or animal brains).  Everything we are doing now may still just be artificial intelligence&#8217;s pre-history and prologue.  Feynman in his lectures on computation mentions that RNA transcription can be estimated to take around 100 kT of energy to transcribe a bit while a transistor may easily use 100,000,000 kT energy units to switch states.  This means for the amount of heat the human head dissipates (energy supply and heat dissipation are rapidly becoming the most relevant measures of computational power) you could do a million times more work using RNA techniques (if you knew how) than with transistors.  So computers may not yet be what we should call large (though they are likely getting there).  What we currently call <a target="_blank" href="http://www.win-vector.com/blog/2009/08/on-the-hysteria-over-the-cloud/">&#8220;datacenters&#8221;</a> are in fact block sized computers (consuming an enormous amount of energy and dissipating a huge amount of heat).</p>
<p><center><br />
<img  target="_blank" src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/google_data_center_lenoir.jpg" alt="google_data_center_lenoir.jpg" border="0" width="512" height="355" /><br />
<br/>A datacenter (or a block sized computer)<br />
</center></p>
<p>Not all improvements in machine intelligence have come from (or are to come from) improvements in hardware.  Many of the improvements came from machine learning research results and these are what I will outline below.</p>
<p>Early machine learning algorithms were driven by analogy.  This led us to perceptrons (1957, fairly early in the history of computer science) and neural nets.  These methods have their successes but were largely over used and developed before researchers developed a good list of desirable properties of a machine learning method.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/10/220px-Neural_network_example.svg_.png" alt="220px-Neural_network_example.svg.png" border="0" width="220" height="293" /><br />
<br/>Neural Net diagram<br />
</center></p>
<p>These methods live on but are,  in my opinion, not currently competitive.  Some of their important ideas and contributions have been revived from time to time, such as the online update rules becoming what we now call stochastic gradients.</p>
<p>A list of (often incompatible) desirable properties of a machine learning algorithm is the following:</p>
<ul>
<li>Able to represent complicated functions</li>
<li>Good generalization performance (quality predictions on data not seen during training)</li>
<li>Unique optimal model for a given set of data and feature definitions</li>
<li>Efficient and well characterized solution method</li>
<li>Consistent summary statistics</li>
<li>Preference for simple models</li>
</ul>
<p>We divert from this list for a bit of background and context.</p>
<p>The neural net was largely celebrated for its ability to represent complex functions and the perceived efficiency of its newer back-propagation based training method (related to the <a target="_blank" href="http://www.win-vector.com/blog/2010/07/gradients-via-reverse-accumulation/">efficient calculation of gradients</a>).  The downsides were you never knew if your neural net was the right one (even assuming you had the right features, layout and training data) and could not be sure you were biasing towards simple models that might perform well on novel queries.  Great effort was expended in extending neural nets based on the supposition they should work as they were an analogy to how we imagined biological neurons might function.  An almost mystic hope was derived from the non-linear nature and special properties of the sigmoid curve (which was in fact a curve already known to statisticians).</p>
<p>Other methods than neural nets also had early success.  The field of information retrieval (which was not &#8220;sexy&#8221; prior to the Web) had huge success since the 1960s with <a taret="_blank" href="http://en.wikipedia.org/wiki/Naive_Bayes_classifier">Naive Bayes</a>, <a target="_blank" href="http://en.wikipedia.org/wiki/Rocchio_Classification">Rocchio Classification</a>, and <a target="_blank" href="http://en.wikipedia.org/wiki/Tf–idf">TF/IDF</a> methods.  The early success of these methods may have in fact delayed research on current hot research areas such as segmentation and author topic models.</p>
<p>Theoretical computer science initially sought to characterize machine learning methods in non-statistical language.  In the 1980s a great amount of ink was spilled on &#8220;learning boolean functions.&#8221;  Papers proving nothing was learnable (by picking a function related to cryptography) alternated with papers proving everything was learnable (for example via amplification techniques like boosting).  Generalization of models to new data remained a theoretical problem that was dealt with by appeals to model complexity and <a target="_blank" href="http://en.wikipedia.org/wiki/Minimum_description_length">MDL</a> (minimum description length).  A major breakthrough in characterizing generalization performance was the <a target="_blank" href="http://en.wikipedia.org/wiki/Probably_approximately_correct_learning">PAC model</a> (probably approximately correct) framework which finally allowed direct treatment of generalization performance.</p>
<p>We now have enough context  to discuss some of the current best of breed machine learning techniques (that address many of the desired properties mentioned above):</p>
<ul>
<li><a target="_blank" href="http://en.wikipedia.org/wiki/Support_vector_machine">Support Vector Machines</a></li>
<li><a target="_blank" href="http://en.wikipedia.org/wiki/Kernel_methods">Kernel Methods</a></li>
<li><a target="_blank" href="http://en.wikipedia.org/wiki/Logistic_regression">Logistic Regression</a></li>
<li><a target="_blank" href="http://en.wikipedia.org/wiki/Principle_of_maximum_entropy">Maximum Entropy Methods</a></li>
<li><a target="_blank" href="http://en.wikipedia.org/wiki/Regularization_(mathematics)">Regularization</a></li>
<li><a target="_blank" href="http://en.wikipedia.org/wiki/Graphical_model">Graphical Models</a></li>
<li><a target="_blank" href="http://en.wikipedia.org/wiki/Conditional_random_field">Conditional Random Fields</a></li>
<p> </ul>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/10/556px-Svm_max_sep_hyperplane_with_margin.png" alt="556px-Svm_max_sep_hyperplane_with_margin.png" border="0" width="278" /><br />
<br/><br />
Typical SVM maximum margin diagram<br />
</center></p>
<p>Not all of these methods are new (Logistic Regression for example dates from 1925 and is itself based on regression which goes back to Gauss).  But the concerns these methods address are all much more statistical than artificial intelligence in nature.  For example we don&#8217;t  suppose that there is some cryptographically obscured combination of features that we need to find to make the best prediction.  We instead worry about detecting which features are useful and note that it is a significant (though solvable) problem to correctly use combinations of useful features (phrased as statistical concerns: feature to feature dependencies and higher order interactions).  Machine learning has always run where statisticians fear to tread.   But more and  more often we are seeing that the methods and concerns of statisticians are what are needed to achieve many of the listed desired properties of machine learning models.</p>
<p>The methods I have singled out for praise are very effective and achieve a number of our listed desired properties.  For example:  both logistic regression and maximum entropy have a unique solution that is easy to find.  They are also both consistent with all summaries known during training.  That is: if 30% of the positive training data has a feature present then 30% of the data also has the feature present when weighted by the model&#8217;s score (so the model score shares a lot of properties with training truth).  Support Vector Machines also have well understood solutions and a theory (called maximum margin) that directly addresses generalization (good predictions on new data).  Kernel Methods (both as used in SVMs and elsewhere) allow controlled introduction of very complex functions.  Graphical Models and Conditional Random Fields also allow the controlled introduction of modeled dependencies in the data.</p>
<p>It is now common to call what was previously thought of as artificial intelligence or machine learning: &#8220;statistical machine learning.&#8221;  This reflects that the kind of prediction and characterization we expect from machine learning algorithms are in fact statistical concerns that we can deal with if we have enough data and enough computational resources. </p>
<p>The current important issues for statistical machine learning include:</p>
<ul>
<li>Dealing with very large datasets (driving the return of simpler methods like Naive Bayes)</li>
<li>Dealing with lack of training data (driving interest in clustering and manifold regularization methods)</li>
<li>Dealing with unstructured data and text mining (driving interest in information extraction and segmentation via generative models)</li>
</ul>
<p>Just as Wigner famously wrote about &#8220;The Unreasonable Effectiveness of Mathematics&#8221; in the 1960s  Halevy,Norvig and Pereira write about the &#8220;Unreasonable Effectiveness of Data.&#8221;   They argue that we are in the age of big data (or the age of analysts).   Or, as Varian observed: &#8220;it is a good time to supply a good complementary to data&#8221; (i.e. it is a good time to be an analyst).  I would temper this with we are likely in the age of unmarked data and unstructured data.  Less often are we asked to automate a known prediction and more often we are asked to cluster, characterize and segment wild data. In my opinion the hard problem in machine learning has moved from prediction to characterization.  With enough marked training data (that is data for which we know both the observables and desired outcome) it is now quite possible to use standard techniques and libraries to build a very good predictive model.  However, it is still hard to characterize, segment or extract useful information from the wealth of unstructured and unmarked data that is upon us.  And this is where a lot of the current research in statistical machine learning is directed.  </p>
<p>Or course characterization and clustering have their own infamous history.  Rota wrote: &#8220;&#8230; Or a subject is important, but nobody understands what is going on; such is the case with quantum field theory, the distribution of primes, pattern recognition and cluster analysis.&#8221;  Artificial intelligence may be moving from areas where computer scientists have over-promised to areas where statisticians have over-promised.  But this is not a disaster: the most valuable research tends to be done in hectic times in messy fields, not in calm times in neat fields.  And the already large scale adoption of statistical machine learning techniques means there is immediate great client value in even seemingly small improvements in understanding, explanation, documentation, training, tools, libraries and techniques.</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2010/10/Xbarst1.jpg" alt="Xbarst1.jpg" border="0" width="384" height="398" /><br />
<br/><br />
Classic attempt to add structure to text<br />
</center></p>
<p>(images from Wikipedia)</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/' rel='bookmark' title='A Demonstration of Data Mining'>A Demonstration of Data Mining</a></li>
<li><a href='http://www.win-vector.com/blog/2011/10/kernel-methods-and-support-vector-machines-de-mystified/' rel='bookmark' title='Kernel Methods and Support Vector Machines de-Mystified'>Kernel Methods and Support Vector Machines de-Mystified</a></li>
<li><a href='http://www.win-vector.com/blog/2011/07/book-review-ensemble-methods-in-data-mining-seni-elder/' rel='bookmark' title='Book Review: Ensemble Methods in Data Mining (Seni &amp; Elder)'>Book Review: Ensemble Methods in Data Mining (Seni &#038; Elder)</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2010/10/a-personal-perspective-on-machine-learning/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>A Demonstration of Data Mining</title>
		<link>http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=a-demonstration-of-data-mining</link>
		<comments>http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/#comments</comments>
		<pubDate>Thu, 20 Aug 2009 01:16:27 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Applications]]></category>
		<category><![CDATA[Expository Writing]]></category>
		<category><![CDATA[Mathematics]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[ML]]></category>
		<category><![CDATA[Regression]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=252</guid>
		<description><![CDATA[REPOST (now in HTML in addition to the original PDF). This paper demonstrates and explains some of the basic techniques used in data mining. It also serves as an example of some of the kinds of analyses and projects Win Vector LLC engages in. August 19, 2009 John Mount1 A Demonstration of Data Mining 1&#160;&#160;Introduction [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2011/07/book-review-ensemble-methods-in-data-mining-seni-elder/' rel='bookmark' title='Book Review: Ensemble Methods in Data Mining (Seni &amp; Elder)'>Book Review: Ensemble Methods in Data Mining (Seni &#038; Elder)</a></li>
<li><a href='http://www.win-vector.com/blog/2009/04/the-data-enrichment-method/' rel='bookmark' title='The Data Enrichment Method'>The Data Enrichment Method</a></li>
<li><a href='http://www.win-vector.com/blog/2011/07/your-data-is-never-the-right-shape/' rel='bookmark' title='Your Data is Never the Right Shape'>Your Data is Never the Right Shape</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>REPOST (now in HTML in addition to the original  <a href="http://www.win-vector.com/dfiles/ADemonstrationOfDataMining.pdf"> PDF</a>).</p>
<p>This paper  demonstrates and explains some of the basic techniques used in data mining.  It also serves as an example of some of the kinds of analyses and projects Win Vector LLC engages in.<span id="more-252"></span>
<div class="p"><!----></div>
<h3 align="center">August 19, 2009 </h3>
<h3 align="center">John Mount<a href="#tthFtNtAAB" name="tthFrefAAB"><sup>1</sup></a> </h3>
<h1 align="center">A Demonstration of Data Mining </h1>
<div class="p"><!----></div>
<h2><a name="tth_sEc1"><br />
1</a>&nbsp;&nbsp;Introduction</h2>
<div class="p"><!----></div>
<p> A major industry in our time is the collection of large data sets in preparation for the magic of data mining [<a href="#NYTStat" name="CITENYTStat">Loh09</a>,<a href="#Halevy:2009p2327" name="CITEHalevy:2009p2327">HNP09</a>].  There is extreme excitement about both the possible applications (identifying important customers, identifying medical risks, targeting advertising, designing auctions and so on) and the various methods for data mining and machine learning.  To some extent these methods are classic statistics presented in a new bottle.  Unfortunately, the concerns, background and language of the modern data-mining practitioner are different than that of the classic statistician- so some demonstration and translation is required.  In this writeup we will show how much of the magic of current data mining and machine learning can be explained in terms of statistical regression techniques and show how the statistician&#8217;s view is useful in choosing techniques.</p>
<div class="p"><!----></div>
<p> Too often data mining is used as a black-box. It is quite possible to clearly use statistics to understand the meaning and mechanisms of data mining.</p>
<div class="p"><!----></div>
<h2><a name="tth_sEc2"><br />
2</a>&nbsp;&nbsp;The Example Problem</h2>
<div class="p"><!----></div>
<p> Throughout this writeup we will work on a single idealized example problem.  For our problem we will assume we are working with a company that sells items and that this company has recorded its past sales visits.  We assume they recorded how well the prospect matched the product offering (we will call this &#8220;match factor&#8221;), how much of a discount was offered to the prospect (we will call this &#8220;discount factor&#8221;) and if the prospect became a customer or not (this is our determination of positive or negative outcome).  The goal is to use this past record as &#8220;training data&#8221; and build a model to predict the odds of making a new sale as a function of the match factor and the discount factor.  In a perfect world the historic data would look a lot like Figure&nbsp;<a href="#fig:IdealFitting">1</a>.  In Figure&nbsp;<a href="#fig:IdealFitting">1</a> each icon represents a past sales-visit, the red diamonds are non-sales and the green disks are successful sales.  Each icon is positioned horizontally to correspond to the discount factor used and vertically to correspond to the degree of product match estimated during the prospective customer visit.  This data is literally too good to be true in at least three qualities: the past data covers a large range of possibilities, every possible combination has already been tried in an orderly fashion and the good and bad events &#8220;are linearly separable.&#8221;  The job of the modeler would then be to draw the separating line (shown in Figure&nbsp;<a href="#fig:IdealFitting">1</a>) and label every situation above and to the right of the separating line as good (or positive) and every situation below and to the left as bad (or negative).</p>
<div class="p"><!----></div>
<div class="p"><!----></div>
<p><a name="tth_fIg1"><br />
</a><br />
<center><img width="400" src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/IdealFitting.png" alt="IdealFitting.png" /></p>
<p></center><center>Figure 1: Ideal Fitting Situation</center><br />
<a name="fig:IdealFitting"><br />
</a></p>
<div class="p"><!----></div>
<p> In reality past data is subject to what prospects were available (so you are unlikely to have good range and an orderly layout of past sales calls) and also heavily affected by past policy.  An example policy might be that potential customers with good product match factor may never have been offered a significant discount in the past; so we would have no data from that situation.  Finally each outcome is a unique event that depends on a lot more than the two quantities we are recording- so it is too much to hope that the good prospects are simply separable from the bad ones.</p>
<div class="p"><!----></div>
<p> Figure&nbsp;<a href="#fig:IdealFitting">1</a> is a mere cartoon or caricature of the modeling process, but it represents the initial intuition behind data mining.  Again: the flaws in Figure&nbsp;<a href="#fig:IdealFitting">1</a> represent the implicit hopes of the data miner.  The data miner wishes that the past experiments are laid out in an orderly manner, data covers most of the combinations of possibilities and there is a perfect and simple concept ready to be learned.</p>
<div class="p"><!----></div>
<p> Frankly, an experienced data miner would feel incredibly fortunate if the past data looked anything like what is shown in Figure&nbsp;<a href="#fig:EmpiricalData">2</a>.</p>
<div class="p"><!----></div>
<div class="p"><!----></div>
<p><a name="tth_fIg2"><br />
</a><br />
<center><img width="400" src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/empirical1.png" alt="empirical1.png" /></p>
<p></center><center>Figure 2: Empirical Data</center><br />
<a name="fig:EmpiricalData"><br />
</a></p>
<div class="p"><!----></div>
<p> The green disks (representing good past prospects) and the red diamonds (representing bad past prospects) are intermingled (which is bad).  There is some evidence that past policy was to lower the discount offered as the match factor increased (as seen in the diagonal spread of the green disks).  Finally we see the red diamonds are also distributed differently than the green disks. This is both good and bad.  The good is that the center of mass of the red diamonds differs from the center of mass of the green disks.  The bad is that the density of red diamonds does not fall any faster as it passes into the green disks than it falls in any other direction.  This indicates there is something important and different (and not measured in our two variables) about at least some of the bad prospects.  It is the data miner&#8217;s job be aware and to press on.</p>
<div class="p"><!----></div>
<h3><a name="tth_sEc2.1"><br />
2.1</a>&nbsp;&nbsp;The Trendy Now</h3>
<div class="p"><!----></div>
<p> In truth data miners often rush where classical statisticians fear to tread.  Right now the temptation is to immediately select from any number of &#8220;red hot&#8221; techniques, methods or software packages.  My short list of super-star method buzzwords includes:</p>
<div class="p"><!----></div>
<ul>
<li> Boosting[<a href="#Schapire:2001p1019" name="CITESchapire:2001p1019">Sch01</a>,<a href="#Breiman:2000p1134" name="CITEBreiman:2000p1134">Bre00</a>,<a href="#Freund:2003p1009" name="CITEFreund:2003p1009">FISS03</a>]
<div class="p"><!----></div>
</li>
<li> Latent Dirichlet Allocation[<a href="#Blei:2003p1063" name="CITEBlei:2003p1063">BNJ03</a>]
<div class="p"><!----></div>
</li>
<li> Linear Regression[<a href="#statistics" name="CITEstatistics">FPP07</a>,<a href="#Agresti" name="CITEAgresti">Agr02</a>]
<div class="p"><!----></div>
</li>
<li> Linear Discriminant Analysis[<a href="#Fisher:1936p2576" name="CITEFisher:1936p2576">Fis36</a>]
<div class="p"><!----></div>
</li>
<li> Logistic Regression[<a href="#Agresti" name="CITEAgresti">Agr02</a>,<a href="#Klein:2003p261" name="CITEKlein:2003p261">KM03</a>]
<div class="p"><!----></div>
</li>
<li> Kernel Methods[<a href="#kernel1" name="CITEkernel1">CST00</a>,<a href="#kernel2" name="CITEkernel2">STC04</a>]
<div class="p"><!----></div>
</li>
<li> Maximum Entropy[<a href="#Klein:2003p261" name="CITEKlein:2003p261">KM03</a>,<a href="#Grunwald:2005p108" name="CITEGrunwald:2005p108">Gru05</a>,<a href="#Stern:1989p1480" name="CITEStern:1989p1480">SC89</a>,<a href="#Dudik:2006p954" name="CITEDudik:2006p954">DS06</a>]
<div class="p"><!----></div>
</li>
<li> Naive Bayes[<a href="#Lewis:1998p105" name="CITELewis:1998p105">Lew98</a>]
<div class="p"><!----></div>
</li>
<li> Perceptrons[<a href="#Beigel:2008p1027" name="CITEBeigel:2008p1027">BRS08</a>,<a href="#Dasgupta:2005p2013" name="CITEDasgupta:2005p2013">DKM05</a>]
<div class="p"><!----></div>
</li>
<li> Quantile Regression[<a href="#quantile" name="CITEquantile">Koe05</a>]
<div class="p"><!----></div>
</li>
<li> Ridge Regression[<a href="#Breiman:1997p1133" name="CITEBreiman:1997p1133">BF97</a>]
<div class="p"><!----></div>
</li>
<li> Support Vector Machines[<a href="#kernel1" name="CITEkernel1">CST00</a>]
<div class="p"><!----></div>
</li>
</ul>
<div class="p"><!----></div>
<p> Based on some of the above referenced writing and analysis I would first pick &#8220;logistic regression&#8221; as I am confident that, when used properly, it is just about as powerful as any of the modern data mining techniques (despite its somewhat less than trendy status).  Using logistic regression I immediately get just about as close to a separating line as this data set will support: Figure&nbsp;<a href="#fig:LinearSepartor">3</a>.</p>
<div class="p"><!----></div>
<div class="p"><!----></div>
<p><a name="tth_fIg3"><br />
</a><br />
<center><img width="400" src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/lin1.png" alt="lin1.png" /></p>
<p></center><center>Figure 3: Linear Separator</center><br />
<a name="fig:LinearSepartor"><br />
</a></p>
<div class="p"><!----></div>
<p> The separating line actually encodes a simple rule of the form: &#8220;if 2.2*DiscountFactor + 3.1*MatchFactor &#8805; 1 then we have a good chance of a sale.&#8221;  This is classic black-box data mining magic.  The purpose of this writeup is to look deeper how to actually derive and understand something like this.</p>
<div class="p"><!----></div>
<h2><a name="tth_sEc3"><br />
3</a>&nbsp;&nbsp;Explanation</h2>
<div class="p"><!----></div>
<p> What is really going on?  Why is our magic formula at all sensible advice, why did this work at all and what motivates the analysis?  It turns out regression (be it linear regression or logistic regression) works in this case because it somewhat imitates the methodology of linear discriminant analysis (described in: [<a href="#Fisher:1936p2576" name="CITEFisher:1936p2576">Fis36</a>]).  In fact in many cases it would be a better idea to perform a linear discriminant analysis or perform an analysis of variance than to immediately appeal to a complicated method.  I will first step through the process of linear discriminant analysis and then relate it to our logistic regression.  Stepping through understandable stages lets us see where we were lucky in modeling and what limits and opportunities for improvement we have.</p>
<div class="p"><!----></div>
<div class="p"><!----></div>
<p><a name="tth_fIg4"><br />
</a><br />
<center></p>
<table>
<tr>
<td><img width="250" src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/posDat.png" alt="posDat.png" /></td>
<td><img width="250" src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/negDat.png" alt="negDat.png" />
</td>
</tr>
</table>
<p></center><center>Figure 4: Separate Plots</center><br />
<a name="fig:SeparatePlots"><br />
</a></p>
<div class="p"><!----></div>
<p> Our data initially looks very messy (the good and bad group are fairly mixed together).  But if we examine out data in separate groups we can see we are actually incredibly lucky in that the data is easy to describe.  As we can see in Figure&nbsp;<a href="#fig:SeparatePlots">4</a>: the data, when separated by outcome (plotting only all of the good green disks or only all of the bad red diamonds), is grouped in simple blobs without bends, intrusions or other odd (and more work to model) configurations.</p>
<div class="p"><!----></div>
<p> We can plot the idealizations of these data distributions (or densities) as &#8220;contour maps&#8221; (as if we are looking down on the elevations of a mountain on a map) which gives us Figure&nbsp;<a href="#fig:SeparateDistributions">5</a>.</p>
<div class="p"><!----></div>
<div class="p"><!----></div>
<p><a name="tth_fIg5"><br />
</a><br />
<center></p>
<table>
<tr>
<td><img width="250" src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/posDist.png" alt="posDist.png" /></td>
<td> <img width="250" src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/negDist.png" alt="negDist.png" />
</td>
</tr>
</table>
<p></center><center>Figure 5: Separate Distributions</center><br />
<a name="fig:SeparateDistributions"><br />
</a></p>
<div class="p"><!----></div>
<h3><a name="tth_sEc3.1"><br />
3.1</a>&nbsp;&nbsp;Full Bayes Model</h3>
<div class="p"><!----></div>
<p> From Figure&nbsp;<a href="#fig:SeparateDistributions">5</a> we can see while our data is not separable there are significant differences between the groups.  The difference in the groups is more obvious if we plot the difference of the densities on the same graph as in Figure&nbsp;<a href="#fig:DifferenceInDensity">6</a>.  Here we are visualizing the distribution of positive examples as a connected pair of peaks (colored green) and the distribution of negative examples a deep valley (colored red) located just below and to the left of the peaks.</p>
<div class="p"><!----></div>
<div class="p"><!----></div>
<p><a name="tth_fIg6"><br />
</a><br />
<center><img width="400" src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/diff1.png" alt="diff1.png" /></p>
<p></center><center>Figure 6: Difference in Density</center><br />
<a name="fig:DifferenceInDensity"><br />
</a></p>
<div class="p"><!----></div>
<p> This difference graph is demonstrating how both of the densities or distributions (positive and negative) reach into different regions of the plane.  The white areas are where the difference in densities is very small which includes the areas in the corners (where there is little of either distribution) and the area between the blobs (where there is a lot of mass from both distributions competing).  This view is a bit closer to what a statistician wants to see- how the distributions of successes and failures different (this is a step to take before even guessing at or looking for causes and explanations).</p>
<div class="p"><!----></div>
<p> Figure&nbsp;<a href="#fig:DifferenceInDensity">6</a> is already an actionable model- we can predict the odds a new prospect will buy or not at a given discount by looking where they fall on Figure&nbsp;<a href="#fig:DifferenceInDensity">6</a> and checking if they fall in a region on strong red or strong green color.  We can also recommend a discount for a given potential customer by drawing a line at the height determined by their degree of match and tracing from left to right until we first hit a strong green region.  We could hand out a simplified Figure&nbsp;<a href="#fig:FullBayesModel">7</a> as a sales rulebook.</p>
<div class="p"><!----></div>
<div class="p"><!----></div>
<p><a name="tth_fIg7"><br />
</a><br />
<center><img width="400" src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/bayesModel1.png" alt="bayesModel1.png" /></p>
<p></center><center>Figure 7: Full Bayes Model</center><br />
<a name="fig:FullBayesModel"><br />
</a></p>
<div class="p"><!----></div>
<p> This model is a full Bayes model (but not a Naive Bayes model, which is oddly more famous and which we will cover later).  The steps we took were: first we summarized or idealized our known data into two Gaussian blobs (as depicted in Figure&nbsp;<a href="#fig:SeparateDistributions">5</a>).  Once we had estimated the centers, widths and orientations of these blobs we could then: for any new point say how likely the point is under the modeled distribution of sales and how likely the point is under the modeled distribution of non-sales.  Mathematically we claim we can estimate P(x,y &#124;sale)<a href="#tthFtNtAAC" name="tthFrefAAC"><sup>2</sup></a> and P(x,y &#124; non-sale) (where x is our discount factor and y is our matching factor).<a href="#tthFtNtAAD" name="tthFrefAAD"><sup>3</sup></a> Neither of these are what we are actually interested in (we want: P(sale &#124; x,y)<a href="#tthFtNtAAE" name="tthFrefAAE"><sup>4</sup></a>).  We can, however, use these values to calculate what we want to know.  Bayes&#8217; law is a law of probability that says if we know P(sale &#124; x,y), P(non-sale &#124; x,y), P(sale) and P(non-sale)<a href="#tthFtNtAAF" name="tthFrefAAF"><sup>5</sup></a> then:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/eqn1.png"/><br />
</center></p>
<p>Figure&nbsp;<a href="#fig:FullBayesModel">7</a> depicts a central hourglass shaped region (colored green) that represents the region of x, y values where P(sale &#124;x,y) is estimated to be at least 0.5 and the remaining (darker red region) are the situations predicted to be less favorable.  Here we are using priors of P(sale) = P(non-sale) = 0.5, for different priors and thresholds we would get different graphs.</p>
<div class="p"><!----></div>
<p> Even at this early stage in the analysis we have already accidentally introduced what we call &#8220;an inductive bias.&#8221;  By modeling both distributions as Gaussians we have guaranteed that our acceptance region will be an hourglass figure (as we saw in Figure&nbsp;<a href="#fig:FullBayesModel">7</a>).  One undesirable consequence of the modeling technique is the prediction sales become unlikely when both match factor and discount factor are very large.  This is somewhat a consequence of our modeling technique (though the fact that the negative data does not fall quickly as it passes into the green region also added to this).  This un-realistic (or &#8220;not physically plausible&#8221;) prediction is called an artifact (of the technique and of the data) and it is the statistician&#8217;s job to see this, confirm they don&#8217;t want it and eliminate it (by deliberately introducing a &#8220;useful modeling bias&#8221;).</p>
<div class="p"><!----></div>
<h3><a name="tth_sEc3.2"><br />
3.2</a>&nbsp;&nbsp;Linear Discriminant</h3>
<div class="p"><!----></div>
<p> To get around the bad predictions of our model in the upper-right quadrant we &#8220;apply domain knowledge&#8221; and introduce a useful modeling bias as follows.  Let us insist that our model be monotone: that if moving some direction is good than moving further in the same direction is better.  In fact let&#8217;s insist that our model be a half-plane (instead of two parabolas).  We want a nice straight separating cut, which brings us to linear discriminant analysis.  We have enough information to apply Fisher linear discriminant technique and find a separator that maximizes the variance of data across categories while minimizing the variance of data within one category and within the other category.  This is called the linear discriminant and it is shown in Figure&nbsp;<a href="#fig:LinearDiscriminant">8</a>.</p>
<div class="p"><!----></div>
<div class="p"><!----></div>
<p><a name="tth_fIg8"><br />
</a><br />
<center><img width="400" src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/lda1.png" alt="lda1.png" /></p>
<p></center><center>Figure 8: Linear Discriminant</center><br />
<a name="fig:LinearDiscriminant"><br />
</a></p>
<div class="p"><!----></div>
<p> The blue line is the linear discriminant (similar to the logistic regression line depicted earlier on the data-slide).  Everything above or to the right of the blue line is considered good and everything below or to the left of the blue line is considered bad.  Notice that this advice while not quite as accurate as the Bayes Model near the boundary between the two distributions is much more sensible about the upper right corner of the graph.</p>
<div class="p"><!----></div>
<p> To evaluate a separator we collapse all variation parallel to the separating cut (as shown in Figure&nbsp;<a href="#fig:collapse">9</a>).  We then see that each distribution becomes a small interval or streak.  A separator is good if these resulting streaks are both short (the collapse packs the blobs) and the two centers of the streaks are far apart (and on opposite size of the separator).  In Figure&nbsp;<a href="#fig:collapse">9</a> the streaks are fairly short and despite some overlap we do have some usable separation between the two centers.</p>
<div class="p"><!----></div>
<div class="p"><!----></div>
<p><a name="tth_fIg9"><br />
</a><br />
<center><img width="400" src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/collapse2.png" alt="collapse2.png" /></p>
<p></center><center>Figure 9: Evaluating Quality of Separating Cut</center><br />
<a name="fig:collapse"><br />
</a></p>
<div class="p"><!----></div>
<p> To make the above precise we switch to mathematical notation.  For the i-th positive training example form the vector v<sub>+,i</sub> and the matrix S<sub>+,i</sub> where</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/eqn2.png"/><br />
</center></p>
<div class="p"><!----></div>
<p> where x<sub>i</sub> and y<sub>i</sub> are the known x and y coordinates for this particular past experience.  Define v<sub>&#8722;,i</sub>, S<sub>&#8722;,i</sub> similarly for all negative examples.  In this notation we have for a direction &#947;: the distance along the &#947; direction between the center of positive examples and center of negative examples is: &#947;<sup>T</sup> ( &#8721;<sub>i</sub> v<sub>+,i</sub> / n<sub>+</sub> &#8722; &#8721;<sub>i</sub> v<sub>&#8722;,i</sub> / n<sub>&#8722;</sub>) (where n<sub>+</sub> is the number of positive examples and n<sub>&#8722;</sub> is the number of negative examples).  We would like this quantity to be large.  The degree of spread or variance of the positive examples along the &#947; direction is &#947;<sup>T</sup> (&#8721;<sub>i</sub> S<sub>+,i</sub> / n<sub>+</sub>) &#947;.  The degree of spread or variance of the negative examples along the &#947; direction is &#947;<sup>T</sup> (&#8721;<sub>i</sub> S<sub>&#8722;,i</sub> / n<sub>&#8722;</sub>) &#947;.  We would like the last two quantities to be small.  The linear discriminant is picked to maximize:</p>
<p><center><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/eqn3.png"/><br />
</center></p>
<p>It is a fairly standard observation (involving the Rayleigh quotient) that this form is maximized when:<br />
<center><br />
<a name="eq:lda"><br />
</a><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/eqn4.png"/><br />
</center></p>
<div class="p"><!----></div>
<p> As we have said, the linear discriminant is very similar to what is returned by a regression or logistic regression.  In fact in our diagrams the regression lines are almost identical to the linear discriminant.  A large part of why regression can be usefully applied in classification comes from its close relationship to the linear discriminant.</p>
<div class="p"><!----></div>
<h3><a name="tth_sEc3.3"><br />
3.3</a>&nbsp;&nbsp;Linear Regression</h3>
<div class="p"><!----></div>
<p> Linear regression is designed to model continuous functions subject to independent normal errors in observation.  Linear regression is incredibly powerful at characterizing and elimination correlations between the input variables of a model.  While function fitting is different than classification (our example problem) linear regression is so useful whenever there is any suspected correlation (which is almost always the case) that it is an appropriate tool.  In our example in the positive examples (those that led to sales) there is clearly a historical dependence between the degree of estimated match and amount of discount offered.  Likely this dependence is from past prospects being subject to a (rational) policy of &#8220;the worse the match the higher the offered discount&#8221; (instead of being arranged in a perfect grid-like experiment as in our first diagram: Figure&nbsp;<a href="#fig:IdealFitting">1</a>).  If this dependence is not dealt with we would under-estimate the value of discount because we would think that discounted customers are not signing up at a higher rate (when these prospects are in fact clearly motivated by discount, once you control for the fact that many of the deeply discounted prospects had a much worse degree of match than average).</p>
<div class="p"><!----></div>
<p> For analysis of categorical data linear regression is closely linked to ANOVA (analysis of variance).[<a href="#Agresti" name="CITEAgresti">Agr02</a>] Recall that variance was a major consideration with the linear discriminant analysis, so we should by now be on familiar ground.</p>
<div class="p"><!----></div>
<p>In our notation the standard least-squares regression solution is:<br />
<center><br />
<a name="eq:leastsquares"><br />
</a><br />
<img src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/eqn5.png"/><br />
</center></p>
<p>where y<sub>+,i</sub> = 1 for all i and y<sub>&#8722;,i</sub> = &#8722;1 for all i.</p>
<div class="p"><!----></div>
<p> If we have the same number of positive and negative examples (i.e.  n<sub>+</sub> = n<sub>&#8722;</sub>) then Equation&nbsp;<a href="#eq:lda">1</a> and Equation&nbsp;<a href="#eq:leastsquares">2</a> are identical and we have &#946; = &#947;.  So in this special case the linear discriminant equals the least square linear regression solution.  We can even ask how the solutions change if the relative proportions of positive and negative training data changes.  The linear discriminant is carefully designed not to move, but the regression solution will tilt to be an angle that is more compatible with the larger of the example classes and shift to cut less into that class.  The linear regression solution can be fixed (by re-weighting the data) to also be insensitive to the relative proportions of positive and negative examples but does not behave that way &#8220;fresh out of the box.&#8221;</p>
<div class="p"><!----></div>
<h3><a name="tth_sEc3.4"><br />
3.4</a>&nbsp;&nbsp;Logistic Regression</h3>
<div class="p"><!----></div>
<p> While linear regression is designed to pick a function that minimizes the sum of square errors logistic regression is designed to pick a separator that maximizes something called <em>the plausibility of the data</em>.  In our case since the data is so well behaved the logistic regression line is essentially the same as the linear regression line.  It is in fact an important property of logistic regression that there is always a re-weighting (or choice of re-emphasis) of the data that causes some linear regression to pick the same separator as the logistic regression.  Because linear and logistic regression are only identical in specific circumstances it is the job of the statistician to know which of the two is more appropriate for a given data set and given intended use of the resulting model.</p>
<div class="p"><!----></div>
<h2><a name="tth_sEc4"><br />
4</a>&nbsp;&nbsp;Other Methods and Techniques</h2>
<div class="p"><!----></div>
<h3><a name="tth_sEc4.1"><br />
4.1</a>&nbsp;&nbsp;Kernelized Regression</h3>
<div class="p"><!----></div>
<p> One way to greatly expand the power of modeling methods is a trick called kernel methods.  Roughly kernel methods are those methods that increase the power of machine learning by moving from a simple problem space (like ours in variables x and y) to a richer problem space that may be easier to work in.  A lot of ink is spilled about how efficient the kernel methods are (they work in time proportional to the size of the simple space, not the complex one) but this is not their essential feature.  The essential feature is the expanded explanation power and this is so important that even the trivial kernel methods (such as directly adjoining additional combinations of variables) pick up most of the power of the method.  Kernel methods are also overly associated with Support Vector Machines- but are just as useful when added to Naive Bayes, linear regression or logistic regression.</p>
<div class="p"><!----></div>
<p> For instance: Figure&nbsp;<a href="#fig:KernelizedRegression">10</a> shows a bow-tie like acceptance region found by using linear regression over the variables x, y, x<sup>2</sup>, y<sup>2</sup> and x y (instead of just x and y).  Note how this result is similar to the full Bayes model (but comes from a different feature set and fitting technique).</p>
<div class="p"><!----></div>
<div class="p"><!----></div>
<p><a name="tth_fIg10"><br />
</a><br />
<center><img width="400" src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/kRegression.png" alt="kRegression.png" /></p>
<p></center><center>Figure 10: Kernelized Regression</center><br />
<a name="fig:KernelizedRegression"><br />
</a></p>
<div class="p"><!----></div>
<h3><a name="tth_sEc4.2"><br />
4.2</a>&nbsp;&nbsp;Naive Bayes Model</h3>
<div class="p"><!----></div>
<p> We briefly return to the Bayes model to discuss a more common alternative called &#8220;Naive Bayes.&#8221;  A Naive Bayes model is like a full Bayes model except an additional modeling simplification is introduced in assuming that P(x,y&#124;sale) = P(x&#124;sale)P(y&#124;sale) and P(x,y&#124;non-sale) = P(x&#124;non-sale)P(y&#124;non-sale).  That is we are assuming that the distributions of the x and y measurements are essentially independent (once we know which outcome happened).  This assumption is the opposite of what we do with regression in that we ignore dependencies in the data (instead of modeling and eliminating the dependencies).  However, Naive Bayes methods are quite powerful and very appropriate in sparse-data situations (such as text classification).  The &#8220;naive&#8221; assumption that the input variables are independent greatly reduces the amount of data that needs to be tracked (it is much less work to track values of variables instead of simultaneous values of pairs of variables).  The curved separator from this Naive Bayes model is illustrated in Figure&nbsp;<a href="#fig:NaiveBayesModel">11</a>.</p>
<div class="p"><!----></div>
<div class="p"><!----></div>
<p><a name="tth_fIg11"><br />
</a><br />
<center><img width="400" src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/naiveBayesModel1.png" alt="naiveBayesModel1.png" /></p>
<p></center><center>Figure 11: Naive Bayes Model</center><br />
<a name="fig:NaiveBayesModel"><br />
</a></p>
<div class="p"><!----></div>
<p> The Naive Bayes version of the advice or policy chart is always going to be an axis-aligned parabola as in Figure&nbsp;<a href="#fig:NaiveBayesDecision">12</a>.  Notice how both the linear discriminant and the Naive Bayes model make mistakes (places some colors on the wrong side of the curve)- but they are simple, reliable models that have the desirable property of having connected prediction regions.</p>
<div class="p"><!----></div>
<div class="p"><!----></div>
<p><a name="tth_fIg12"><br />
</a><br />
<center><img width="400" src="http://www.win-vector.com/blog/wp-content/uploads/2009/08/naiveBayesModel2.png" alt="naiveBayesModel2.png" /></p>
<p></center><center>Figure 12: Naive Bayes Decision</center><br />
<a name="fig:NaiveBayesDecision"><br />
</a></p>
<div class="p"><!----></div>
<h3><a name="tth_sEc4.3"><br />
4.3</a>&nbsp;&nbsp;More Exotic Methods</h3>
<div class="p"><!----></div>
<p> Many of the hot buzzword machine learning and data mining methods we listed earlier are essentially different techniques of fitting a linear separator over data.  These methods seem very different but they all form a family once you realize many of the details of the methods are determined by:</p>
<div class="p"><!----></div>
<ul>
<li> Choice of Loss Function
<div class="p"><!----></div>
<p> This is what notion of &#8220;goodness of fit&#8221; is being used.  It can be normalized mean-variance (linear discriminants), un-normalized variance (linear regression), plausibility (logistic regression), L1 distance (support vector machines, quantile regression), entropy (maximum entropy), probability mass and so on.</p>
<div class="p"><!----></div>
</li>
<li> Choice of Optimization Technique
<div class="p"><!----></div>
<p> For a given loss function we can optimize in many ways (though most authors make the mistake of binding their current favorite optimization method deep into their specification of technique): EM, steepest descent, conjugate gradient, quasi-Newton, linear programming and quadratic programming to name a few.</p>
<div class="p"><!----></div>
</li>
<li> Choice of Regularization Method
<div class="p"><!----></div>
<p> Regularization is the idea of forcing the model to not pick extreme values of parameters to over-fit irrelevant artifacts in training data.  Methods include MDL, controlling energy/entropy, Lagrange smoothing, shrinkage, bagging and early termination of optimization.  Non-explicit treatment of regularization is one reason many methods completely specify their optimization procedure (to get some accidental regularization).</p>
<div class="p"><!----></div>
</li>
<li> Choice of Features/Kernelization
<div class="p"><!----></div>
<p> The richness of the feature set the method is applied to is the single largest determinant of model quality.</p>
<div class="p"><!----></div>
</li>
<li> Pre-transformation Tricks
<div class="p"><!----></div>
<p> Some statistical methods are improved by pre-transforming the outcome data to look more normal or be more homoscedastic.<a href="#tthFtNtAAG" name="tthFrefAAG"><sup>6</sup></a></p>
<div class="p"><!----></div>
</li>
</ul>
<div class="p"><!----></div>
<p> If you think along a few axes like these (instead of evaluating them by their name and lineage) you tend to see different data mining methods more as embodying different trade-offs than as being unique incompatible disciplines.</p>
<div class="p"><!----></div>
<div class="p"><!----></div>
<h2><a name="tth_sEc5"><br />
5</a>&nbsp;&nbsp;Conclusion</h2>
<div class="p"><!----></div>
<p> Our goal for this writeup was to fully demonstrate a data mining method and then survey some important data mining and machine learning techniques.  Many of the important considerations are &#8220;too obvious&#8221; to be discussed by statisticians and &#8220;too statistical&#8221; to be comfortably expressed in terms popular with data miners.  The theory and considerations from statistics when combined with the experience and optimism of data-mining/machine-learning truly make possible achieving the important goal of &#8220;learning from data.&#8221;</p>
<div class="p"><!----></div>
<p>This expository writeup is also meant to serve as an example of the<br />
types of research, analysis, software and training supplied by<br />
Win-Vector LLC <a href="http://www.win-vector.com"><tt>http://www.win-vector.com</tt></a> .  Win-Vector LLC<br />
prides itself in depth of research and specializes in identifying,<br />
documenting and implementing the &#8220;simplest technique that can<br />
possibly work&#8221; (which is often the most understandable, maintainable,<br />
robust and reliable).  Win-Vector LLC specializes in research but<br />
has significant experience in delivering full solutions (including<br />
software solutions and integration with existing databases).</p>
<div class="p"><!----></div>
<p><font size="-1"></p>
<h2>References</h2>
<dl compact="compact">
<dt><a href="#CITEAgresti" name="Agresti">[Agr02]</a></dt>
<dd>
Alan Agresti, <em>Categorical data analysis (wiley series in probability and<br />
  statistics)</em>, Wiley-Interscience, July 2002.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITEBreiman:1997p1133" name="Breiman:1997p1133">[BF97]</a></dt>
<dd>
Leo Breiman and Jerome&nbsp;H Friedman, <em>Predicting multivariate responses in<br />
  multiple linear regression</em>, Journal of the Royal Statistical Society, Series<br />
  B (Methodological) <b>59</b> (1997), no.&nbsp;1, 3-54.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITEBlei:2003p1063" name="Blei:2003p1063">[BNJ03]</a></dt>
<dd>
David&nbsp;M Blei, Andrew&nbsp;Y Ng, and Michael&nbsp;I Jordan, <em>Latent dirichlet<br />
  allocation</em>, Journal of Machine Learning Research <b>3</b> (2003),<br />
  993-1022.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITEBreiman:2000p1134" name="Breiman:2000p1134">[Bre00]</a></dt>
<dd>
Leo Breiman, <em>Special invited paper. additive logistic regression: A<br />
  statistical view of boosting: Discussion</em>, Ann. Statist. <b>28</b> (2000),<br />
  no.&nbsp;2, 374-377.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITEBeigel:2008p1027" name="Beigel:2008p1027">[BRS08]</a></dt>
<dd>
Richard Beigel, Nick Reingold, and Daniel&nbsp;A Spielman, <em>The perceptron<br />
  strikes back</em>, 6.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITEkernel1" name="kernel1">[CST00]</a></dt>
<dd>
Nello Cristianini and John Shawe-Taylor, <em>An introduction to support<br />
  vector machines and other kernel-based learning methods</em>, 1 ed., Cambridge<br />
  University Press, March 2000.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITEDasgupta:2005p2013" name="Dasgupta:2005p2013">[DKM05]</a></dt>
<dd>
Sanjoy Dasgupta, Adam&nbsp;Tauman Kalai, and Claire Monteleoni, <em>Analysis of<br />
  perceptron-based active learning</em>, CSAIL Tech. Report (2005), 16.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITEDudik:2006p954" name="Dudik:2006p954">[DS06]</a></dt>
<dd>
Miroslav Dudik and Robert&nbsp;E Schapire, <em>Maximum entropy distribution<br />
  estimation with generalized regularization</em>, COLT (2006), 15.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITEFisher:1936p2576" name="Fisher:1936p2576">[Fis36]</a></dt>
<dd>
Ronald&nbsp;A Fisher, <em>The use of multiple measurements in taxonomic problems</em>,<br />
  Annals of Eugenics <b>7</b> (1936), 179-188.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITEFreund:2003p1009" name="Freund:2003p1009">[FISS03]</a></dt>
<dd>
Yoav Freund, Raj Iyer, Robert&nbsp;E Schapire, and Yoram Singer, <em>An efficient<br />
  boosting algorithm for combining preferences</em>, Journal of Machine Learning<br />
  Research <b>4</b> (2003), 933-969.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITEstatistics" name="statistics">[FPP07]</a></dt>
<dd>
David Freedman, Robert Pisani, and Roger Purves, <em>Statistics 4th edition</em>,<br />
  W. W. Norton and Company, 2007.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITEGrunwald:2005p108" name="Grunwald:2005p108">[Gru05]</a></dt>
<dd>
Peter&nbsp;D Grunwald, <em>Maximum entropy and the glasses you are looking<br />
  through</em>.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITEHalevy:2009p2327" name="Halevy:2009p2327">[HNP09]</a></dt>
<dd>
Alon Halevy, Peter Norvig, and Fernando Pereira, <em>The unreasonable<br />
  effectiveness of data</em>, IEEE Intellegent Systems (2009).</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITEKlein:2003p261" name="Klein:2003p261">[KM03]</a></dt>
<dd>
Dan Klein and Christopher&nbsp;D Manning, <em>Maxent models, conditional<br />
  estimation, and optimization</em>.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITEquantile" name="quantile">[Koe05]</a></dt>
<dd>
Roger Koenker, <em>Quantile regression</em>, Cambridge University Press, May<br />
  2005.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITELewis:1998p105" name="Lewis:1998p105">[Lew98]</a></dt>
<dd>
David&nbsp;D Lewis, <em>Naive (bayes) at forty: The independence assumption in<br />
  information retrieval</em>.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITENYTStat" name="NYTStat">[Loh09]</a></dt>
<dd>
Steve Lohr, <em>For today’s graduate, just one word: Statistics</em>,<br />
  <a href="http://www.nytimes.com/2009/08/06/technology/06stats.html"><tt>http://www.nytimes.com/2009/08/06/technology/06stats.html</tt></a>, August 2009.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITER:Sarkar:2008" name="R:Sarkar:2008">[Sar08]</a></dt>
<dd>
Deepayan Sarkar, <em>Lattice: Multivariate data visualization with R</em>,<br />
  Springer, New York, 2008, ISBN 978-0-387-75968-5.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITEStern:1989p1480" name="Stern:1989p1480">[SC89]</a></dt>
<dd>
Hal Stern and Thomas&nbsp;M Cover, <em>Maximum entropy and the lottery</em>, Journal<br />
  of the American Statistical Association <b>84</b> (1989), no.&nbsp;408,<br />
  980-985.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITESchapire:2001p1019" name="Schapire:2001p1019">[Sch01]</a></dt>
<dd>
Robert&nbsp;E Schapire, <em>The boosting approach to machine learning an<br />
  overview</em>, 23.</p>
<div class="p"><!----></div>
</dd>
<dt><a href="#CITEkernel2" name="kernel2">[STC04]</a></dt>
<dd>
John Shawe-Taylor and Nello Cristianini, <em>Kernel methods for pattern<br />
  analysis</em>, Cambridge University Press, June 2004.</dd>
</dl>
<p></font></p>
<div class="p"><!----></div>
<p><center><b>APPENDIX</b><br />
</center></p>
<div class="p"><!----></div>
<h2><a name="tth_sEcA"><br />
A</a>&nbsp;&nbsp;Graphs</h2>
<div class="p"><!----></div>
<p>The majority of the graphs in this writeup were produced using &#8220;R&#8221;<br />
<a href="http://www.r-project.org/"><tt>http://www.r-project.org/</tt></a> and Deepayan Sarkar&#8217;s Lattice<br />
package[<a href="#R:Sarkar:2008" name="CITER:Sarkar:2008">Sar08</a>].</p>
<div class="p"><!----></div>
<hr />
<h3>Footnotes:</h3>
<div class="p"><!----></div>
<p><a name="tthFtNtAAB"></a><a href="#tthFrefAAB"><sup>1</sup></a><br />
<a href="mailto:jmount@win-vector.com"><tt>mailto:jmount@win-vector.com</tt></a><br />
<a href="http://www.win-vector.com/"><tt>http://www.win-vector.com/</tt></a><br />
<a href="http://www.win-vector.com/blog/"><tt>http://www.win-vector.com/blog/</tt></a></p>
<div class="p"><!----></div>
<p><a name="tthFtNtAAC"></a><a href="#tthFrefAAC"><sup>2</sup></a>Read P(A &#124; B) as: &#8220;the probability of A will<br />
  happen given we know B is true.&#8221;</p>
<div class="p"><!----></div>
<p><a name="tthFtNtAAD"></a><a href="#tthFrefAAD"><sup>3</sup></a>Technically we are working with densities, not<br />
  probabilities, but we will use probability notation for its<br />
  intuition.</p>
<div class="p"><!----></div>
<p><a name="tthFtNtAAE"></a><a href="#tthFrefAAE"><sup>4</sup></a>P(sale &#124; x,y) is the probability of<br />
making a sale as a function of what we know about the prospective<br />
customer and our offer.  Whereas P(x,y&#124;sale) was just how likely it is<br />
to see a prospect with the given x and y values, conditioned on knowing we made<br />
a sale to this prospect.</p>
<div class="p"><!----></div>
<p><a name="tthFtNtAAF"></a><a href="#tthFrefAAF"><sup>5</sup></a> P(sale) and<br />
  P(non-sale) are just the &#8220;prior odds&#8221; of sales or what<br />
  our estimate of our chances of success are before we look at any<br />
  facts about a particular customer.  We can use our historical<br />
  overall success and failure rates as estimates of these quantities.</p>
<div class="p"><!----></div>
<p><a name="tthFtNtAAG"></a><a href="#tthFrefAAG"><sup>6</sup></a>A situation is homoscedastic if the errors are independent of where we are in the parameter space (our x,y or match factor and discount factor).  This property is very important for meaningful fitting/modeling and interpreting significance of fits.</p>
<hr /><small>File translated from<br />
T<sub><font size="-1">E</font></sub>X<br />
by <a href="http://hutchinson.belmont.ma.us/tth/"><br />
T<sub><font size="-1">T</font></sub>H</a>,<br />
version 3.85.<br />On 29 Aug 2009, 11:43.</small></p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2011/07/book-review-ensemble-methods-in-data-mining-seni-elder/' rel='bookmark' title='Book Review: Ensemble Methods in Data Mining (Seni &amp; Elder)'>Book Review: Ensemble Methods in Data Mining (Seni &#038; Elder)</a></li>
<li><a href='http://www.win-vector.com/blog/2009/04/the-data-enrichment-method/' rel='bookmark' title='The Data Enrichment Method'>The Data Enrichment Method</a></li>
<li><a href='http://www.win-vector.com/blog/2011/07/your-data-is-never-the-right-shape/' rel='bookmark' title='Your Data is Never the Right Shape'>Your Data is Never the Right Shape</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>

