<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Win-Vector Blog</title>
	<atom:link href="http://www.win-vector.com/blog/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.win-vector.com/blog</link>
	<description>The Applied Theorist&#039;s Point of View</description>
	<lastBuildDate>Fri, 11 May 2012 16:58:25 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<item>
		<title>Enhance OSX Finder</title>
		<link>http://www.win-vector.com/blog/2012/05/enhance-osx-finder/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=enhance-osx-finder</link>
		<comments>http://www.win-vector.com/blog/2012/05/enhance-osx-finder/#comments</comments>
		<pubDate>Fri, 11 May 2012 16:58:20 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Computers]]></category>
		<category><![CDATA[Public Service Article]]></category>
		<category><![CDATA[Tutorials]]></category>
		<category><![CDATA[Finder]]></category>
		<category><![CDATA[OSX]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1997</guid>
		<description><![CDATA[I tend to prefer command line Linux and full window OSX for my work. The development and data handling tool chain is a bit better in Linux and the user interface reliability of the complete vertical stack is a bit better in OSX. I repeat here a couple of tips I found to improve the [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2010/05/must-have-software/' rel='bookmark' title='Must Have Software'>Must Have Software</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>I tend to prefer command line Linux and full window OSX for my work.  The development and data handling tool chain is a bit better in Linux and the user interface reliability of the complete vertical stack is a bit better in OSX.   I repeat here a couple of tips I found to improve the OSX finder.<span id="more-1997"></span>A key feature missing from the OSX finder is a convenient &#8220;open another finder in the current directory.&#8221;  The finder does have the &#8220;open a new finder on each folder-click option&#8221; but that litters your desktop with many useless intermediate finders.  You can also right click on a folder in the Finder pathbar (which defaults to not visible) to get a menu up allowing a new finder to launch via &#8220;Open Enclosing Folder.&#8221;  For me (even with the path bar on) nether of these methods really flow.</p>
<p>Fortunately Apple does supply the user with incredibly powerful tools such as AppleScript.  This will allow us to add an always present single click (no control or right-click and no menu) action to our Finder toolbar.  </p>
<p><img style="display:block; margin-left:auto; margin-right:auto;" src="http://www.win-vector.com/blog/wp-content/uploads/2012/05/Finder.png" alt="Finder" title="Finder.png" border="0" width="300" height="128" /></p>
<p>To add the ability to open a new finder as a click all one has to do is the following (we are not distributing this as finished code as it is not our work and we don&#8217;t wan&#8217;t to encourage people to download and use unexamined code).</p>
<ol>
<li>Open the AppleScript Editor</li>
<li>Select File->New</li>
<li>Paste the the following into the AppleScript Editor (from: <a target="_blank" href="http://hints.macworld.com/article.php?story=20080108144434753">dtomasch (Daniel T)  on Macword Mac OSX hints</a><br />
<code></p>
<pre>
-- From: http://hints.macworld.com/article.php?story=20080108144434753
try
	tell application "Finder"
		activate
		set this_folder to (the target of the front window) as alias
		set {x1, y1} to position of front window
		make new Finder window to this_folder
		set position of front window to {(x1 + 50), (y1 + 150)} --This offsets the new window more than the average Finder tiling does
	end tell
end try
</pre>
<p></code>
</li>
<li>
Press the compile hammer.
</li>
<li>
Select File->Save As, choose file format Applicaton and save as &#8220;OpenFinder.app&#8221;.
</li>
<li> (optional, for a nice icon)<br />
With a finder right click on the new application and select Show Package Contents.  Then replace Contents/Resources/applet.icns with <a target="_blank" href="http://eggy.deviantart.com/art/Finder-Icon-156265322">Pieter Stroink&#8217; finder icon <img style="display:block; margin-left:auto; margin-right:auto;" src="http://www.win-vector.com/blog/wp-content/uploads/2012/05/NewImage.png" alt="NewImage" title="NewImage.png" border="0" width="300" height="200" /></a> (the file Finder_without_Bag.icns from the downloadable zip file Finder_Icon_by_eggy.zip).
</li>
<li>
On a Finder right click on some blank space in the upper toolbar and select Customize Toolbar.  Drop the new application in the toolbar.
</li>
</ol>
<p>Now all of your finders show the new application as an icon in the upper toolbar.  When you click on this icon you get a new finder in the current context.</p>
<p>But you can do more, you can repeat this pattern with other applications.  In my case I really like being able to open the command line terminal in the current finder directory.  One of the really cool things about OSX is it has a good command line terminal.  If you already are a terminal user you will like the following.</p>
<p>You can get the terminal in on this game by using <a target="_blank" href="http://hints.macworld.com/article.php?story=20020426093503563">TomWoozle (Tom Anthony)&#8217;s  Macword Mac OSX hint</a> in the same patter as above to create a button that open a new terminal shell at the current finder location.  You just use this block of AppleScript code:</p>
<p><code></p>
<pre>
-- from: http://hints.macworld.com/article.php?story=20020426093503563
on run
	tell application "Finder"
		try
			activate
			set frontWin to folder of front window as string
			set frontWinPath to (get POSIX path of frontWin)
			tell application "Terminal"
				activate
				do script with command "cd \"" &#038; frontWinPath &#038; "\""
			end tell
		on error error_message
			beep
			display dialog error_message buttons ¬
				{"OK"} default button 1
		end try
	end tell
end run
</pre>
<p></code></p>
<p>and get the icons from <code>/Applications/Utilities/Terminal.app/Contents/Resources/Terminal.icns</code>.  </p>
<p><img style="display:block; margin-left:auto; margin-right:auto;" src="http://www.win-vector.com/blog/wp-content/uploads/2012/05/Terminal.png" alt="Terminal" title="Terminal.png" border="0" width="300" height="300" />The absolute icing on the cake is OSX&#8217;s built in <code>"open"</code> command.  You can open a Finder pointing to the current terminal&#8217;s working directory by typing <code>"open ."</code> in the terminal (completing the circle).</p>
<p>Note: we have seen the open terminal app &#8220;stutter&#8221; or sometimes open two terminals.</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2010/05/must-have-software/' rel='bookmark' title='Must Have Software'>Must Have Software</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2012/05/enhance-osx-finder/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>The differing perspectives of statistics and machine learning</title>
		<link>http://www.win-vector.com/blog/2012/05/the-differing-perspectives-of-statistics-and-machine-learning/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=the-differing-perspectives-of-statistics-and-machine-learning</link>
		<comments>http://www.win-vector.com/blog/2012/05/the-differing-perspectives-of-statistics-and-machine-learning/#comments</comments>
		<pubDate>Sun, 06 May 2012 16:05:58 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Opinion]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[machine learning perspective]]></category>
		<category><![CDATA[statistical perspective]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1985</guid>
		<description><![CDATA[In both working with and thinking about machine learning and statistics I am always amazed at the differences in perspective and view between these two fields. In caricature it boils down to: machine learning initiates expect to get rich and statistical initiates expect to get yelled at. You can see hints of what the practitioners [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2010/10/a-personal-perspective-on-machine-learning/' rel='bookmark' title='A Personal Perspective on Machine Learning'>A Personal Perspective on Machine Learning</a></li>
<li><a href='http://www.win-vector.com/blog/2011/12/why-you-can-not-to-use-statistics-to-dispute-magic/' rel='bookmark' title='Why you can not to use statistics to dispute magic'>Why you can not to use statistics to dispute magic</a></li>
<li><a href='http://www.win-vector.com/blog/2009/12/statistics-to-english-translation-part-2b-calculating-significance/' rel='bookmark' title='Statistics to English Translation, Part 2b: Calculating Significance'>Statistics to English Translation, Part 2b: Calculating Significance</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>In both <a target="_blank" href="http://www.win-vector.com/RecentClients/RecentClients.html">working with</a> and <a target="_blank" href="http://www.win-vector.com/blog/2010/10/a-personal-perspective-on-machine-learning/">thinking about</a> machine learning and statistics I am always amazed at the differences in perspective and view between these two fields.  In caricature it boils down to: machine learning initiates expect to get rich and statistical initiates expect to get yelled at.  You can see hints of what the practitioners expect to encounter by watching their preparations and initial steps.<span id="more-1985"></span>Machine learning experts anticipate solving a code or riddle.   The assumption seems to be we will encounter a problem that is difficult due to its intrinsic structure or shape (and the difficulty is not from something as mundane as measurement).</p>
<p>Telling stereotype machine learning examples and methods include:</p>
<ul>
<li>
The XOR problem as an important violation of linear seperability: Minsky, Marvin Lee, and Seymour Papert.  Perceptrons: an introduction to computational geometry. ,1st ed. Cambridge, Massachusetts: MIT Press, 1969.
</li>
<li>
The &#8220;two spirals problem&#8221; found in Kevin J. Lang and Michael J, Witbrock, &#8220;Learning to Tell Two Spirals Apart&#8221;, in Proceedings of the 1988 Connectionist Models Summer School, Morgan Kaufmann, 1988.
</li>
<li>
The continuing fascination and re-discovery of biologically inspired learning techniques (in the hope that they may work even if we don&#8217;t yet know why): <a target="_blank" href="http://en.wikipedia.org/wiki/Perceptron">perceptrons</a>, <a target="_blank" href="http://en.wikipedia.org/wiki/Self-organizing_map">self organizing maps</a> and <a target="_blank" href="http://en.wikipedia.org/wiki/Cellular_automaton">cellular automata</a>.
</li>
<li>
Logical methods like <a target="_blank" href="http://en.wikipedia.org/wiki/Version_space">version spaces</a>; and other <a target="_blank" href="http://en.wikipedia.org/wiki/Novum_Organum">Novum Organum</a> inspired methods (like <a target="_blank" href="http://en.wikipedia.org/wiki/Resolution_(logic)">resolution</a> and classic symbolic AI planning).
</li>
<li>
The complementary relation of cryptography and machine learning (at most one of them can be an easy endeavor): Cryptography and Machine Learning by Ronald L. Rivest. Proceedings ASIACRYPT &#8217;91 (Springer 1993), 427&#8211;439.
</li>
</ul>
<p>There is an identifiable theme that all of the data is before us and it is just a matter of finding its secrets using either well founded methods or arcane methods.  Even if none of the variables or measurements initially available are immediately useful perhaps some combination of them will be.  The first order of business is to find the right combination or transformation.  It is just a matter of sufficiently clever computation.   </p>
<p>The initial activity of a machine learning practitioner is often to choose among sophisticated representations, model forms and tools.   And having such powerful tools machine learning practitioners rush where statisticians traditionally fear to tread.</p>
<p>Statisticians, on the other hand, have very good descriptions of what often goes wrong in even observing simple data.  </p>
<p>For example:  artificial intelligence and machine learning ideas such as version spaces, linear separability and logical entailment all depend on data without a single transcription error.  A single positive example deep in the center of the mass of positive examples can kill these methods if it is mis-transcribed as a negative.  These issues can be solved (for example, the emergence of <a target="_blank" href="http://en.wikipedia.org/wiki/Support_vector_machine#Soft_margin">soft margin classifiers</a> to deal with error).  But it is telling that error and data distribution were not the first concerns.</p>
<p>Statisticians produce a lot of results describing data quality issues like:</p>
<ul>
<li>
Error/Noise.  As we mentioned above- we must assume their may be data in error.  That is why methods like logistic regression use maximum likelihood (try to be consistent with as much of the mass of the data as possible) instead of ideas like margin and separability.
</li>
<li>
Collinarity.   An example of this is when variables that individually are useful fail to perform even better when used together (as they correlate or are collinear with each other, so they each have reduced marginal value once you have some of these variables in your model).
</li>
<li>
<a target="_blank" href="http://en.wikipedia.org/wiki/Simpson's_paradox">Simpson&#8217;s paradox</a>.   Where a treatment can look better in all sub-experiments, yet look worse overall.
</li>
<li>
<a target="_blank" href="http://en.wikipedia.org/wiki/Nuisance_variable">Nuisance variables</a>.   Variables that predict the outcome, but not in a useful or controllable way.   A simple example would be the day of week&#8217;s impact on web traffic.  You can&#8217;t control the day of the week (pay to have more Mondays) but if you don&#8217;t deal with its influence you may mistakingly assign some of its influence to a treatment that overlapped more Mondays than an alternative.
</li>
</ul>
<p>The statistical practitioner usually starts by examining single variable effects.  They test if variables are reliable and test to what extent they remain useful after adding more variables.  The statistician doesn&#8217;t expect some clever combination of variables to out-perform all of its constituent parts, but to build an ensemble of variables such that each variable is not performing much worse than when it was used alone (so the quality of the model is nearly additive in the number of chosen variables).</p>
<p>It is one of our maxims that the major source of deep statistical problems is poor record keeping (or experimental design).   With perfect records you would not need a lot of the more powerful statistical tools.  If you had sufficiently detailed records of intermediate states you would not need big statistical tools like <a target="_blan" href="http://en.wikipedia.org/wiki/Bayesian_network">Bayesian networks</a> or <a target="_blank" href="http://en.wikipedia.org/wiki/Conditional_random_field">conditional random fields</a>.  However, much of what we are calling &#8220;bad record keeping&#8221; is the excusable failure to have records of important <em>unobservable</em> states (though the statistics gets just as hard when things that could have been recorded are accidentally mixed, aggregated, truncated or censored).</p>
<p>The statistician tends to use sophisticated methods to validate and repair data, not to decode complex hidden relations.  A <a target="_blank" href="http://www.linkedin.com/pub/philip-apps/1/23/42b">friend of mine</a> characterizes the statistical view as &#8220;you have to get up pretty early in the morning to beat linear regression.&#8221;</p>
<p>The machine learning practitioner tends to have much better tools for dealing with difficult relations (they are not forced to think in linear terms, especially with the use of <a target="_blank" href="http://www.win-vector.com/blog/2011/10/kernel-methods-and-support-vector-machines-de-mystified/">kernel methods</a> to allow richer model forms, though the statistician does have their own methods like <a target="_blank" href="http://www.win-vector.com/blog/2011/02/the-cranky-guide-to-trying-r-packages/">generalized additive models</a>).</p>
<p>To know which view is more advantageous for a given problem you just need to think clearly are you more worried about functional form (so you should look to machine learning) or issues of measurement (so you should look to statistics).</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2010/10/a-personal-perspective-on-machine-learning/' rel='bookmark' title='A Personal Perspective on Machine Learning'>A Personal Perspective on Machine Learning</a></li>
<li><a href='http://www.win-vector.com/blog/2011/12/why-you-can-not-to-use-statistics-to-dispute-magic/' rel='bookmark' title='Why you can not to use statistics to dispute magic'>Why you can not to use statistics to dispute magic</a></li>
<li><a href='http://www.win-vector.com/blog/2009/12/statistics-to-english-translation-part-2b-calculating-significance/' rel='bookmark' title='Statistics to English Translation, Part 2b: Calculating Significance'>Statistics to English Translation, Part 2b: Calculating Significance</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2012/05/the-differing-perspectives-of-statistics-and-machine-learning/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>How to remember point shape codes in R</title>
		<link>http://www.win-vector.com/blog/2012/04/how-to-remember-point-shape-codes-in-r/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=how-to-remember-point-shape-codes-in-r</link>
		<comments>http://www.win-vector.com/blog/2012/04/how-to-remember-point-shape-codes-in-r/#comments</comments>
		<pubDate>Tue, 24 Apr 2012 17:17:30 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Tutorials]]></category>
		<category><![CDATA[ggplot2]]></category>
		<category><![CDATA[graphing]]></category>
		<category><![CDATA[plotting]]></category>
		<category><![CDATA[R]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1971</guid>
		<description><![CDATA[I suspect I am not unique in not being able to remember how to control the point shapes in R. Part of this is a documentation problem: no package ever seems to write the shapes down. All packages just use the &#8220;usual set&#8221; that derives from S-Plus and was carried through base-graphics, to grid, lattice [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2011/07/your-data-is-never-the-right-shape/' rel='bookmark' title='Your Data is Never the Right Shape'>Your Data is Never the Right Shape</a></li>
<li><a href='http://www.win-vector.com/blog/2011/02/the-cranky-guide-to-trying-r-packages/' rel='bookmark' title='The cranky guide to trying R packages'>The cranky guide to trying R packages</a></li>
<li><a href='http://www.win-vector.com/blog/2011/12/my-favorite-graphs/' rel='bookmark' title='My Favorite Graphs'>My Favorite Graphs</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>I suspect I am not unique in not being able to remember how to control the point shapes in <a target="_blan" href="http://cran.r-project.org/">R</a>.  Part of this is a documentation problem: no package ever seems to write the shapes down.  All packages just use the &#8220;usual set&#8221; that derives from S-Plus and was carried through base-graphics, to grid, lattice and ggplot2.  The quickest way out of this is to know how to generate an example plot of the shapes quickly.  We show how to do this in <a target="_blank" href="http://had.co.nz/ggplot2/">ggplot2</a>.  This is trivial- but you get tired of not having it immediately available.<span id="more-1971"></span><code></p>
<pre>
library(ggplot2)
ggplot(data=data.frame(x=c(1:16))) + geom_point(aes(x=x,y=x,shape=x))
</pre>
<p></code></p>
<p><img style="display:block; margin-left:auto; margin-right:auto;" src="http://www.win-vector.com/blog/wp-content/uploads/2012/04/symbols1.png" alt="Symbols1" title="symbols1.png" border="0" width="600" height="600" /><br />
Or if you are feeling more daring:</p>
<p><code></p>
<pre>
ggplot(data=data.frame(x=c(1:16))) + geom_point(aes(x=x,y=x,shape=x)) +
   facet_wrap(~x,scales='free')
</pre>
<p></code></p>
<p><img style="display:block; margin-left:auto; margin-right:auto;" src="http://www.win-vector.com/blog/wp-content/uploads/2012/04/symbols2.png" alt="Symbols2" title="symbols2.png" border="0" width="600" height="600" /></p>
<p>Update 4-25-2012</p>
<p>As Idris kindly pointed out in the comments the above no longer works.  I suspect what changed is something in the transition to R2.15.0.  The following hack does work:</p>
<p><code></p>
<pre>
sum &lt;- ggplot()
for(i in 1:16) {
   sum &lt;- sum +
      geom_point(data=data.frame(x=c(i)),aes(x=x,y=x),shape=i)
}
sum
</pre>
<p></code></p>
<p>The trick is outside the aes() it looks like an integer works and at least 16 values work (instead of only around 6 inside the aes()).  Frankly we are doing a lot of work to dance around R&#8217;s delayed evaluation and variable binding rules in this case.  The ggplot2 examples have always plugged factors into the shapes- but I had always assumed that was to densify the indexing (get all the shape numbers into an interval) and not a requirement for using the shape parameter.  The <a target="_blank" href="http://had.co.nz/ggplot2/geom_point.html">online documentation</a> of shape has always been a line of text of the form:</p>
<table width='80%'>
<tr>
<th>Aesthetic</th>
<th>Default</th>
<th>Related scales</th>
</tr>
<tr>
<td>shape</td>
<td>16</td>
<td>identity, manual, shape</td>
</tr>
<tr>
</table>
<p>Which to my mind doesn&#8217;t specify if you have to supply a factor or not.  But you really should buy Professor Hadley Wickham&#8217;s excellent book: <a href="http://www.amazon.com/ggplot2-Elegant-Graphics-Data-Analysis/dp/0387981403">ggplot2: Elegant Graphics for Data Analysis</a> (not an affiliate link).  My copy (like almost everything else I own) just happens to be packed in a box right now, so I can not easily consult it.</p>
<p>And it looks like we do get some nice bug fixes with this update.  The missing scales on some of the facet_wrap panes appear to be fixed (yey!):</p>
<p><code></p>
<pre>
ggplot(data=data.frame(x=c(1:16))) + geom_point(aes(x=x,y=x)) +
    facet_wrap(~x,scales='free')
</pre>
<p></code></p>
<p><img style="display:block; margin-left:auto; margin-right:auto;" src="http://www.win-vector.com/blog/wp-content/uploads/2012/04/Rplot.png" alt="Rplot" title="Rplot.png" border="0" width="300" height="300" /></p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2011/07/your-data-is-never-the-right-shape/' rel='bookmark' title='Your Data is Never the Right Shape'>Your Data is Never the Right Shape</a></li>
<li><a href='http://www.win-vector.com/blog/2011/02/the-cranky-guide-to-trying-r-packages/' rel='bookmark' title='The cranky guide to trying R packages'>The cranky guide to trying R packages</a></li>
<li><a href='http://www.win-vector.com/blog/2011/12/my-favorite-graphs/' rel='bookmark' title='My Favorite Graphs'>My Favorite Graphs</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2012/04/how-to-remember-point-shape-codes-in-r/feed/</wfw:commentRss>
		<slash:comments>13</slash:comments>
		</item>
		<item>
		<title>Congratulations to both Dr. Nina Zumel and EMC- great job</title>
		<link>http://www.win-vector.com/blog/2012/04/congratulations-to-both-dr-nina-zumel-and-emc-great-job/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=congratulations-to-both-dr-nina-zumel-and-emc-great-job</link>
		<comments>http://www.win-vector.com/blog/2012/04/congratulations-to-both-dr-nina-zumel-and-emc-great-job/#comments</comments>
		<pubDate>Sat, 21 Apr 2012 21:01:32 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[data science]]></category>
		<category><![CDATA[Opinion]]></category>
		<category><![CDATA[data analytics]]></category>
		<category><![CDATA[EMC]]></category>
		<category><![CDATA[Training]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1966</guid>
		<description><![CDATA[A big congratulations to Win-Vector LLC&#8216;s Dr. Nina Zumel for authoring and teaching portions of EMC&#8216;s new Data Science and Big Data Analytics training and certification program. A big congratulations to EMC, EMC Education Services and Greenplum for creating a great training course. Finally a huge thank you to EMC, EMC Education Services and Greenplum [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2012/04/setting-expectations-in-data-science-projects/' rel='bookmark' title='Setting expectations in data science projects'>Setting expectations in data science projects</a></li>
<li><a href='http://www.win-vector.com/blog/2008/04/sorting-in-anger/' rel='bookmark' title='Sorting Used in Anger'>Sorting Used in Anger</a></li>
<li><a href='http://www.win-vector.com/blog/2011/12/my-favorite-graphs/' rel='bookmark' title='My Favorite Graphs'>My Favorite Graphs</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>A big congratulations to  <a target="_blank" href="http://www.win-vector.com/">Win-Vector LLC</a>&#8216;s <a target="_blank" href="http://www.win-vector.com/Staff/NinaZumel/NinaZumel.html">Dr. Nina Zumel</a> for authoring and teaching portions of <a target="_blank" href="http://www.emc.com/">EMC</a>&#8216;s new <a target="_blank" href="http://education.emc.com/guest/campaign/data_science.aspx">Data Science and Big Data Analytics</a> training and certification program.  A big congratulations to EMC, <a target="_blank" href="https://education.emc.com/default_guest.aspx">EMC Education Services</a> and <a target="_blank" href="http://www.greenplum.com/">Greenplum</a> for creating a great training course.  Finally a huge thank you to EMC, EMC Education Services and Greenplum for inviting Win-Vector LLC to contribute to this great project.</p>
<p><a target="_blank" href="http://education.emc.com/guest/campaign/data_science.aspx"><img style="display:block; margin-left:auto; margin-right:auto;" src="http://www.win-vector.com/blog/wp-content/uploads/2012/04/389273_10150730223199318_602824317_9375276_1010737649_n.jpg" alt="389273 10150730223199318 602824317 9375276 1010737649 n" title="389273_10150730223199318_602824317_9375276_1010737649_n.jpg" border="0" width="427" height="600" /></a><span id="more-1966"></span><a target="_blank" href="http://education.emc.com/guest/campaign/data_science.aspx"><img style="display:block; margin-left:auto; margin-right:auto;" src="http://www.win-vector.com/blog/wp-content/uploads/2012/04/NZLecture.jpg" alt="NZLecture" title="NZLecture.jpg" border="0" width="600" height="314" /></a></p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2012/04/setting-expectations-in-data-science-projects/' rel='bookmark' title='Setting expectations in data science projects'>Setting expectations in data science projects</a></li>
<li><a href='http://www.win-vector.com/blog/2008/04/sorting-in-anger/' rel='bookmark' title='Sorting Used in Anger'>Sorting Used in Anger</a></li>
<li><a href='http://www.win-vector.com/blog/2011/12/my-favorite-graphs/' rel='bookmark' title='My Favorite Graphs'>My Favorite Graphs</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2012/04/congratulations-to-both-dr-nina-zumel-and-emc-great-job/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Setting expectations in data science projects</title>
		<link>http://www.win-vector.com/blog/2012/04/setting-expectations-in-data-science-projects/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=setting-expectations-in-data-science-projects</link>
		<comments>http://www.win-vector.com/blog/2012/04/setting-expectations-in-data-science-projects/#comments</comments>
		<pubDate>Sat, 21 Apr 2012 18:13:08 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[data science]]></category>
		<category><![CDATA[Opinion]]></category>
		<category><![CDATA[Tutorials]]></category>
		<category><![CDATA[Analytics]]></category>
		<category><![CDATA[analytics project planning]]></category>
		<category><![CDATA[data science projet planning]]></category>
		<category><![CDATA[project planning]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1958</guid>
		<description><![CDATA[How is it even possible to set expectations and launch data science projects? Data science projects vary from &#8220;executive dashboards&#8221; through &#8220;automate what my analysts are already doing well&#8221; to &#8220;here is some data, we would like some magic.&#8221; That is you may be called to produce visualizations, analytics, data mining, statistics, machine learning, method [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2008/02/hello-world-an-instance-rhetoric-in-computer-science/' rel='bookmark' title='Hello World: An Instance Of Rhetoric in Computer Science'>Hello World: An Instance Of Rhetoric in Computer Science</a></li>
<li><a href='http://www.win-vector.com/blog/2010/08/what-did-theorists-do-before-the-age-of-big-data/' rel='bookmark' title='What Did Theorists Do Before The Age Of Big Data?'>What Did Theorists Do Before The Age Of Big Data?</a></li>
<li><a href='http://www.win-vector.com/blog/2011/07/your-data-is-never-the-right-shape/' rel='bookmark' title='Your Data is Never the Right Shape'>Your Data is Never the Right Shape</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>How is it even possible to set expectations and launch data science projects?</p>
<p>Data science projects vary from &#8220;executive dashboards&#8221; through &#8220;automate what my analysts are already doing well&#8221; to &#8220;here is some data, we would like some magic.&#8221;   That is you may be called to produce visualizations, analytics, data mining, statistics, machine learning, method research or method invention.  Given the wide range of wants, diverse data sources, required levels of innovation and methods it often feels like you can not even set goals for data science projects.</p>
<p>Many of these projects either fail or become open ended (become unmanageable).</p>
<p>As an alternative we describe some of our methods for setting quantifiable goals and front-loading risk in data science projects.<span id="more-1958"></span><br />
<h2>The typical situation</h2>
<p>Data science projects are often considered untrackable because either there is &#8220;some magic&#8221; expected, there is no prior bound on how dirty the incoming data is or there is no prior definition of what a good result would look like.  An example might be &#8220;invent a method to use website visit data to predict who is likely to purchase from us.&#8221;</p>
<p>When magic is expected you are really talking about invention and not a data science deployment.  The research that more naturally fits into a data science project is both: learning the nature of the domain, problem and data; and traditional literature research (are there known methods that help with our situation?).  You can schedule intervals of invention spikes into a larger project; but you can not really specify outcomes of these spikes (&#8220;magic method solves problem X by February 7th&#8221;).  So these should be seen as tasks in a larger project that may or may not help.  The entire project should not fully depend on them.  The project must be able to succeed even when the invention spikes fail.</p>
<p>Additional red flags include lack of description of the input data and no concrete definition of the outcome we are trying to predict (Purchase ever? Purchase in the next month? Spend at least $100?).   </p>
<p>Often the first fix to the project ask happen as: &#8220;we better at least quantify expected performance: let&#8217;s insist on an accuracy of 95%.&#8221;  This often happens in a business meeting late in the project launch when it is noticed that what is likely a large and important project has absolutely no acceptance criteria. Unfortunately this &#8220;bar&#8221; is often set without any research if accuracy <a target="_blank" href="http://www.win-vector.com/blog/tag/precision-and-recall/">is even the correct measure</a>, if 95% is easy or hard or even if the enterprise will be profitable at this accuracy.  This is a good intent, but the arbitrary goal (that nobody will really be held to) is a step backwards.  </p>
<p>What we need to do is: schedule dedicated time to learn about the domain and data before writing project goals and scope.  This itself can be part of a small concrete expectation setting project.   To complete the expectation setting project we need reusable methods to set useful, realistic goals that really measure if a data science project is on track (i.e. that a data science project can be held to).   We outline a few methods to generate prior estimates for two of the important data science project measures: model performance and business utility.</p>
<h2>Minimal components to safely ensure success</h2>
<p>At a minimum a project must have an observable quantifiable measure of success.  So it makes sense to work on setting this expectation first.  That does not mean the success criterion needs to be set in stone- as this is often not possible.  Instead it means you often have to commission an initial research project to quantify what sort of outcome is even possible and if such an outcome would make sense for the business.  This unknown result determines if the project even has a chance to succeed, so it makes sense to try and eliminate the hidden project risk it represents by determining success criteria as a separate project.   The overall modeling project often should not even be commissioned until the expectation setting project is complete.  The result may indicate no further data science work is appropriate until features are added to engineering systems or the business.  But this is good: nobody wants to start doomed projects, they instead want to know what to changes to implement to allow a successful project to be later launched.  </p>
<p>You can in fact run data science projects as you would run any development project (all projects have risks and unknowns- so these problems are not in fact unique to data science projects).  It is just that you can not, unfortunately, run data science projects in parallel with developing initial measurement and feedback systems. This is one case where starting a bridge from two shores to meet in the middle does not decrease project time.  Specific measurement, control and feedback in a data science project requires running a few cars across the bridge (but won&#8217;t require all lanes be ready at the start).</p>
<h2>Methods for expectation setting</h2>
<p>The expectation setting part of a data science project is to estimate how well a very good model would perform <em>without paying the time and cost of producing the model</em>.   This may seem impossible, but there are methods that  estimate to be performance and utility with moderate effort.  To show the flavor of this idea we list a couple of methods to estimate performance and a couple of methods to estimate utility.</p>
<h3>Methods to prior estimate model performance</h3>
<p>You need to know what prediction performance to commit to.  Some ways to prior estimate this are given here:</p>
<ul>
<li>Current performance
<p>Data scientists don&#8217;t work in a vacuum.  Usually we are trying to build a model that will be used to improve a business process.  This implies there is already some process in place.   It could be something as simple as &#8220;offer all return visitors a recommendation&#8221; or even a hand tuned set of business rules.  Do not to be too proud, too polite or too rushed to measure the current system&#8217;s performance <em>as if it were a classifier</em>.  For example if the current policy is &#8220;offer all return visitors a recommendation&#8221; measure what percentage of them buy (getting at precision) and what percentage of first time visitors buy (getting at recall).</p>
<p>If the current system looks like a very high performance classifier (near perfect precision and recall) then you are not going to be able to usefully improve on it (so nobody should want the task of improving on it).  You may want the task of automating it (if parts of it are human driven)- but you now have guidance where to go for rules and training data.  If the current system looks like a low performance classifier then you should not agree to develop a very high performance replacement as your immediate project.  If the business is running with 50% accuracy, then it is plausible the business will run better with 70% accuracy and it does not make sense to propose hitting 95% accuracy as a first project (such a project may be unrealistic or may just take longer).
</li>
<li>3 by 5 card estimation
<p>This is an especially useful technique in trying to automate high quality human judgements (often done so they can be applied at a larger scale or higher speed).  Get some time with the experts you are trying to extend the work of.  If you can&#8217;t get the time for the initial project- then you already know any larger project would be doomed, so this itself is a good up-front test.  Then ask them to do their job on paper.  For example: suppose the task is to send a coupon to somebody likely to use it; prepare by sending a lot of coupons at random and recording who used the coupons (again this is a good gatekeeper or risk that should be front-loaded; if an organization is not ready for quick measurements and A/B tests adding these capabilities if far more important than any model construction).  Have the experts pick (using all information available) who should have gotten coupons.  Ask what information they used.   User your known hidden outcomes to evaluate performance.  Prepare 3 by 5 cards with only the chosen information and see if they can indeed predict at their historic rate with the limited information.  If they can do it a model may be able to do it (and the less you need to put into the model the less engineering is needed) if they can not do it a model may not be able to do it.
</li>
<li>Bayes limit estimation
<p>One property shared by most models is that they report the same prediction given two identical examples.  This &#8220;same input produces same output&#8221; observation puts an upper bound on classifier performance called the Bayes limit.  Even a perfect model can not outperform the Bayes limit.   You can design a cross-validation study on your training data to estimate the Bayes limit without building a real deployable model.  Generate pairs of training examples with what you consider to be identical or nearly identical input patterns.  You then see how well the known outcome from one example is at predicting the known example of the other.  This is an instance of a permutation test or <a target="_blank" href="http://en.wikipedia.org/wiki/Resampling_(statistics)">resampling simulation</a>.  We do not have to pair all of the training data, just find nearest neighbors for an appropriate random sample to estimate training data variation (with respect to the whole data set, this can be done with a single table scan).  </p>
<p>An actual implementation of this estimate as a final method would be a <a target="_blank" href="http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm">nearest neighbor model</a>.  Which may are may not be advised for actual implementation.  Two  of the downsides of nearest neighbor algorithms are their computational cost and poor generalization.   The computational cost is equal to the number of queries times the training set size (which is okay for scoring an example sample but unacceptable in production, meaning you often must bring in some sort of <a target="_blan" href="http://en.wikipedia.org/wiki/Locality-sensitive_hashing">sophisticated dimension reduction technique</a>).  Poor generalization (or over fitting) is why nearest neighbor algorithms tend not to perform as well on truly novel data as they do during naive cross validation.</p>
<p>Think of the Bayes limit as an estimated upper bound, you probably won&#8217;t do better than it and you likely will (intentionally) do worse (as you trade performance on historic data for better performance on new data and efficiency).  If your initial Bayes limit is poor, then no amount of modeling will help- you need more features and feature engineering (a different sort of project).
</li>
</ul>
<h3>Methods to prior estimate business utility</h3>
<p>What the business really needs to know is if promised increase in classifier performance leads to a desired increase in business (customers, revenue or anything as long as it is specific and measurable).  From your performance calibration exercises you should have a reasonable target classification  or modeling improvement in  mind.  You want of quantify the expected business impact of that the proposed amount of classification improvement on the business.  This is where we are converting from statistical significance (is the math on our side) to <a target="_blank" href="http://en.wikipedia.org/wiki/Clinical_significance">clinical significance</a> (will it drive an appreciable change in outcome).</p>
<ul>
<li>Retrospective simulation
<p>From the historic business data choose uniformly at random a sample of interactions that meet your target classifier performance.  That is if you thing your achievable goal is a 10% increase and precision and a 2% decrease in recall (often you want to or are forced to trade precision and recall): generate a sample of customers from historic offers that meet this pattern.  This doesn&#8217;t require a model- just access to historic data.  Then measure the change in revenue if these had been your entire customer set versus a unbiased sample of the same size.
</li>
<li>Secant line method
<p>This can be a prospective or active study.  If you feel you can build a model which increases precision by 10% then get permission to degrade the running site precision by 10% (for example: make 10% more offers at random to degrade the current model or procedure).  If the degradation has no effect then probably the improvement will have no effect.   You want to run this study on a small sub-population as it confirms utility by losing money.  The idea is: it is easier to break things than improve them and the rate of change in one direction isn&#8217;t a bad approximation of the rate of change in the opposite direction.
</li>
<li>Wizard of Oz method
<p>Deploy a simulation of a better model by more expensive means (with intention of doing the engineering work to get a reasonable implementation if the experiment yields good results).  If you goal is to build a machine that &#8220;produces color combinations for you as good as a designer&#8221; simulate the effect of success by paying a designer to work with a small subset of your customers.  If the designer doesn&#8217;t improve revenue and/or customer satisfaction than even an algorithm as smart as the designer will also so fail.  It is important that nobody confuses this experiment with a sustainable method that will be left in place.
</li>
</ul>
<h2>Review of Purpose</h2>
<p>In all cases you are trying to front load getting specific and risk.   You are trying to fill in unknowns (what is our current performance?, what is our sensitivity? what do we need?) and front load risk (do we have the data?, does the data even differ between good and bad prospects?).  The results of a project like this can serve both as a gatekeeper to and source of specification for a follow-up actual data-science research and implementation project.  The first project itself should have specific description like: &#8220;be confident in the estimates of the following measures of possible model quality, data availability and probably business impact by this date.&#8221;  After that those values can be used as part of a project scoping exercise for an actual data science implementation.</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2008/02/hello-world-an-instance-rhetoric-in-computer-science/' rel='bookmark' title='Hello World: An Instance Of Rhetoric in Computer Science'>Hello World: An Instance Of Rhetoric in Computer Science</a></li>
<li><a href='http://www.win-vector.com/blog/2010/08/what-did-theorists-do-before-the-age-of-big-data/' rel='bookmark' title='What Did Theorists Do Before The Age Of Big Data?'>What Did Theorists Do Before The Age Of Big Data?</a></li>
<li><a href='http://www.win-vector.com/blog/2011/07/your-data-is-never-the-right-shape/' rel='bookmark' title='Your Data is Never the Right Shape'>Your Data is Never the Right Shape</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2012/04/setting-expectations-in-data-science-projects/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Small github reorginization</title>
		<link>http://www.win-vector.com/blog/2012/03/small-github-reorginization/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=small-github-reorginization</link>
		<comments>http://www.win-vector.com/blog/2012/03/small-github-reorginization/#comments</comments>
		<pubDate>Wed, 28 Mar 2012 21:12:02 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Administrativia]]></category>
		<category><![CDATA[Coding]]></category>
		<category><![CDATA[code]]></category>
		<category><![CDATA[code examples]]></category>
		<category><![CDATA[examples]]></category>
		<category><![CDATA[git]]></category>
		<category><![CDATA[github]]></category>
		<category><![CDATA[projects]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1953</guid>
		<description><![CDATA[I would like to remind readers we are sharing more of our project code at https://github.com/WinVector.Also a heads-up that the former SQL-Screwdriver project has been split into WinVector/SQLScrewdriver and WinVector/Logistic (with a dependence of Logistic on SQLScrewdriver). Related posts: SQL Screwdriver Lanchester&#8217;s Law: why small advantages swell in StarCraft Gradients via Reverse Accumulation
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2011/01/sql-screwdriver/' rel='bookmark' title='SQL Screwdriver'>SQL Screwdriver</a></li>
<li><a href='http://www.win-vector.com/blog/2010/09/lanchesters-law-why-small-advantages-swell-in-starcraft/' rel='bookmark' title='Lanchester&#8217;s Law: why small advantages swell in StarCraft'>Lanchester&#8217;s Law: why small advantages swell in StarCraft</a></li>
<li><a href='http://www.win-vector.com/blog/2010/07/gradients-via-reverse-accumulation/' rel='bookmark' title='Gradients via Reverse Accumulation'>Gradients via Reverse Accumulation</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>I would like to remind readers we are sharing more of our project code at <a target="_blank" href="https://github.com/WinVector">https://github.com/WinVector</a>.<span id="more-1953"></span>Also a heads-up that the former SQL-Screwdriver project has been split into <a target="_blank" href="https://github.com/WinVector/SQLScrewdriver">WinVector/SQLScrewdriver</a> and <a target="_blank" href="https://github.com/WinVector/Logistic">WinVector/Logistic</a> (with a dependence of Logistic on SQLScrewdriver).</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2011/01/sql-screwdriver/' rel='bookmark' title='SQL Screwdriver'>SQL Screwdriver</a></li>
<li><a href='http://www.win-vector.com/blog/2010/09/lanchesters-law-why-small-advantages-swell-in-starcraft/' rel='bookmark' title='Lanchester&#8217;s Law: why small advantages swell in StarCraft'>Lanchester&#8217;s Law: why small advantages swell in StarCraft</a></li>
<li><a href='http://www.win-vector.com/blog/2010/07/gradients-via-reverse-accumulation/' rel='bookmark' title='Gradients via Reverse Accumulation'>Gradients via Reverse Accumulation</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2012/03/small-github-reorginization/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Modeling Trick: the Signed Pseudo Logarithm</title>
		<link>http://www.win-vector.com/blog/2012/03/modeling-trick-the-signed-pseudo-logarithm/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=modeling-trick-the-signed-pseudo-logarithm</link>
		<comments>http://www.win-vector.com/blog/2012/03/modeling-trick-the-signed-pseudo-logarithm/#comments</comments>
		<pubDate>Fri, 02 Mar 2012 05:19:42 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Statistics]]></category>
		<category><![CDATA[arcsinh]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[singed pseudo logarithm]]></category>
		<category><![CDATA[stabilizing transform]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1946</guid>
		<description><![CDATA[Much of the data that the analyst uses exhibits extraordinary range. For example: incomes, company sizes, popularity of books and any &#8220;winner takes all process&#8221;; (see: Living in A Lognormal World). Tukey recommended the logarithm as an important &#8220;stabilizing transform&#8221; (a transform that brings data into a more usable form prior to generating exploratory statistics, [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2011/07/your-data-is-never-the-right-shape/' rel='bookmark' title='Your Data is Never the Right Shape'>Your Data is Never the Right Shape</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>Much of the data that the analyst uses exhibits extraordinary range.  For example: incomes, company sizes, popularity of books and any &#8220;winner takes all process&#8221;; (see: <a target="_blank" href="http://www.win-vector.com/blog/2010/02/living-in-a-lognormal-world/">Living in A Lognormal World</a>).  Tukey recommended the logarithm as an important &#8220;stabilizing transform&#8221; (a transform that brings data into a more usable form prior to generating exploratory statistics, analysis or modeling).  One benefit of such transforms is: data that is <a target="_blank" href="http://en.wikipedia.org/wiki/Normal_distribution">normal</a> (or Gaussian) meets more of the stated expectations of common modeling methods like <a target="_blank" href="http://en.wikipedia.org/wiki/Least_squares">least squares</a> linear regression.  So data from distributions like the lognormal is well served by a <code>log()</code> transformation (that transforms the data closer to Gaussian) prior to analysis.  However, not all data is appropriate for a log-transform (such as data with zero or negative values).  We discuss a simple transform that we call a signed pseudo logarithm that is particularly appropriate to signed wide-range data (such as profit and loss).<span id="more-1946"></span>Log-transforming data is essential when analyzing systems that operate in relative terms or are &#8220;scale invariant&#8221; (such as financial returns).   For example <a target="_blank" href="http://en.wikipedia.org/wiki/Geometric_Brownian_motion">Geometric Brownian motion</a>  is a stochastic process central to a number of financial models (such as option pricing).  Geometric Brownian motion is actually the exponential of a standard linear Brownian motion (where increments are normal or Gaussian).   In this case a logarithmic transformation actually moves from the observed data to a more natural frame of reference (where increments are additive instead of being multiplicative).  However, a major shortcoming of the log transform is its inability to deal with zero values and negative values.</p>
<p>A signed value we have often been asked to characterize or predict is: profit and loss (often called P&#038;L).    One natural model for P&#038;L would be as the difference between a revenue and an expense.  If the revenue and expense were both normally distributed then their difference would also be normal (and we would not need any stabilizing transform).  However, if the revenue and expense were both log-normally distributed (say both proportional to task size or some other log-normal parameter) then the difference is not normal (it retains the propensity for extreme values or heavy tails of the original distributions).  And for many financial size measures (company size, contract size and so on) the log-normal distribution is a much more realistic model than the normal distribution.  In some situations P&#038;L&#8217;s are formed from completely observed revenues and expenses (so we can model everything without sign problems), in other situations the signed P&#038;L from an unobserved (or unrecorded) underling process and we are forced to deal with signed quantities.</p>
<p>For signed data we suggest the following transformation (code in <a target="_blank" href="http://cran.r-project.org/">R</a>):</p>
<blockquote><p>
<code><br />
pseudoLog10 &lt;- function(x) { asinh(x/2)/log(10) }<br />
</code>
</p></blockquote>
<p><code>asinh()</code> is a somewhat ugly function that is the inverse of <code>sinh()</code>.   <code>sinh()</code> is defined as:</p>
<blockquote><p>
<code><br />
sinh(x) = (e^x - e^(-x))/2<br />
</code>
</p></blockquote>
<p>The important point is for <code>x</code> such that <code>|x|</code> is large <code>2*sinh(x)</code> rapidly approaches <code>sign(x)*e^(|x|)</code>.  Thus we should expect <code>asinh(x/2)</code> to look a lot like <code>sign(x)*log(|x|)</code> (which is why we call it a signed pseudo logarithm).   For <code>pseudoLog10()</code> we take the previous function divided by <code>log(10)</code> to ensure that we are in log-10 like units (i.e. <code>pseudoLog10(100)</code> is nearly 2, <code>pseudoLog10(1000)</code> is nearly 3 and so on).  Business audiences tend to have an easier time with log-10 (or dB) units (which can be explained as counting the number of decimal digits) than natural log or log-e units.</p>
<p>So for large positive numbers <code>pseudoLog10()</code> pretty much behaves like <code>log10()</code> (itself a standard transform).  In fact <code>pseudoLog10()</code> has the following nice properties:</p>
<ol>
<li><code>pseudoLog10(x)</code> is defined for all real <code>x</code>.</li>
<li><code>pseudoLog10(0) = 0</code>.</li>
<li><code>pseudoLog10(-x) = -pseudoLog10(x)</code>.</li>
<li><code>pseudoLog10(x)</code> is monotone in <code>x</code>.</li>
<li>For <code>x</code> such that <code>|x|</code> is large: <code>pseudoLog10(x)</code> is very near <code>sign(x)*log10(|x|)</code>.</li>
</ol>
<p>We strongly recomend trying this transformation before feeding heavy tail data into a linear or logistic model.</p>
<p>However, we can not  recommend the transformation for presentation.   Consider the simple case of plotting the distribution or density of normal data with mean zero and standard-deviation 10 (see  <a target="_blank" href="http://www.win-vector.com/blog/2011/12/my-favorite-graphs/">My Favorite Graphs</a> for description of a density plot):</p>
<p><code></p>
<blockquote>
<pre>
library(ggplot2)
pseudoLog10 &lt;- function(x) { asinh(x/2)/log(10) }
d &lt;- data.frame(x=rnorm(n=1000,sd=100))
ggplot(d) + geom_density(aes(x=x))
</pre>
</blockquote>
<p></code></p>
<p><img style="display:block; margin-left:auto; margin-right:auto;" src="http://www.win-vector.com/blog/wp-content/uploads/2012/03/normalDensity.png" alt="NormalDensity" title="normalDensity.png" border="0" width="400" height="400" /></p>
<p>The density plot shows what we would expect- a near normal distribution (most points towards the center and mass falling off quickly as we move away).  However, the plot of the pseudoLog10 transformed data is not what we would hope:</p>
<p><code></p>
<blockquote>
<pre>
d$pseudoLog10x &lt;- pseudoLog10(d$x)
ggplot(d) + geom_density(aes(x=pseudoLog10x))
</pre>
</blockquote>
<p></code></p>
<p><img style="display:block; margin-left:auto; margin-right:auto;" src="http://www.win-vector.com/blog/wp-content/uploads/2012/03/pseudoLog10Density.png" alt="PseudoLog10Density" title="pseudoLog10Density.png" border="0" width="400" height="400" /></p>
<p>The data density (falsely) appears bimodal!  This is because the <code>pseudoLog10()</code> transform is compressing ranges more and more violently as we move away from the origin (and not compressing near the origin).  So as we move away from origin: the product of the real data density times the degree of range compression climbs, achieves a maximum and then falls.   This phenomena (which is just a &#8220;change of variables&#8221; for densities) gives us the bimodal appearance for unimodal distributions that have significant mass outside of the range [-10,10].   The bimodal appearance is mostly a fact about the transform not really a feature of the underlying data.</p>
<p>We see value in examining at the relative sizes and centers of these two modes for asymmetric distributions (such as the profit and loss statement for a set of accounts that are mostly losing money).  The position and relative sizes of the modes gives us an initial hint what to look for (helps with questions like: &#8220;are total losses driven by many accounts or by few accounts&#8221; and so on).    We can not, however, recommend the <code>pseudoLog10()</code> transform for presentation.  The most striking feature of the graph is almost always the bimodal appearance of the data; and the bimodal appearance is almost always an artifact of the transform (not a real feature of the data).   You can not in good conscious push a presentation where the most prominent and exciting observation is not in fact in the data.</p>
<p>We do still recommend trying the <code>pseudoLog10()</code> transform when building a linear or logistic model with wide ranged data.  The transformation usefully compresses range which allows the modeled coefficients to be a function of most of the data and not a function of a few extreme values.  Models that depend on most of their data (or on central estimates from their data) tend to be safer, achieve higher statistical significance and cross-validate more reliably.  Models that are dominated by a few extreme values tend to be unsafe, not achieve statistical significance and not cross-validate reliably.  The bimodal artifact can work in the favor of modeling as it tends to compress a transformed variable into &#8220;typical positive example&#8221; and &#8220;typical negative example&#8221; while still allowing magnitudes to enter the model in some form.</p>
<p>Used with care the <code>pseudoLog10()</code> or <code>arcsinh()</code> transform can be an important data preparation step for signed data with large range.  Many financial summaries (such as P&#038;L) meet these conditions and often profit from the transform.</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2011/07/your-data-is-never-the-right-shape/' rel='bookmark' title='Your Data is Never the Right Shape'>Your Data is Never the Right Shape</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2012/03/modeling-trick-the-signed-pseudo-logarithm/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Why I don&#8217;t like Dynamic Typing</title>
		<link>http://www.win-vector.com/blog/2012/02/why-i-dont-like-dynamic-typing/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=why-i-dont-like-dynamic-typing</link>
		<comments>http://www.win-vector.com/blog/2012/02/why-i-dont-like-dynamic-typing/#comments</comments>
		<pubDate>Sat, 25 Feb 2012 14:27:37 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Opinion]]></category>
		<category><![CDATA[Rants]]></category>
		<category><![CDATA[Tutorials]]></category>
		<category><![CDATA[Dynamically Typed Languages]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[Static Typing]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1937</guid>
		<description><![CDATA[A lot of people consider the static typing found in languages such as C, C++, ML, Java and Scala as needless hairshirtism. They consider the dynamic typing of languages like Lisp, Scheme, Perl, Ruby and Python as a critical advantage (ignoring other features of these languages and other efforts at generic programming such as the [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2011/08/programmers-should-know-r/' rel='bookmark' title='Programmers Should Know R'>Programmers Should Know R</a></li>
<li><a href='http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/' rel='bookmark' title='Automatic Differentiation with Scala'>Automatic Differentiation with Scala</a></li>
<li><a href='http://www.win-vector.com/blog/2011/12/my-favorite-graphs/' rel='bookmark' title='My Favorite Graphs'>My Favorite Graphs</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>A lot of people consider the static typing found in languages such as C, C++, ML, Java and Scala as needless hairshirtism.  They consider the dynamic typing of languages like Lisp, Scheme, Perl, Ruby and Python as a critical advantage (ignoring other features of these languages and other efforts at generic programming such as the STL).</p>
<p>I strongly disagree.  I find the pain of having to type or read through extra declarations is small (especially if you know how to copy-paste or use a modern IDE).  And certainly much smaller than the pain of the dynamic language driven anti-patterns of: lurking bugs, harder debugging and more difficult maintenance.  Debugging is one of the most expensive steps in software development- so you want incur less of it (even if it is at the expense of more typing).  To be sure, there <em>is</em> significant cost associated with static typing (I confess: I had to read the book and post a question on Stack Overflow to design the type interfaces in <a target="_blank" href="http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/">Automatic Differentiation with Scala</a>; but this is up-front design effort that has ongoing benefits, not hidden debugging debt).</p>
<p>There is, of course, no prior reason anybody should immediately care if I do or do not like dynamic typing.  What I mean by saying this is I have some experience and observations about problems with dynamic typing that I feel can help others.</p>
<p>I will point out a couple of example bugs that just keep giving.  Maybe you think you are too careful to ever make one of these mistakes, but somebody in your group surely will.  And a type checking compiler finding a possible bug early is the cheapest way to deal with a bug (and static types themselves are only a stepping stone for <a target="_blank" href="http://altdevblogaday.com/2011/12/24/static-code-analysis/">even deeper static code analysis</a>).<span id="more-1937"></span>For my examples I will pick on the programming language <a target="_blank" href="http://cran.r-project.org/">R</a> (which we have used and <a target="_blank" href="http://www.win-vector.com/blog/tag/r/">written about in the past</a>).</p>
<p>One of the supposed advantages of dynamically typed languages is that &#8220;everything is a macro.&#8221;  That is you write a function and it is really a template that specializes and works over many different data types.  For example: suppose we decided to write our own function to compute sample variance in R:</p>
<style type="text/css">
td.linenos { background-color: #f0f0f0; padding-right: 10px; }
span.lineno { background-color: #f0f0f0; padding: 0 5px 0 5px; }
pre { line-height: 125%; }
body .hll { background-color: #ffffcc }
body  { background: #f8f8f8; }
body .c { color: #408080; font-style: italic } /* Comment */
body .err { border: 1px solid #FF0000 } /* Error */
body .k { color: #008000; font-weight: bold } /* Keyword */
body .o { color: #666666 } /* Operator */
body .cm { color: #408080; font-style: italic } /* Comment.Multiline */
body .cp { color: #BC7A00 } /* Comment.Preproc */
body .c1 { color: #408080; font-style: italic } /* Comment.Single */
body .cs { color: #408080; font-style: italic } /* Comment.Special */
body .gd { color: #A00000 } /* Generic.Deleted */
body .ge { font-style: italic } /* Generic.Emph */
body .gr { color: #FF0000 } /* Generic.Error */
body .gh { color: #000080; font-weight: bold } /* Generic.Heading */
body .gi { color: #00A000 } /* Generic.Inserted */
body .go { color: #808080 } /* Generic.Output */
body .gp { color: #000080; font-weight: bold } /* Generic.Prompt */
body .gs { font-weight: bold } /* Generic.Strong */
body .gu { color: #800080; font-weight: bold } /* Generic.Subheading */
body .gt { color: #0040D0 } /* Generic.Traceback */
body .kc { color: #008000; font-weight: bold } /* Keyword.Constant */
body .kd { color: #008000; font-weight: bold } /* Keyword.Declaration */
body .kn { color: #008000; font-weight: bold } /* Keyword.Namespace */
body .kp { color: #008000 } /* Keyword.Pseudo */
body .kr { color: #008000; font-weight: bold } /* Keyword.Reserved */
body .kt { color: #B00040 } /* Keyword.Type */
body .m { color: #666666 } /* Literal.Number */
body .s { color: #BA2121 } /* Literal.String */
body .na { color: #7D9029 } /* Name.Attribute */
body .nb { color: #008000 } /* Name.Builtin */
body .nc { color: #0000FF; font-weight: bold } /* Name.Class */
body .no { color: #880000 } /* Name.Constant */
body .nd { color: #AA22FF } /* Name.Decorator */
body .ni { color: #999999; font-weight: bold } /* Name.Entity */
body .ne { color: #D2413A; font-weight: bold } /* Name.Exception */
body .nf { color: #0000FF } /* Name.Function */
body .nl { color: #A0A000 } /* Name.Label */
body .nn { color: #0000FF; font-weight: bold } /* Name.Namespace */
body .nt { color: #008000; font-weight: bold } /* Name.Tag */
body .nv { color: #19177C } /* Name.Variable */
body .ow { color: #AA22FF; font-weight: bold } /* Operator.Word */
body .w { color: #bbbbbb } /* Text.Whitespace */
body .mf { color: #666666 } /* Literal.Number.Float */
body .mh { color: #666666 } /* Literal.Number.Hex */
body .mi { color: #666666 } /* Literal.Number.Integer */
body .mo { color: #666666 } /* Literal.Number.Oct */
body .sb { color: #BA2121 } /* Literal.String.Backtick */
body .sc { color: #BA2121 } /* Literal.String.Char */
body .sd { color: #BA2121; font-style: italic } /* Literal.String.Doc */
body .s2 { color: #BA2121 } /* Literal.String.Double */
body .se { color: #BB6622; font-weight: bold } /* Literal.String.Escape */
body .sh { color: #BA2121 } /* Literal.String.Heredoc */
body .si { color: #BB6688; font-weight: bold } /* Literal.String.Interpol */
body .sx { color: #008000 } /* Literal.String.Other */
body .sr { color: #BB6688 } /* Literal.String.Regex */
body .s1 { color: #BA2121 } /* Literal.String.Single */
body .ss { color: #19177C } /* Literal.String.Symbol */
body .bp { color: #008000 } /* Name.Builtin.Pseudo */
body .vc { color: #19177C } /* Name.Variable.Class */
body .vg { color: #19177C } /* Name.Variable.Global */
body .vi { color: #19177C } /* Name.Variable.Instance */
body .il { color: #666666 } /* Literal.Number.Integer.Long */
  </style>
<div class="highlight">
<pre>
variance <span class="o">&lt;-</span> <span class="kr">function</span><span class="p">(</span>x<span class="p">)</span> <span class="p">{</span>
   n <span class="o">&lt;-</span> length<span class="p">(</span>x<span class="p">)</span>
   sumX <span class="o">&lt;-</span> sum<span class="p">(</span>x<span class="p">)</span>
   sumXX <span class="o">&lt;-</span> sum<span class="p">(</span>x<span class="o">*</span>x<span class="p">)</span>
   <span class="p">(</span>n<span class="o">/</span><span class="p">(</span>n<span class="o">-</span><span class="m">1</span><span class="p">))</span><span class="o">*</span><span class="p">(</span>sumXX<span class="o">/</span>n <span class="o">-</span> <span class="p">(</span>sumX<span class="o">/</span>n<span class="p">)</span><span class="o">*</span><span class="p">(</span>sumX<span class="o">/</span>n<span class="p">))</span>
<span class="p">}</span>
</pre>
</div>
<p>This works great and even matches the built-in funciton <code>var()</code>:</p>
<div class="highlight">
<pre>
<span class="o">&gt;</span> variance<span class="p">(</span>c<span class="p">(</span><span class="m">1000000</span><span class="p">,</span><span class="m">2000000</span><span class="p">,</span><span class="m">3000000</span><span class="p">,</span><span class="m">4000000</span><span class="p">,</span><span class="m">5000000</span><span class="p">))</span>
<span class="p">[</span><span class="m">1</span><span class="p">]</span> <span class="m">2.5</span>e<span class="o">+</span><span class="m">12</span>
<span class="o">&gt;</span> var<span class="p">(</span>c<span class="p">(</span><span class="m">1000000</span><span class="p">,</span><span class="m">2000000</span><span class="p">,</span><span class="m">3000000</span><span class="p">,</span><span class="m">4000000</span><span class="p">,</span><span class="m">5000000</span><span class="p">))</span>
<span class="p">[</span><span class="m">1</span><span class="p">]</span> <span class="m">2.5</span>e<span class="o">+</span><span class="m">12
</span>
</pre>
</div>
<p>That is it works until we (either knowingly or unknowingly) apply the function to data of a different type:</p>
<div class="highlight">
<pre>
<span class="o">&gt;</span> variance<span class="p">(</span>as.integer<span class="p">(</span>c<span class="p">(</span><span class="m">1000000</span><span class="p">,</span><span class="m">2000000</span><span class="p">,</span><span class="m">3000000</span><span class="p">,</span><span class="m">4000000</span><span class="p">,</span><span class="m">5000000</span><span class="p">)))</span>
<span class="p">[</span><span class="m">1</span><span class="p">]</span> <span class="kc">NA
Warning message:
In x * x : NAs produced by integer overflow
</span>
</pre>
</div>
<p>Our macro specialized to calculate over the integers when given integer arguments and then fails due to overflow.   Here it is obvious, but in a dynamically typed language we don&#8217;t always know the type of what we are passing in as we may have gotten the value from somewhere else.   If we define <code>variance()</code> as a function over doubles in a statically typed language then the language would force either an explicit (programmer supplied) or implicit (language supplied) coercion when attempting to use the function on a vector of integers.  The problem is: it is a bigger responsibility to write a correct macro (as the macro has to work over more possible types than a simple function).  The dynamic language pushes this onto us and sometimes we get burnt and sometimes everything is okay.  This sort of consideration is one of the reasons functional programing advocates prefer anonymous functions to declaring on the fly classes: less is possible so it is easier to safely implement what is implied.</p>
<p>Some of the problem can be dispelled with test driven development. I am proponent of test driven development, so much so that I don&#8217;t want to waste my valuable test budget testing for things that a decent type system can defend against.  Also, by starting broad (assuming it is fair to re-use a function on many different types of arguments) you have entered into a bad bargain where you either have to document what subset of arguments the function works properly on (which is essentially declaring types!), add extra defensive code to cast the arguments on the way in (a waste, and <a target="_blank" href="http://www.win-vector.com/blog/2010/02/postels-law-not-sure-who-to-be-angry-with/">needlessly defensive coding brings in its own problems</a>) or write enough tests to document proper function on a whole bunch of types you don&#8217;t actually care about (char, byte, short int &#8230;)).  Unexpected properties of real world data will throw you enough testing and debugging challenges (for example: <a target="_blank" href="http://www.win-vector.com/blog/2008/04/sorting-in-anger/">the effect of unexpected constant data in bad quicksort implementations</a>) that you don&#8217;t need additional hidden challenges that a static type system could exclude.</p>
<p>My second complaint is that most dynamically typed languages go further and force the horrible anti-pattern of automatic (or zero-declaration) variables on us.   Since we are not, in a dynamically typed language, required to declare type- it is considered a waste to force the user to declare variables at all (statements like &#8220;<code>var colTypeClass</code>&#8220;).  This argument is seductive because another supposed advantage of dynamically typed languages is conciseness, and variable declarations appear to have little value if you are not declaring types.  However consider the following code:</p>
<div class="highlight">
<pre>
sqlColType <span class="o">&lt;-</span> <span class="kr">function</span><span class="p">(</span>colTypeName<span class="p">)</span> <span class="p">{</span>
   colTypeClass <span class="o">&lt;-</span> <span class="s">&#39;unhandled&#39;</span>
   <span class="kr">if</span><span class="p">(</span>colTypeName <span class="o">%in%</span> list<span class="p">(</span><span class="s">&#39;smallint&#39;</span><span class="p">,</span><span class="s">&#39;integer&#39;</span><span class="p">,</span><span class="s">&#39;bigint&#39;</span><span class="p">,</span><span class="s">&#39;decimal&#39;</span><span class="p">,</span><span class="s">&#39;numeric&#39;</span><span class="p">,</span><span class="s">&#39;real&#39;</span><span class="p">,</span><span class="s">&#39;double precision&#39;</span><span class="p">,</span><span class="s">&#39;serial&#39;</span><span class="p">,</span><span class="s">&#39;bigserial&#39;</span><span class="p">,</span><span class="s">&#39;money&#39;</span><span class="p">))</span> <span class="p">{</span>
      colTypeClass <span class="o">&lt;-</span> <span class="s">&#39;numeric&#39;</span>
   <span class="p">}</span> <span class="kr">else</span> <span class="kr">if</span><span class="p">(</span>colTypeName <span class="o">%in%</span> list<span class="p">(</span><span class="s">&#39;character varying&#39;</span><span class="p">,</span><span class="s">&#39;character&#39;</span><span class="p">,</span><span class="s">&#39;text&#39;</span><span class="p">,</span><span class="s">&#39;boolean&#39;</span><span class="p">))</span> <span class="p">{</span>
      colTypeClass <span class="o">&lt;-</span> <span class="s">&#39;categorical&#39;</span>
   <span class="p">}</span> <span class="kr">else</span> <span class="kr">if</span><span class="p">(</span>colTypeName <span class="o">%in%</span> list<span class="p">(</span><span class="s">&#39;interval&#39;</span><span class="p">,</span><span class="s">&#39;date&#39;</span><span class="p">))</span> <span class="p">{</span>
      colTypeGlass <span class="o">&lt;-</span> <span class="s">&#39;temporal&#39;</span>
   <span class="p">}</span> <span class="kr">else</span> <span class="kr">if</span><span class="p">(</span>length<span class="p">(</span>grep<span class="p">(</span><span class="s">&#39;time&#39;</span><span class="p">,</span>colTypeName<span class="p">))</span><span class="o">&gt;</span><span class="m">0</span><span class="p">)</span> <span class="p">{</span>
      colTypeClass <span class="o">&lt;-</span> <span class="s">&#39;temporal&#39;</span>
   <span class="p">}</span>
   colTypeClass
<span class="p">}</span>
</pre>
</div>
<p>This code (for better or for worse, and at some point we all have to write or use something this ugly) is attempting to map specific SQL column type names into broad classes of types (numeric, categorical and temporal).   However there is a typo-bug in the above code that is only possible in a language with automatic variable declaration.  Consider the following to applications of <code>sqlColType()</code>:</p>
<div class="highlight">
<pre>
<span class="o">&gt;</span> sqlColType<span class="p">(</span><span class="s">&#39;integer&#39;</span><span class="p">)</span>
<span class="p">[</span><span class="m">1</span><span class="p">]</span> <span class="s">&quot;numeric&quot;</span>
<span class="o">&gt;</span> sqlColType<span class="p">(</span><span class="s">&#39;date&#39;</span><span class="p">)</span>
<span class="p">[</span><span class="m">1</span><span class="p">]</span> <span class="s">&quot;unhandled&quot;</span>
</pre>
</div>
<p>The first result is as designed and the second is wrong.  What happened is in the if-block where &#8220;date&#8221; should have been identified we accidentally spelled &#8220;Class&#8221; with a &#8220;G&#8221; and the result we meant to return was trapped in a shiny new automatic variable that never escapes the function.  You may consider this particular bug unlikely, but in a language without automatic variable declaration it is literally impossible.  And you don&#8217;t even have to actually have this bug in your code to suffer from it.  This mistake is something you have to check for when inspecting/debugging faulty code (because you have not pre-guarantee it can not happen).</p>
<p>My third complaint is the common lack of significant refactoring tools for dynamically typed languages.  The ability to automatically apply larger scale meaningful code changes (such as when using Eclipse&#8217;s Java development environment) is big.  Dynamic type advocates would argue that most of the successful refactorings are just the IDE shepherding around type cruft that is not present in a dynamic language.  This is not true.  In addition to the trivial code motion and package management there are significant  code transformations: method extraction, method signature alteration and safe variable renaming just to name three.  It is a real luxury to work with a system that can safely rename a variable (and all of its references) even when there are other strings and variables using the same token.  It is also a luxury to work in teams where nobody can say &#8220;yeah, we wanted to remove that argument from the method- but nobody has time to update and test all of the consumers.&#8221;  Most dynamic languages don&#8217;t even have the very clever &#8220;poor man&#8217;s refactoring&#8221; (change the method declaration, attempt a re-compile and then insert changes everyplace the compiler flags an error).   When changing a method signature in a typical dynamically typed language you are typically left with the lurking  worry that some bit of code somewhere is still attempting to use the old signature and will exhibit a runtime error when the exact set of circumstances required to execute the bad path happen in production (i.e. that you won&#8217;t be lucky enough to find it in a test).  IDEs have a somewhat dirty reputation as being a crutch (somewhat due to horrible interface builders and large boilerplate systems), but the treatment of code as an object subject to a series of meaningful transformations is game changing (and is most commonly associated with statically typed languages, somewhat by historic accident but also likely due to the presence of extra declaration blocks often in statically typed languages and not due to the actual type system itself).</p>
<p>To sum up: dynamic typing allows more expressive code and saves space.  But we pay a large cost downstream in more expensive debugging and much weaker ability to refactor or analyze.  I favor the compromise where most code is statically typed and either only language supplied functions are capable of dynamic typing or there are user escapes out (like templating).  While there is some doubt as to whether you can design a language as powerful as Scheme or Python without dynamic typing (some attempts have failed and some attempts are still evolving) I still prefer static typing. Or (more accurately) I prefer to deal with statically typed code (and am willing to put up with some expense to have it).  Initial coding is not the only phase of the software lifecycle.</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2011/08/programmers-should-know-r/' rel='bookmark' title='Programmers Should Know R'>Programmers Should Know R</a></li>
<li><a href='http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/' rel='bookmark' title='Automatic Differentiation with Scala'>Automatic Differentiation with Scala</a></li>
<li><a href='http://www.win-vector.com/blog/2011/12/my-favorite-graphs/' rel='bookmark' title='My Favorite Graphs'>My Favorite Graphs</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2012/02/why-i-dont-like-dynamic-typing/feed/</wfw:commentRss>
		<slash:comments>26</slash:comments>
		</item>
		<item>
		<title>Ergodic Theory for Interested Computer Scientists</title>
		<link>http://www.win-vector.com/blog/2012/02/ergodic-theory-for-interested-computer-scientists/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=ergodic-theory-for-interested-computer-scientists</link>
		<comments>http://www.win-vector.com/blog/2012/02/ergodic-theory-for-interested-computer-scientists/#comments</comments>
		<pubDate>Sat, 04 Feb 2012 17:42:07 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Expository Writing]]></category>
		<category><![CDATA[Mathematics]]></category>
		<category><![CDATA[Ergodic Theorem]]></category>
		<category><![CDATA[Gibbs Sampler]]></category>
		<category><![CDATA[Markov Chains]]></category>
		<category><![CDATA[Random Sampling]]></category>
		<category><![CDATA[Randomized Algorithms]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1933</guid>
		<description><![CDATA[We describe ergodic theory in modern notation accessible to interested computer scientists. The ergodic theorem (http://en.wikipedia.org/wiki/Ergodic theory (link)) is an important principle of recurrence and averaging in dynamical systems. However, there are some inconsistent uses of the term, much of the machinery is intended to work with deterministic dynamical systems (not probabilistic systems, as is [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2008/02/hello-world-an-instance-rhetoric-in-computer-science/' rel='bookmark' title='Hello World: An Instance Of Rhetoric in Computer Science'>Hello World: An Instance Of Rhetoric in Computer Science</a></li>
<li><a href='http://www.win-vector.com/blog/2012/01/six-fundamental-methods-to-generate-a-random-variable/' rel='bookmark' title='Six Fundamental Methods to Generate a Random Variable'>Six Fundamental Methods to Generate a Random Variable</a></li>
<li><a href='http://www.win-vector.com/blog/2010/08/what-did-theorists-do-before-the-age-of-big-data/' rel='bookmark' title='What Did Theorists Do Before The Age Of Big Data?'>What Did Theorists Do Before The Age Of Big Data?</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>We describe ergodic theory in modern notation accessible to interested computer scientists.</p>
<p>The ergodic theorem (http://en.wikipedia.org/wiki/Ergodic theory (link)) is an important principle of recurrence and averaging in dynamical systems. However, there are some inconsistent uses of the term, much of the machinery is intended to work with deterministic dynamical systems (not probabilistic systems, as is often implied) and often the conclusion of the theory is mis-described as its premises.</p>
<p>By “interested computer scientists” we mean people who know math and work with probabilistic systems1, but know not to accept mathematical definitions without some justification (actually a good attitude for mathematicians also).<span id="more-1933"></span>Please click through to read <a target="_blank" href="http://www.win-vector.com/dfiles/ErgodicTheory.pdf">Ergodic Theory for Interested Computer Scientists</a>.</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2008/02/hello-world-an-instance-rhetoric-in-computer-science/' rel='bookmark' title='Hello World: An Instance Of Rhetoric in Computer Science'>Hello World: An Instance Of Rhetoric in Computer Science</a></li>
<li><a href='http://www.win-vector.com/blog/2012/01/six-fundamental-methods-to-generate-a-random-variable/' rel='bookmark' title='Six Fundamental Methods to Generate a Random Variable'>Six Fundamental Methods to Generate a Random Variable</a></li>
<li><a href='http://www.win-vector.com/blog/2010/08/what-did-theorists-do-before-the-age-of-big-data/' rel='bookmark' title='What Did Theorists Do Before The Age Of Big Data?'>What Did Theorists Do Before The Age Of Big Data?</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2012/02/ergodic-theory-for-interested-computer-scientists/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Six Fundamental Methods to Generate a Random Variable</title>
		<link>http://www.win-vector.com/blog/2012/01/six-fundamental-methods-to-generate-a-random-variable/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=six-fundamental-methods-to-generate-a-random-variable</link>
		<comments>http://www.win-vector.com/blog/2012/01/six-fundamental-methods-to-generate-a-random-variable/#comments</comments>
		<pubDate>Fri, 20 Jan 2012 19:23:38 +0000</pubDate>
		<dc:creator>John Mount</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Mathematics]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Tutorials]]></category>
		<category><![CDATA[Ergodic Theory]]></category>
		<category><![CDATA[Markov Chains]]></category>
		<category><![CDATA[Markov Monte Carlo]]></category>
		<category><![CDATA[Random Sampling]]></category>

		<guid isPermaLink="false">http://www.win-vector.com/blog/?p=1925</guid>
		<description><![CDATA[Introduction To implement many numeric simulations you need a sophisticated source of instances of random variables. The question is: how do you generate them? The literature is full of algorithms requiring random samples as inputs or drivers (conditional random fields, Bayesian network models, particle filters and so on). The literature is also full of competing [...]
Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2011/06/what-is-a-large-enough-random-sample/' rel='bookmark' title='What is a large enough random sample?'>What is a large enough random sample?</a></li>
<li><a href='http://www.win-vector.com/blog/2011/10/kernel-methods-and-support-vector-machines-de-mystified/' rel='bookmark' title='Kernel Methods and Support Vector Machines de-Mystified'>Kernel Methods and Support Vector Machines de-Mystified</a></li>
<li><a href='http://www.win-vector.com/blog/2011/07/book-review-ensemble-methods-in-data-mining-seni-elder/' rel='bookmark' title='Book Review: Ensemble Methods in Data Mining (Seni &amp; Elder)'>Book Review: Ensemble Methods in Data Mining (Seni &#038; Elder)</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<h2> Introduction</h2>
<p>To implement many numeric simulations you need a sophisticated source of instances of random variables.  The question is: how do you generate them?  </p>
<p>The literature is full of algorithms requiring random samples as inputs or drivers (<a target="_blank" href="http://en.wikipedia.org/wiki/Conditional_random_field">conditional random fields</a>, <a target="_blank" href="http://en.wikipedia.org/wiki/Bayesian_network">Bayesian network models</a>, <a target="_blank" href="http://en.wikipedia.org/wiki/Particle_filter">particle filters</a> and so on). The literature is also full of competing methods (<a target="_blank" href="http://en.wikipedia.org/wiki/Pseudorandom_number_generator">pseudorandom generators</a>, <a target="_blank" href="http://en.wikipedia.org/wiki/Entropy_(information_theory)">entropy sources</a>, <a target="_blank" href="http://en.wikipedia.org/wiki/Gibbs_sampling">Gibbs samplers</a>, <a target="blank" href="http://en.wikipedia.org/wiki/Metropolis–Hastings_algorithm">Metropolis–Hastings algorithm</a>, <a target="_blank" href="http://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo">Markov chain Monte Carlo methods</a>, <a target="_blank" href="http://en.wikipedia.org/wiki/Bootstrapping">bootstrap methods</a> and so on).  Our thesis is: this diversity is supported by only a few fundamental methods.  And you are much better off thinking in terms of a few deliberately simple composable mechanisms than you would be in relying on some hugely complicated black box &#8220;brand name&#8221; technique. </p>
<p>We will discuss the half dozen basic methods that all of these techniques are derived from.<span id="more-1925"></span>To our mind all of the famous random variate generation/sampling techniques are derived from combinations of the following six fundamental methods:</p>
<ol>
<li>Physical sources.</li>
<li>Empirical resampling.</li>
<li>Pseudo random generators.</li>
<li>Simulation/Game-play.</li>
<li>Rejection Sampling.</li>
<li>Transform methods.</li>
</ol>
<p>The technical fights (such as: &#8220;is Gibbs sampling superior to, or even distinguishable from, Markov chain Monte Carlo?&#8221;) are all in the details, history and citation conventions.   Each field and particular method accretes its own traditions.  We will quickly discuss the fundamental methods we listed.  As we will see: complexity goes up as we move through the list (so at some point things are no longer fundamental but instead derived, allowing us to end the list).</p>
<h2>The Methods</h2>
<h3>Physical sources</h3>
<p>This is the most basic way (though not as practical in the computer age) to generate random variables.  Observe the flip of a real coin, shuffle actual cards, mix numbered balls or count the number of ticks from an actual radioactive source.  In all of these the randomness comes from physical principles (such <a target="_blank" href="http://en.wikipedia.org/wiki/Chaos_theory">chaotic dynamics</a> for coin flips or <a target="_blank" href="http://en.wikipedia.org/wiki/Quantum_mechanics">quantum mechanics</a> for radioactive decay).</p>
<p>These sources are &#8220;outside of computer science&#8221; so we will say the least about them.</p>
<h3>Empirical resampling</h3>
<p>This is what used to be called &#8220;tables&#8221; (which were themselves often generated from physical processes).   The observation is: that sometimes<br />
to run a simulation you need access to instances of random variables that are distributed in a very precise way- but you don&#8217;t have a usable  description of the desired distribution.  You would think that in this case you could do nothing.  But the principle of empirical resampling is that you can approximately generate new samples by taking samples (with repetition or replacement) from an old sample.  This is the cornerstone of Bootstrap methods.</p>
<p>As an example:  suppose we were given the sample of numbers 5, 5, 10, 5, 5 which has mean equal to 6.  Further suppose we have no<br />
description of how these number were generated but we wanted to know if a mean of at least 8 is likely or unlikely for five more numbers drawn the same way.  We can approximate this by drawing many samples of size five from this original sample (allow the same number to be in our new<br />
 sample multiple times) and get the bootstrap estimate of the probability of seeing mean of at least 8 as having a probability around 0.6%.</p>
<p>This may seem trivial- but it is very important.</p>
<h3>Pseudo random generators</h3>
<p>In the computer age, to avoid need for external tables or expensive and slow peripherals we tend to use pseudo random generators.  That is the output of deterministic iterative procedures as equivalent to true random sources.  The science of pseudo randomness has evolved from cobbled together procedures passing ad-hoc tests (such as in Knuth Volume 2) to more formal pseudo randomness based on important properties (like provably being k-wise independent) or complexity (being computationally indistinguishable from a truly random on a time or space bounded machine).  Behind the canned routines of all of the basic &#8220;random generators&#8221; commonly available is a pseudo random source.  </p>
<p>Good references for the modern theory include: 	</p>
<ul>
<li>
&#8220;Pseudorandomness and Cryptographic Applications&#8221; Michael Luby 1996.
</li>
<li>
&#8220;Modern Cryptography, Probabilistic Proofs and Pseudorandomness&#8221; Oded Goldreich, 1999.
</li>
</ul>
<p>The most basic form of a sequential pseudo random generator is a sequence of states s(1), s(2), s(3) &#8230; . Where s(i+1) = g(s(i)) where g() is our deterministic function that maps state to state.  The observed random variables are then h(s(i)) where h() is some deterministic function maps state to observables.  For example for the <a target="_blank" href="http://en.wikipedia.org/wiki/Linear_congruential_generator">linear congruential generator</a>  found in glibc we have g(x) = (1103515245*x + 12345) modulo 2^32 and h(x) = x modulo 2^30 (x an integer from 0 to 2^32 &#8211; 1).  An example application: this generator when divided by (2^30 &#8211; 1) might return numbers passably uniformly distributed in the interval [0,1].  Two such variates might be uses as a uniform sample from the unit square.</p>
<p>That a simple iterated deterministic system (like the modulo arithmetic or even a physical system like coin flipping) would even superficially appear random (let alone be safe to use as pseudo random source) turns out to be the main consequence of <a target="_blank" href="http://en.wikipedia.org/wiki/Ergodic_theory">Ergodic theory</a> (which we will touch on in a later article).  The point is: it should not be obvious (without bringing in some more theory) why you should trust pseudo-random sources.</p>
<h3>Simulation/Game-play</h3>
<p>Another fundamental method is direct simulation or game play.  If we wanted a random variable that was 1 with probability equal to the odds of being dealt a full house from a standard shuffled deck of 52 cards (and zero otherwise).  We can generate such a variable by simulating shuffling a deck, drawing a hand and returning 1 if the hand draw is a full house (and returning 0 otherwise).  Notice in this case we are combining many random variables to get a single result.</p>
<p>One of the most important simulation techniques is Markov chain Monte Carlo methods (related to Gibbs sampling, simulated annealing and many other variations).  These method implement a complex procedure over a stream of random inputs to generate a more difficult to achieve sequence of random outputs.</p>
<p>For example:  Let T be the set of pairs of non-negative integers x, y such that x + y &le; 1000.   We could implement a Markov chain on this set from a source of coin flips.  Given a point (x,y) in T we take three coin flips and move to new point (x&#8217;,y&#8217;) (also in T) using the following procedure:</p>
<ol>
<li>Let m = 1 if the first flip is heads and m=0 if the first flip is tails.</li>
<li>Let v = (1,0) if the second flip is heads and v=(0,1) if the second flip is tails.</li>
<li>Let d = +1 if the third flip is heads and d = -1 if the third flip is tails.</li>
<li>If (x,y) + m*d*v is in T let (x&#8217;,y&#8217;) = (x,y) + m*d*v, otherwise let (x&#8217;,y&#8217;) = (x,y) (stay put).</li>
</ol>
<p>Repeating this procedure a large number of times produces a sequence of points (x,y) such that (x,y) is distributed uniformly on S (again this follows from ergodic principles).  The correctness of this simulation of or game of following a Markov chain is a very fundamental method in generating more complicated random variates and something we will write more about in an article dealing with the ergodic principle (the relation of connectedness to showing averages over time equal averages over space).</p>
<p>For simple shapes (rectangle, triangles) there are more efficient ways to generate points uniformly at random.  For squares we exploit independence and just generate the coordinates independently.  For triangles we could rejection sample from a bounding rectangle.   Or we could use a tranform method: write down a counting function that indexes all the points in the triangle and generate points by index (for example it is easy to work out there are 501501 points in our example S so if we generate a random integer uniformly from 1 to 501501 can just pick the point with given index as our sample).</p>
<p>For general convex shapes (in high dimensions) these methods become intractible and Markov chain methods are one of the few options remaining.</p>
<h3>Rejection Sampling</h3>
<p>Rejection sampling is another way to convert one sequence of random variables into another.  If we assume we can generate a random variable according to the distribution p(x) we can &#8220;rejection sample&#8221; to a new distribution using an &#8220;acceptance function&#8221; q(x) which returns a number in the interval [0,1].  Our procedure is to<br />
repeat the following: generate x with probability p(x), generate a random variable y with uniformly in the interval [0,1] if y &le; q(x) accept x as<br />
our answer and quit (otherwise draw a new x and repeat).</p>
<p>When the distribution that rejection sampling draws with is such that if x and y had a ratio of being drawn of p(x)/p(y) then under the rejection procedure they have relative odds of (p(x)q(x))/(p(y)q(y)).  An important special case is when q() is always 0 or 1, in this case we are drawing with relative odds proportional to p(x) from the subset of x with q(x)=1.</p>
<p>As an example: consider the problem of trying to draw a point (x,y) such that x^2 + y^x &lt; 1 (the open unit disk) uniformly at random.  The rejection sampling solution is: repeat the following until you have a success: generate x and y independently uniformly in the interval [-1,1], if x^2 + y^2 &lt; then 1 accept them as our sample (otherwise repeat).  This procedure is very fast as the unit disk that represents our acceptance region has area pi and the square we are generating trials from has area 4: so we over a 78% chance of success on each trial or expect to only have to run fewer that 1.28 trials (on average) to get a sample.</p>
<h3>Transform methods</h3>
<p>A transform method is used when we have the ability to generate instances of a random variable according to one distribution and we would like instances according to another distribution.</p>
<p>One method is used when we have access to the inverse of the <a target="_blank" href="http://en.wikipedia.org/wiki/Cumulative_distribution_function">cumulative distribution function</a> of the distribution we are trying to generate.  In this case  we can use this function to convert uniform variants from the interval [0,1] into our target distribution.  The commutative distribution function is the function cdf() where cdf(x) is the probability a random variate generated according to our distribution is less than or equal to x.  The inverse function function icdf() where icdf(y)  is such that cdf(icdf(y)) = y.  For example the <a target="_blank" href="http://en.wikipedia.org/wiki/Exponential_distribution">exponential distribution</a>  has an inverse cumulative distribution function icdf(y) = -ln(1-y)/lamda .  So if y is<br />
generated uniformly in the interval [0,1] then icdf(y) is a random variable generated according to the exponential distribution with parameter lambda.</p>
<p>A great example of transform methods is generating Gaussian random variables.  We could directly use the inverse cumulative distribution function method described above- but to do this we would require a special function library to perform the required calculation of the inverse cummulative distribution (or inverse of <a target="_blank" href="http://en.wikipedia.org/wiki/Error_function">erf()</a>).  Another way is the <a target="_blank" href="http://en.wikipedia.org/wiki/Marsaglia_polar_method">polar method</a>: generate x,y uniformly from the open unit disk (by, for example rejection sampling as described earlier), set s = x^2 + y^2 and return  x*sqrt(-2 ln(s)/s),  y*sqrt(-2 ln(s)/s) as two independent Gaussian random variables.   The trick being: the distribution function of r = sqrt(s) is of the form r*e^(-r*r/2) which leads to an elementary cumulative distribution function (unlike the original Gaussian density of the form e^(-r*r/2)) that is easy to invert.</p>
<h2>Conclusion</h2>
<p>Our thesis is: all major methods to generate random variables use aspects of the six methods we have listed here as fundamental.  Or you should at least have a fluid understanding of at least these methods.  You should be able to break down big &#8220;brand name&#8221; methods (like Gibbs sampling) roughly into their constituent parts (so you can reason about them).   One example: notice how ratios of probabilities enter into Markov chain Monte Carlo methods (they cause step rejections); from this you can reason if your problem has bounded ratios it is a good candidate for direct application of the technique (and if it does not you need to add some more ideas, as was demonstrated in:  <a target="_blank" href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.11.9794">&#8220;Sampling from Log-Concave Distributions,&#8221; Alan Frieze , Ravi Kannan , Nick Polson, Ann. Appl. Prob, 1994</a> ).</p>
<p>The first two methods we discuss (physical sources and empirical re-sampling) are of the class of solutions &#8220;already have the right answer.&#8221;  Pseudo random generators are the primary way to negate the need for physical sources and resampling techniques.  Simulation, rejection sampling and transform methods are the main tools for building new distributions out of old.</p>
<p>It is a matter of taste if a given trick fits into this ad-hoc taxonomy or not.   You can invent new and better generation methods- but these methods are easily derived using ideas from the fundamental methods we mentioned here.</p>
<p>Related posts:<ol>
<li><a href='http://www.win-vector.com/blog/2011/06/what-is-a-large-enough-random-sample/' rel='bookmark' title='What is a large enough random sample?'>What is a large enough random sample?</a></li>
<li><a href='http://www.win-vector.com/blog/2011/10/kernel-methods-and-support-vector-machines-de-mystified/' rel='bookmark' title='Kernel Methods and Support Vector Machines de-Mystified'>Kernel Methods and Support Vector Machines de-Mystified</a></li>
<li><a href='http://www.win-vector.com/blog/2011/07/book-review-ensemble-methods-in-data-mining-seni-elder/' rel='bookmark' title='Book Review: Ensemble Methods in Data Mining (Seni &amp; Elder)'>Book Review: Ensemble Methods in Data Mining (Seni &#038; Elder)</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.win-vector.com/blog/2012/01/six-fundamental-methods-to-generate-a-random-variable/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>

<!-- Dynamic Page Served (once) in 1.463 seconds -->
<!-- Cached page served by WP-Cache -->

