Win-Vector Blog http://www.win-vector.com/blog The Applied Theorist's Point of View Mon, 29 Jun 2009 16:03:35 +0000 http://wordpress.org/?v=2.8 en hourly 1 Public Service Article: JSTOR and other Useful Research Archives http://www.win-vector.com/blog/2009/06/public-service-article-jstor-and-other-useful-research-archives/ http://www.win-vector.com/blog/2009/06/public-service-article-jstor-and-other-useful-research-archives/#comments Sun, 28 Jun 2009 17:41:44 +0000 Nina Zumel http://www.win-vector.com/blog/?p=169
  • Something I don’t get about business and bailouts
  • YAYGDA (Yet Another Yahoo Google Deal Article)
  • Exciting Technique #1: The “R” language.
  • ]]>
    How do you get access to current and historical research articles if you are not affiliated with a university or large research organization? Our second public service article discusses some useful online research archives.Most readers of this blog probably keep track of the latest developments in their field through journal subscriptions and memberships to appropriate professional associations. Perhaps some of you even splurge on digital library subscriptions, such as IEEE Explore or the INFORMS Digital Library — both of which I have found quite useful. In our field (Computer Science), academic researchers are generally conscientious about making their research papers available through their websites.

    But researchers in other fields are not always so good about making copies of their papers easily available, and older classic papers (say, for example, Bradley Efron’s 1979 Annals of Statistics paper on the Jackknife) are often still worth reading, but are not always easy to find. Where to go?

    This is a list of some resources that I’ve discovered over the years. The list isn’t comprehensive, by any means, but I offer them here because maybe you will find them helpful, too. The list, and my opinions, are biased towards research in the mathematical and computer sciences, but many of these resources are potentially useful for any research area, including the humanities.

    JSTOR

    jstor_logo.gif

    JSTOR is a digital archive of over one thousand scholarly journals, covering topics in the humanities, social and physical sciences and mathematics. I love JSTOR. It is an incredibly useful resource, containing the full contents of every issue of every journal in their collection up to within 3-5 years of the present time (it’s a moving wall). The collection is full-text searchable. I use JSTOR to find classic papers in Math, Statistics, and Computer Science, as well as more recent papers that have been published in journals that are otherwise not available to me.

    Access to JSTOR is available to members of participating institutions, mostly universities, but also many public libraries. I have access to JSTOR free with my San Francisco Public Library card, via the SFPL website. (I believe that any resident of California is eligible for a SFPL library card with proof of California residency; good news if you are in California and your local library doesn’t subscribe).


    As a side note, San Francisco Public Library subscribes to several quite useful digital research services, including FirstSearch, the OED, Encyclopedia Brittanica, and Morningstar. Some of these other services also provide access to selected full-text articles. SFPL also participates in ILL (Interlibrary Loan) and Link+, a similar cross-library loan service. All good reasons to support your local library!

    ArXiv

    arxiv.jpg

    ArXiv is a pre-print server hosted by Cornell, serving pre-prints of papers in Physics, Mathematics, Computer Science, Quantitative Biology, Quantitative Finance and Statistics. Many important researchers use ArXiv to get around the fact that major journal publishers insist on holding the copyright to articles published in their journals. “Pre-prints” haven’t yet been published, and hence the authors are free to distribute them freely. Fields Medalist Terence Tao regularly distributes his about-to-be published work through ArXiv.

    On the other hand, ArXiv has very open submission policies, so you should be more careful of the papers you find here than you would be with a refereed or curated source, such as JSTOR or PubMed Central (which we will discuss later). ArXiv has, unfortunately, more than its fair share of what Augustus de Morgan used to politely call “paradoxers“. The “Journal Reference” field of the article summaries will generally give you an indication of whether or not the paper is legitimate, in the sense of having been peer-reviewed; but note, for instance, this paper on a polynomial-time algorithm for Traveling Salesman (the Traveling Salesman problem is provably NP-complete, so a result of this magnitude would win the Clay Millennium Prize, if true).

    Another side note: I’ve linked to the Amazon page on de Morgan’s Budget of Paradoxes because that was the first synopsis I found. The copyright on the book has expired, so if you are actually interested in reading it (it’s fairly funny, in places), you can find the full version on Google Books or Project Gutenberg.

    CiteSeerX

    CSxbeta.jpg
    CiteSeer was the original search engine and archive for online technical papers; it got me through graduate school, and my first post-PhD position at SRI. I don’t believe that the original CiteSeer system is still active, but its successor, CiteSeerX, is being developed and hosted at Penn State. It concentrates on computer science literature, as did the original. CiteSeerX builds its corpus by webcrawling, so again, the papers it finds are not necessarily refereed. Like its predecessor, CiteSeerX search results include the paper’s abstract, a BibTex citation, a list of the paper’s references, a pointer to the paper’s original location, and (usually) an archived version of the paper, in case the original link has gone dead. Good stuff.

    AccessMyLibrary

    img_page_header.jpg
    AccessMyLibrary is a service that pools the periodical resources of several libraries across the United States. Any article in a periodical held by a participating library is available for free download to anyone who holds a library card in any other participating library. I find this service less useful than JSTOR: the holdings are generally newspapers and popular magazines, although there are some journals represented, as well as law and business reviews. The download format strips all of the original formatting from articles, which makes them rather ugly and a bit harder to read. I think you lose the figures, too. Still, it’s free if you have a library card, and it’s a good place to search for an article if you can’t find it anywhere else.

    Questia

    questia.jpg

    Questia is a for-pay service that claims to have “the world’s largest online collection of books and journal articles in the humanities and social sciences, plus magazine and newspaper articles”. Their collection is full-text searchable and, as they say, “you can read every title cover to cover”. Good luck doing so, though — articles and book chapters are not downloadable. Instead, you have to read them through Questia’s online interface, which is pretty clunky. On the plus side, they allow you to build your own “bookshelves” to collect books and articles that are relevant to you by topic or project. You can bookmark key sections, and highlight key passages. I used Questia when I was involved in research projects with psychology and organizational science aspects. I could get hold of articles or textbooks that I wanted to look at faster than through Interlibrary Loan, and more conveniently than going down to Stanford. The subscription fee at the time was cheaper than a membership to the APA or buying the articles piecemeal from Elsevier, or whoever.

    Currently, Questia’s subscription fee is $19.95/month for full library access; you can also subscribe to specific collections (such as Psychology, Literature, or Philosophy) for $9.95 per collection per month.

    Mendeley

    header-logo.png
    Another way to find useful literature is to connect with other people out there who share your interests. Mendeley is a tool that allows you to organize your collection of research papers, share it with colleagues, and to peruse the collections of other researchers with similar interests. I haven’t used it myself; but a friend of ours who is an active and influential AI researcher recommends it. It’s certainly worth a mention.

    PubMed Central

    pmclogo.gif

    PubMed Central is a free digital archive of biomedical and life sciences journal literature, sponsored and managed by the NIH. We don’t do life science research here at Win-Vector, but I’m mentioning PubMed because of this awesome policy by the NIH:

    nih.jpg

    NSF and DoD should institute similar policies, too.

    Google Books, Google Scholar

    logo.gif

    Yes, they’re out there. Personally, I find them less useful than JSTOR or a subscription to (say) IEEE Explore. Google Scholar generally returns the abstracts of articles at sites that don’t provide open access to the full-text article, such as the website of the journal that published the article, or the website of a restricted research archive, like the ACM. This is useful, in that it tells you that the article exists, but it’s rather frustrating, too. I don’t find Google Scholar to be significantly more helpful than doing a general Google search on the same keywords. On the other hand, some people swear by Google Scholar, so obviously your mileage may vary.

    Google Books has a very annoying habit of returning hits on your search terms, then not giving you read access to the page in question. Useless. If you happen to be doing research in an area where older books in the public domain are still of interest (for instance, my amateur interest in folklore and mythology), then Google Books can be quite helpful; of course, this situation is generally not true in technical research.

    Offline: Your Local University Library

    iStock_000005201261XSmall.jpg

    Here in the Bay Area, we are fortunate because the Stanford Library System has generous visitor access policies. The visitors’ policy statement is here; briefly, non-Stanford visitors are allowed 7 courtesy visits per year, with no borrowing privileges. For more visits, you can purchase an access card. I used the Stanford Libraries when my company was down in Mountain View, and I’m grateful for their openness. I don’t think many universities are as generous as Stanford is, but if you are near a university campus, it doesn’t hurt to check. For instance, the University of San Francisco will sell access cards to their library, with or without borrowing privileges, to non-affiliated visitors (it ain’t cheap), and allows practicing California attorneys access to their Law Library. San Francisco State has a Friends of the Library program, whereby non-affiliated visitors have access and borrowing privileges to the CSUSF library collection for $45/year.

    And there you have it. Research away!

    Related posts:

    1. Something I don’t get about business and bailouts
    2. YAYGDA (Yet Another Yahoo Google Deal Article)
    3. Exciting Technique #1: The “R” language.

    ]]>
    http://www.win-vector.com/blog/2009/06/public-service-article-jstor-and-other-useful-research-archives/feed/ 2
    Public Service Article: Back Up http://www.win-vector.com/blog/2009/06/public-service-article-back-up/ http://www.win-vector.com/blog/2009/06/public-service-article-back-up/#comments Fri, 12 Jun 2009 00:09:04 +0000 John Mount http://www.win-vector.com/blog/?p=144
  • Public Service Article: JSTOR and other Useful Research Archives
  • YAYGDA (Yet Another Yahoo Google Deal Article)
  • Exciting Technique #1: The “R” language.
  • ]]>
    This is a public service article encouraging all of us to back up our data (which more and more is our lives). I sketch some methods and resources for doing this.

    As more of our life becomes digital (work, finances, passwords, pictures, contacts,dairies,videos and email) we must be more diligent in backing up our data. If your hard drive fails at work you might lose some spreadsheets (and you might not lose anything if your IT department is on their toes) if you computer fails at home you lose your wedding album. Your hard disk will fail and try to take all of your data (life) with it- it is a matter of when not a matter of if. You want this to be an inconvenience, not a disaster. Become expert at backing up and take the time to help others.

    First some definitions. Everything stored on your computer is called “data” and it is most commonly stored on a single “hard drive.” The act of making an extra copy of your data is traditionally called “backing up.” The act of trying to get access to your extra copy of your data is traditionally called “restoring.” The whole point of backing up is to be able to restore. If you can’t get your data back it really doesn’t matter what steps you took. Backing up with no ability to restore is just cargo cult behavior.

    If you have a professional service available they will likely do a better job than you can (this is one reason that larger businesses have professional IT staffs). However, at home you are likely on your own.

    This is an opinion piece and I am advocating backing up everything (whole drives) locally. If you do not back up everything you will need to choose what to back up and what to skip- and you will make mistakes and lose things. If you do not have a local back up, you might not be able to restore (back up service goes out of business, internet connections are still too slow to be practical). At the very least you should have a local back up; a remote back up is a good second step. Remote back up services are a good idea for important data and there are some high quality ones, but a few have gone out of business (Xdrive) so do not want one to be your only chance of salvation.

    Let us first address a technical issue- what sort of set-up are you backing up? The three most common situations are: Windows, OSX and other Unix (Linux/BSD, yes I know OSX is a Unix). Each of these have different appropriate tools:

    • Windows:

      For Windows Home type operating systems you are unlikely to have access to Microsoft’s back up tools (which is a real shame, the tools are more useful at home than at a business). So you need to install something.

      I have not researched the Windows world extensively, so I can not give advice. I can, however relate my experiences and current policies.

      I now avoid EMC Retrospect (often comes free with USB drives) at all costs. My experience has been that EMC Retrospect is hard to use to restore your data (the whole point of backing up). For me it often refused to run (due to licensing issues) and it was very sensitive to the exact version of the Microsoft.Net framework that was installed on my Windows system. Two separate times an update in the Microsoft.Net system rendered EMC Retrospect unusable (and broke nothing else).

      I have happily purchased Acronis True Image three times now (twice for myself and once for a friend). Their website is a bit confusing (you must be careful to get the retail product, not the many thousands of dollar enterprise product). The software seems to be very good. It can back up, restore and can even read data from an “image” (which means you can get to your data with out even restoring).

    • OSX:

      An embarrassment of riches:

      The free options include following Jamie Zawinski’s wonderful advice (which I am shamelessly stealing from here) , using the free copy of SuperDuper! (which is very good and a complete back up solution even in the free version) or Time Machine (the back up utility included in the current Mac operating system: Leopard).

      One huge advantage of modern Macs is if you have formatted your drive correctly you can boot off a USB drive. So if you use the above instructions you can plug your back in and use it to run (delaying your need to open up your machine or attempt a restore until later). This is also important in rehearing your restore procedures.

      Finally, if you have the cash there is the somewhat over-priced (but wonderful) Time Capsule. You can live without Time Capsule, but it is part of my “dream set up” (described below).

    • Unix:

      Follow any sort of advice on how to script back ups (such as Jamie Zawinski’s) and you should be protected. Rsync is a great tool.

    More important than the back up tools is having a precise back up goal and a matching back up plan. I use my own goal and plan as an example and you can use it as a basis for safer or more risky plans (depending on your resources and needs).

    My goal is to: (with very high probability) not lose more than a week of my life. The plan to achieve this is a full local back up every week and the willingness to buy some new equipment if I have to do a restore. A failure could delay my work for a day or so, but not put me out of business. For my business it does not make sense to ensure “no down time”- this is an unreasonably expensive thing to try to achieve (and the inappropriateness of this goal is one reason many people have no back ups at all). My worst case “restore” plan is to drive to a store and buy the cheapest temporary computer. A more likely case is I just need to use one of my extra drives to do my restore (very cheap). I would then restore the back up onto a fresh drive (or the temporary computer) and work from there until I could repair or replace my major system.

    My back up plan has several “eyes open” weaknesses. I only back up every week, so I could lose a week’s of data if my disk dies right before a back up. Also, to restore my data could take a day and $500 (trip to store to buy a temporary computer and hours to restore drive contents). Knowing these weaknesses are the point of the back up plan: I am trading hoping that my drive doesn’t blow up and take all of my data away for hoping my drive doesn’t blow up and cost me a day of work and few hundred dollars. That is I am trading the Sword of Damocles for worrying about something like stubbing my toe. Drive failures while inevitable are not frequent. if I put a quarter in a jar every day I don’t have a drive failure I would more than likely have the $500 needed to perform an emergency restore saved up long before I have a drive failure. By not purchasing excess extra equipment (computers) before the failure I save money by maybe not having to purchase it all or at least purchasing cheaper and better equipment at the time of failure (instead of now).

    Now to describe my implementation of my plan. First I purchased the following
    things:

    • Time Capsule (optional):

      apple-time-capsule_1.jpg

    • Thermaltake External Hard Drive SATA Dock ($40 : Newegg):

      dock-station.jpg

    • Two 1TB drives ($90/each Newegg, these
      are the cheaper “internal” drives that go into desktop computers or into the Themaltake dock. If you don’t like the ugly you could skip the Themaltake dock and buy USB drives instead.):

      HD-S1000S32.jpg

    • So for a little over $220 I am in business. Every week I could take one of the drives out of its envelope, stick it in the Thermaltake dock and use one of the tools described above to create a complete back up. What I actually do is even better. Any time I want I ask my computer to use Time Machine to back up to the Time Capsule (typically takes about 20 minutes) and then once a week I stick a drive in the Themaltake dock and let the Time Capsule copy itself onto the drive (so both me and my computer are completely uninvolved in the 8 hours this step can take). For offsite back ups (to defend against things like fire) I can take one of the drives to a safe place off site (locker, safe deposit box). I recommend physical protection (locks, fire safes) to protect your drives (not encryption, there is a good chance you will get something wrong with encryption and not be able to restore).

      Using Time Machine gives me the benefit of having multiple back ups so I can look at earlier versions of files and the speed of only needing to perform incremental back ups (only what has changed needs to be copied). Another way to get the advantage of having extra versions of all of your files is to put most of your files under management of a “source control system” like Bazaar. Systems like this (free, runs on Windows, OSX and Unix) let you keep all versions of all of your files (answers things like “what did I have in the file before I deleted it last week?”) and are incredibly useful (you will wonder how you lived without them).

      Finally I end with some “defensive thinking” required to succeed with back ups. I have not said why I purchased two extra drives. This is so I can rotate which extra drive I back up onto. Drives most often fail when being used- so it is very plausible that my main machine could die while backing up. If the main machine dies while backing up then not only is its data lost but the back up is also useless (as the main machine was interrupted while trying to write it out). This is not quite ironic because while it is contrary to what you would want it is not unexpected. To be safe from a failure during the back up procedure you must have a second drive that is not being used. Only after the first back up is known to have succeeded can you then back up onto the other drive.

      You must rehearse and think through all of your back up steps. If you are lucky you will find flaws in your plan during rehearsals instead of when you go to restore. For example tape back up procedures are notorious for writing out years incremental back ups that don’t work during a restore attempt. Use a system that allows safe rehearsals (such as trying to boot from a bootable back up or inspect a file from an Acronis image or Time Machine archive). Plans that only allow restores are not safely rehearsable (if the rehearsal fails you damage something on your primary machine). Also: if you are really trying to restore you are not likely to be in a good mood, iron out potential kinks with rehearsals not during a panic.

      No plan is perfect- we can not cheaply eliminate all risk. In this case what we can do is eliminate exposure to likely scenarios. Data loss can still happen, but it is not inevitable.

    Related posts:

    1. Public Service Article: JSTOR and other Useful Research Archives
    2. YAYGDA (Yet Another Yahoo Google Deal Article)
    3. Exciting Technique #1: The “R” language.

    ]]>
    http://www.win-vector.com/blog/2009/06/public-service-article-back-up/feed/ 2
    What is “Genetic Art?” http://www.win-vector.com/blog/2009/06/what-is-genetic-art/ http://www.win-vector.com/blog/2009/06/what-is-genetic-art/#comments Tue, 02 Jun 2009 05:17:57 +0000 John Mount http://www.win-vector.com/blog/?p=125
  • Programs reduced to statistics
  • ]]>
    What is “genetic art?” My answer to this is http://www.geneticart.org (redirects to http://www.mzlabs.com), but this requires some explanation.
    The quick answer is this is genetic art:


    pic1.png


    pic2.png

    The longer answer is that a number of times different forms of algorithmic art have been invented. Algorithmic art is art generated by mathematical procedures. Such art is similar to earlier mechanical and kinetic art forms. One branch of mathematics often used to generate such art is called “fractals.” We looked somewhere else for our inspiration (our art is not strictly fractal in nature). What we worked on we called “genetic art” to emphasize the role of encoding and re-combination in the works.

    In the early 90’s Karl Sims presented a number of art installations based on at least three interesting ideas:

    • Transforming images
    • Evolving combinations of transforms
    • Direct participation

    (see: Karl Sims. Artificial Evolution for Computer Graphics. Proceedings of SIGGRAPH 1991 and Karl Sims’ homepage).

    The part that caught a number of people’s imaginations was the evolution aspect. Karl Sims defined a method of combining transformations of original source images. He then allowed people to manipulate his art installations and “vote” on art they liked best. The more popular pieces were combined (or bred) to create newer works that then put up against criticism. After many breedings (or generations) the combinations of transforms were quite complicated and a number of unexpected images were created.

    At CMU Shumeet Baluja, Dean Pomerleau and Todd Jochem were interested both in the evolutionary aspects of the art and also seeing if a machine could learn to model user tastes (see Shumeet Baluja, Dean Pomerleau and Todd Jochem. Simulating User’s Preferences: Towards Automated Artificial Evolution for Computer Generated Images. Technical Report CMU-CS-93-198. Carnegie Mellon University. Pittsburgh, PA. October 1993. ). They built a much simpler art system that combined primitive elements (elements closer to brush strokes than to original pictures) and tried to learn user preferences for complex pictures.

    Figures of this era looked much like this (well better than this, this comes from a scan of a black and white printing of the paper):


    evolve.png.

    Scott Neal Reilly built a new, more simplified system; and with Michael Witbrock put the whole thing on the Web. This was an unimaginably primitive time on the Web. Cutting edge interaction was sites like “Blue Dog Can Count.” The Mac had no forms capable browser and Amazon.com was still a year away from launching. An interactive art exhibition running directly on the Web (and manipulated by anybody) was a significant step forward.

    Michael Witbrock was influenced by the stories of Heikegani Crabs and Alan Turing’s 1952 paper “The Chemical Basis of Morphogenesis” (which theorized how simple systems could develop textures).

    At this point I (John Mount) got interested in the project and felt that much more could be done with how such systems handled color. The art had been simplified to primitive elements that one could think of as brushes (really more like gradients) but the art was essentially grey-scale with a false-color map applied at the last step. Karl Sims had made transformations on images his primitive operations, I wanted my primitive operation to be transformations on color.

    Being a math-nerd I chose to encode color inside a mathematical system called “Quaternions” (see Ebbinghaus et al. Numbers Springer-Verlag, Second Edition, 1988). Colors are often represented as three brightness terms- for example intensity of red, intensity of green and intensity of blue. The Quaternions were discovered by Sir William Rowan Hamilton in 1843 (see Wikipeida: Quaterion). Sir Hamilton was trying to solve the problem of encoding positions in space in a nice structure and was so excited by his discovery he carved his fundamental formula for them in the Brougham Bridge the night he had his breakthrough. Quaternions are represented as four standard numbers (so they have enough “slots” to encode a position in Sir Hamilton’s case or in our case a color) and they their selves behave a lot like individual numbers. There are rules for adding, subtracting, multiplying and even dividing Quaterions. This means you can write formulas over them and these formulas are now directly manipulating colors (instead of manipulating geometry or intensities as in the earlier systems).

    So, as with Neal Reilly’s system, we represented all of our transformations as formulas and represented “breeding” as ripping a bit of one formula out and combining it with another. For instance these two rather uninteresting color gradients were represented by the formulas:

    ( x – i y ) : x_iy.png
    ( x – i y – j x – k y ) : x_iy_jx_ky.png.

    One of our arithmetic operations was named “mod” and we could use it to combine the two items into a more complicated formula and somewhat more interesting picture:

    ( mod ( x – i y) ( x – i y – j x – k y ) ) : mod.png.

    After enough generations of selection and breeding the formulas get long and complicated (luckily nobody but the machines have to look at them) and the pictures get interesting:


    silver.png.

    Michael Witbrock and Scott Neal Reilly supplied an updated web interface. And at this point we got our 15 minutes of fame:

    From Wired 3.01 January 1995 p. 147 Kristin Spence’s “Net Surf” column:


    Wired3.01.png.

    We also made large prints (using a parallel computation system named “WAX” by Peter Stout) and set up an exhibition in a Pittsburgh coffee house. The development work was largely done on a machine with a black and white monitor (later we got access to a grey scale monitor) so it really was a treat that color was able to fend for itself.

    On an unrelated track in 1991 another CMU student, Scott Draves, was pursuing a serious project and building art based on iterated function systems (more related to fractals than our work) and in 1999 added some genetic ideas and released the Electronic Sheep client/server oriented screen saver (inspired by SETI@home).

    Recently we have gotten back to some more of Sims’ ideas and allowed more geometric transformations and real source images, such as incorporating some of Dover’s royalty free Japanese textile designs:


    texture.png.

    The original system was group interactive and an incredible hit (and had its own Zephyr discussion instance- the then equivalent of Twitter). The current demo is stand alone and server-free, so it is a single player game.

    All the user has to do is point a Java 1.4 (or better) capable browser at: Genetic Art Program. Click (and hold) on the “Action Menu” of any of the sub-windows and select “take over left selection” on one picture that interests you and “take over right selection” on another picture that interests you. You will notice doing this copies the pictures to the left and right of the “breed pictures” button. Now press “breed pictures” as many times as you want. Each time you press it you will get a new picture build by combining elements of the two small pictures. At any time you can have some other picture you like take over a breeding position (again by using its action menu). And you can also scroll the strip of five pictures around and click on them to introduce previous favorites from the earlier server version of the program into your space of opportunities.

    We were originally a bit uncomfortable calling the work “Art” but a number of the images have made significant impressions on us and others- so perhaps it qualifies.

    Related posts:

    1. Programs reduced to statistics

    ]]>
    http://www.win-vector.com/blog/2009/06/what-is-genetic-art/feed/ 0
    Programs reduced to statistics http://www.win-vector.com/blog/2009/05/programs-reduced-to-statistics/ http://www.win-vector.com/blog/2009/05/programs-reduced-to-statistics/#comments Sun, 31 May 2009 16:24:58 +0000 John Mount http://www.win-vector.com/blog/?p=110
  • Sorting Used in Anger
  • I know, I am the one being a jerk
  • Hello World: An Instance Of Rhetoric in Computer Science
  • ]]>
    An interesting article on programming languages by Guillaume Marceau is making the rounds:
    The speed, size and dependability of programming languages
    . The article points out very clearly what some of the differences in major programming languages are. The author uses benchmarking and graphs in an interesting way.

    I have had a soft spot for this kind of study ever since I read: Donald E. Knuth: An Empirical Study of FORTRAN Programs. Softw., Pract. Exper. 1(2): 105-133 (1971). In that article Knuth admits to breaking into people’s accounts to collect statistics on what evil people were feeding into the FORTRAN complier.

    Lets look at the gestalt of a few popular programing following button-sized excerpts from Marceau’s article:


    compPlots.png.

    To build these graphs 19 challenge problems were implemented in 72 programming languages. Each square is programming language, the x-axis is runtime size and the y-axis is code size (large is bad on both of these). Each line segment connects the code size and run-time of one example program run to the centroid of all such runs for the language. We all know code size is not a very good stand-in for programming difficulty (compare C a merely primitive language to C++ an outright programmer hostile language), but the pictures actually tell a credible story.

    • GCC (or C) is very very fast but takes a lot of code (its graph is a vertical bar running up and down the left).
    • Java mostly works like C, but every once and a lets you down on performance (this is leaving out that Java is far safer than C and far more wasteful of memory).
    • Javascript and Ruby have such bad implementations that their centers are off the graph (this brings up a point the original authors well understand- you can not benchmark a language only a specific run of a specific program using a specific language implementation).
    • Perl and Erlang have similar run time performance (though are completely opposite poles of elegance, elegance not plotted on graph).
    • Ruby’s implementation makes Python look fast.
    • OCaml lives up to its reputation of being simultaneously very expressive and efficient (but expressive power is not a direct measure of ease of use, think of APL).

    The benchmarking depends on people donating example programs and the problem types are heavily biased towards the puzzle are (where C, Java and OCaml excel) and not to the “its a one-liner because it is already done in a frame work” (Perl, Python, Ruby).

    For all the problems inherent in such a study I think it is actually interesting what a little quantitative data lets us think about.

    Related posts:

    1. Sorting Used in Anger
    2. I know, I am the one being a jerk
    3. Hello World: An Instance Of Rhetoric in Computer Science

    ]]>
    http://www.win-vector.com/blog/2009/05/programs-reduced-to-statistics/feed/ 0
    The Joy of Calculation http://www.win-vector.com/blog/2009/05/the-joy-of-calculation/ http://www.win-vector.com/blog/2009/05/the-joy-of-calculation/#comments Tue, 12 May 2009 15:27:22 +0000 John Mount http://www.win-vector.com/blog/?p=87
  • Hello World: An Instance Of Rhetoric in Computer Science
  • Something I don’t get about business and bailouts
  • Do Not Let Your Medical Records Be Used Against You
  • ]]>
    I recently had the pleasure of finding a copy of the manual for my favorite calculator. I know it is incredibly nerdy to have a favorite calculator (and even more nerdy to read the manual), but it really got me thinking.

    The manual subtly sold an incredible point of view: the engineer’s view. The manual appears trivial at the surface but is in fact a very good rhetoric pushing a fascinating point of view: you can infer things quickly. This led me to think about a number of technical viewpoints (engineers point of view, scientists point of view and lastly mathematicians point of view). They are all lumped together as “quantitative” but they are radically different.

    Listen to this (from the beginning of the HP15C calculator manual):

     
    CalculationExample.gif
     

    Notice the emphasis on the physical activity of calculation. The emphasis is not on equations, mathematics or physics. The calculation is deliberately described as key strokes. No attempt is made to justify any of the steps or numbers used. The point being made is: if you are agile and ready (have the correct fore-knowledge) you can calculate. If you can calculate you can know things. Robert Heinlein made this point about slide-rules in his science fiction story: “Have Spacesuit- Will Travel.” And likely a similar joy can be felt while accounting on an abacus.

    This is the engineer’s view: the world continuously gives up many small and simple clues as to what is going on around you. These are like “tells” in poker. You can reason from them and build incredible things using them. The smallness and simplicity of the techniques are pure comfort.

    In Michael Lewis’s “Liars Poker” the author mentions a moment when he knows that everything he is being told about the market is a lie. He knows this because he attempts to converts one statement about the market into another using his calculator. When he attempts the conversion (figuring out something he was not supposed to know from clues coming from something he was told) it does not add up. Importantly he describes working this out on his calculator- not using a sophisticated computer model or a spreadsheet. He is comfortable in his heterodox position because he calculated it by hand in small and simple steps.

    This joy in comparing one conclusion to another (using a calculator) differs from the idealized scientist’s view in that there is no derivation or application of deeper laws. The engineer’s view is: if you can remember it or guess at it then you don’t need to derive it.

    Some of the great scientists (Enrico Fermi) and mathematicians (Stanislaw Ulam) became masters of the engineering view and could dazzle with it.

    One of Fermi’s famous stunts was measuring the yield of a nuclear bomb test by observing how far scraps of paper were moved. Fermi may have worked from first principles, but he could also have used a simple pre-prepared trick. If he had observed how far scraps of paper had moved in an earlier conventional bomb test (which he now knew the yield of) and then applied a simple engineering trick called “dimensional analysis” that let him reason the amount of work observed (how far the slips of paper were moved) depended linearly on the bomb yield and decreased as the cube of how far away he was from the explosion. So all he did was compute the ratio of of how far the slips moved in each test and then divide this three times in succession by the ratio of how far way he was from the center of each test. Merely being able to divide told Fermi something (the new bomb yield) before he was officially allowed to know it. Notice how he did not need to use any facts about the bombs being tested, the speed of sound, atmospheric pressure, density or temperature.

    Such reasoning may seem crude- but it is far more informative and far more exciting than the published work of many lesser scientists. The bulk of most merely poor scientific work (as opposed to outright wrong work) is of the form: “here are some pointless measurements I got by applying an expensive new instrument in exactly the situations the manufacturer designed it for.” Or “here are some manipulations that seem original since I don’t feel I have to cite any non-physicists.”

    I side with the mathematicians (not the engineers or even scientists) and I think it is safe to say that mathematicians (who have their own particular view) are more sympathetic to the engineer’s view than to the scientist’s view.

    One joke that has been told about me is that I am not happy at a presentation unless there is an equation on the board. This is typical of mathematicians. The excitements comes from the opportunity to “kick the times.” Once you remove enough details an equation is a simple statement of the form “A=B” (to borrow the title of a wonderful book by Marko Petkovsek, Herbert Wilf and Doron Zeilberger). An equation is a welcome moment of concreteness in contrast to the many painful abstractions that are necessary for much of mathematics. The dirty secret is that mathematicians perk up when an equation is on the board not because they like equations- but they are hoping to plug in values for “A” and “B” such that the equation is shown to be false. My branch of mathematics (theoretical computer science) is more a competitive than a cooperative field. One measure of audience interest in my field was if somebody to grab the magic marker out of your hand to try and write down a counter-example to what you were trying to demonstrate. Gian-Carlo Rota tells a similar tale where someone in a mathematical audience grabs the chalk and tries to complete the presentation.

    One reason I side with the mathematicians and not the engineers is: if pressed too far the engineer’s view goes wrong. The way it goes wrong is found in the thick classic comprehensive engineering handbooks. These books attempt to store and systematize all of a given field’s engineering knowledge. Once you attempt to become comprehensive and are devoting all of your intellect to memorizing and applying the standard approximations and estimates you are lost.

    I also do not side with the scientists because mathematicians have no sympathy for trying to “buy your way out of solving a hard problem” by running an expensive experiment. Mathematicians do work with data (even messy data) but we call this “application” not “proof.”

    To me the best view is: if you can derive anything then you do not need to remember anything.

    Related posts:

    1. Hello World: An Instance Of Rhetoric in Computer Science
    2. Something I don’t get about business and bailouts
    3. Do Not Let Your Medical Records Be Used Against You

    ]]>
    http://www.win-vector.com/blog/2009/05/the-joy-of-calculation/feed/ 1
    The Data Enrichment Method http://www.win-vector.com/blog/2009/04/the-data-enrichment-method/ http://www.win-vector.com/blog/2009/04/the-data-enrichment-method/#comments Fri, 01 May 2009 01:03:06 +0000 John Mount http://www.win-vector.com/blog/?p=80
  • Exciting Technique #1: The “R” language.
  • A Quick Appreciation of the Sharpe Ratio
  • Paper on stock trading
  • ]]>
    We explore some of the ideas from the seminal paper “The Data-Enrichment Method” ( Henry R Lewis, Operations Research (1957) vol. 5 (4) pp. 1-5). The paper explains a technique of improving the quality of statistical inference by increasing the effective size of the data-set. This is called “Data-Enrichment.”

    Now more than ever we must be familiar with the consequences of these important techniques. Especially if we don’t know if we might already be a victim of them.


    “The Data-Enrichment Method” is an absolutely wonderful 1957 tongue in cheek parody of a very tempting method of accidental data falsification. The method presented is spookily plausible and actually anticipates some very important (and correct) methods later used in the EM, Jackknife, Bootstrap and other resampling techniques (for example see: “Bootstrap Methods: Another Look at the Jackknife”, Bradley Efron. Ann. Statist. (1979) vol. 7 (1) pp. 1-26).

    The idea is innocently presented with an accompanying data-set: perception of a sound at a different presented decibel levels (loudnesses):

    Source.DB Detections Failures
    62 5 40
    65 10 30
    68 15 20
    71 20 10
    74 25 5
    77 30 3

    From this table it is obvious that the number of detections is increasing (and the number of failures is decreasing) as the sound is presented louder and louder. This makes sense and puts a quantitative rate to our prior expectation that detection gets easier as loudness increases. For this data the trend is quite obvious and we can easily plot a regression line that accurately models the effect of Source.DB on detection rate:


    SourceDBDetectionRate.gif

    But we want more. Can we increase our model precision and confidence by incorporating our domain knowledge? If we are only trying to accurately estimate the rate that loudness increases the detection level and we are willing to assume that it really does increase, then: could we not pre-prepare the data to use our domain knowledge?

    The method suggested is to add in some contra-factuals that we feel confident about. For example we could (using our assumption that loudness increases detection, just to an unknown degree) notice that the 30 failures at 65 DB certainly would not have been heard if they had been run at 62 DB (even quieter). By the same reasoning we can assume that the 5 detections at 62 DB would have been heard had they been run at 65 DB, 68 DB, 71 DB, 74 Db or 77 DB. In this way we have used our starting “seed data” and our domain knowledge to boost into a much larger data set that shows the expected relation much more strongly.

    The above paragraph is, of course, nonsense. I am doing the original paper an injustice by summarizing- because in the original paper the procedure seems perfectly plausible (and useful). It is not until the author works a second example that has a poor initial relation (that actually needs the enrichment) that the joke is revealed.

    The second example is coin flipping. The author applies an inductive bias that “clearly standing higher up on a staircase increases the chances of a coin flip coming up heads” and then uses the data enrichment method to enhance the data set. The original data set is indeed too noisy to show the effect and the enhancement is in fact quite dramatic. The original data:

    Stair.Step Heads Tails
    1 4 6
    2 5 5
    3 7 3
    4 4 6
    5 6 4
    6 5 5
    7 6 4
    8 6 4
    9 3 7
    10 4 6

    The enhanced data is much more interesting:

    Stair.Step Virtual.Heads Virtual.Tails
    1 4 50
    2 9 44
    3 16 39
    4 20 36
    5 26 30
    6 31 26
    7 37 21
    8 43 17
    9 46 13
    10 50 6

    It is easier to see what is going on in the following plots (which show measured success rates as a function of number of stairs up the staircase and show a smoothed fit of the relationship). The original data is a noisy mess:


    CoinSmoothed.gif

    And the enriched data is more trend-like:


    VirtualSmoothed.gif

    In fact the regression line fit onto the raw data even has the wrong sign (points down instead of up):


    CoinFit.gif

    Now, obviously this is a joke. The enhancement procedure did not so much enhance the data as obliterate it. The procedure makes no sense and it is treating the procedure with undue respect to point out any one feature as being “what is wrong with it.” But the original desire is legitimate: can we use informed assumptions to gain a useful inductive bias? If we do know something should we not need less data?

    The answer is yes- but we have to be careful. We must read up on the differences between Bayesian, frequentist and empirical methods and decide which set of methods is best for us. Up until now we have been fitting “by standard methods” which is really just minimizing how far the data is from the model (by moving the model around). That isn’t the only way to fit (see: “Controversies In The Foundation Of Statistics” Bradley Efron, American Mathematical Monthly (1978) vol. 85 (4) pp. 231-246).

    For example a Bayesian might say that the goal of model fitting is not to pick a model that is closest to the data (maximizes the data’s plausibility with respect to the model) but to pick a model that simultaneously maximizes the product of the data’s plausibility with respect to the model and the model’s acceptability. For example we could say all models for coin-flips with negative slopes are unacceptable and pick the best model with a non-negative slope. However, assigning of degrees of acceptability (or priors) on every possible model is laborious and may require more knowledge than we have from our “reasonable prior domain knowledge.”

    Another method is to use more sophisticated notions. One such method is Quantile Regression ( Roger Koenker, Cambridge University Press 2005). This methodology treats regression as a constrained optimization problem- so it is a simple matter to add in more constraints (like the slope must be positive) without having to assign arbitrary plausibilities to every possible model. Another (huge) advantage is that Quantile Regression is much more stable and even without any entered constraints recognizes that the coin-flip data is likely trend free. Here we plot the Quantile Regression analysis of the coin-data (without having added any prior constraints):


    QuantileRegression.gif

    To be honest: the method got lucky- the fit is better than should be expected. But Quantile Regression is the perfect framework for adding in domain-constraints.

    So: while The Data Enrichment Method is a fraud, there are ways to to enhance analysis to incorporate domain knowledge into results. Instead of saying “any bias (even useful bias) ruins fitting” one should have a cookbook of methods ready to be applied. These cookbooks hide under names like “Econometric Society Monographs” (in my opinion the econometricians really own the interface between theoretical statistics and hard-nosed applications).

    Related posts:

    1. Exciting Technique #1: The “R” language.
    2. A Quick Appreciation of the Sharpe Ratio
    3. Paper on stock trading

    ]]>
    http://www.win-vector.com/blog/2009/04/the-data-enrichment-method/feed/ 0
    What does the Market Think? http://www.win-vector.com/blog/2009/03/what-does-the-market-think/ http://www.win-vector.com/blog/2009/03/what-does-the-market-think/#comments Wed, 18 Mar 2009 18:23:43 +0000 John Mount http://www.win-vector.com/blog/?p=71
  • It is not all the quants’ fault.
  • Is Search Advertising a Market for Lemons?
  • Something I don’t get about business and bailouts
  • ]]>
    What does the market think about IBM’s proposed acquisition of Sun?
    Given the differences in size between the two companies it is definitely a case of “IBM + Sun = IBM.” Also, one might think that IBM being down over 2% in price (1:30pm Eastern March 18 2009: mid day after the news got out) on a neutral day (Dow down 0.5%, NASDAQ and S&P 500 up) is a strong vote against the merger. A more careful analysis shows that the market has not really expressed a strong opinion yet.

    A lot of ink is spilled about the “information markets” but a lot of writers ignore just how much of the pricing of markets is due to information-free arbitrage and represents how markets work (and not information). For example the over 2% price decline in IBM actually tells us almost nothing- it is to be expected.

    The quantity to look at is not price, but the total market value of IBM plus Sun.

    If the market is in equilibrium Sun’s price should increase by an amount equal to the premium IBM is thought to be willing to pay for the stock times the perceived probability of the deal going through. Sun right now is $8.72 (up from $4.97) so it is safe to assume IBM is offering nearly a 100% premium on Sun stock and the market is fairly certain the deal will go through. So Sun’s total market cap (price per share times total number of shares) rose from $3.7 billion to $6.5 billion (or a net increase of $2.8 billion).

    IBM’s price slid from $92.91 to $90.83 this means that its market cap fell from $124.7 billion to $121.9 billion or a loss of around $2.8 billion. Almost identical to the increase that priced Sun up 75%.

    So the market is pricing IBM + Sun today very very closely to what the sum of prices yesterday. I would interpret this as yielding no information other than the fact that the market feels IBM will pay a substantial premium for Sun (and that the deal is likely to go through, yielding the 75% single day price in increase in Sun). The fact that the sum of market caps is so well preserved indicates that no big player has yet started trading on a strong opinion if the deal is good or bad.

    All of the above arguments are “arbitrage-like.” The idea is if the market mis-priced the sum of IBM plus Sun then an informed trader could profit by taking an informed contrary position (it is not true arbitrage because the trader would have to take a risky position for a period of time) and waiting until some time after the merger finishes (or fails) to take a profit. Of course, as is now painfully obvious, all such arguments are irrelevant if the market locks up. Without a fluid market there is no reason for complicated combinations of investments to price lock step with each other.

    Related posts:

    1. It is not all the quants’ fault.
    2. Is Search Advertising a Market for Lemons?
    3. Something I don’t get about business and bailouts

    ]]>
    http://www.win-vector.com/blog/2009/03/what-does-the-market-think/feed/ 0
    It is not all the quants’ fault. http://www.win-vector.com/blog/2009/03/it-is-not-all-the-quants-fault/ http://www.win-vector.com/blog/2009/03/it-is-not-all-the-quants-fault/#comments Thu, 05 Mar 2009 20:08:01 +0000 John Mount http://www.win-vector.com/blog/?p=51
  • What does the Market Think?
  • A Quick Appreciation of the Sharpe Ratio
  • Something I don’t get about business and bailouts
  • ]]>
    There is plenty of blame to go around from the current global financial crisis. But, I would like to point out that it is not “all the quants’ fault.” We are all now, unfortunately, sitting in the middle of a high quality (and extremely expensive) lesson in financial mathematics. I would hate for some of the truly important points to be lost to paying too much attention to some of the shiny details.

    One fascinating article ( Recipe for Disaster: The Formula That Killed Wall Street by Felix Salmon, Wired, February 2009) has so popularized assigning blame to one formula (and one mathematician) that posting the image of a formerly obscure statistical formula (called a “copula”) is now considered good for a laugh.


    copula.gif

    A copula formula.

    However, the original mathematical paper being castigated (”On Default Correlation: A Copula Function Approach” by David X Li, Risk Metrics (2000)) is in fact good work. What is wrong is not the formula but the over-reliance on the formula. If we place all the blame on “copulas” we will be too ready to repeat the current disaster with some other “better” model.

    We need to think more like Michael Lewis and use specific failures as miniature laboratories to learn larger lessons. A great example is his write-up of the Iceland financial collapse ( Wall Street on the Tundra by Michael Lewis, Vanity Fair, April 2009 ) which, if you read carefully, contains a general indictment of speculative greed and getting rich by pushing around bits of paper (instead pursing activities that create actual value).

    So back to the copulas: what is to be learned (now at great expense) there? I would like to work through some of the important points of Dr. Li’s paper and explain some of the painful points in our current lesson. In my opinion none of the flaws are mathematical (or in the paper)- the flaws are all deep defects in logic and reason (and found in the later behavior of traders).

    The main purpose of Dr. Li’s paper was to figure out how to price a newer and more complicated financial instrument (the credit default swap) in terms of older underlying instruments (mortgages). In addition to developing the necessary mathematics the paper contains several clever ideas based on the logic of reasonable markets. As the markets became very large and very unreasonable the logic no longer applied. That is what went wrong.

    Credit default swaps (in their simplest form) essentially started as insurance policies against mortgages defaulting. Unfortunately, credit default swaps were unregulated financial instruments instead of regulated insurance policies. Credit default swaps degenerated into “bets” (or derivative securities) when they were separated from the underlying mortgages. You could, in essence, buy or sell insurance on your neighbor defaulting on their mortgage.

    The legitimate use of credit default swaps would be to set up a market for insurance and re-insurance. If you borrowed money to buy a house a bank might buy a credit default swap to partially insure against you defaulting on the loan. However, the market went somewhat insane. Since everything was measured in dollars and probabilities (instead of specific contracts and records) a bank that had a million dollars of exposure from lending you money (to buy your mansion) would end up buying insurance on your neighbor’s mansion (which they did not finance) from somebody with no really ability to pay-off in the event of default. From a pure balance-sheet point of view the numbers “made sense” a bank with a million dollars of exposure brought the appropriate amount of insurance. From a business point of view: they purchased insurance on the wrong property and purchased it from somebody they should not be doing business with.

    So credit default swaps eventually made no sense as insurance policies. How did they fare as financial instruments? Even if credit default swaps made no sense for the institutions originating (creating) them there was a market trading them. So, ignoring what they were: if you could buy them when they were cheap and sell them when they were expensive you could make money. This is where Dr. Li’s paper comes in: he figured out how to estimate the underlying theoretical value of a credit default swap. With this knowledge you would know when the market price for a credit default swap was cheap (the trading price would be below the theoretical price) or expensive (the trading price would be above the theoretical price). Traders could make more money.

    And this is where things went very very wrong. With more profit there were more traders. With more traders there was a larger market to accept credit default swaps. Since there were no rules anybody could originate (create) them. In particular there was no rule that said there could not be more credit default swaps than underlying mortgages. And this is where the insanity of the market no longer matched the reasonable logic of Dr. Li’s original paper.

    The idea of assigning a theoretical value to items using information from another market depends critically on two financial concepts. The first one is well known and is called “price taker.” The second one is more obscure and I will call it “information taker.” Due to extreme scale both reasonable assumptions became false.

    A “price taker” is a trader in the market that is small enough that the trader does not radically change prices. This is the opposite of “price maker” who is a trader who’s activity is so great that they essentially drive prices. The assumption was that the credit default market would be a “price taker” with respect to other markets. The theory was that you could disassemble a credit default swap into some mortgages, some interest bearing annuities and some other pieces. You could then get the prices for all of these components from other markets and know if the credit default swap was cheap or expensive relative to the current price of its constituent parts. This works for a single credit default swap. But what happens if you needed to take apart a larger number of them at once? That might require acquiring more mortgages than actually exist. Attempting to acquire or dump the components would have a huge price-making impact on all of the other markets. The idea that the credit default swap should price at the current price of its components falls apart, the very attempt to dissemble them would re-price the other markets. Even worse: the markets could “lock up” and stop trading (if for example you dumped so many mortgages into the markets that nobody wanted to buy any at any price).

    What I call an “information taker” is a newer idea. One of the clever steps in Dr. Li’s paper that the some of the unknown quantities needed by the theoretical model for credit default swaps could be estimated from the market pricing of mortgages. For example: an estimate of future mortgage default rates is one component needed to correctly price credit default swaps. One way to estimate future mortgage default rates is to learn a lot about actual mortgage holders, learn a lot about macro economics and try to predict future default rates in a number of plausible future scenarios. This is expensive and it is by no means certain (since you really can not predict the future). Another way to estimate future mortgage default rates is to examine the “credit spread” or difference in market pricing of mortgages as compared to less risky securities. If these other markets are working correctly (or “in equilibrium”) you can infer the future default rates from the pricing. This idea works, until too many people use it. If everybody else in the market is performing expensive research on future default rates then: the pricing of mortgages (relative to other less risky assets) will necessarily give you the information needed to solve for your model’s unknowns. However, once everybody is an “information taker” (using market pricing to try to estimate unmeasured fundamentals) the market is just one big “echo chamber” with no actual data being injected. You can no longer correctly estimate parameters from the market because there are no informed players to steal from. Even worse if those markets go out of equilibrium, lock up or stop trading you don’t even hear echoes- you become completely deaf.

    These simple flaws in reasoning (in addition to bubble-driven greed) are behind the current global disaster. We need to protect ourselves from all of these fundamental causes (which will occur again and again), not vilify some formerly obscure financial mathematics (which will never appear in the same skin twice).

    Related posts:

    1. What does the Market Think?
    2. A Quick Appreciation of the Sharpe Ratio
    3. Something I don’t get about business and bailouts

    ]]>
    http://www.win-vector.com/blog/2009/03/it-is-not-all-the-quants-fault/feed/ 0
    Volunteers in Large Clubs: The Theorist’s View http://www.win-vector.com/blog/2009/02/volunteers-in-large-clubs-the-theorists-view/ http://www.win-vector.com/blog/2009/02/volunteers-in-large-clubs-the-theorists-view/#comments Thu, 26 Feb 2009 23:43:56 +0000 John Mount http://www.win-vector.com/blog/?p=48
  • Sorting Used in Anger
  • What does the Market Think?
  • Exciting Technique #1: The “R” language.
  • ]]>
    I have just posted a new write-up: Volunteers in Large Clubs: The Theorist’s View. This paper describes some interesting issues in organizing volunteers in a large club and tries to show (without math) how a theoretical computer scientist attacks such problems.This is just a quick project (more of a start than a finish). However since so much client work is confidential I like to put out something that shows a bit of the workflow and thought pattern found at Win-Vector.

    Related posts:

    1. Sorting Used in Anger
    2. What does the Market Think?
    3. Exciting Technique #1: The “R” language.

    ]]>
    http://www.win-vector.com/blog/2009/02/volunteers-in-large-clubs-the-theorists-view/feed/ 0
    Map Reduce: A Good Idea http://www.win-vector.com/blog/2009/01/map-reduce-a-good-idea/ http://www.win-vector.com/blog/2009/01/map-reduce-a-good-idea/#comments Sun, 25 Jan 2009 20:32:20 +0000 John Mount http://www.win-vector.com/blog/?p=30
  • Sorting Used in Anger
  • Exciting Technique #1: The “R” language.
  • I know, I am the one being a jerk
  • ]]>
    Some time ago I subscribed to The Database Column because it would be fun to see what these incredible people wanted to discuss. We owe much of our current database technology to Professor Stonebraker and Vertica sounds like an incredible product. And I definitely want to continue to subscribe.

    However, the reading experience is marred by some flaw in their RSS system that keeps marking the article “MapReduce: A major step backwards” as a new article. This causes the article to appear in my RSS reader every few weeks as “new.” This wouldn’t bother me too much except that the article runs so counter to experience that it is itself offensive.

    I have no desire to defend Google (the home of MapReduce)- they don’t need it and are clearly laughing all the way to the bank. However the points used to kick at MapReduce are so broad and so devalue practitioner experience that they are insulting. I find the individual arguments offensive and wish to stand against them. I am not that concerned about the conclusion, use MapReduce or don’t. For some things MapReduce is a good tool and for some things it is not.

    Let’s limit ourselves to the 5 primary complaints from the article. The article (verbatim) says MapReduce is:

    1. A giant step backward in the programming paradigm for large-scale data intensive applications.

    2. A sub-optimal implementation, in that it uses brute force instead of indexing.

    3. Not novel at all — it represents a specific implementation of well known techniques developed nearly 25 years ago.

    4. Missing most of the features that are routinely included in current DBMS.

    5. Incompatible with all of the tools DBMS users have come to depend on.

    Now let us comment:

    1. “A giant step backward in the programming paradigm for large-scale data intensive applications.”

    Actually, no.

    MapReduce represents a continuity in a stream of ideas that made UNIX great: composable transient tools. Not everything is a database or data warehouse. A lot of the grungy UNIX tools (like sort, sed, awk, join) have often been combined to do large scale (at the time) research because they all worked “out of core” fairly well. This makes for a horrible bailing-wire set-up. However, it often handles problems of a size much larger than would have been possible on the hardware at the time.

    In addition the author trots out the “it’s Codasyl all over again” argument. This argument refers to the ongoing pain and expense derived from binding algorithmic details too close to the data representation. In earlier writing it was a fantastic point that warned that the up and coming object oriented databases were going to be the same nasty pointer chasing nightmares that hierarchical databases had been. I can see why an author might feel that just saying “it’s Codasyl” could win any argument.

    2. “A sub-optimal implementation, in that it uses brute force instead of indexing.”

    MapReduce does not use brute force.

    MapReduce uses the idea (one that goes back to merge sort) that parallel traversals (that is: running through two lists in the same order synchronously) are a very powerful technique that can, among other things, produce indices. MapReduce is so efficient that it has been shown to be competitive with the best large scale sorting algorithms on their home-turf: sorting.

    MapReduce looks brutish because it drops a lot of popular design features. One such feature is trying to speed things up through local caching and combining. However, on the data that MapReduce is commonly used (free form written text) it is a provable property of the data that local caching is an ineffective complication (due to the heavy-tailedness of the data). So many of the graceful features missing from MapReduce are actually no help on the types of data it is used on. There is a certain grace in leaving only only the features that are actually helping.

    3. “Not novel at all — it represents a specific implementation of well known techniques developed nearly 25 years ago.”

    A nasty attack.

    MapReduce is a good explanation of some merging techniques that have been common knowledge for quite a while. This is a legitimate expository goal: explaining something everybody already “knows” better. In fact this is very hard to do and considered a legitimate accomplishment in many fields (for example Rota stated it was a legitimate goal in mathematics). I myself looked at some of my own older code for dealing with very large data sets after reading the MapReduce paper. I saw that the paper was describing what I was already doing (splitting the data into streams for later re-joining) and explaining it so well that it was now a method and no longer a hack. When a paper successfully teaches about you something you already “know” it is a good work.

    The attack is is also inaccurate- the ideas are not 25 years old it is closer to 120 years old.
    We could easily trace the lineage of MapReduce back to Hollerith style sorting machines that pre-date general purpose computers (i.e. going back to before 1889) . MapReduce refers back to a time when all computation was performed by what we now call external sorting and tabulation. These 19th century technologies may seem archaic but they were developed in a word similar to ours: worlds where the amount of data is in excess of your conveniently reconfigurable computational resources.

    4. “Missing most of the features that are routinely included in current DBMS.”

    Unfortunate.

    I miss a lot of those features.

    However, because MapReduce is such a lean technique I have seen engineers implement their own MapReduce systems in a day (to solve a problem they are working on). That is they are successfully sorting, joining, indexing and summarizing hundreds of gigabytes of data on a consumer PC within a couple of days of being asked to. This is from scratch after reading the MapReduce paper.

    The “make versus buy” decision should not always come out “make.” But it is not wise to artificially bloat up requirements so that the decision can only be “buy.”

    5. “Incompatible with all of the tools DBMS users have come to depend on.”

    Good.

    Frankly for a lot of analytic practitioners many DMBS systems and tools have become expensive obstacles in the way getting results. Yes, we enjoy humiliating an interview candidate that does not know all of the Codd normal forms (or can’t remember which of the alphabet soups of OLTP or OLAP is the “good one” ) as much as the next person. But to many of us a lot of these tools and procedures are more obstacles than a solutions.

    This may sound nasty, but if were not the case why would companies like Vertica be producing radical new database tools? The fact is existing DBMS tools were designed for a different type and scale of data than we regularly see on the web (and column oriented database designers seem to share this view). The situation is so bad that “roach motel” is a common analyst’s slang for “data warehouse” (derived from: “data checks in but it never checks out”).

    This isn’t meant to be a hagiography of MapReduce, but given that MapReduce has paid the bills I feel it deserves a small show of respect along the lines of “dance with the one who brung you.”

    MapReduce is not a panacea. One of the tasks I have hated most in my career was maintaining a seven step MapReduce based system. I would love to have avoided all the detail fiddling that set-up required. However, the system paid our bills by performing a calculation that was beyond the scale of simpler methods and it would have been unaffordable to buy a solution.

    Related posts:

    1. Sorting Used in Anger
    2. Exciting Technique #1: The “R” language.
    3. I know, I am the one being a jerk

    ]]>
    http://www.win-vector.com/blog/2009/01/map-reduce-a-good-idea/feed/ 0
    Exciting Technique #1: The “R” language. http://www.win-vector.com/blog/2009/01/exciting-technique-1-the-r-language/ http://www.win-vector.com/blog/2009/01/exciting-technique-1-the-r-language/#comments Thu, 22 Jan 2009 19:59:01 +0000 John Mount http://www.win-vector.com/blog/?p=26
  • The Data Enrichment Method
  • New Paper
  • Programs reduced to statistics
  • ]]>
    Our first “exciting technique” article is about a statistical language called “R.”

    R is a language for statistical analysis available from http://cran.r-project.org/ . The things you can immediately do with it are incredible. You can import a spreadsheet and immediately spot relationships, trend and anomalies. R gives you instant access to top notch visualization methods and sophisticated statistical methods.

    R is so hot (a strange thing to say about a statistics package) that it was the subject of a recent New York Times article: http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html . If you read between the lines some of the interviewees come off as being slightly threatened by R (there is a slight hint of “R is very good for others”). In fact R is simply very good. A good statistician with R can do things that a great statistician without R can not. Like all tools R is dangerous, ask for the wrong analysis and you well draw wrong and misleading conclusions. Ask for the right analysis and R will correctly perform it while tracking critical implementation details that would take you hundreds of hours to master on you own.

    Want to produce graphs using the theories of perception and analysis of W. S. Cleveland? Simple- use Deepayan Sarkar’s “Lattice” model, which even has a wonderful book.

    Want to find subtle relationships in your data using logistic regression (one of the more complicated cousins of linear regression)? That is built into the base R system.

    Need to re-run all of your analyses because the data has changed? R is script based and stores your command history. A single paste can re-run a 20 step analysis and re-build a 10 slide presentation.

    Impressed by a particular type of analysis? Take, for example, Roger Koenker’s “Quantile Regression” (which is a brilliant idea backed by a masterpiece of a book). Guess what, the original author has supplied a free R-module that implements the ideas.

    Want to give a client working software? Easy, R is open source and comes with very good automated installers for OSX, Linux and Windows.

    Want to train somebody to use R? Easy, R has an extensive library of excellent books and there is even an exciting set of books with a series title “Use R!”

    Want to learn the internals of R from John M. Chambers (one of the inventors of the “S” language that R is an implementation of)? You are in luck the latest book by Chambers is “Software for Data Analysis, Programming with R.” R is so popular that it has managed to pull one of the creators of S language and the proprietary S+ implementation into its world.

    It is almost getting to the point where you need to justify not using R.

    Related posts:

    1. The Data Enrichment Method
    2. New Paper
    3. Programs reduced to statistics

    ]]>
    http://www.win-vector.com/blog/2009/01/exciting-technique-1-the-r-language/feed/ 1
    New “exciting techniques” series of articles. http://www.win-vector.com/blog/2009/01/new-exciting-techniques-series-of-articles/ http://www.win-vector.com/blog/2009/01/new-exciting-techniques-series-of-articles/#comments Thu, 22 Jan 2009 19:46:37 +0000 John Mount http://www.win-vector.com/blog/?p=25
  • Exciting Technique #1: The “R” language.
  • Betting Best-Of Series
  • I know, I am the one being a jerk
  • ]]>
    I am starting a new “exciting techniques” series of articles on the Win-Vector blog. The primary purpose of the Win-Vector blog remains identifying and describing needs, but I am starting a new sub-series of articles about techniques.

    Occasionally each of us is asked “what are some things you are excited about?” It is an exciting question, which I never really feel free to actually answer. The reason is that this is usually asked within an obvious context. It would be naive not to recognize that the question is usually really about something specific. In my case the context is usually data handling and storage (an important platform that my work stands on) . Usually I give some weak answer about the quick utility of MapReduce or the promise of column oriented or streaming databases.

    Most of the the things I am deeply excited about (limiting myself down to technology to avoid issues like family, charity work or politics) are techniques not products. I would like to give myself the opportunity to mention some of them here.

    I am going to be a bit broad in my interpretation of the word “technology.” One dictionary definition of technology is: “the application of scientific knowledge for practical purposes.” I am going to emphasize the “knowledge” portion (ideas, techniques) as being the true core of technology and ignore the artifacts (like databases, web servers or Macbook Pros). This is a matter of taste; I find the ideas much more exciting than the artifacts.

    To write about some of these exciting things I am going to try to split the articles on the Win-Vector blog into some more categories. The main stream of articles will be articles about applications. Even identifying how different industries can use mathematical and statistical methods is a very big and a very important task. The most important problem remains correctly identifying needs.

    However, I am also very interested in writing about the techniques. Unfortunately, articles about techniques are a bit more esoteric and may not be as useful to my intended audience as application articles. So I will tag these articles as “techniques” to try and segregate them a bit.

    Related posts:

    1. Exciting Technique #1: The “R” language.
    2. Betting Best-Of Series
    3. I know, I am the one being a jerk

    ]]>
    http://www.win-vector.com/blog/2009/01/new-exciting-techniques-series-of-articles/feed/ 0
    The Purpose of this Blog http://www.win-vector.com/blog/2008/11/the-purpose-of-this-blog/ http://www.win-vector.com/blog/2008/11/the-purpose-of-this-blog/#comments Wed, 05 Nov 2008 19:40:18 +0000 John Mount http://www.win-vector.com/blog/?p=24
  • New Paper
  • Programs reduced to statistics
  • How Market Designs Set Prices
  • ]]>
    The purpose of this blog (which is not quite “blog like” in its promise of a once a month longish technical article) is to educate, share the Win-Vector principles and learn more about writing (through practice).

    I am a big fan of “understanding through writing” (you learn through trying to explain). The difficulty in technical writing is always balancing the incompatible competing needs for conciseness, clarity, correctness and utility. There is a next-level of writing and understanding (that I am not at, but I am becoming more able to recognize) where these things synergize instead of compete. This post will close with such an example from Edsger Dijkstra (in its entirety):

    Elegance is not a dispensable luxury but a factor that decides between success and failure.

    This covers so much of what I am trying to say.

    (And thank you to Peteris Krumins for blogging on this)

    Related posts:

    1. New Paper
    2. Programs reduced to statistics
    3. How Market Designs Set Prices

    ]]>
    http://www.win-vector.com/blog/2008/11/the-purpose-of-this-blog/feed/ 0
    Something I don’t get about business and bailouts http://www.win-vector.com/blog/2008/10/something-i-dont-get-about-business-and-bailouts/ http://www.win-vector.com/blog/2008/10/something-i-dont-get-about-business-and-bailouts/#comments Wed, 01 Oct 2008 20:11:25 +0000 John Mount http://www.win-vector.com/blog/?p=23
  • It is not all the quants’ fault.
  • A Quick Appreciation of the Sharpe Ratio
  • The Joy of Calculation
  • ]]>
    I don’t really know what the right answer to the $700 Billion Dollar Bailout Question is (I have not read the bill, and I wonder if the bill really describes what would happen). But the whole situation does remind me of a related question: is it really the end of the world if the “credit markets freeze?” It is a disaster if the equity markets tank for a period of longer than a year or so (prevents people from retiring and so on)- but I am not sure if all of the consequences we are being told really follow.

    If the reason to bail-out Wall Street is to “protect retirement savings” why not just dump the $700 Billion into Social Security and make Social Security a needs based program?

    Here is one story we are being told. Without the bailout the credit markets freeze. If the credit markets freeze then “main street” is hurt because businesses can not borrow money. This causes actual economic damage and everybody is really poorer.

    I understand the credit markets increase the money supply and generally promote growth. But it might be okay if they were temporarily unavailable. My experience in trading is that when you buy or sell in a finance market: most of the time the counter-party in a trade is another hedge fund or speculator- not somebody actually doing anything productive with the money. So most of transactions lost to a freeze really helped nobody you would care about.

    Also, why should businesses always need credit? Many respected businesses (Apple, Google, Microsoft) have tens of billions in cash reserves and likely do not need credit. Others (Oracle) have huge cash positions and debt positions as the same time. I have a feeling that a credit crunch will punish over-extended business that have weird balance sheets more than businesses that run on a sound financial basis.

    If you think all businesses have a “rocket science” accounting ability (they produce reports that nobody else can understand and we just balance our check books), consider this. I have regularly seen offers of early payment (net-10 days instead of net-20 days) if a you accept a 10% cut in your bill. Now if you think of this as a favor you might accept it (a supplier should always be willing to offer 10% off to make a client happier). If you think of this as a payday loan (which is what it is) the supplier would be paying an interest rate that compounds to 569% annually just to get their money 20 days earlier. If this is what is considered “financial engineering” we do not need it in this world.

    Related posts:

    1. It is not all the quants’ fault.
    2. A Quick Appreciation of the Sharpe Ratio
    3. The Joy of Calculation

    ]]>
    http://www.win-vector.com/blog/2008/10/something-i-dont-get-about-business-and-bailouts/feed/ 0
    A Quick Appreciation of the Sharpe Ratio http://www.win-vector.com/blog/2008/09/a-quick-appreciation-of-the-sharpe-ratio/ http://www.win-vector.com/blog/2008/09/a-quick-appreciation-of-the-sharpe-ratio/#comments Wed, 01 Oct 2008 03:15:07 +0000 John Mount http://www.win-vector.com/blog/?p=22
  • How Market Designs Set Prices
  • What does the Market Think?
  • It is not all the quants’ fault.
  • ]]>
    The current state of the global financial markets has gotten more people than usual worrying about the technical aspects of finance. One method for reasoning about investment returns and risk is a tool called the Sharpe Ratio. It is well worth reviewing this measure and seeing how, if used properly, it doesn’t favor any of the mistakes that underly our current financial crisis.

    The Sharpe ratio is a famous measure of “risk adjusted return” and is defined as “the ratio of the expected excess return from an investment divided by standard deviation of the excess return.” It is most easily demonstrated by an example (which we work in pieces).

    If an investment is expected to generate a profit of 15% in the next year and an insured bank account would generate 10% profit then the expected excess return invested is 15% – 10% = 5%. A rational investor would never take a risky investment that did not have a positive excess return (else they would expect to make more money at a bank). “Expected” is a technical term which means the average return of the investment averaged over all possible outcomes (weighted by the odds of each outcome), we can explain this by working a couple of examples.

    Consider investment “A” which is a generally good idea that returns a 20% profit in half the possible years and a 10% profit in the other half of the years. Investment A has an expected return of 0.5*20% + 0.5*10% = 15%. Investment “A” has 15% – 10% = 5% excess return.

    Also consider another investment “B” which is a risky bet that returns 20% profit most years (around 95.8% of them) and goes bankrupt in the other years. The expected return of investment “B” is 0.958*20% + 0.042*(-100%) = 14.96%, or essentially 15%. Investment “B” has 15% – 10% = 5% excess return.

    As we can see “expectation” alone can not really tell these two investments apart. That is why the second component of the Sharpe ratio is something called the standard deviation. The standard deviation is defined as the square-root of the squared deviations of the return from the target value of 15%. What we do is measure for each possible outcome how far off the return is from the target of 15%, multiply this number by itself (called squaring it) and then take the square-root of the sum of all such values. Again, this is best explained by an example.

    Investment “A” has a standard deviation of:
    square-root( 0.5 * (20% – 15%)*(20% – 15%) + 0.5 * (10% – 15%)*(10% – 15%) ) = 5%

    And investment “B” has a standard deviation of:
    square-root( 0.958 *( 20% – 15%)*( 20% – 15%) + 0.042*(-100% – 15%)*(-100% – 15%) ) = 24%

    Just like in the calculation of expectation we are taking every possible situation and summing (weighted by the likelihood) our value of interest (in this case the squared variation).

    The standard deviation’s opinion is that investment “B” is about five times riskier than investment “A.” And this is the grace of the Sharpe ratio: it says that investment “A”’s value is (15% – 10%)/5% = 1 and “B”’s value is (15% – 10%)/24% = 0.2.

    An interesting feature of the Sharpe ratio is that, unlike Wall Street, it does not believe that leveraging increases profitability. A common desperation move is to take an investment that has a moderate return and borrow money to simulate larger returns by having larger exposure. For instance an investment that returns 15% can try to simulate a higher return by borrowing. If for every $1,000 invested we borrow another $1,000 to invest (paying the risk rate of 10% for the money) one can show an apparent rate of return of ($2000*15% – $1000*10%)/$1000 or 20%. However, this is not free money- the investor is taking on twice as much risk for only half as much more return. In fact with sufficient leverage (three times, for times, thirty times) one can convert a safe investment into a risky investment that could even go bankrupt. The Sharpe ratio (by design) is not fooled by this sort of manipulation. Investing $1000 in investment A has the exact same Sharpe ratio as investing $1000 plus $1000 more borrowed at the risk-free rate (this is part of the cleverness of using excess returns instead of un-adjusted returns).

    Unfortunately to use the Sharpe ratio you need good estimates of three things:

    1) The expected return of the investment.

    2) The risk-less available in the market (to compute excess).

    3) The standard deviation of the investment.

    All three of these facts are about the future, so we don’t really know any of them. The historic returns of an investment are not the same thing as the expected returns in the future, interest rates can change and the standard deviation is especially hard to estimate. However, if you have a model (or at least a theory) of what your investments are supposed to do then you can plug in estimates for these three quantities and use the Sharpe ratio to determine which investments really are best.

    If you knew how investment “A” worked and could estimate that it returned 20% about half the time and 10% the other times you could estimate its Sharpe ratio as 1. And if you knew investment “B” was a gamble that almost always paid off at 20% with a single rare event that causes bankruptcy you could estimate its Sharpe ratio as 0.2. Even if your estimates were inaccurate (say you estimate investment “A”’s Sharpe ratio is 0.7 and investment “B”’s Sharpe ratio as 0.3) the indication is to stay away from investment “B.”

    This is in stark contrast to the conclusion you would draw if you thought of these investments as a “black box” (like a fund of funds does) and looked only at their historic performance. If you looked at around 5 years of historic performance of both investments you would (incorrectly) think the following:

    Investment A looks kind of noisy, some years it returns 10% and some years it return 20%. You would estimate (correctly) the return as averaging to 15% and you can even get a historic estimate of its standard deviation that is actually about right (5%)

    Investment B looks like easy money. With about 80% chance you would not have seen a bankruptcy, just 5 years of 20% returns. You would mis-estimate the return as being 20% (all you have ever seen) and further mis-estimate the standard deviation as 0%.

    Based on historic data alone you would fire the manager of investment “A”, give the manager of investment “B” a huge bonus and invest all of your money. And a few years later you would go bankrupt.

    What is going on is very well explained by Nassim Nicholas Taleb as “the turkey paradox.” Domestic turkeys are all killed at about the exact same age (say 60 days). For somebody that understands commercial poultry farming there is not any mystery or uncertainty about it. 60 days before you want to sell a turkey carcass you buy a turkey chick. There is an inevitability and reverse causality- the desire for the turkey’s carcass funds and causes the turkey’s start of life 60 days earlier. Now if the turkey is a statistical empiricist (perhaps with a PhD in machine learning) things look good. The turkey sets up a model of each day having an unknown chance of being good or bad. The turkey figures that each day’s outcome is an independent trial drawn from this single unknown probability. The turkey collects evidence: every day it gets fed. Each day is more evidence that all days will be good. And then on day 60 the turkey gets a nasty surprise. The turkey’s life was a bad investment from day one, all of the “evidence” the turkey collects along the way was irrelevant because the model was wrong. And the model was wrong because the turkey guessed at the model instead of investigating the nature of poultry farming.

    Much is the same in many investments. There are investments that look like investment “B” when you open the hood. Many of them involve writing “out of the money options” and “default swaps.” These are essentially selling insurance on events that nobody thinks are likely. Selling insurance that usually is not used is profitable, until the insurance gets used. This is why insurance companies (if they are ethical) don’t treat the entirety of collected payments as profit- but as a stockpile that must be kept to pay the claims that will inevitably some day come true.

    It is important to point out the Sharpe ratio will give you incorrect results if you plug bad estimates into it. Overall the Sharpe ratio prefers good investments and diversification but it can be led astray. In fact that is the whole point: no amount of smart math will undo the inevitable consequences of wrong models that are used because “you need something you can solve” (like the turkey) or “everybody else is getting rich using them” (like investment “B”).

    Related posts:

    1. How Market Designs Set Prices
    2. What does the Market Think?
    3. It is not all the quants’ fault.

    ]]>
    http://www.win-vector.com/blog/2008/09/a-quick-appreciation-of-the-sharpe-ratio/feed/ 0
    What is Mathematics, Really? http://www.win-vector.com/blog/2008/08/what-is-mathematics-really/ http://www.win-vector.com/blog/2008/08/what-is-mathematics-really/#comments Mon, 11 Aug 2008 01:28:50 +0000 John Mount http://www.win-vector.com/blog/?p=21
  • Betting Best-Of Series
  • Map Reduce: A Good Idea
  • The Joy of Calculation
  • ]]>
    I recently had one of those “practitioner’s epiphanies” that I
    really feel captures the core of the issue and quickly explains a lot
    about mathematics.

    My current definition is:

    Mathematics is the minimal environment to preserve ideas.

    This time (due to the absolute horribleness of HTML) I have created a PDF containing my essay:
    What is Mathematics, Really? .

    I know PDF is a not liked by many, but it really is a much better document container than HTML markup.

    Related posts:

    1. Betting Best-Of Series
    2. Map Reduce: A Good Idea
    3. The Joy of Calculation

    ]]>
    http://www.win-vector.com/blog/2008/08/what-is-mathematics-really/feed/ 0
    YAYGDA (Yet Another Yahoo Google Deal Article) http://www.win-vector.com/blog/2008/06/yaygda-yet-another-yahoo-google-deal-article/ http://www.win-vector.com/blog/2008/06/yaygda-yet-another-yahoo-google-deal-article/#comments Wed, 25 Jun 2008 16:03:52 +0000 John Mount http://www.win-vector.com/blog/?p=20
  • What does the Market Think?
  • Public Service Article: JSTOR and other Useful Research Archives
  • How Market Designs Set Prices
  • ]]>
    Information week describes the current “Yahoo/Google deal” as being one that would “allow Yahoo to place Google ads on its site and collect the revenue.” But in reality it is a deal that will allow Google to sell Yahoo the rope to hang itself. To the theorist’s eye the deal looks like a doomsday machine designed along the lines of a simple game called a “stag hunt.”

    In “stag hunt” a number of hunters set out cooperate and hunt (with a guaranteed result) a single stag together (and split the benefits). The twist in the game is that to succeed the hunters all must cooperate and if a single hunter fails to show up they will not catch the stag. The problem is that each hunter can individually hunt a hare and with certainty come away with a hare (even though the value of the hare is much less than the expected value of the hunter’s portion of the stag). This is a sad game in that it is indisputably the case that for each player the best outcome is when they all hunt the stag together, but most players will likely “defect” and hunt hares. This game is similar to the “prisoner’s dilemma” but worse in that players in stag hunt defect solely out of fear and not for additional selfish benefit (as found in the prisoner’s dilemma).

    How does this relate to the Yahoo/Google deal? Google got a minimum commitment from Yahoo to serve $83 million worth of Google ads on the Yahoo portal. Google is one of the few entities for which $83 million is a pittance. However it is enough traffic to generate statistics that will make it obvious to each and every Yahoo executive that the Google ads are worth around 30% more than Yahoo self-served ads (the typical historic difference in the quality of matching of the two services). So every quarter each and every Yahoo division head can decide whether to “hunt stag” (route advertising into the Yahoo system and help collect the data and experience to eventually eliminate any Google premium) or “hunt hare” and route more of their division’s business to Google for a higher immediate revenue.

    Without this deal (and the intense scrutiny Yahoo is under) Yahoo would literally have all the time in the world to fix their advertising system. They are a very large company and their current system, though inferior, is profitable. So Yahoo can finance self-improvements indefinitely. It is unfortunate that Yahoo’s last few attempts at improvement (”Panama”) were not enough- but there was no reason this should have been the last attempt.

    With this deal Yahoo rapidly routes all of its revenue through Google’s system. Actually because “division performance” is a positional good it could happen very rapidly (even if a division executive has the moral strength to not take the Google profits, he or she will be out-competed by a sister division executive that does and gets promoted past them). Yahoo is rapidly reduced to a farmer selling land and machinery to (temporarily) feed their family.

    Once all of Yahoo’s revenues are routed though Google Yahoo will be completely blind to how their revenue is derived (unable to even confirm they are getting their promised cut: see Comparing Apples and Oranges). Furthermore Yahoo will move from being a trove of valuable content to fulfilling Google’s definition of being a “link farm” (Google has never posted the definition of this undesirable designation- but empirically you seem to match it if you serve Google ads).

    Finally, Yahoo will be completely at Google’s mercy.

    To quote Vinton Cerf VP and Chief Internet Evangelist at Google:

    “In the case of Yahoo, the company believes that it will be beneficial to assist Yahoo with its experiment.”

    Related posts:

    1. What does the Market Think?
    2. Public Service Article: JSTOR and other Useful Research Archives
    3. How Market Designs Set Prices

    ]]>
    http://www.win-vector.com/blog/2008/06/yaygda-yet-another-yahoo-google-deal-article/feed/ 0
    How Market Designs Set Prices http://www.win-vector.com/blog/2008/06/how-market-designs-set-prices/ http://www.win-vector.com/blog/2008/06/how-market-designs-set-prices/#comments Mon, 16 Jun 2008 21:53:52 +0000 John Mount http://www.win-vector.com/blog/?p=19
  • What does the Market Think?
  • Is Search Advertising a Market for Lemons?
  • I know, I am the one being a jerk
  • ]]>
    Hal Varian (Chief Economist, Google) recently shared a concise article with the title “How auctions set ad prices”. The article is a clear exposition of how ad prices determine the sorting order of bidders for online advertising. However, the tone of the article is not quite compatible with how it feels from the outside.

    Roughly the first idea for ad bidding is to use a traditional auction and to sort bidders by price. That is the highest bidder is given position one and the next highest bidder is given position two and so on. There is some sophistication in that each bidder can be asked to pay (if their ad is clicked) a minimum increment more than the bid of the maximum-bid of the position below them. This variation (often called “Vickrey”) makes the auction a little more dynamic in that it simulates each bidder lowering their bids to the minimum amount that would have yielded them the position they have been awarded.

    The second idea is to not sort by bids, but instead sort by bid times an estimate of click-through probability. This was a simple and brilliant idea. With this innovation bids are now sorted in terms of their “expected value.” This is because in “pay per click” advertising the ad-bidder pays only if their advertisement is clicked on. When we say the bids are now sorted in terms of “expected value” we mean that the bids are now sorted in terms of how much money the advertising supplier can expect to make. A low-quality ad that has no chance to be clicked on (and thus will generate no revenue) is no longer sorted behind slightly lesser bids from more lucrative bidders.

    The third idea is where the explanation becomes difficult to follow. A second adjustment factor called “Quality Score” (meant to reflect the quality of the web site and page the advertisement is pointing to) is introduced. How is this quality score determined? Here is a quote from the original article:

    “Where does this Ad Quality Score come from? It was originally determined by historical click through rates but has been refined over the years using sophisticated statistical models.”

    The phrase “sophisticated models” signals that the quality score can not be reproduced and estimated by bidders (unlike click-through probability). So bidders do not really know what the quality score is. I admit I do not know what the quality score is, but I do know something one could (in theory) do with it.

    Suppose for a single ad-word you had two bidders named Cain and Abel that both are willing to bid $1.00 for their ad to be in the first position (a common situation). Further suppose that each bidder is willing to spend a fixed amount per-day and stops bidding when their daily budget is exhausted (a common bidding policy). If Abel has a much smaller budget than Cain then Abel may run out of money early in the day and be removed from later auctions in the day. From that point on Cain can place ads in position one for the minimum bid (maybe $0.10) and a lot less money changes hands (than when Abel was in the auction). If, however, Abel was given a special quality score that just happened to be such that the ratio of Abel’s quality score to Cain’s quality score was around the same size as the ratio of Cain’s daily budget to Abel’s daily budget then things change radically. With this higher quality score (remember Cain’s budget is larger so by our description Abel’s quality score is larger) Abel can compete with Cain’s ad with a deeply discounted bid. This additional bidding power is exactly what is needed to keep Abel in the auction for the whole day. This in turn is exactly what is needed to make Cain bid higher competitive prices (instead of minimum bid) for the whole day. The additional money extracted from Cain can far exceed the discount given to Abel.

    At this point many bidders might wish for a more transparent “Quality Score.”

    A great technical article describing this kind of mathematics is: “AdWords and Generalized On-line Matching” Aranyak Mehta, Amin Saberi, Umesh V Vazirani and Vjay V Vazirani (2006).

    Related posts:

    1. What does the Market Think?
    2. Is Search Advertising a Market for Lemons?
    3. I know, I am the one being a jerk

    ]]>
    http://www.win-vector.com/blog/2008/06/how-market-designs-set-prices/feed/ 1
    Betting Best-Of Series http://www.win-vector.com/blog/2008/05/betting-best-of-series/ http://www.win-vector.com/blog/2008/05/betting-best-of-series/#comments Wed, 28 May 2008 01:23:04 +0000 John Mount http://www.win-vector.com/blog/?p=18
  • What is Mathematics, Really?
  • A Quick Appreciation of the Sharpe Ratio
  • The Data Enrichment Method
  • ]]>
    Here is a new expository paper describing the mathematics involved in betting on something like the United States’ Major League Baseball World Series. It isn’t so much about baseball as about demonstrating some of the really great ideas from mathematical finance in a simplified setting. This sort analysis is the “secret sauce” in a lot of financial models and I trying to share the thrilling feeling of working with these techniques in an elementary essay (with diagrams).

    Related posts:

    1. What is Mathematics, Really?
    2. A Quick Appreciation of the Sharpe Ratio
    3. The Data Enrichment Method

    ]]>
    http://www.win-vector.com/blog/2008/05/betting-best-of-series/feed/ 0
    Is Search Advertising a Market for Lemons? http://www.win-vector.com/blog/2008/05/is-search-advertising-a-market-for-lemons/ http://www.win-vector.com/blog/2008/05/is-search-advertising-a-market-for-lemons/#comments Tue, 13 May 2008 16:29:59 +0000 John Mount http://www.win-vector.com/blog/?p=17
  • What does the Market Think?
  • How Market Designs Set Prices
  • YAYGDA (Yet Another Yahoo Google Deal Article)
  • ]]>
    author: John Mount, 5-13-2008

    Anand Rajaraman recently wrote a very thought-provoking entry on his Datawocky blog. He asks “Is Search Advertising a Giffen Good?” As he explains a Giffen Good is a sort of economic doomsday machine that some segment of consumers are forced to buy more of an inferior good as the price of the inferior good goes up. His article is well written are really invites one to think about the issue. Anand’s question made me thing about a number of issues (which I will outline here) and I will leave off with a question of my own.

    The classic example of a Giffen situation is when rice or noodles are sold to the poor. If the price of rice goes up this segment of consumers has no choice but to curtail their spending on more expensive legumes, vegetables and meat to put what remains of their spending power into the cheapest source of calories (which could remain rice, even though the price of rice increased). This isn’t really free choice, or stockpiling in anticipation of further price increases but a simple grim economic trap. Giffen behaviors have long been suspected but not really documented with much quality until recently (see “Giffen Behavior: Theory And Evidence” Robert T Jensen, John F Kennedy, NBER Working Paper (2007) vol. 13243).

    It is hard to determine if advertisers are Giffen consumers. For one marketing and advertising are “Positional Goods,” that is goods that derive some of their value from ranking (like market share). Marketing and advertising also have large negative externalities. That is every advertising dollar spent by Company A not only takes business away from Company B (the more famous zero-sum part of advertising) but also drives up unit costs of advertising for all advertisers (part of the negative externality). These sort of goods can drive a lot of very strange (and counter-intuitive) market behaviors.

    The first strange market behavior is an unlimited ratchet effect. It is hard to pinpoint what portion of advertising really grows the market and what portion merely moves consumers from brand to brand (television advertising of cigarettes in the United States cigarette is an interesting example “The Effect of the 1971 Advertising Ban on Behavior in the Cigarette Industry” Craig A. Gallet, Managerial and Decision Economics, Vol. 20, No. 6 (Sep., 1999), pp. 299-303. ). To the extent that advertising is not growing the market you just have dollars chasing each other. Society can experience a ratcheting effect where you move from a reasonable amount of advertising spend to a place, as described in Cory Doctorow’s “The Rebranding of Billy Bailey,” where so much is spent on advertising that people have to lease out advertising space on their own skin. That is people can not afford to buy goods at inflated prices unless they earn additional income by subjugating their selves to marketing campaigns and prices are in turn high because so much is spent on marketing campaigns.

    This ratcheting effect is so strong that we see hints of game-theoretic situations every bit as strange as those described in Herman Kahn’s “On Thermonuclear War.”

    Returning to the United States cigarette example we can speculate if the 1970’s ban on TV advertising was really an “arms control treaty” among cigarette manufactures to decrease television spend (remember under United States law it would be illegal collusion for competing companies to negotiate a spending cap among themselves). We also saw use of “credible threats” in the form of advertising deliberately spent in very inefficient channels (such as expensive golf sponsorships). Such spending is dollars deliberately wasted to demonstrate that one company could instantly move dollars into more effective channels (magazine ads, billboards, NASCAR) if any completing company “defected” and moved more of its dollars into effective channels. The companies themselves do not need to be incredibly clever or Machiavellian to come up with these strategies- the competition in the market can lead them into these behaviors.

    In on-line advertising “targeted ads” (that is ads shown to people who have just typed in a search related to a product) are by far the most valuable. This is, of course, because these are often the people that are closest to making a purchase. But these are also the “zero sum” people- you are not growing the overall market when you advertise to them. So if you could get your competitors to agree not to advertise to them you would also be happy not to advertise to them (somebody would still make the sale and you would all save a lot of money).

    Now I will get to my question: is search advertising a market for lemons? A “market for lemons” is a market where goods are hard to examine so it is marginally profitable to try to get away with selling defective goods in the market. Usually such markets collapse as buyers can not afford to pay fair value (as they know they will often get defective goods) and sellers stop placing any non-defective goods (as buyers are no longer able to offer fair prices). The name comes from the American slang for a defective car and the ideas (including an analysis of used car markets leading to the invention of “dealer certified” guarantees) eventually led to a Nobel Prize in Economics.

    We must realize that high spend in advertising is not always proof that there is high value in advertising. The dynamics of the market can cause high spend independent of true value. Right now we are seeing very high and increasing spend in search advertising. I argue that spending alone is not enough to determine the value of search advertising. We have seen an on-line advertising boom/bust cycle once before; back when everybody was trading traffic through affiliate networks (the “eyeballs and money” era of the Internet). Affiliate networks were definitely a market for lemons: full of aggregators that mixed premium traffic with low quality traffic and sold the aggregate for more than sum of the parts. To avoid this we now market advertising impressions (often banners, priced as CPM) and advertising clicks (often driven by target text ads, priced as CPC). However both of impressions and clicks are just traffic seen from the other side. Once you get clever with targeting, modeling and manipulating “click through rates” you see that each advertising click is in fact equivalent to some large (but predictable) amount of traffic.

    Given so many clever players the question becomes: does search advertising really remain a fundamentally different market than affiliate traffic?

    Related posts:

    1. What does the Market Think?
    2. How Market Designs Set Prices
    3. YAYGDA (Yet Another Yahoo Google Deal Article)

    ]]>
    http://www.win-vector.com/blog/2008/05/is-search-advertising-a-market-for-lemons/feed/ 1