Win-Vector Blog http://www.win-vector.com/blog The Applied Theorist's Point of View Wed, 18 Aug 2010 15:11:02 +0000 en hourly 1 http://wordpress.org/?v=3.0.1 Statsmanship: Failure Through Analytics Sabotage http://www.win-vector.com/blog/2010/08/statsmanship-failure-through-analytics-sabotage/?utm_source=rss&utm_medium=rss&utm_campaign=statsmanship-failure-through-analytics-sabotage http://www.win-vector.com/blog/2010/08/statsmanship-failure-through-analytics-sabotage/#comments Mon, 16 Aug 2010 18:48:13 +0000 John Mount http://www.win-vector.com/blog/?p=1526
  • Deming, Wald and Boyd: cutting through the fog of analytics
  • Statistics to English Translation, Part 2a: ’Significant’ Doesn’t Always Mean ’Important’
  • Map Reduce: A Good Idea
  • ]]>
    Ambitious analytics projects have a tangible risk of failure. Uncertainty breeds anxiety. There are known techniques to lower the uncertainty, guarantee failure and shift the blame onto others. We outline a few proven methods of analytics sabotage and their application. In honor of Steven Potter call this activity “statsmanship” which we define as pursing the goal of making your analytics group cry.


    Panthouse-klompen.jpg

    • Squander A/B testing bandwidth on bugfixes:

      A/B testing is the art of testing two or more variations of a product in parallel to try and directly detect or measure an important difference between them. The idea is to lock in any positive changes and back out any negative changes. An A/B platform needs to manage a lot of measurements to get sample sizes large enough to return reliable results. A typical result of measuring a series of 10 attempts to raise revenue per customer might look like the following:


      ab1.png

      Some of these changes are good and some are bad. However, if you don’t have the bandwidth to make the A/B tests you don’t know which are which and you end up essentially forced to take all of the changes that “sounded good” (shown as the blue curve in the next chart). If you do have the measurements you back-out the bad changes and keep accumulating the good ones (the violet curve in the next chart). The difference is dramatic.


      ab2.png

      As you can see, with enough useful experiments you can cherry pick a bunch of risky ideas into accumulated improvements.

      All of this depends on a stream of good ideas and having enough bandwidth (customers segregated into different treatment and control groups) to make all of these measurements as each of these variations is applied. A good way to lower the demand for new ideas is to clog up the A/B testing infrastructure.

      One way to clog up the A/B testing infrastructure is to reserve the testing and statistics infrastructure to document the group is meeting goals. For example- collect a lot of statistics on a necessary bug fix. A policy saying the impact of all bug fixes (even those you have no choice but to implement) must be quantified can easily eat up all of your A/B bandwidth without testing any new ideas. If asked why you are doing this say it is to ensure that bug fixing is meeting its ROI targets.

    • Encourage the A/B testing framework to sting itself to death:

      The hardest thing to measure statistically is a non-effect. This is because a non-effect (that is a change that does nothing) is identical to what statisticians call the “null hypothesis” (which is the hypothesis you are trying to reject). Any attempt to measure a non effect will return a result that isn’t quite zero but doesn’t quite have enough data to show there is an effect significant at the current effect size being looked at. A repeated study with more data will get the same sort of equivocal result, just for a smaller effect size. This is why when designing a study you need to first establish a lower bound on effect sizes or be willing to say something like “we see no change below x% as being clinically relevant.” Otherwise if there really is no effect you get a series of bad studies as you are tempted to mis-use larger and larger sample sizes to study smaller and smaller effects. Statistics can never “prove zero” they can only prove below a given bound.

      A famous example is attempting to test the difference between 41 shades of blue on a thin border. You know this
      can make no real difference, but the poor suckers running the test will only get equivocal results. You can then send them back to run larger tests (which will also fail to achieve statistical significance) because “at our scale even a very small effect is important.” Insist the statistician prove there is absolutely no effect (don’t let them get away with proving any effect is below a given size).

    • Don’t provide a domain expert or product manager:

      One of the more useful tenants of modern software development (in particular some of the variations of “agile”) is that for useful work to be done a domain expert or product manager must be integral to the effort. Often this role is called “the customer” and it is an individual in the company (not a real customer) who has the experience, intuition and authority to declare success or failure (in addition to supplying ideas and useful intermediate goals). Ambitions research is even riskier than development, so make sure the research group does not even meet good development practices (let alone good research practices).

      For example:
      Statistics/Analytics is very good at testing and quantifying possible profitable hunches but (despite some of the broader claims attributed to data mining) has no systematic way of generating non-trivial hunches. So you can slow down an analytics effort by not supplying any intuition. Insist that it all “come from the data.”

    • Insist on retrospective studies:

      Convince management that the market will not tolerate experimentation (customers will revolt, competitors will see our secret sauce, …). So any proposed change can only by analyzed by attempting a retrospective study on older data. Instead of exposing new customers to variations on proposed improvements have the analytics group sift through old data and model (guess at) impacts these changes would have had using machine learning or statistical modeling. Machine learning is particularly painful without training data and statistics depends on meaningful measurements. Retrospective studies are very important- but they can not be your only tool.

    • Insist on perfectly clean studies:

      If the retrospective trap doesn’t work you are a good position to push for “perfectly clean studies.” Only one variation can be tried at a time (else variations interfere) and you can’t even end the trial on disaster (“could be a fluke, backing out the change now would give our data a censorship/stopping bias”). With enough procedures and insisting on sample sizes specified before having any hint at the effect size you are trying to measure you can completely crush analytics.

    • Self service analytics:

      If the “clean study” gambit doesn’t work then you are in a good position to advocate “self service analytics.” Push control of the A/B testing infrastructure to all of the engineers. Any engineer can request a fraction of the site traffic to try a variation on. Each customer might see many different variations from the many different engineers- “but hey, with a little linear algebra the stat guys can iron that out.”

    • Security:

      They can’t analyze the data if they can’t get to it. Partition that data into different areas of sensitivity, build elaborate procedures and protocols so data from different sensitivity areas can not be combined. Or just deny analysts access to all of the data. Your IT/Networking department can do this for you with complicated chains of trusted clients, VPNs and approved builds.

    • Dining Philosophers:

      Make sure you don’t provision enough resources (machines, disk, memory, database nodes) for all of your analysts to work at the same time. Get them to turn on each other. This is sometimes called the datamart method.

    • Catch 22 ROI:

      Don’t budget a study until you know the expected ROI of the result and don’t accept an ROI estimate that isn’t backed up by a study.

    • Blue Ocean Strategy:

      Make sure your analyst has a “blue ocean opportunity.” Give them data that nobody has ever used or looked at before. Wait a while and then say “they are way
      too expensive to be running down these picayune data cleaning issues.”

    • Run before you can walk:

      Insist the analysis scale to “billions of records” on the first try. Or try the early spec gambit: “this needs to go into development parallel to you doing the research.”

    • “I could have done that in Excel”:

      The dual to the run before you can walk strategy. Don’t allow any easy victories (like “we found all of the currently unprofitable accounts”). Insist on exotic models and above all “prediction” (“predict which accounts will become unprofitable”).

    • “Needs to be more explainable”:

      The entire analysis technique needs to fit onto a single Powerpoint slide- “for upper management.” This is the dual strategy to the “I could have done that in Excel” strategy.

    • Insist on Excel:

      Insist on and enjoy the deadly dance of pivot tables, office data connections and plugin solvers.

    • Death by software engineering:

      Insist not on a result or procedure but a “dashboard” with “an intuitive UI.”

    • Postulate sub-populations of non-customers:

      Even with a product manager you can force failure by concentrating analytics on the wrong questions. Postulate three to five customer types that find your product lacking for different contradictory reasons (“too technical”, “not for power users”, …). Now you can squander effort on: characterizing the customer groups, estimating the size of the customer groups and estimating the improved uptake each incompatible change to your product would induce in each group. Instead of working on your product you are now working on psychology, demographics and many incompatible variations of your product.

    In conclusion: if you can’t win this game against the analysts, you aren’t really trying. Or for the non tongue in cheek version: successful ambitious analytics requires a minimum amount of attention and flexibility. All of the “blockers” here are variations of valid concerns that only become blockers when either there is no attention or adaption.

    Related posts:

    1. Deming, Wald and Boyd: cutting through the fog of analytics
    2. Statistics to English Translation, Part 2a: ’Significant’ Doesn’t Always Mean ’Important’
    3. Map Reduce: A Good Idea

    ]]>
    http://www.win-vector.com/blog/2010/08/statsmanship-failure-through-analytics-sabotage/feed/ 2
    Fast Portfolio re-Balancing as a Fractional Linear Program http://www.win-vector.com/blog/2010/08/fast-portfolio-re-balancing-as-a-fractional-linear-program/?utm_source=rss&utm_medium=rss&utm_campaign=fast-portfolio-re-balancing-as-a-fractional-linear-program http://www.win-vector.com/blog/2010/08/fast-portfolio-re-balancing-as-a-fractional-linear-program/#comments Fri, 13 Aug 2010 04:11:41 +0000 John Mount http://www.win-vector.com/blog/?p=1516
  • “Easy” Portfolio Allocation
  • What Did Theorists Do Before The Age Of Big Data?
  • Programs reduced to statistics
  • ]]>
    Fast Portfolio re-Balancing as a Fractional Linear Program is an example of the kind of work we have done encoding client problems (in this case optimal portfolio selection) as optimization problems (so we can use purchased software to solve them). Its a bit mathy- but we are excited we got permission to share this.
    An example figure from the article:


    Vertices.png

    Related posts:

    1. “Easy” Portfolio Allocation
    2. What Did Theorists Do Before The Age Of Big Data?
    3. Programs reduced to statistics

    ]]>
    http://www.win-vector.com/blog/2010/08/fast-portfolio-re-balancing-as-a-fractional-linear-program/feed/ 0
    What Did Theorists Do Before The Age Of Big Data? http://www.win-vector.com/blog/2010/08/what-did-theorists-do-before-the-age-of-big-data/?utm_source=rss&utm_medium=rss&utm_campaign=what-did-theorists-do-before-the-age-of-big-data http://www.win-vector.com/blog/2010/08/what-did-theorists-do-before-the-age-of-big-data/#comments Mon, 02 Aug 2010 18:42:45 +0000 John Mount http://www.win-vector.com/blog/?p=1514
  • Good Graphs: Graphical Perception and Data Visualization
  • A Demonstration of Data Mining
  • The Data Enrichment Method
  • ]]>
    We have been living in the age of “big data” for some time now. This is an age where incredible things can be accomplished through the effective application of statistics and machine learning at large scale (for example see: “The Unreasonable Effectiveness of Data” Alon Halevy, Peter Norvig, Fernando Pereira, IEEE Intelligent Systems (2009)). But I have gotten to thinking about the period before this. The period before we had easy access to so much data, before most computation was aggregation and before we accepted numerical analysis style convergence as “efficient.” A small problem I needed to solve (as part of a bigger project) reminded me what theoretical computer scientists did then: we worried about provable worst case efficiency.


    The problem that got me thinking is this:

    Given a sequence of n integers x1 through xn and an integer k (1 ≤ k ≤ n), find the mean value of all of the medians of the k-sized selections from x1 through xn. Or as a formula:


    EMedian.png

    where x_s is defined as the sequence of integers whose indices are in the set s (not necessarily a contiguous sequence). The median is the “value in the middle” (a value such that half of the selected data are above it and half are below) and “(n choose k)” is the number of ways to choose k items from a collection of n items (which is just the number: n!/((n-k)! k!)). So our sum is adding up a number of terms and then dividing by the number of such terms to get the average or mean of the terms. We will call this sum a “mean of medians”.

    Some obvious special cases are: for k=1 the
    expression simplifies to the sum of the x_i divided by n (or just the mean of all the x_i) and for k=n it simplifies to the median of all of the x_i. For intermediate values of k it is not immediately obvious how to efficiently compute the value of the sum. Directly adding all (n choose k) terms (as the sum is written) would be very slow for large n with even moderate sized k. Instead we look to find some method that has the same value of the total sum, without directly computing all of the terms.

    This gets us to the ad-hoc side of theoretical computer science. We need a clever idea. In this case the idea is simple. To keep things simple: suppose all of the n integers in our sequence have different values and that k is odd (neither of these conditions are important they just let us avoid some non-essential technicalities). What values can median(x_s) take on? median(x_s) is always x_i form some i in the subset s. In fact our sum is equivalent to:


    Sum2.png

    This new sum has a reasonable number of terms- so we can actually calculate it directly if we knew the values of the terms. Without loss of generality assume the x_i are sorted in increasing order. Then the number of times x_i is the median of some x_s is exactly:


    term.png

    (and 0 for i < 1+(k-1)/2 or i > n – (k-1)/2). This is just the number of ways to place x_i in the center of a subset- by choosing (k-1)/2 smaller neighbors and (k-1)/2 larger neighbors. The count is given by multiplying the number of ways to choose smaller neighbors by the number of ways to choose the larger neighbors.

    The complete solution calculating the mean of medians for distinct sorted x_i is:


    fullsum.png

    A statistician would recognize this expression as a kind of centrally weighted Winsorized mean. The shape of the graph of weights (in this case the n=10, k=5) is suggestive of
    a bounded normal window (though i is a rank, not a free-ranging value):


    10w5.png

    Likely we have re-invented a data treatment known to statisticians. But the above steps were really just combinatorics. What a theorist does is abstract something down to this sort of problem and think of variations and solutions. The formal side of theoretical computer science is attempting to organize all of the problems in the world into two classes: problems presumed easy and problems presumed hard.

    For example- what if we had wanted to know the median of many means instead of the mean of many medians?
    It turns out a small variation of the median of means problem is already known to be difficult. The hard version of the reversed problem is called “Kth largest subset” (this is a different K than we have been using up until now). The Kth largest subset problem is: given a sequence of integers and constants K and B do K or more distinct subsets have a sum of no more than B? The Kth largest subset problem is known to be “NP hard” which means that a whole host of other thought to be hard problems could be solved if we had the ability to solve this problem (see “Computers and Intractability: A Guide to the Theory of NP-Completeness” Michael R. Garey and David S. Johnson, 1979). The median of many means is not quite as expressive as the Kth largest subset problem (so we have not proven the median of many means is itself NP hard) but the relation is strong: to even verify that we had the right median we would need to solve a Kth largest subset problem with K=(n choose k)/2 and B= k*the_claimed_median (assuming we modified the subset problem to restrict itself to size k sub-sequences). If fact even checking if a given number is one of the possible means (let alone if it is the median of the means) is equivalent to the NP Complete knapsack problem. This correspondence makes us suspect that something as simple as the trick we devised for the mean of medians problem will not solve the median of means problem. One should not take this argument too seriously though, as the same argument would seem to apply to the pair of problems “min of means” and “mean of mins” both of which are in fact easy. We have not proven the median of means problem to be hard, we have just found relevant algorithms and related problems.

    What theorists do is find these analogs and then do the heavy lifting (continue the work even when the solution is not simple) to find a way to either efficiently solve the problem at hand or prove it is in fact as hard as other important problems. This kind of thing is hard work for small gains, but the value accumulates because each of these gains is permanent. Finally additional variations of the problem are tried and characterized, to help check we hare not “leaving money on the table” (missing nearby improvements). Some of this may seem like pointless work on toy problems- but every once in a while one of these solutions or characterizations is exactly what is needed to take a large system (like a search engine) to the next level of performance.

    Related posts:

    1. Good Graphs: Graphical Perception and Data Visualization
    2. A Demonstration of Data Mining
    3. The Data Enrichment Method

    ]]>
    http://www.win-vector.com/blog/2010/08/what-did-theorists-do-before-the-age-of-big-data/feed/ 1
    Gradients via Reverse Accumulation http://www.win-vector.com/blog/2010/07/gradients-via-reverse-accumulation/?utm_source=rss&utm_medium=rss&utm_campaign=gradients-via-reverse-accumulation http://www.win-vector.com/blog/2010/07/gradients-via-reverse-accumulation/#comments Thu, 15 Jul 2010 00:00:04 +0000 John Mount http://www.win-vector.com/blog/?p=1493
  • Automatic Differentiation with Scala
  • “Easy” Portfolio Allocation
  • A Quick Appreciation of the Sharpe Ratio
  • ]]>
    We extend the ideas of from Automatic Differentiation with Scala to include the reverse accumulation. Reverse accumulation is a non-obvious improvement to automatic differentiation that can in many cases vastly speed up calculations of gradients.
    As the tables, diagrams and equations do not translate well into HTML, our full article is available here in PDF: http://www.win-vector.com/dfiles/ReverseAccumulation.pdf.

    The purpose of our article is to explain reverse accumulation automatic differentiation clearly (and to release some sample code and timing results). A side effect of the article is to make sense of the following two diagrams:

    If the following is picture of standard or forward differentiation:

    cutFwd.png

    then the following is a picture of reverse accumulation:

    cutRev.png

    Related posts:

    1. Automatic Differentiation with Scala
    2. “Easy” Portfolio Allocation
    3. A Quick Appreciation of the Sharpe Ratio

    ]]>
    http://www.win-vector.com/blog/2010/07/gradients-via-reverse-accumulation/feed/ 0
    Automatic Differentiation with Scala http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/?utm_source=rss&utm_medium=rss&utm_campaign=automatic-differentiation-with-scala http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/#comments Tue, 15 Jun 2010 04:19:20 +0000 John Mount http://www.win-vector.com/blog/?p=1481
  • Gradients via Reverse Accumulation
  • R examine objects tutorial
  • Survive R
  • ]]>
    This article is a worked-out exercise in applying the Scala type system to solve a small scale optimization problem. For this article we supply complete Scala source code (under a GPLv3 license) and some design discussion.
    Usually we work using a combination of databases, Java, optimization libraries and analysis suites (like R). The reason is that, for our typical problems, Java hits a sweet spot of trading off runtime performance against ease of development and maintenance. In the tens of gigabytes range (data sets larger than the Wikipedia but smaller than the Web) Java outperforms the scripting languages (Ruby, Python …) and is much easer to develop in and document than C++. This sweet spot is both subjective and situational- if the tasks were smaller and in a services framework Python is a better choice, if performance is paramount then C or C++ (with the STL) and Hadoop are a better choice, if pre-built statistical libraries are needed then R becomes a better choice. For the type problem we present here Scala is a very good choice.

    Our Example Problem

    Our small scale problem is this: we have a number of target points on a map and we want to pick a central point to directly connect to all of these points with wire. Our goal is to minimize the total amount of wire used. This problem is called the “Geometric Median”. So we are trying to find a point that minimizes the sum of distances from our chosen center to every target point. If we were trying to minimize the sum of squared distances from our chosen center to every target point the answer would be obvious: the average or mean (which by Hooke’s law is also the point where a set of identical springs would relax to). The mean is in fact a fairly good guess, but you can do better (which could important if the “wire” is expensive, such as cutting irrigation or drainage ditches). For example given the three target points (20,0), (-1,-1) and (-1,1) the optimal point is (-0.42,0) not the mean (6,0) and the choice of optimal point represents an over 19% savings in total wiring distance (see figure).


    points.png

    This is a substantial saving in cost.

    The problem changes as we consider variations. If indirect connections (such as routing one point through another, which may or may not be possible for reasons of capacity or safety) and multiple new centers are allowed we then have an instance of the Steiner Tree Problem which is harder to solve (since it is known to be NP complete). If no new centers are allowed (all routing must be between pre-existing target points) then we have a Spanning Tree Problem- which admits very quick solutions.

    We bring up the geometric median as a mere example. We don’t intend for our code to solve only the geometric median problem and we don’t intend to touch on the literature of specialized methods for solving the geometric median problem. Instead we are trying to demonstrate the speed you can develop prototype solutions if you have a few good tools (like various optimizers) available in your toolkit. Numeric optimizers may sound exotic, but they often are the kind of thing you want to experiment with and link directly into your code.

    Optimization as General Tool

    Now that we have the example problem we can describe a solution strategy. In this case the solution uses code “we wished we had lying around” before we started on the problem. We will pretend we have the tools we want ready to solve our problem and then we will pay our debt and build the required tools. The issue is that there is not an obvious closed form for the solution of the geometric median problem. So we are forced to work a bit harder. In this case harder means we need to solve an optimization problem. Consider the contour plot of the total wiring cost as function of where we choose to place our center. Our optimal point (-0.42,0) had wiring cost of 22.73 and the contour plot given here shows concentric regions of solution positions with higher cost.


    contour.png

    In general it is unwise to throw an optimizer at an arbitrary problem and hope to find the globally best solution. But in this case (and in many similar situations) we can prove that a simple local optimizer will in fact find the unique best solution. This is a property of the problem not of the optimizer. The concentric regions shown in the contour plot have a very nice shape: they are convex. That is: they have no intrusions- for any two points drawn from one of these shapes the straight line segment between these points stays inside the given shape. We don’t have to depend on observation- we can actually prove this is always the case for this problem. The wiring cost from a proposed center to any single target point is a convex function of where we choose to place our center (a convex function is a function whose graph never reaches above the secant line drawn between any two points on its graph). The total wiring cost is just the sum of the wiring costs to each target point. And to finish: the sum of a collection of convex functions is itself a convex function. Since the contour plot of a convex function has only convex shapes and we have proven the statement.

    But how does this help us? There is a standard technique to find “local minima” of a function by inspecting a function for places where the gradient is zero (points where there is no obvious down hill direction on the contour plot). This technique usually can only be guaranteed to find local minima (places where no small change improves your situation). But there is no guarantee that the local minimum you find is in fact the global minimum (the best possible solution). Except when you are dealing with a convex function. When a function is convex then all of the local minima are always grouped together into a single convex connected shape (if not a line drawn between two remote minima would violate the convexity definition). And if the function is never flat then this set is a single unique point: the unique best solution. Our inspection technique will be a gradient driven optimizer- that is an optimizer that when the gradient is non-zero improves its objective by running down hill and halts when the gradient is zero.

    The stated function to minimize is to sum the distance from our proposed center to each target point. We can write this as the sum of the distances:


    dist1.png

    ( euclid1.png which is the traditional Euclidean or L2 distance). This function actually has one one subtle flaw that we will deal with in the appendix (see: Fixing Smoothness).

    Using Scala to Apply the Optimization Solution

    To find our optimal center placement using Scala we first write our cost or objective as a Scala function:

        val dat:Array[Array[Double]] = Array(
          Array( 20, 0.0),
          Array( -1.0, 1.0),
          Array( -1.0, -1.0)
        )
    
        def fx(p:Array[Double]):Double = {
          val dim = p.length
          val npoint = dat.length
          var total = 0.0
          for(k <- 0 to (npoint-1)) {
            var term = 0.0
            for(i <- 0 to (dim-1)) {
              val diff = p(i) - dat(k)(i)
              term = term + diff*diff
            }
            total = total + scala.math.sqrt(term)
          }
          total
        }
    

    Scala is succinct and it is a great connivence to have a function definition capture data from its environment. What we would like to do is generate an initial guess as the solution (we use the mean as our initial guess) and then call an optimizer (in this case a conjugate gradient optimizer) to do all the work:

     val p0:Array[Double] = mean(dat)
     val (pF,fpF) = CG.minimize(fx,p0)
    

    At this point we would be done, except the conjugate gradient method (which is superior to gradient descent and many the non-gradient methods) requires a gradient.
    We could provide a numeric estimate of the gradient by the following divided difference method:

      def gradientD(f:Array[Double]=>Double,p:Array[Double]):Array[Double] = {
        val xdim = p.length
        val p2 = copy(p)
        val base = f(p2)
        val ret = new Array[Double](xdim)
        val delta = 1.0e-6
        for(i <- 0 to (xdim-1)) {
          p2(i) = p(i) + delta
          val fplus = f(p2)
          p2(i) = p(i)
          val diff = (fplus-base)/delta
          ret(i) = diff
        }
        ret
      }
    

    This numeric divided difference method often outperforms non-derivative optimization methods (like Powell’s Method and the Nelder-Mead Amoeba method). But the technique can run into numeric difficulties. We can remedy this if we are willing to write our function in a slightly more general way. If we re-encode our function in a generic manner we can use automatic differentiation (not to be confused with numeric differentiation or with symbolic differentiation) to produce a reliable gradient for optimization. What we need to do is re-write our function to work over an abstract field of numbers instead of only the machine supplied doubles. In fact what we need to do is specify a generic function that will work over any field, with the field to be determined later. The code to do this in Scala is very similar to the non-generic code:

       val genericFx = new VectorFN {
          def apply[Y <: NumberBase[Y]](p:Array[Y]):Y = {
            val field = p(0).field
            val dim = p.length
            val npoint = dat.length
            var total = field.zero
            for(k <- 0 to (npoint-1)) {
              var term = field.zero
              for(i <- 0 to (dim-1)) {
                val diff = p(i) - field.inject(dat(k)(i))
                term = term + diff*diff
              }
              total = total + smoothSQRT(term)
            }
            total
          }
        }
    

    Notice that code is very similar to the “def fx()” code. The key differences are that we had to define genericFx as extending a trait (a type of Scala interface) called VectorFN and inside this trait extension we defined a parameterized function name apply(). apply() is a generic function that is willing to work over any type Y where Y is at least of type NumberBase[Y] (we will get more into what that means in a moment). The difference in notation is that while the Scala function syntax can not specify a generic function with free type parameters (the incompletely specified Y) the Scala semantics are strong enough to implement this. In fact standard function definitions (such as “def fx()”) are just syntactic sugar for extending the Scala built-in Function1 trait. With a generic objective function in hand all we need is conjugate gradient code that is expecting a VectorFN (and willing to call apply() instead of just using naked function parenthesis) and some type NumberBase[Y] that can compute gradients for us. The Scala compiler can specialize our genericFx() into one version for quick calculation and another for gradients. How this is done is what we will discuss next. From our point of view our problem is solved with the following one line of code:

    val (pF,fpF) = CG.minimize(genericFx,p0)
    

    This should always be your goal- build sufficient preparation so your last step is a “obvious one liner.”

    What Tools we Wish we Had Lying Around

    We supply in our example some workable conjugate gradient code, but that is standard so we will not discuss it. What is of interest (and facilitated by Scala’s parametrized type system) is the implementation of dual numbers as a framework to supply automatic differentiation. An implementation of dual numbers as a NumerBase[DualNumber] type is the core of our demonstration.

    Dual numbers are an algebraic structure written as pairs of real numbers “(a,b)”. The arithmetic table for dual numbers is given below:

    (a,b) + (c,d) = ((a+c) , (b+d))
    (a,b) – (c,d) = ((a-c) , (b-d))
    (a,b) * (c,d) = ((a*c) , (a*d+b*c))
    (a,b) / (c,d) = ((a/c) , ((b*c-a*d)/(a*a)))

    In a dual number (a,b) “a” is the “large” or “standard” part of the number. You can check from the arithmetic table that the pair of dual numbers (a,0) and (c,0) behave just as we would expect the real numbers a and c to behave. In the dual number (a,b) “b” is the “small” or “ideal” portion of the number. From the multiplication rule above we can observe two rules: (0,b) * (c,0) = (0,b*c) (something small times anything else is small) and (0,b)*(0,d) = (0,0) (two small things become zero when multiplied). Essentially the dual numbers are carrying around the first two terms of a Taylor series: we get as a result both the function value and the function derivative. For a function f() over the real numbers we extend f() to work over the dual number by defining: f((a,b)) = (f(a),b f’(a)) (which is consistent with the previously defined arithmetic). We can check that the dual numbers numbers obey the usual laws of arithmetic (associative, commutative, distributive, identities and inverses). The punchline is that over the dual numbers the divided difference estimate of f’(x) (the derivative of f() evaluated at x) is in fact exact in the sense that f((x,1)) = (f(x),f’(x)) (or f((x,0)+(0,1)) – f((x,0)) = (0, f’(x))). Implementing the DualNumber class is little more than transcribing the above arithmetic table into Scala.

    We have already seen how to write code that uses NumberBase[Y] types (genericFx() itself is an example). A more complicated example is the CG.minimize() code which not only accepts a generic function (in the form of VectorFN) but then specializes it to NumberBase[DualNumber] to compute gradients and also specializes to NumberBase[MDouble] for quick calculation during line searches (MDouble is just an adapter for machine Doubles, used for speed). The ability to re-specialize a function is one of the advantages of a parameterized type system. The DualNumbers are an example of forward automatic differentiation. We could also use the same object framework to capture a representation of the computation path and apply more sophisticated methods such as reverse automatic differentiation.

    We give a link to a jar containing complete Scala source code including this example, the DualNumber implementation, a conjugate gradient implementation and some JUnit tests (all under a GPLv3 license) and will go on to describe some of the design decisions. The code is the bulky part of this work, so we will move on to discuss something more compact: types.

    Types

    If code is ever beautiful it is only when it is succinct. Among the most succinct forms of code are individual type signatures and interfaces (though the indiscriminate repetition of type signatures is rightly considered ugly bloat, which Scala works to avoid). Since we are distributing complete source we will describe only types and method signatures. The entry points to the code are the JUnit tests (organized in the ScalaDiff/test source directory and depending on JUnit which was not included) and the demo program in ScalaDiff/src/demo/Demo.scala).

    To be a usable arithmetic type (like DualNumber or MDouble) you must extend the following parameterized abstract class:

    abstract class NumberBase[NUMBERTYPE <: NumberBase[NUMBERTYPE]] {
      // basic arithmetic
      def + (that: NUMBERTYPE):NUMBERTYPE
      def - (that: NUMBERTYPE):NUMBERTYPE
      def unary_-():NUMBERTYPE
      def * (that: NUMBERTYPE):NUMBERTYPE
      def / (that: NUMBERTYPE):NUMBERTYPE  // that not equal to zero
      // more complicated
      def pow(that:Double):NUMBERTYPE
      def exp:NUMBERTYPE
      def log:NUMBERTYPE // this is positive
      // comparison functions
      def > (that: NUMBERTYPE):Boolean
      def >= (that: NUMBERTYPE):Boolean
      def == (that: NUMBERTYPE):Boolean
      def != (that: NUMBERTYPE):Boolean
      def < (that: NUMBERTYPE):Boolean
      def <= (that: NUMBERTYPE):Boolean
      // utility
      def field:Field[NUMBERTYPE]
    }
    

    In particular DualNumber extends NumberBase[DualNumber]. This deliberate circular reference has a big purpose: it allows publicly visible contravariant return types (returning nearly the exact type we really are instead of a base type). This allows us to have strict type arguments so that trying to add a MDouble to DualNumber is a type error (even though they both extend the same base class). The automatic differentiation technique encapsulated in the DualNumber class only works if all of the calculation is in the DualNumber types and this strict type enforcement allows the compiler to help prevent results sneaking in and out through other types. All of the methods on NumberBase are obviously related to arithmetic except the field() method. This method gives us access to a Field object which is responsible for carrying around the runtime type information (this is a common problem in Java and Scala, that some type information known at compile type such choice of template types is not easily accessed at runtime). The Field class is as follows:

    abstract class Field [NUMBERTYPE <: NumberBase[NUMBERTYPE]] {
      def zero:NUMBERTYPE            // return canonical zero in field
      def one:NUMBERTYPE             // return canonical one in field
      def inject(v:Double):NUMBERTYPE  // return canonical representation of number in field
      def project(v:NUMBERTYPE):Double // return standard-number represented in field
      def array(n:Int):Array[NUMBERTYPE] // return an array of this type
    

    The Field class is where we have factories for numbers (zero, one, arrays, injection from standard Doubles), casting (projection back to standard Doubles).

    With these types defined we can actually read intent off some of the method signatures.

    For example our conjugate gradient optimizer is accessed through the following method signature:

     def minimize(fn:VectorFN,x0:Array[Double]):(Array[Double],Double) // return x,f(x)
    

    The above can be read as: CG.minimize() requires a VectorFN (our trait representing single argument functions with a free type parameter) and an initial point (in standard Doubles). The code will the return a pair of the optimum point and the function evaluated at the optimum point. From the type signature we can see that CG.minimize() expects to re-specialize the function “fn” to types of its own choosing (else it could have accepted a parameterized argument instead of our custom trait) and will handle all up-conversion and down-conversion between machine Doubles and NumberBase[Y]‘s itself. This sort of type information is hard to express (let alone enforce) in a dynamically typed language.

    A slightly more complicated example is the lineMinD() method:

    def lineMinD[Y<:NumberBase[Y]](field:Field[Y],
     f:Array[Y]=>Y,
     xm:Array[Double],
     di:Array[Double]):(Array[Double],Double)
    

    Notice it is willing to work with any type parameterized function (which means it is willing to let the caller pick the actual type of NumberBase[Y] and work with that). Most callers will call with Y=MDouble (the wrapper for machine Doubles) and lineMin() will then work with that (without ever really knowing the actual underlying type).

    A lot of fans of dynamic languages consider type systems to be mere hairshirt penance. But that is not so. Broken type systems (like Java’s collections before erasure parameters were introduced in Java 1.5) are indeed more trouble than they are worth. Working type systems (like C++ Templates/STL, Java 1.5+ and Scala) allow you to solve problems (and enforce decisions) during the design phase (which is much much cheaper than during the deployment phase). You can’t set your types in stone (you are likely going to have them subtly wrong for the first few iteration). You must be willing to think like a “language lawyer” to find out what parts of your work can be specified and enforced in the language type system. To use an analogy: static types are your blueprint or your underpainting.

    Tests

    One argument against static types is that you can get much of their benefit from unit tests. My opinion is you never have enough unit tests, so putting more pressure on your test suite is not wise. Static types plus tests are strictly more powerful than static types alone or tests alone.

    Even for this example toy-scale project we have include a JUnit test set to pursue a number of goals:

    • Confirm our number implementations (DualNumber and MDouble) correctly model machine Doubles (perform parallel calculations and compare).
    • Confirm DualNumber obeys expected laws of algebra composition and cancellation including the portions that can not be modeled in machine Doubles.
    • Confirm DualNumbers compute gradients.
    • Confirm operations of optimizers and optimizer components.

    Many of these tests are related, but they don’t all imply each other and give different perspective on the errors they catch. For example no amount of parallel computation between DualNumbers and machine Doubles is going to confirm the infinitesimal portion of the DualNumber is propagating correctly (since this is not a property of machine Doubles). So we add extra tests that expect DualNumber to obey algebraic relations like: a*(b+c) = a*b + a*c hold. It is then another step to confirm that whatever the DualNumbers calculate is not only self-consistent, but also models a truncated Taylor Series or differentiation.

    Conclusion

    We hope we have demonstrated how the complexity of a mathematical programming problem can be managed by breaking the problem into an objective function that is separate from the optimizer (allowing the optimizer to be both good and hidden) and a static type system (such as Scala) to help enforce required properties of a calculation (such as all numbers being routed though a required representation). With these sort of tools available many formerly hard problems (that are often, unfortunately solved by over-specifying direct inefficient iterative improvement techniques) become “if I can write a reasonable objective function this may already by solved by an optimizer in my library.” The more of these tools you have (either in your code or in your reference library) the more of these problems become easy (this is the topic of my earlier paper: The Local to Global Principle).

    Appendix: Fixing Smoothness

    Our chosen example objective function is very nice (i.e. convex) but it has a small (but correctable) problem. The derivative or gradient or gradient has some jump discontinuities that could cause an optimizer to exit prematurely (not at the global optimum). Consider the simple form of this for wiring a center to a single point at the origin (even in 1 dimension). The wiring cost function is sqrt(x*x) has a cost graph as shown here.


    abs.png

    This is convex- but derivative is not smooth as we see in the included graph of the derivative of sqrt(x*x).


    dabs.png

    So: in this case if the optimizer stops at one of the target points we can’t be sure that it stopped at the global optimum (it may have stopped due to the discontinuity in the gradient). For some simple problems the optimum is necessarily at a target point. For example on the number line take the target points 0,1 and x. As long as x≥0 and x≤1 the optimum placement will be x itself.

    One way to defend against this is to use some sort of smoothed version of sqrt() that essentially decreases a little faster near the origin. Our cost function becomes:


    cost2.png

    where s() is our suitable approximation of the sqrt() function. Two candidates are s(x) = (x+tau)^(1/2) and s(x) = x^(1/2 + tau); where tau is a small constant. As long as tau is greater than zero we have no derivative discontinuity in s(x^2) and convexity is preserved (even made a bit stricter). Other ways to deal with this include adding additional coordinates to the problem and small perturbations on these coordinates. Finally, a point found by optimizing with respect to s(x) can be “polished” by re-starting the optimization at the first found solution and using sqrt(x) as the new objective (if the original point is not near any of the target points).

    Related posts:

    1. Gradients via Reverse Accumulation
    2. R examine objects tutorial
    3. Survive R

    ]]>
    http://www.win-vector.com/blog/2010/06/automatic-differentiation-with-scala/feed/ 5
    Must Have Software http://www.win-vector.com/blog/2010/05/must-have-software/?utm_source=rss&utm_medium=rss&utm_campaign=must-have-software http://www.win-vector.com/blog/2010/05/must-have-software/#comments Fri, 28 May 2010 17:26:07 +0000 John Mount http://www.win-vector.com/blog/?p=1461
  • Microsoft Store Again
  • Exciting Technique #1: The “R” language.
  • Public Service Article: JSTOR and other Useful Research Archives
  • ]]>
    Having worked with Unix (BSD, HPUX, IRIX, Linux and OSX), Windows (NT4, 2000, XP, Vista and 7) for quite a while I have seen a lot of different software tools. I would like to quickly exhibit my “must have” list. These are the packages that I find to be the single “must have offerings” in a number of categories. I have avoided some categories (such as editors, email programs, programing language, IDEs, photo editors, backup solutions, databases, database tools and web tools) where I have no feeling of having seen a single absolute best offering.

    The spirit of the list is to pick items such that: if you disagree with an item in this list then either you are wrong or you know something I would really like to hear about.

    Encryption, disk images: TrueCrypt (open source: Linux, Windows, OSX)
    TrueCrypt can create portable encrypted virtual disks (files that can be mounted as a disk on any operating system).
    Encryption, files: GnuPG (open source: Linux, Windows, OSX)
    GnuPG is the tool to use to encrypt files for email.
    Presentation: Apple Keynote (commercial: OSX)
    Keynote is not quite as friendly as Microsoft PowerPoint, but it quickly produces beautiful presentations.
    Reference Library: Papers (commercial: OSX)
    “iTunes for PDF.” Manage thousands of PDFs and references, annotate with meta-data, place papers into multiple project folders. An interesting runner-up is BibDesk (open source: OSX).
    Spreadsheet: Microsoft Excel (commercial: Windows, OSX)
    Open Office and Google Docs are getting better every day, but neither come close to Microsoft Excel in functionality and versatility of user interface. If you are on a platform that supports Excel, working regularly with spreadsheets and using something other than Excel: it really means that you do not value your time.
    Statistics Software: R (open source: Linux, Windows, OSX)
    R is rapidly becoming the platform of choice for statisticians and is (with the addition of lattice and ggplot2) the best way to produce graphs. R has fairly nasty programming language, but has so many statistical operations available that it can not be avoided.
    Technical Documentation: LaTeX (open source: Linux, Windows, OSX)
    It may seem antiquated but TeX/LaTex is still far more powerful than the “WSYWYG” pretenders. The separation of presentation from specification, automatic management of references, table of contents and being able
    to include PDFs from external files (which get refreshed when you re-build the document) are all lifesavers.
    Version Control: git (open source: Linux, Windows, OSX)
    Just about the only version control system that: doesn’t damage the data you are trying to manage by adding dot-files into all of the directories, can routinely handle large files and can work productively without a network connection. Perforce is powerful central server commercial option (with the ability to have central policies, control and review).

    I look forward to learning which of my choices are considered poor and what your must-haves are.

    Related posts:

    1. Microsoft Store Again
    2. Exciting Technique #1: The “R” language.
    3. Public Service Article: JSTOR and other Useful Research Archives

    ]]>
    http://www.win-vector.com/blog/2010/05/must-have-software/feed/ 2
    Algorithmic Movie (with texture) http://www.win-vector.com/blog/2010/04/algorithmic-movie-with-texture/?utm_source=rss&utm_medium=rss&utm_campaign=algorithmic-movie-with-texture http://www.win-vector.com/blog/2010/04/algorithmic-movie-with-texture/#comments Tue, 27 Apr 2010 16:44:52 +0000 John Mount http://www.win-vector.com/blog/?p=1457
  • What is “Genetic Art?”
  • Gradients via Reverse Accumulation
  • ]]>
    We would like to share a new algorithmic movie we have created.

    Since the mid 90′s we have been dabbling off and on with a combination of algorithmic and genetic art (see: What is “Genetic Art?” or try running the Java code directly in your browser). Every once in a while we return to the project and generate something we would like to share.


    For this project we have used formulas over the variables “x” and “y” to describe how color varies as a function of position on our canvas.

    This has allowed formulas like:

    ( + ( mod ( iexp k ) ( isin ( / j ( / ( x + i y + j x + k y ) k ) ) ) ) ( mod ( iexp k ) ( isin ( / j ( / ( x + i y + j x + k y ) k ) ) ) ) )

    To generate pictures like this:


    gartPicture2010_04_27_09.20.21.794.jpg

    We then add a source-texture from C. Estrade’s “Full-Color Japanese Textile Designs CD-ROM and Book” (Dover, unrestricted use):


    023.jpg

    Which (with a slightly modified formula) yields a picture like this:

    ( + ( subst ( mod ( iexp k ) ( isin ( / j ( / ( x + i y + j x + k y ) k ) ) ) ) Img23 ) ( mod ( iexp k ) ( isin ( / j ( / ( x + i y + j x + k y ) k ) ) ) ) )


    gartPicture2010_04_18_09.12.24.212.jpg

    We can further modify the formula to depend on time (represented by the new variable “z”):

    ( + ( subst ( mod ( iexp k ) ( isin ( / j ( / ( x + i y + j (x +z) + k (y + z) ) k ) ) ) ) Img23 ) ( mod ( iexp k ) ( isin ( / j ( / ( x + i y + j (x +z) + k (y + z) ) k ) ) ) ) )

    And get a movie like this:



    What we have previously called “genetic art” was the system of automatically combining and re-combining fragments of formulas using user votes and preferences (so nobody would have to see or understand these ugly formulas to produce art). What we now present is a larger “algebra” of “simple picture plus pattern = complicated pictures” and “picture plus time transformations = movie.”

    Related posts:

    1. What is “Genetic Art?”
    2. Gradients via Reverse Accumulation

    ]]>
    http://www.win-vector.com/blog/2010/04/algorithmic-movie-with-texture/feed/ 0
    SIGACT Review of: Combinatorics the Rota Way http://www.win-vector.com/blog/2010/04/sigact-review-of-combinatorics-the-rota-way/?utm_source=rss&utm_medium=rss&utm_campaign=sigact-review-of-combinatorics-the-rota-way http://www.win-vector.com/blog/2010/04/sigact-review-of-combinatorics-the-rota-way/#comments Wed, 21 Apr 2010 03:51:56 +0000 John Mount http://www.win-vector.com/blog/?p=1450
  • What is Mathematics, Really?
  • The Joy of Calculation
  • Sorting Used in Anger
  • ]]>
    SIGACT News review of: Combinatorics the Rota Way. Also found on Professor Gasarch’s page and ACM SIGACT News Volume 41, Issue 2 (paywall)

    Review of
    Combinatorics The Rota Way
    by Joseph P.S. Kung, Gian-Carlo Rota and Catherine H. Yan
    Cambridge, 2009
    396 pages, Trade Paperback
    Review by
    John Mount, jmount@win-vector.com
    April 20, 2010

    Introduction

    Combinatorics, as it matures, becomes harder to succinctly describe. The field has progressed from the basic study of finite sets and counting techniques to being the discipline where questions involving counting, graphs, connectivity, mappings and partial orders all naturally reside. But the objects that combinatorics studies turn out not to be the correct foundation to support modern combinatorial methods. Many combinatorial methods were dismissed as mere technique until combinatorics expanded to include the natural domains of these methods: lattices, formal power series, valuation rings, matroids and many diverse algebras. One person who pushed hard for this coherence and unity was Gian-Carlo Rota.

    An example of a high-school level combinatorial trick is proving the equation

    $\displaystyle \sum_{i=0}^{n} \binom{n}{i} = 2^n $

    by applying the binomial theorem to $ (1+1)^n$ . This trick is transformed into a method when you recognize that you really should be working in the ring of formal power series and invent the Umbral Calculus. With the Umbral Calculus you can use the equivalence of the following two equations:

    $\displaystyle b^n$ $\displaystyle =$ $\displaystyle (a+1)^n = \sum_{i=0}^{n} \binom{n}{i} a^i$  
    $\displaystyle a^n$ $\displaystyle =$ $\displaystyle (b-1)^n = \sum_{i=0}^{n} (-1)^{n-i} \binom{n}{i} b^i$  



    (i.e. $ b = a+1$ is equivalent to $ a=b-1$ ) to prove that for any two arbitrary infinite sequences $ a_i,b_i$ the following two statements are also equivalent:


    $\displaystyle b_n$ $\displaystyle =$ $\displaystyle \sum_{i=0}^{n} \binom{n}{i} a_i \;$for all$\displaystyle \; n$ (1)
    $\displaystyle a_n$ $\displaystyle =$ $\displaystyle \sum_{i=0}^{n} (-1)^{n-i} \binom{n}{i} b_i \;$for all$\displaystyle \; n.$ (2)



    For example: we could pick $ a_i = i$ and substitute it into Equation 1. With some work we see this implies $ b_i= 2^{i-1} i$ .1Then by the Umbral result we know Equation 2 must also be true so we get a new identity: $ n = \sum_{i=0}^{n} (-1)^{n-i} \binom{n}{i} 2^{i-1} i$ . This algebraic production of a new identity is very different than the classical method of “counting two ways” (or being lucky enough to come up with a clever bijection to prove the identity).

    Summary

    The book “Combinatorics the Rota Way” is itself hard to succinctly describe. The first and third authors tell of writing this book using notes from the Massachusetts Institute of Technology’s course 18.315 collected over a span of more than 30 years. Gian-Carlo Rota himself was added as a posthumous author. The book itself contains more than a single course-year’s worth of material and is packed very densely.

    The book’s emphasis is abstract and algebraic. The exercises are not to teach, but are instead to identify applications of combinatorics in other mathematical disciplines. The book is the product of a strong push to demonstrate many combinatorial methods in their most powerful, but not most obvious, forms. This work is clearly a labor of love and contains some remarkable material. However, due to the large breadth of the work not much time is spent on motivation or on concrete examples.

    Chapter 1: Sets, Functions and Relations

    The first chapter covers the definitional foundations of combinatorics: sets, lattices, partial orders, functions and relations. These are the discrete objects that the book will reason about by later building more complicated algebraic objects. This section is very dense and reads like a compressed Bourbaki treatment of discrete mathematics.

    One portion of this chapter that is problematic is the section on entropy that seems to serve no purpose other than to prepare the reader for exercise 1.4.10 which demonstrates an abstraction of entropy. Also, exercises 1.2.5(j,k) are needlessly cruel in asking the reader to recreate the Robertson-Seymour graph minor theorem. There have been books where the reader is successfully guided through a major result by exercises, such as the Weak Perfect Graph Theorem in Lovász’s “Combinatorial Problems and Exercises”, but this book is not structured in that manner.

    Chapter 2: Matching Theory

    The second chapter is a welcome change in tone and opens with a quote from Harper and Rota describing matching theory and a clever 1979 Putnam exam problem is worked into the exercises and solutions. Central to the chapter is “marriage theorem”, which determines when matchings are possible. Also discussed is Birkhoff’s Theorem, which states that every doubly stochastic matrix is a convex combination of permutations matrices, which relates matchings to matrices. The text is lively and includes a number of well-researched asides, such as the origin of the name “The Hungarian Method.” However, there are some problems with forward reference: for example the reader is asked to work a couple of exercise (2.4.5 and 2.4.6) using the Binet-Cauchy formula, which isn’t discussed at length until chapter 6.

    Chapter 3: Partially Ordered Sets and Lattices

    This chapter begins with a very exciting presentation of the Möbius Function (the convolutional inverse of what is essentially the indicator function of a partial order). It is a real pleasure to see this material well presented in a general lattice setting, instead of the more common and specialized number theoretic setting. The chapter moves on to chains (ordered sequences in lattices) and anti-chains (sets of incomparable elements) in partial orders. The authors present Dilworth’s theorem which states that every partial can be covered by a number of chains no larger than the size of the largest anti-chain.2 The chapter continues with Sperner Theory, which relates counting anti-chains to binomial coefficients. Chapter 3 concludes with valuation rings and Möbius Algebras: a transition to the more algebraic style found in Chapter 4.

    Chapter 4: Generating Functions and the Umbral Calculus

    This is a key chapter. The book introduces the Umbral Calculus, a transform space automating the manipulation of generating functions. The algebra of delta operators is introduced, which provides an abstraction of differentiation. Finally co-algebras are explored, which abstract the processes of factoring.

    A rare (and unfortunate) typo on page-190 mis-defines a basic sequence $ p_n(x)$ for the delta operator $ Q$ as obeying $ Q p_n(x) = p_{n-1}(x)$ instead of the correct equation: $ Q p_n(x) = n p_{n-1}(x)$ . A careful reader can spot the mistake as it is inconsistent with the the subsequent demonstrations and uses.

    Chapter 5: Symmetric Functions and Baxter Algebras

    This chapter treats a number of important algebraic topics. Symmetric functions are studied and identified as being the obvious class of functions that contains all of the well know generating functions already studied. Pólya’s Enumeration Theory, which is the method of counting the number of equivalence classes of distinct arrangements, is given a very interesting exposition. But the book skips the classic examples and exercises, such as counting the number of ways to construct distinct necklaces from colored beads, that would be needed for the topic to be fully approachable. Baxter Algebras, which abstract both summation and integration by parts, are introduced and via a study the sequence shift operator. By this point the book has abstract versions of both differentiation and integration, providing a combinatorial groundwork to prove theorems on “the calculus” that are more general than is possible in any one theory of differentiation or integration.

    Chapter 6: Determinants, Matrices and Polynomials

    This chapter is most similar to classical polynomial invariant theory, the study of symmetric functions of the roots of polynomials such as the discriminant. A major theme of this chapter is the study of the relations between properties of polynomial coefficients and the locations of roots of the polynomials. The study of matrices brings us to the remarkable Binet-Cauchy Formula for the determinant of a product of matrices. The results are deep, but it is a shame that more time isn’t spent on simple concrete applications such as using the Binet-Cauchy formula to count the number of spanning trees in a graph. This chapter reveals the parts of combinatorics that come from analysis and the study of locations of roots of polynomials (via group theory), in contrast to the parts that come from enumerating finite sets, linear algebra and abstract algebra. This is also the chapter where the exterior algebra, a favorite tool of Rota’s, is most discussed.

    A typo on page 275 (a potentially confusing comma in the definition of the $ eval()$ operation) can be recovered from because the authors have the nice habit of explicitly calling out the domain and range of functions.

    Opinion

    Some important questions about this book are: is Gian-Carlo Rota a coauthor, what is the purpose of the book and who is the best audience?

    Gian-Carlo Rota seems appropriately labeled as a co-author, as clearly a lot of his work went into the book. The book is not suitable to be used as an introductory text book or as a reference. It is a book meant to be read. The ideal audience is capable of graduate level mathematics, is comfortable with a high degree of abstraction and algebra and is already familiar with many of the structures and techniques of combinatorics: sets, graphs, matrices, alternating sequences and generating functions. A mathematician or computer scientist wanting to learn more about the science of combinatorics will find a good read here.

    The book works best as a second read of the topics covered. If you already know of a combinatorial method, like Pólya’s Enumeration Theory, this book is a good place to find the starting point for an alternate and powerful treatment of the topic. The book admits to not being self contained, and has a few forward-reference problems. However, this is forgivable when you realize the goal of this book is not to teach some easy discrete mathematics before you move on to analysis, but to extract the important combinatorial methods and themes from all of mathematics.

    The content is well written, very accurate and well edited. The index is good, but not quite up to the job. The bibliography is very good and divided into three useful sections: papers by Gian-Carlo Rota and coworkers, books for further reading and a section of references.

    We close with a extract from the book at hand. Many mathematicians have used the phrase “merely combinatorial proof” as a phrase of dismissal. However, when properly founded, combinatorial proofs are in fact more general than proofs that depend on additional specific details from the original problem domain. The authors take some justifiable pleasure in including points like: “Hilbert’s basis theorem is equivalent to the ‘trivial combinatorial fact’ given in Gordan’s lemma.” This is certainly a taste of combinatorics the Rota way.


    Footnotes

    ….1
    For this use the binomial theorem to expand $ (1+x)^n$ , differentiate with respect to $ x$ and then substitute in $ x=1$ .
    … anti-chain.2
    From this they derive just about the only Ramsey-theoretic style result in the book: any large partial order must have a large chain or large anti-chain.


    Related posts:

    1. What is Mathematics, Really?
    2. The Joy of Calculation
    3. Sorting Used in Anger

    ]]>
    http://www.win-vector.com/blog/2010/04/sigact-review-of-combinatorics-the-rota-way/feed/ 0
    Deming, Wald and Boyd: cutting through the fog of analytics http://www.win-vector.com/blog/2010/04/deming-wald-and-boyd-cutting-through-the-fog-of-analytics/?utm_source=rss&utm_medium=rss&utm_campaign=deming-wald-and-boyd-cutting-through-the-fog-of-analytics http://www.win-vector.com/blog/2010/04/deming-wald-and-boyd-cutting-through-the-fog-of-analytics/#comments Tue, 20 Apr 2010 22:53:03 +0000 John Mount http://www.win-vector.com/blog/?p=1421
  • Statsmanship: Failure Through Analytics Sabotage
  • ]]>
    This article is a quick appreciation of some of the statistical, analytic and philosphic techniques of Deming, Wald and Boyd. Many of these techniques have become pillars of modern industry through the sciences of statistics and operations research.

    We start with W. Edwards Deming. Deming was a statistician who designed many of the production methods of post-war occupied Japan. Deming’s work on quality quantification, measurement and continuous improvement formed the fundamental basis of Japan’s later rise as a respected manufacturing super power. Many of the further improved techniques were later imported into the United States as “eastern wisdom.” However, some of the lesser ideas were perverted by eager followers into destructive cargo-cult rituals like “six sigma” (we must remember that it was the depth and power of Deming’s ideas that attracted the imitators).

    One of Deming’s most fundemental ideas was the “PDCA loop.”


    PDCA.png

    The PDCA loop is cycle of conceptual and analytic effort that sequences repeatedly through the stages Plan, Do, Check and Act. The cycle starts with a plan and the next cycle’s plan is influenced by results of the previous cycle. The explicit Check and Act steps show the presumption that the Do step will always need measurement and correction. This cycle is designed to help mitigate Clausewitz’s observation that “no campaign plan survives first contact with the enemy.” Deming’s idea is essentially the systematic application of the scientific method (“propose/test”- or Francis Bacon’s Novum Organum of 1620) to adaption and implementation of plans.

    While Deming was teaching planning and “statistical process control” to boost US wartime production a number of other statisticians were having great success in developing reactive strategies. One of the best stories is that of Abraham Wald. Wald became interested in allied aircraft mortality during World War II. He prepared a number of studies and charts of surviving aircraft, tabulating where bullet and shrapnel damage was most extensive. He could, for example, combine inspections of many returning bombers to determine where the returning bombers had the most damage (say the bulk area of fuselage and the leading edges of the wings):


    b25b.png

    Wald then had the genius idea of proposing additional armor on the parts of the aircraft that never showed any hits on surviving aircraft (reasoning that aircraft routinely took damage everywhere so the undamaged areas in surviving aircraft must be the areas more often damaged in the unobserved, non-returning lost aircraft). From the above diagram we might propose to add more armor near the pilots, engines and trailing control surfaces. Wald later published sophisticated statistical techniques for imputing the distribution of hits (and therefore the distribution of vulnerabilities) on the unobserved aircraft: “A Method of Estimating Plane vulnerability Based on Damage of Survivors,” Abraham Wald, Center for Naval Analyses (1943).

    This art of reactive observation was later systematized by Colonel John Boyd. Boyd invented what he called the “OODA loop.”


    OODA.png

    This loop cycles similarly to Demings’s through a pattern of Observe, Orient, Decide and Act. The OODA loop differs from the PDCA loop in that it assumes a world that looks back and adapts against your actions. Boyd added ideas of tempo and pace such as “short cutting the loop” (skipping from act to orient or even act to decide) to adapt faster than nature or than your enemy.

    Boyd is also famous for applying his and Wald’s ideas in the design of the A-10 Warthog. The A-10 is a unique non-stealth, sub-sonic close air support plane. It is considered one of the ugliest things to every fly. The A-10 was not state of the art when it was introduced but it was scientifically designed for survival in the style of Wald. The engine intakes are partially protected by the wings, there is extra titanium armor around the pilot and a primitive direct lever control system in addition to the traditional hydraulics. The A-10 is known for its “lingering ability” or ability to stay near troops under fire to deliver support. It has also allowed pilots like Major Kim Reed-Campbell to fly for an hour and return to base after losing pieces of wing and all hydraulics. Here is a picture Reed-Campbell inspecting her damaged A-10 in 2003 after safely landing:


    Reed-Campbell.jpg

    Deming, Wald and Boyd were able to move statistics and analytics beyond description and use mathematics for prescription. The techniques they developed for planning, measurement and reasoning remain relevant to this day.

    Related posts:

    1. Statsmanship: Failure Through Analytics Sabotage

    ]]>
    http://www.win-vector.com/blog/2010/04/deming-wald-and-boyd-cutting-through-the-fog-of-analytics/feed/ 0
    R annoyances http://www.win-vector.com/blog/2010/03/r-annoyances/?utm_source=rss&utm_medium=rss&utm_campaign=r-annoyances http://www.win-vector.com/blog/2010/03/r-annoyances/#comments Sat, 20 Mar 2010 18:49:42 +0000 John Mount http://www.win-vector.com/blog/?p=1407
  • R examine objects tutorial
  • Relative returns: a banker versus trader paradox
  • Sorting Used in Anger
  • ]]>
    Readers returning to our blog will know that Win-Vector LLC is fairly “pro-R.” You can take that to mean “in favor or R” or “professionally using R” (both statements are true). Some days we really don’t feel that way.
    Consider the following snippet of R code where we create a list with a single element named “x” that refers to a numeric vector. We start with a demonstration of the hard-coded method of pulling the x-value back out using the “$” operator.

    > l <- list(x=c(1,2,3))
    > l$x
    [1] 1 2 3
    

    But suppose we wanted to automate this; that is pass in the name of the value we want in a variable. We are after all using a computer, so automating a step seems like a reasonable desire. R supplies a notation for this using the “[]” operator. But something slightly different comes out under the “[]” operator than under the “$” operator:

    > varName <- 'x'
    > l[varName]
    $x
    [1] 1 2 3
    

    Notice that the printed outputs are slightly different (one echoes "$x" and one does not). Let's use the "class()" method to see what is actually being returned in each case.

    > class(l$x)
    [1] "numeric"
    > class(l['x'])
    [1] "list"
    

    Completely different return types are returned (in one case a numeric vector in the other a general list, not interchangeable types).

    At this point you may think it is time to turn in our "pro" label and call ourselves "newb" (Internet slang for "newbie" or "idiot"). But let's slow down for a bit. When two views of the same situation disagree (such as the difference in opinion between the authors of R and myself whether the "[]" and "$" operators should return the same type) you at most know that at least one of those views is wrong. You don't really know if one view is right or even if one view is right which one it is. I can, however, bring in some additional argument to try and show the design of R is in fact wrong. The additional argument is "The Principle of Least Astonishment." This principle roughly says that it is a mistake to introduce unnecessary differences in outcomes (which to the unprepared user are unpleasant surprises). There may be some deep (yet obscure) reasons the two operators prefer to return different results. But the fact you would have to find a way to document and explain these differences really should make one think that this situation is really a mis-design and the "explanation" is really an attempt at a work around. Or to put it more rudely: there may be an explanation, but there is no excuse.

    For another example consider creating a 3 by 3 matrix:

    > m <- matrix(c(1,2,3,1,1,1,0,0,1),nrow=3,ncol=3)
    > m
         [,1] [,2] [,3]
    [1,]    1    1    0
    [2,]    2    1    0
    [3,]    3    1    1
    

    Now select the last two rows of the matrix.

    > m[c(FALSE,TRUE,TRUE),]
         [,1] [,2] [,3]
    [1,]    2    1    0
    [2,]    3    1    1
    >
    

    Now (for the punchline) try to select just the middle row of the matrix.

    > m[c(FALSE,TRUE,FALSE),]
    [1] 2 1 0
    

    Notice that once again (and without warning) the result is subtly different. I admit that it seems paranoid to worry about such small differences- but when you are debugging a system that should work these are exactly the killing mistakes you are looking for. In this case the problem is pretty bad. See what happens if you tried to ask for the dimension of each of these differing returns:

    > dim(m[c(FALSE,TRUE,TRUE),])
    [1] 2 3
    > dim(m[c(FALSE,TRUE,FALSE),])
    NULL
    

    The first case works fine (reports 2 rows and 3 columns). The second case returns "NULL" (instead of 1 row and 3 columns). In R NULL is sometimes used as an error-value (instead of throwing an exception) and this value will poison any further conditions or calculations it is involved in. The main way to deal with the arbitrary introduction of such NULLs is the incredibly tedious uncertain defensive coding practices that we argue against in Postel’s Law: Not Sure Who To Be Angry With. Such code weakens both programs and programmers.

    But what is going on in this example? Once again we use the "class()" method to inspect the subtly different results.

    > class(m[c(FALSE,TRUE,TRUE),])
    [1] "matrix"
    > class(m[c(FALSE,TRUE,FALSE),])
    [1] "numeric"
    

    The result is disappointing. For a two-row select R returns a matrix (what we would expect). For a single-row select R does us the "favor" of converting the result into a vector. This is a disaster. A single row matrix is similar to a vector, but even R itself does not support the same set of operations and outcomes on vectors as it does on matrices (for example the failure of the "dim()" method). It is not safe to further calculate with these results (without by-hand converting the result back to a single row matrix which R can in fact represent). In my case this created crashing bugs deep in a long running analysis (and was hard to diagnose as the bug was in an "innocent operation" not in a "risky calculation").

    All of this has to violate John Chambers' "Prime Directive" for data: "an obligation on all creators of software to program in such a way that the computations can be understood and trusted." Chambers' opinion being relevant as he is the author of the S language (of which R is an open source re-implementation). We continue to recommend R, but we also recommend being exceptionally careful when using it (which unfortunately adds time to projects).

    Related posts:

    1. R examine objects tutorial
    2. Relative returns: a banker versus trader paradox
    3. Sorting Used in Anger

    ]]>
    http://www.win-vector.com/blog/2010/03/r-annoyances/feed/ 10