Win-Vector Blog http://www.win-vector.com/blog The Applied Theorist's Point of View Wed, 03 Feb 2010 23:56:37 +0000 http://wordpress.org/?v=2.9.1 en hourly 1 Living in A Lognormal World http://www.win-vector.com/blog/2010/02/living-in-a-lognormal-world/ http://www.win-vector.com/blog/2010/02/living-in-a-lognormal-world/#comments Wed, 03 Feb 2010 23:46:37 +0000 Nina Zumel http://www.win-vector.com/blog/?p=1388
  • Statistics to English Translation, Part 2b: Calculating Significance
  • A Demonstration of Data Mining
  • Good Graphs: Graphical Perception and Data Visualization
  • ]]>
    Recently, we had a client come to us with (among other things) the following question:
    Who is more valuable, Customer Type A, or Customer Type B?

    This client already tracked the net profit and loss generated by every customer who used his services, and had begun to analyze his customers by group. He was especially interested in Customer Type A; his gut instinct told him that Type A customers were quite profitable compared to the others (Type B) and he wanted to back up this feeling with numbers.

    He found that, on average, Type A customers generate about $92 profit per month, and Type B customers average about $115 per month (The data and figures that we are using in this discussion aren’t actual client data, of course, but a notional example). He also found that while Type A customers make up about 4% of the customer base, they generate less than 4% of the net profit per month. So Type A customers actually seem to be less profitable than Type B customers. Apparently, our client was mistaken.

    Or was he?

    A little more elementary statistics revealed that the median profit generated by Type A customers is $65 — e.g., half the customers from group A generate more than $65 profit per month. The median for Type B customers is about $4.80 — so half the customers from group B generate less than five dollars profit every month. Maybe our client’s instincts aren’t completely off-base.

    Let’s look at the distribution of net profit across both customer populations:

    densityAll.png
    Figure 1: Distribution of net profit for Type A customers (blue) and Type B customers (red). The x-axis gives the net profit or loss, and the y-axis gives the fraction of the population that generates a given net profit.

    This pattern is typical among the customers of many businesses. The majority of customers generate relatively moderate profit (or loss); but there is an important minority of large-profit and large-loss customers out on both tails. In this case, the monthly customer value actually ranges from losses in the tens of thousands to profits of several hundred thousands (I clipped the graph, for “clarity”).

    I hesitate to call these large magnitude customers “outliers” because that term implies anomalous, possibly erroneous, data. In this case, the “outliers” are relatively rare, but important, customers who can potentially make the difference between a company that is in the black or in the red. Still, they are the exception and their behavior doesn’t necessarily tell you anything about the behavior of your typical customer. Knowing the mean profitability of a given customer group is important, of course, but the estimate will be dominated by your exceptionally profitable or lossy customers in that group, and as we’ve seen, that hides information about the majority of your customers.

    You might remember from our Good Graphs article that if you have positive skewed data with a wide dynamic range, graphing the data on a log scale helps you see phenomena across the entire range of data that you might miss on the ordinary graph. Unfortunately, we have data here in the positive and negative range. So let’s split the customers into three groups: profitable, unprofitable, and break-even. About 5-6% of the customer base is break-even, roughly the same proportion in Groups A and B; we’ll ignore them for now, and look at the profitable customers first (over 80% of the customers, in both groups).

    positiveCusts.png
    Figure 2: Distribution of profit from profitable Type A customers (blue) and Type B customers (red). The x-axis gives net profit on a log 10 scale, so every labelled tick corresponds to a change by a factor of 100 (eg. 10^0 = $1, 10^2 = $100, and so on). The y-axis represents the fraction of the profitable customer base that generates a given profit.

    Now we can clearly see that (among profitable customers) the typical Type A customer is in fact more profitable than the typical Type B customer. The mean profit from profitable Type A customers is about $227, and the median profit is about $93 (shown by the dashed blue line). About 2/3 of the profitable Type A customers generate between $21 and $400 in profit, and over 95% of them generate between $5 and $1721 in profit. We can call that 95% the set of “typical” profitable Type A customers. That’s not a standard definition, but it’s an intuitive one, and useful for this discussion.

    Approximately 2.5% of Type A customers generate profits greater than $1721; let’s call them the Type A “best-customers,” some of whom generate profits in the tens of thousands. They are responsible for 30% of the profit that comes from profitable Type A customers, and 3% of the profit that comes from all profitable customers (even though they only make up 0.2% of that population).

    Profitable Type B customers generate $148 mean profit, and about $7.67 median profit (the red dashed line). A typical profitable Type B customer generates between six cents and $1031 in profit — a lower range than what the typical Type A customer generates, although the very highest-performing Type B customers are competitive with the highest-performing Type A customers (about 130 Type B customers outperform all the Type A customers).

    Unfortunately, when Type A customers are unprofitable, they are typically more unprofitable than those of Type B. This is another reason why the mean profit from Type A customers overall was so low. Our client correctly perceived that Type A customers are typically quite profitable, but there is a small population of real clunkers in the group, too.

    negativeCusts.png
    Figure 3: Distribution of loss from unprofitable Type A customers (blue) and Type B customers (red). The x-axis gives loss on a log 10 scale; further to the right on the graph means a larger loss. An unprofitable Type A customer loses a median of $137 a month, and a mean of $1180. Unprofitable Type B customers lose a median of $4.80, and a mean of $210.

    We can do a similar analysis for the entire base of profitable customers. We would find that the typical profitable customer generates between six cents and $1200 in profit every month (median $8.65, mean $153), and that the 2.5% of best-customers generate over 60% of the profits.

    The Lognormal Distribution

    lognormalComp.png
    Figure 4: (Left) Distribution of profitable customers (graph clipped at $10,000). The x-axis gives the net profit, and the y-axis gives the fraction of the population that generates a given net profit. (Right) Distribution of profitable customers plotted on a log scale.

    The distribution of highly skewed positive data, like the value of profitable customers, incomes, sales, or stock prices, can often be modelled as a lognormal distribution: that is, the log of the data is distributed in a bell-shaped curve centered (in log space) at the median of the data (remember, for a normal curve, the median and the mean are the same). In our case, both the profits (seen above, in Figure 4) and the losses are distributed approximately lognormally. For lognormal populations, the mean is generally much higher than the median, and the bulk of the contribution towards the mean will be made by a small population of highest-valued data points. If you use the mean as a stand-in for value, you will overstate the value of most of your customers.

    If your customer value data is distributed approximately lognormally, then you can quickly estimate the range of values that 95% of your customers will fall into. About 95% of normally distributed data will fall within plus/minus two standard deviations of the mean, and taking logarithms converts multiplication into addition. So: if sd is the standard deviation of the natural log of your customer value data, M is the median profit, and k = exp(sd), then 95% of your customers will fall in the value range (M/(k*k), M*k*k). The 2.5% of customers who generate more than M*k*k profit are your best-customers, who often drive a majority of your profit.

    Long Tail Theory

    The distribution of customers above sounds a lot like Chris Anderson’s Long Tail Theory of consumer goods. Most of the revenue of (for example) a bookseller or a music store comes from a few “hits”, or blockbusters, with the rest of the merchant’s inventory out along the tail of Figure 5, moving a relatively small volume per title.

    LongTailComp.png
    Figure 5: (Top) A notional long tail curve. The y-axis represents sales volume, and the x-axis represents goods ranked from most to least popular. The highest selling goods are to the left. Note that this figure represents the sales curve differently from how the distribution of customer value is represented on the left side of Figure 4. (Bottom) The customer value data (top 10,000 customers) from Figure 4, plotted in the style above. The y-axis has been limited to $50,000 for clarity.

    Anderson generally assumes that sales of such goods are distributed as a power law distribution, rather than a lognormal; the log of power law data isn’t distributed symmetrically, but actually has a longer tail to the right. This means that even for the log of the data, the mean is higher than the median. In fact, in some cases, the mean of a power law distribution can be infinite. If sales volume is power law distributed, then top-selling hits are responsible for an even larger percentage of total sales volume than would be the case with a lognormal.

    The Pareto Distribution, which is one form of a power distribution, has been proposed as an alternative to the lognormal for modelling income distribution and other similar phenomena. Researchers have debated whether lognormal or Pareto is a better model for income distribution since at least the 1950s. Qualitatively, the two distributions have similar behavior. There are certain estimation and forecasting tasks where it does make a difference if your data follows a power law rather than a lognormal, but for the purposes of this discussion, it doesn’t really matter. For those who are interested, Michael Mitzenmacher has a fairly approachable discussion about the difference between power laws and lognormal distributions.

    Back to Long Tail Theory. Historically, merchants tend to concentrate on high-volume items, due to space limitations and the cost of holding inventory. Overall, however, the sum total of tail-product sales will add up to a respectable volume, especially for web retailers who have unlimited “floor space” — or so the Long Tail theory goes. A retailer must then decide whether to follow the traditional “hits-oriented” strategy, or a more “tail-oriented” strategy that caters to the numerous niche markets.

    If we draw an analogy with customer value, then best-customers are “hits.” Obviously, our client would like to “fire” his unprofitable customers while retaining his best-performing customers, and even attract more customers like them. But what about his little customers — the 95% of customers in the typical range? If his retention and growth strategy focuses primarily on attracting and retaining big customers, he is following a hits-oriented strategy. If his campaign also includes reaching out to little customers, then he is following something analogous to a tail strategy.

    Not all business works like a music or book seller; the appropriate strategy will vary. Still, we can think of a few reasons why keeping little customers happy is a good idea.

    For one thing, big customers are not only rare, but they are the ones that your competitors covet the most. Little customers, meanwhile, can still add up to a respectable chunk of change (close to 40% of net profit in our example above). A solid cushion of smaller customers may soften the blow to your profit margin, should a few of your bigger customers defect.

    logos.png

    Consider computer sales. Microsoft and Dell serve both the corporate and consumer markets. To judge from their past marketing practices, they consider business customers to be the more valuable segment (see here for a rant somewhat related to this topic). But business IT sales have declined in the current moribund economic climate; analysts attribute the growth in computer sales for the last quarter of 2009 primarily to consumer spending. Dell’s market growth for that last quarter was much lower than that of HP, Acer, and Apple, which are more consumer-oriented companies. It’s also worth noting that Microsoft saw a 14% decline in revenue for the quarter ending September 30, 2009, compared to the year-ago quarter (and their earnings were in large part due to sales of the Xbox, a consumer product), while at the same time, consumer-oriented Apple saw a 24% increase in revenue from its year-ago quarter.

    Your pool of little customers is also a pool of potential future best-customers. And you can’t always guess which ones. So a wise strategy might be to allocate part of your retention and growth campaign to providing loyalty incentives to smaller customers, and educating them about how your higher-end services or products might benefit them. Those little customers who have the means or opportunity to move on to the next level might very well appreciate your efforts, and stay with you, rather than defecting to a competitor.

    Optimizing Sales vs. Optimizing Customers

    One last thought about retail hits and high-value customers. McPhee’s Theory of Exposure, which is cited by Anita Elberse in her Harvard Business Review article “Should You Invest in the Long Tail?”, states that the popularity of music, film, TV or books is largely driven by “marginal audience participants” — the casual, or light, consumer. Casual consumers gravitate to already popular products because they have limited exposure to alternatives, and hence limited knowledge of them. Consumers of more obscure products, on the other hand, tend to be heavy (and knowledgable) consumers: voracious readers, dedicated music or film buffs, or enthusiasts of specific genres, like science-fiction or horror.

    albums.png

    McPhee’s research was done in 1963, using subjects who had a fairly small range of choices, compared to internet scale. Elberse found, however, that the phenomena McPhee described still held for the internet merchants that she studied. She uses this observation (along with McPhee’s companion theory of Double Jeopardy) to argue that retailers should not substantially alter their traditional hits-based strategies. There is an alternative interpretation:

    If your business follows McPhee’s theory, then hit products disproportionately attract low-value (low-volume) customers, and vice-versa.

    So an overly hits-oriented strategy will skew you towards a base of low-value customers. Indeed, Seth Godin argues that iTunes and Amazon, who are in a better position to implement a more tail-oriented strategy, are thriving at the expense of physical stores exactly because they have been able to steal the quality (high-volume) customers away.

    The moral is that both sales and customer value live in a lognormal world, where blockbuster products are marketed to a large cloud of low revenue customers, and high revenue best-customers are supported by large catalogues of low volume products. Fail to serve one side of this relationship, and you risk losing the other side.

    Related posts:

    1. Statistics to English Translation, Part 2b: Calculating Significance
    2. A Demonstration of Data Mining
    3. Good Graphs: Graphical Perception and Data Visualization

    ]]>
    http://www.win-vector.com/blog/2010/02/living-in-a-lognormal-world/feed/ 0
    Winter 2010 Subscription Campaign http://www.win-vector.com/blog/2010/01/winter-2010-subscription-campaign/ http://www.win-vector.com/blog/2010/01/winter-2010-subscription-campaign/#comments Mon, 18 Jan 2010 21:57:42 +0000 John Mount http://www.win-vector.com/blog/?p=1356
  • Statistics to English Translation, Part 2a: ’Significant’ Doesn’t Always Mean ’Important’
  • Statistics to English Translation, Part 2b: Calculating Significance
  • Hello World: An Instance Of Rhetoric in Computer Science
  • ]]>
    We at Win-Vector LLC would like to invite our loyal readers to help with our Winter 2010 Subscription Campaign. Please encourage your erudite friends and colleagues to read and subscribe to http://www.win-vector.com/blog/.
    Here are some of our most popular articles broken down by area of interest:

    Related posts:

    1. Statistics to English Translation, Part 2a: ’Significant’ Doesn’t Always Mean ’Important’
    2. Statistics to English Translation, Part 2b: Calculating Significance
    3. Hello World: An Instance Of Rhetoric in Computer Science

    ]]>
    http://www.win-vector.com/blog/2010/01/winter-2010-subscription-campaign/feed/ 0
    “Easy” Portfolio Allocation http://www.win-vector.com/blog/2010/01/easy-portfolio-allocation/ http://www.win-vector.com/blog/2010/01/easy-portfolio-allocation/#comments Thu, 14 Jan 2010 20:09:13 +0000 John Mount http://www.win-vector.com/blog/?p=1342
  • A Quick Appreciation of the Sharpe Ratio
  • A Discrete Model Gauging Market Efficiency
  • What is the gambler’s equivalent of Amdahl’s Law?
  • ]]>
    This is an elementary mathematical finance article. This means if you know some math (linear algebra, differential calculus) you can find a quick solution to a simple finance question. The topic was inspired by a recent article in The American Mathematical Monthly (Volume 117, Number 1 January 2010, pp. 3-26): “Find Good Bets in the Lottery, and Why You Shouldn’t Take Them” by Aaron Abrams and Skip Garibaldi which said optimal asset allocation is now an undergraduate exercise. That may well be, but there are a lot of people with very deep mathematical backgrounds that have yet to have seen this. We will fill in the details here. The style is terse, but the content should be about what you would expect from one day of lecture in a mathematical finance course.

    Portfolio allocation is not the “magic predict the future” part of finance, it is the scheme for correctly applying magic predictions of the future. The idea is that if you had an prediction of future returns of a number of assets, the naive thing to do would be to invest everything into the asset with highest predicted return. Portfolio theory, while still taking the predictions at face value, picks an investment pattern that will (in risk-adjusted dollars) outperform the naive strategy even if the predictions are correct and is a bit safer when the predictions are wrong.

    Suppose you had $ n$ different assets you could invest in. For the $ i$ -th asset there is an expected excess relative return of $ \mu_i$ and an estimated variance of $ s_i$ (for a definition of relative return see Relative returns: a banker versus trader paradox and for a definition of variance see A Quick Appreciation of the Sharpe Ratio). Let the vector $ w$ be such that $ X_i$ represents the number of dollars we invest in the $ i$ -th asset. If $ X_i$ is positive then our plan is “to go long” or buy some of the $ i$ -th asset. If $ X_i$ is negative our plan is “to short” or sell some of the $ i$ -th asset to somebody else (It is called going short as we actually sell something we do not have. This is often allowed in finance; as long as we make the same pay-outs to the buyer that the buyer would receive if we really had the item to sell).

    When we appeal to the idea of optimizing the portfolio Sharpe Ratio (again, see A Quick Appreciation of the Sharpe Ratio) then we say a good portfolio is one that doesn’t just maximize expected relative returns (which is $ X^{\top} \mu$ ) but maximizes the ratio of expected relative return to standard deviation:

    $\displaystyle \frac{X^{\top} \mu}{\sqrt{X^{\top} C X}} $

    where (for now) $ C$ is the matrix $ s s^{\top}$ . This ratio is called a “risk adjusted return” (versus the un-adjusted form $ X^{\top} \mu$ ). Also notice that the ratio is homogeneous in $ X$ (doubling $ X$ does not change the ratio as it simultaneously doubles the numerator and the denominator) so an optimal solution $ X$ describes not how much to invest, but what pattern to invest in. This allows us to introduce an important practical constraint: we are only going to allow ourselves to risk a total of $ T$ dollars (both long and short). That is: we insist $ \sum_{i=1}^{n} \vert X_i\vert = T$ . We will ignore this total investment constraint until the end when we can satisfy the constraint by simply re-scaling an partial solution.

    To solve for $ X$ we introduce an old friend: Lagrange Multipliers (or equivalently the Karush-Kuhn-Tucker conditions of optimality). Since the fraction we are trying to optimize is homogeneous in $ X$ we can convert the denominator into a constraint and arbitrarily insist that $ \sqrt{X^{\top} C X} = 1$ without changing the nature of the problem. We are now trying to maximize $ X^{\top} \mu$ subject to $ \sqrt{X^{\top} C X} = 1$ . The Lagrangian conditions of optimality state at the optimum we must have the gradient of the objective is proportional to the gradient of the constraint or:

    $\displaystyle \nabla_X X^{\top} \mu = \lambda \nabla_X ( \sqrt{X^{\top} C X} - 1 ) $

    for some (to be determined) constant $ \lambda$ . Pushing the gradient operator through we get:

    $\displaystyle \mu = \lambda (1/2) ( X^{\top} C X )^{-1/2} 2 C X . $

    A similar equation could be gotten by appealing to a Rayleigh Quotient argument.

    We do not yet know $ X$ (that is what we are trying to solve for), so we do not know what $ X^{\top} C X$ is. However, this is just a scalar and since we are just trying to solve up to a multiple we can throw it out and introduce a new multiple and see that it is enough to solve:

    $\displaystyle \mu = \lambda' C X $

    where $ \lambda'$ is new (still unknown) scalar. This means we have:

    $\displaystyle X = (1/\lambda') C^{-1} \mu $

    so our desired solution is some re-scaling of $ C^{-1} \mu$ .

    As we stated earlier we have a total investment constraint of $ \sum_{i=1}^{n} \vert X_i\vert = T$ . We can achieve this with the following adjusted solution:

    $\displaystyle X = \frac{T}{\sum_{i=1}^{n} \vert(C^{-1} \mu)_i\vert} C^{-1} \mu $

    as our desired optimal portfolio allocation. In the end we can solve for the optimal portfolio by merely solving a linear system (we don’t need anything as expensive as a general purpose optimizer in this case).

    These are very old results (going back as long as there has been Sharpe Ratios and portfolio theory). A good example reference is: “The Valuation of Risk Assets and the Selection of Risky Investments in Stock Portfolios and Capital Budgets,” John Lintner, The Review of Economics and Statistics (1965) vol. 47 (1) pp. 13-37. These results are the basis for advice like: “diversify.” Without modeling risk you would tend to put all of your money in the predicted highest paying asset. When modeling risk you tend to put some of your money in each high paying asset and as long as they do not all fail at the same time you have some safety. Another (very different) route to diversification is the Kelly Criterion (discussed in What is the gambler’s equivalent of Amdahl’s Law?).

    A very important risk we have not yet modeled is that our assets may have a tendency to fail at the same time (meaning we may not have really diversified usefully). The notion of assets may fail at the same time brings us to the ideas of correlation and covariance. When we took $ C = s s^{\top}$ we were implicitly assuming (or modeling), without justification, that each possible asset was independent of all the others (that there was no correlation between asset returns). This is, of course, not going to be anywhere near true in practice. Instead we should take $ C$ to be the Covariance Matrix that represent our estimate of the assent to asset correlations. In this case the solution methods above all work exactly as before. Companies such as MSCI Barra have made complete businesses out of producing and selling estimates of $ C$ .

    Another issue is when we do not allow ourselves to “short” (or take a negative allocation of) assets. In this case we have the additional constraints $ X \ge 0$ which complicates our solution. For the special case where the asset variances are assumed to be independent (i.e. $ C = s s^{\top}$ ) it is enough to solve as above and merely replace any negative allocations with zero when inspecting and scaling the final step of the solution. When the covariances are non-trivial ($ C$ has non-zero off-diagonal entries) this solution may not be optimal. In this case the Karush-Kuhn-Tucker conditions are more complicated and at the point of optimal solution we have the following conditions:

    $\displaystyle \mu + \lambda C X - \sum_{i=1}^{n} \tau_i E^i$ $\displaystyle =$ 0  
    $\displaystyle X$ $\displaystyle \ge$ 0  
    $\displaystyle \sum_{i=1}^{n} X_i$ $\displaystyle =$ $\displaystyle T$  
    $\displaystyle \tau$ $\displaystyle \ge$ 0  
    $\displaystyle \tau^{\top} X$ $\displaystyle =$ 0  



    where $ X$ is the allocation vector we wish to solve for, $ \lambda$ is an unknown scalar, $ \tau$ is a new unknown vector and $ E^i$ is the vector with $ (E^i)_i = 1$ and zeroes elsewhere. Using the Karush-Kuhn-Tucker conditions has allowed us to again almost linearize the problem, but we know have sign constraints on $ X$ and $ \tau$ and what is called a complementarity constraint: $ \tau^{\top} X = 0$ . This sort of problem essentially called a “Linear Complementarity Problem” and is about as hard as solving a linear program (the typical solution method is a variation of the simplex method called “Lemke’s algorithm”). (Technically the $ \lambda$ prevents the problem from being in the right form, but $ \lambda$ can be inspected out of the problem.) The problem can still be solved, you just need a bit more software. If we can not short assets (or at least simulate shorting assets) we not only eliminate many possible portfolios from consideration (so we likely end up with a less profitable portfolio than we would like) we also make the mathematics and computation a bit harder.

    The goal of this writeup has been to show how to systematically convert investment advice like “this stock is going to really take off” into an allocation of assets (which in turn implies a pattern of trades). We take as unexamined premises where to get such advice and whether to use the Sharpe ratio or some other notion of risk and/or utility. The point is that even though it may be complicated, from this point it is just calculation and calculation is easy to automate.

    Related posts:

    1. A Quick Appreciation of the Sharpe Ratio
    2. A Discrete Model Gauging Market Efficiency
    3. What is the gambler’s equivalent of Amdahl’s Law?

    ]]>
    http://www.win-vector.com/blog/2010/01/easy-portfolio-allocation/feed/ 0
    Relative returns: a banker versus trader paradox http://www.win-vector.com/blog/2010/01/relative-returns-a-banker-versus-trader-paradox/ http://www.win-vector.com/blog/2010/01/relative-returns-a-banker-versus-trader-paradox/#comments Thu, 07 Jan 2010 22:20:04 +0000 John Mount http://www.win-vector.com/blog/?p=1296
  • Thievery considered harmful
  • Programs reduced to statistics
  • Survive R
  • ]]>
    Quick Joke.

    Q: What is the difference between a banker and a trader?
    A: A banker will try and tell you a 10% loss followed by a 10% gain is breaking even.


    This is a bit less arcane than some of the issues we usually discuss on the Win-Vector Blog, but it is a fun one. And it does take some effort to disabuse yourself of the banker’s fallacy.

    It turns out that a lot of our instincts about something as simple as ratios is not quite right. Likely this is because the innate skills of counting leads to a deep understanding of addition and not of multiplication. Take for example the opening joke: a 10% loss followed by a 10% gain sounds like it should be exactly breaking even. But in fact it is exactly a 1% loss.

    To compute the loss and gain on $100 we would say after the 10% loss we have $100*(1-0.1) = $90. A 10% gain on this remaining portion would be written as $90*(1 + 0.1) = $99 which, as predicted, is missing a dollar. An incorrect explanation would be something along the lines “well the loss was first, so it applied to a larger number than the gain.” But relative losses and gains work by multiplying and therefore is insensitive to order. It is a fact that a 10% loss followed by a 10% gain is exactly the same as a 10% gain followed by a 10% loss (which eliminates the attempted explanation). The correct explanation is the flaw was far earlier than you would think: you should not believe that the opposite of 10% loss is a 10% gain. To undo the effect of a 10% loss you need just over an 11% gain (a 11.1111111% gain).

    For a more dramatic example consider the Dow Jones Industrial Average. It was at $12827 on January 7th 2008, by March 5th of 2009 it had fallen to $6594 or a 48% loss. By January 4th 2010 it had experienced a 60% gain relative to March 5th 2009- but that only got us to $10583, still a 17% loss relative to January 7th 2008. The opposite of 48% loss is in fact 192% gain (which obviously has not happened).

    Bankers typically quote interest rates as if they were additive. Things like points and fees are all added. This is almost correct for small interest rates. This nearly right (but actually wrong) language is why we have a bestiary of confusing terms describing interest: simple interest, compound interest and yield. The bankers need some way to signal which numbers will actually be used for computing your mortgage payments versus which numbers will be used for advertising (and in the US they tended not to tell you many of the more important numbers until they were required to by law).

    Traders, on the other hand, are very comfortable with multiplying relative losses and relative gains. The main trick of achieving such mastery is to convert multiplication into addition. The way to do this is the log() function (or the logarithm).

    The log() function is simple function that has the property that log(a*b) = log(a) + log(b). For connivence lets pick our notation so that log(10) = 1. From this we can deduce that it must be the case that:


    statement justification
    log(1000) = 3 because: log(1000) = log(10*10*10) = log(10) + log(10) + log(10) = 1 + 1 + 1
    log(1) = 0 because: log(1) = log(1*1) = log(1) + log(1)
    log(0.1) = -1 because: 0 = log(1) = log(0.1 * 10) = log(0.1) + 1 .

    log() can not be used on zero or negative numbers (at least not if you expect a real number as a result). For other values we use our calculator.

    A trader uses logarithms to think additively about relative changes (also called “returns”). Breaking even is represented as 0 (our friend log(1)), relative increases are represented as positive numbers and relative decreases are represented as negative numbers. For example a 10% loss is represented additively using logarithms as log(1- 0.1) = -0.0458. Now in this logarithm notation the additive opposite of a -0.0458 is in fact (as you would hope) +0.0458. You can even double check: log(1 + 0.1111111) = 0.0458. In this notation the mathematics and the language work together- the opposite of a loss is a gain with the same magnitude (and positive sign).

    Returning to our initial example: a 10% loss is represented as -0.0458 and a 10% gain is represented as log(1 + 0.1) = 0.0414, so if we add them (how we combing operations in the logarithmic notation) we get -0.0044. Notice this is not zero, and is in fact equal to log(0 – 0.01) or a net-loss of 1%.

    The point is that even trivial math becomes difficult if you are forced, by language or convention, to work from false premises.

    Related posts:

    1. Thievery considered harmful
    2. Programs reduced to statistics
    3. Survive R

    ]]>
    http://www.win-vector.com/blog/2010/01/relative-returns-a-banker-versus-trader-paradox/feed/ 1
    Statistics to English Translation, Part 2b: Calculating Significance http://www.win-vector.com/blog/2009/12/statistics-to-english-translation-part-2b-calculating-significance/ http://www.win-vector.com/blog/2009/12/statistics-to-english-translation-part-2b-calculating-significance/#comments Mon, 14 Dec 2009 07:02:40 +0000 Nina Zumel http://www.win-vector.com/blog/?p=1281
  • Statistics to English Translation, Part 2a: ’Significant’ Doesn’t Always Mean ’Important’
  • “I don’t think that means what you think it means;” Statistics to English Translation, Part 1: Accuracy Measures
  • Living in A Lognormal World
  • ]]>
    In the previous installment of the Statistics to English Translation, we discussed the technical meaning of the term ‘’significant”. In this installment, we look at how significance is calculated. This article will be a little more technically detailed than the last one, but our primary goal is still to help you decipher statements about significance in research papers: statements like “
    $ (F(2, 864) = 6.6, p = 0.0014)$ ”.

    As in the last article, we will concentrate on situations where we want to test the difference of means. You should read that previous article first, so you are familiar with the terminology that we use in this one.

    A pdf version of this current article can be found here.

    How is Significance Determined?

    Generally speaking, we calculate significance by computing a test statistic from the data. If we assume a specific null hypothesis, then we know that this test statistic will be distributed in a certain way. We can then compute how likely it is to observe our value of the test statistic, if we assume that the null hypothesis is true.

    We’ll explain the use of a test statistic with our Sneetch example from the last installment.

    The t-test for Difference of Means

    Suppose that the test scores for both Star-Bellies and Plain-Bellies are normally distributed, with the means and standard deviations as given in the table below.

      $ n$ (number of subjects) $ m$ (mean score) $ s$ (standard error)
    Star-Bellies 50 78 7
    Plain-Bellies 40 74 8

    Remember from the previous installment that we can estimate the true population means $ \mu_1$ and $ \mu_2$ as normally distributed around the empirical population means $ m_1$ and $ m_2$ respectively, with variances
    $ \sigma^2/{n_1}$ and
    $ \sigma^2/{n_2}$ . This is shown in Figure 1. Informally speaking, there is no significant difference in the two populations if the shaded overlap area in Figure 1 is large.

    Figure 1: The estimates of the means for two populations
    Image overlap

    Calculating this area is somewhat involved. Instead, we calculate the t-statistic:

    $\displaystyle t = \frac{(m_2 - m_1)}{s_D}$ (1)



    where $ s_D$ is called the pooled variance of the two populations.

    $\displaystyle {s_D}^2 = \frac{n_1\cdot {s_1}^2 + n_2\cdot {s_2}^2}{n_1 + n_2 - 2} \cdot (1/n_1 + 1/n_2)$ (2)


    For our Sneetch example, $ s_D = 1.6$ , and $ t=2.499$ , or the negative of that, depending on which group is Group 1. There are
    $ 50 + 40 - 2 = 88$ degrees of freedom.

    If the null hypothesis is true, and the two populations are identical, then $ t$ is distributed according to Student’s distribution with
    $ N_1 + N_2 - 2$ degrees of freedom
    . Student’s distribution is sort of a “stretched out” bell curve; as the degrees of freedom increase (
    $ N_1 + N_2 \rightarrow \infty$ ), Student’s distribution approaches the standard normal distribution, $ N(0, 1)$ 1.

    In other words, if the null hypothesis is true, $ t$ should be near zero. The probability of seeing a $ t$ of a certain magnitude or greater under the null hypothesis is given by the area under the tails of Student’s distribution:

    Figure 2: The area under the tails for a given $ t$
    Image twotailedtest

    This area is $ p$ . For the Sneetch example, $ p = 0.014$ .

    The further out on the tails $ t$ is, the stronger the evidence that you should reject the null hypothesis. If you know for some reason that the mean of one population will be greater than or equal to the other, than you can use the one-tailed test:

    Figure 3: The one-tailed test for a given $ t$
    Image onetailedtest

    This test halves the p-value as compared to the two-tailed test, making a given $ t$ value twice as significant. When in doubt about which to use, the two-tailed test is more conservative against false positives2.

    In discussions of t-tests, you will often see statements of the form:

    The t-test meets the hypothesis that two means are equal if

    $\displaystyle \vert t\vert > t_{\alpha/2, \nu}$    


    for a two-tailed test, or

    $\displaystyle t > t_{\alpha, \nu}$    


    for a (right-sided) one-tailed test.

    The quantities on the right hand side of the two equations above are called the critical values for a given significance level $ \alpha$ (usually,
    $ \alpha = 0.05$ ) and $ \nu$ degrees of freedom. The critical values are the values for which the area of the right hand tail is equal to $ \alpha$ .

    Figure 4: Critical value for a one-tailed test. Reject the null hypothesis if
    $ t > t_{crit}$
    Image onetailedcritval

    For a two-tailed test, you must halve the area under a single tail.

    Figure 5: Critical value for a two-tailed test. Reject the null hypothesis if
    $ \vert t\vert > t_{crit}$
    Image twotailedcritval

    This convention dates back to the time when computational resources were scarce, and researchers had to use pre-computed tables of critical values, rather than calculating $ p$ directly. Today, general statistical packages such as R or Matlab can compute the CDFs of any number of standard distributions; once you can compute the CDF, directly computing $ p$ (the area under the tails) is straightforward. Despite this, many tutorials of the t-test (and of the F-test, and other significance tests) still adhere to the convention of comparing test statistics to critical values. This tends to needlessly ritualize the whole process, and make it seem more complicated and mysterious than it actually is, at least in my opinion.

    David Freedman was very much against the continued practice of using critical values, rather than reporting the actual p-value. The last chapter of Freedman, Pisani and Purves [FPP07] is worth reading for its discussion of this, and other potential pitfalls of significance tests.

    Some standard packages for evaluating t-tests, F-tests, or the ANOVA also present analysis results in terms of critical values. Most of them do usually print the actual p value as well, along with the value of the test statistic and the degrees of freedom. Most researchers rightfully report the test statistics along with the actual significance levels: “we conclude that there is a significant difference in mathematical performance (t(88) = 2.499, p = 0.014)… .” Here, 88 gives the degrees of freedom, $ t(88)$ is the value of the t-statistic, and $ p$ is of course the p-value.

    Similar comments apply to the F-test, discussed in more detail below.

    Assumptions

    Strictly speaking, the t-test is only valid for normally distributed data where both populations have equal variance. However, the test is fairly robust to non-normal data [Box53]. You can verify that the sample variances are “equal enough” – that is, they could plausibly both be sampled observations from populations with the same variance, by using the F-test. The F-statistic

    $\displaystyle F = {s_1}^2/{s_2}^2 $

    is distributed according to the F distribution with
    $ (n_1 - 1,n_2 - 1)$ degrees of freedom

    Figure 6: The F distribution
    Image Ftest

    In practice, the larger variance is usually put in the numerator, so $ F > 1$ . The test should still be two-tailed, so you should double the area under the right-hand tail3. In this situation, you want to check if you ƒshould accept the null hypothesis (that
    $ F \approx 1$ ) at a given significance level. If so, then you can go ahead and apply the t-test.

    There is a variation of the t-tests for distributions of unequal variance, called Welch’s t-test [Wikc]. In this case, you are only checking if the means are equal, not that the distributions are the same.

    The F-test for Analysis of Variance (ANOVA)

    ANOVA is an extension of the difference of means test above to the casae of more than two populations. The null hypothesis in this case is that all the sample means are equal – or more strictly, that all the treatment groups are drawn from the same population.

    The simplest version of the ANOVA is the one-way ANOVA, where there are $ k$ treatment groups (populations) with $ n_i$ subjects (or repetitions, or replications) each, for a total of $ N$ subjects. Each population corresponds to a different single factor (a treatment or a condition: for example, a type of medicine, or a Star-Bellied Sneetch vs. a Plain-Bellied Sneetch vs. a Grinch). Two- or three- way ANOVAs correspond to varying two or three different factors combinatorially. For example, we could do a two-way ANOVA of Sneetch math performance by considering both the belly type and the gender of the Sneetchs.

    Figure 7: Table for a Two-way ANOVA of Sneetch math performance
    Image twowayANOVA

    We will only discuss one-way ANOVA in this article, since that covers all the relevant ideas about calculating significance.

    For a one-way ANOVA, we have the population means $ m_i$ and variances $ {s_i}^2$ . We can also calculate the overall mean $ m_0$ , over the entire aggregate population.

    The between-groups mean sum of squares, which is an estimate of the between-groups variance, is given by

    $\displaystyle {s_B}^2 = \frac{1}{k-1} \sum_i {n_i \cdot (m_i - m_0)^2}$ (3)


    $ {s_B}^2$ is sometimes designated $ MS_B$ It is a measure of how the population means vary with respect to the grand mean.

    The within-group mean sum of squares is an estimate of the within-group variance:

    $\displaystyle {s_W}^2 = \frac{1}{N-k} \sum_i^k \sum_j^{n_i} {x_{ij} - m_i}^2$ (4)


    $ {s_W}^2$ is sometimes designated $ MS_W$ . It is a measure of the “average population variance”.

    Figure 8: Within-group and between-group variance
    Image sigmas

    If the null hypothesis is true, then

    $\displaystyle F = {s_B}^2/{s_W}^2 $

    is distributed according to the F distribution wiht
    $ (k-1, n-k)$ degrees of freedom.

    Figure 9: p-value for the one-tailed F-test
    Image Ftest

    That is, under the null hypothesis, the within-group and between-group variances should be about equal:
    $ F \approx 1$ . If $ F < 1$ , then some of the treatment groups overlap other groups substantially, so practically speaking, one might as well accept the null hypothesis. Hence, a one-sided F test is good enough. As with the t-test, research papers usually give the value of the F statistic, the degrees of freedom, and the p-value: “
    $ (F(2, 864) = 6.6, p = 0.0014)$ ”. In this example, the test statistic value is 6.6, and it was evaluated against the F distribution with (2, 864) degrees of freedom, which means that
    $ k = 3, n = 866$ . The p-value is 0.0014.

    Assumptions

    Like the t-test, ANOVA assumes that the data is normally distributed with equal variances. According to Box [Box53], ANOVA is fairly robust to unequal variances when the population sizes are about the same, but you might want to check anyway. If all the populations are the same size (all the $ n_i$ are the same), the easiest way to check for equality of variances is an F-test of the statistic
    $ F = {s_{max}}^2/{s_{min}}^2$ with $ n-1$ degrees of freedom[Sac84]. In other cases, you can use Bartlett’s Test [Wika] or Levene’s Test [Wikb]. Bartlett’s test uses a test statistic that is distributed as the $ \chi^2$ distribution, and Levene’s test uses one that is distributed as the F distribution. Levene’s test does not assume normally distributed data.

    If the data are not normally distributed, or have unequal variance, often they can be transformed to a form that is closer to obeying the assumptions of ANOVA. The following table of transformations is based on [Sac84, p. 517], and other sources [Hor].

    Figure 10: Table of Transformations
    \begin{figure}\begin{center} \begin{tabular}{\vert p{2.5in}\vert p{3.5in}\vert} ... ...} \ $\sigma \approx k\mu$\ & \ \hline \end{tabular} \end{center}\end{figure}

    Jim Deacon from the University of Edinburgh lists some suggestions as well [Dea]. He also reminds us that running ANOVA on the transformed data will identify significant differences in the transformed data. This is not the same as saying there are significant differences in the original data!

    Once the Null Hypothesis is Rejected

    If you are able to reject the ANOVA null hypothesis, you will usually want to know which population means are significantly different from the rest. Often, in fact, you are primarily interested in which population had the highest mean. For example, if you are comparing the efficacy of a new medicine A against existing medicines B and C, you are probably not too concerned about whether B and C perform significantly differently from each other, only about whether A is significantly better than both.

    If all you care about is whether the highest mean is significantly higher than the others, you can simply test where the statistic

    $\displaystyle (m_1 - m_2)/({s_W}^2 \frac{n_1 + n_2}{n_1\cdot n_2}) $

    falls on the Student-t distribution with $ n-k$ degrees of freedom. Here, $ {s_W}^2$ is the within-group variance, as calculated in Equation 4, $ m_1$ and $ m_2$ are the highest and second highest population means, $ n$ is the total number of samples (
    $ n = \sum{n_i}$ ), and $ k$ is the number of treatment groups.

    This test is usually written

    $\displaystyle m_1 - m_2 > t_{(n-k, \alpha/2)} \cdot \sqrt{{s_W}^2 \cdot \frac{n_1 + n_2}{n_1\cdot n_2}} = LSD_{(1,2)} $

    where
    $ t_{(n-k, \alpha/2)}$ is the (two-sided) critical value for significance level $ \alpha$ and $ n-k$ is the number of degrees of freedom to use. This quantity is called the least significant difference (LSD) between the highest and second highest means, and the test is usually called the LSD test.

    If you want to test all the population differences $ m_i - m_j$ for significance, (or test the highest value against all of the others explicitly) then you need to take some care with the LSD test. Remember that a significance level of $ \alpha$ means that with probability $ \alpha$ you will make a false positive error. To test all possible population differences is $ K$ = ($ k$ choose $ 2$ ) comparisons, or $ K = k-1$ comparisons, if you sort all the means in descending order and compare adjacent ones. Testing the highest mean against all the lower values is also $ K = k-1$ comparisons. This means you have a
    $ K \cdot \alpha$ probability of making a false positive error. So if you want the overall significance level to be $ \alpha$ , each individual comparison should use a stricter significance threshold
    $ p \leq \alpha/K$ .

    A preferred way to compare multiple means for significance (once the ANOVA null hypothesis has been rejected) is to use a multiple range test [Dea] or Tukey’s method [oST06], rather than the LSD test. Tukey’s method tests all pairwise comparison simultaneously, and the multiple range test starts with the broadest range (the highest and the lowest means), and works its way in until significance is lost.

    Conclusion

    We’ve skimmed over many complications in this discussion. Hopefully, though, what we have gone over is enough to demystify much of the statistical discussion in research papers. Perhaps, it will demystify the output of standard ANOVA and t-test packages for you, as well.

    Chong-ho Yu’s site [hY] gives a brief discussion of some of the issues that I’ve skimmed over. It also lists a few common non-parametric tests. These are tests that do not make assumptions about how the data is distributed, and so they may be more appropriate for data that is very non-normal, or for discrete data. They tend to have less power than parametric tests (that is, they have a lower true positive rate); so if the data is at all normal-like, parametric tests are preferred.

    Significance tests are used in other applications beyond testing the difference in means or variances. They are used for testing whether events follow an expected distribution, for testing if there is a correlation between two variables, and for evaluating the coefficients of a regression analysis. We hope to cover some of these applications in future installments of this series.

    Bibliography

    Box53
    G.E.P. Box, Non-normality and tests on variances, Biometrika 40 (1953), no. 3/4, 318-335.
    Dea
    Jim Deacon, A multiple range test for comparing means in an analysis of variance, http://www.biology.ed.ac.uk/research/groups/jdeacon/statistics/tress7.html.
    FPP07
    David Freedman, Robert Pisani, and Roger Purves, Statistics, 4th ed., W. W. Norton & Company, New York, 2007.
    Hor
    Rich Horsley, Transformations, http://www.ndsu.nodak.edu/ndsu/horsley/Transfrm.pdf, Class notes, Plant Sciences 724, North Dakota State University.
    hY
    Chong ho Yu, Parametric tests, http://www.creative-wisdom.com/teaching/WBI/parametric_test.shtml.
    oST06
    National Institute of Standards and Technology, Tukey’s method, NIST/SEMATECH e-Handbook of Statistical Methods, 2006, http://itl.nist.gov/div898/handbook/prc/section4/prc471.htm.
    Sac84
    Lothar Sachs, Applied statistics: A handbook of techniques, 2nd ed., Springer-Verlag, New York, 1984.
    Wika
    Wikipedia, Bartlett’s test, http://en.wikipedia.org/wiki/Bartlett's_test.
    Wikb
    —–, Levene’s test, http://en.wikipedia.org/wiki/Levene's_test.
    Wikc
    —–, Welch’s t test, http://en.wikipedia.org/wiki/Welch's_t_test.


    Footnotes

    1
    Remember from the last installment that when you are estimating the mean of a distribution with unknown mean $ \mu$ and unknown variance $ \sigma^2$ , the 95% confidence interval around your estimate is
    $ m \pm 2\cdot \sigma/\sqrt{n}$ . Intuitively speaking, Student’s distribution is what you get if you calculate confidence intervals using the estimated variance $ s$ instead of the true but unknown variance $ \sigma$ . The distribution is stretched out compared to the normal distribution to reflect this increased uncertainty.
    … positives2
    In his textbook Statistics, Freedman tells an anecdote about a study that was published in the Journal of the AMA, claiming to demonstrate that cholesterol causes heart attacks. The treatment group that took a cholesterol reducing drug had “significantly fewer” heart attacks than the control group (
    $ p \approx 0.035$ ). A closer reading revealed that the researchers used a one-tailed test, which is equivalent to assuming that the treatment group was going to have fewer heart attacks. What if the drug had increased the risk of heart attack? The proper two-tailed significance of their results would have been
    $ p \approx 0.07$ , which is higher than JAMA’s strict significance threshold of 0.05. [FPP07, p. 550]
    … tail3
    The area to the right of $ F$ with $ (a,b)$ degrees of freedom is equal to the area to the left of $ 1/F$ , with $ (b,a)$ degrees of freedom.


    Related posts:

    1. Statistics to English Translation, Part 2a: ’Significant’ Doesn’t Always Mean ’Important’
    2. “I don’t think that means what you think it means;” Statistics to English Translation, Part 1: Accuracy Measures
    3. Living in A Lognormal World

    ]]>
    http://www.win-vector.com/blog/2009/12/statistics-to-english-translation-part-2b-calculating-significance/feed/ 0
    CRU graph yet again (with R) http://www.win-vector.com/blog/2009/12/cru-graph-yet-again-with-r/ http://www.win-vector.com/blog/2009/12/cru-graph-yet-again-with-r/#comments Sun, 13 Dec 2009 19:25:00 +0000 John Mount http://www.win-vector.com/blog/?p=1195
  • R examine objects tutorial
  • Survive R
  • A Demonstration of Data Mining
  • ]]>
    IowaHawk has a excellent article attempting to reproduce the infamous CRU climate graph using OpenOffice: Fables of the Reconstruction. We thought we would show how to produced similarly bad results using R.

    If the re-constructed technique is close to what was originally done then so many bad moves were taken that you can’t learn much of anything from the original “result.” This points out some of the pratfalls of not performing hold-out tests, not examining the modeling diagnostics and not remembering that linear regression models fail to low-variance models (i.e. when they fail they do a good job predicting the mean and vastly under-estimate variance).

    Our article not an article on global warming, but an article on analysis technique. Human driven global warming is either happening or not happening independent of any bad analysis. Finding the physical truth is a bigger harder job than eliminating some bad reports (the opposite of a bad report is not necessarily the truth). Bad analyses can have many different sources (mistakes, trying to jump ahead of your colleagues on something you believe is true, trying to fake something you believe is false or be figments of overly harsh critics) and we have not heard enough to make any accusations.

    First: load the data (I re-formatted it at bit so R can read it: jonesmannrogfig2c.txt, data1400.dat_.txt ) , perform the principle components reduction and fit a first
    model.

    > library(lattice)
    > d1400 <- read.table('data1400.dat.txt',sep='\t',header=FALSE)
    > d1400r <- as.matrix(d1400[,2:23])
    > pcomp <- prcomp(na.omit(d1400r))
    > plot(pcomp)
    > vars <- data.frame(cbind(Year=d1400[,1],d1400r %*% pcomp$rotation),row.names=d1400[,1])
    > jones <- read.table('jonesmannrogfig2c.txt',sep='\t',header=TRUE)
    > datUnion <- merge(vars,jones,all=TRUE)
    > datUnion$avgTemp <- with(datUnion,(NH+CET+Central.Europe+Fennoscandia)/4.0)
    > model <- lm(avgTemp ~ PC1 + PC2 + PC3 + PC4 + PC5 ,dat=datUnion)
    > summary(model)
    
    Call:
    lm(formula = avgTemp ~ PC1 + PC2 + PC3 + PC4 + PC5, data = datUnion)
    
    Residuals:
           Min         1Q     Median         3Q        Max
    -0.8811679 -0.2658117  0.0008174  0.2933058  1.0450044 
    
    Coefficients:
                  Estimate Std. Error t value Pr(>|t|)
    (Intercept)  0.0065252  0.5750696   0.011   0.9910
    PC1         -0.0001683  0.0003912  -0.430   0.6679
    PC2         -0.0003678  0.0010114  -0.364   0.7168
    PC3          0.0003177  0.0014821   0.214   0.8307
    PC4          0.0044084  0.0019351   2.278   0.0246 *
    PC5          0.0188520  0.0205137   0.919   0.3601
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
    
    Residual standard error: 0.4505 on 113 degrees of freedom
      (484 observations deleted due to missingness)
    Multiple R-squared: 0.05223,	Adjusted R-squared: 0.01029
    F-statistic: 1.245 on 5 and 113 DF,  p-value: 0.2927
    

    We used only 5 principle components as modeling variables, because as is typical of principle component analysis- beyond the first few components the components become vanishingly small and unsuitable to use in modeling (see graph pcomp below).

    However, this gave a model with far smaller R-squared than people are reporting, so lets add in a lot of components like everybody else does (bad!).

    > model <- lm(avgTemp ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8 + PC9 +PC10 +PC11 +PC12 + PC13 ,dat=datUnion)
    > summary(model)
    
    Call:
    lm(formula = avgTemp ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 +
        PC8 + PC9 + PC10 + PC11 + PC12 + PC13, data = datUnion)
    
    Residuals:
         Min       1Q   Median       3Q      Max
    -0.87249 -0.25951  0.03996  0.25055  0.99039 
    
    Coefficients:
                  Estimate Std. Error t value Pr(>|t|)
    (Intercept)  7.431e-01  1.424e+00   0.522   0.6028
    PC1         -1.796e-04  3.665e-04  -0.490   0.6253
    PC2         -4.179e-04  9.759e-04  -0.428   0.6694
    PC3          3.306e-05  1.430e-03   0.023   0.9816
    PC4          3.416e-03  1.803e-03   1.894   0.0609 .
    PC5          4.032e-02  1.978e-02   2.039   0.0440 *
    PC6         -3.260e-03  2.660e-02  -0.123   0.9027
    PC7         -7.134e-02  3.620e-02  -1.971   0.0514 .
    PC8         -1.339e-01  7.895e-02  -1.696   0.0928 .
    PC9          7.577e-02  5.734e-02   1.321   0.1892
    PC10         2.700e-01  5.878e-02   4.594 1.22e-05 ***
    PC11         8.562e-02  6.741e-02   1.270   0.2068
    PC12        -8.057e-02  1.053e-01  -0.765   0.4461
    PC13        -4.099e-02  1.064e-01  -0.385   0.7008
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
    
    Residual standard error: 0.4141 on 105 degrees of freedom
      (484 observations deleted due to missingness)
    Multiple R-squared: 0.2558,	Adjusted R-squared: 0.1637
    F-statistic: 2.777 on 13 and 105 DF,  p-value: 0.001961
    

    This is a degenerate model that essentially didn’t fit (thought the significance on PC10 component fools the fitter, but PC10 can’t be usable- it is essentially noise). Graphically we can see the fit is not very useful (despite having a little bit of R-squared) by looking at the graph of the fit plotted in the region of fitting. Notice how the fit variance is much smaller than the true data variance even in the region of training data, this is typical of bad regression fits.

    > dRange <- datUnion[datUnion$Year>=1856 & datUnion$Year<=1980,]
    > xyplot(avgTemp + prediction ~Year,dat=dRange,type='l',auto.key=TRUE)
    

    Now the statement they wanted to make is that the present looks nothing like the past. The past is only available through the fit model so what you would hope is that the model looks like the present and then the model itself separates the past and present. Instead as you see in the graphs above and below this fails two ways: the model looks nothing like the present and the model’s past looks a lot like the model’s present.

    > datUnion$prediction <- predict(model,newdata=datUnion)
    > xyplot(avgTemp + prediction ~Year,dat=datUnion,type=c('p','smooth'),auto.key=TRUE)
    

    What we could do to falsely drive the conclusion (which itself may or may not be true, it just is not supported by this technique, model or data) is create the infamous graph where we switch from modeled data in the past to actual data in the present and then act surprised that the two did not line up (which they did at no step during the fitting). I don’t have the heart to unify the colors or remove the legend, but here is the graph below:

    > datUnion$dinked <- datUnion$prediction
    > datUnion$dinked[!is.na(datUnion$avg)] <- NA
    > xyplot(avgTemp + dinked ~Year,dat=datUnion,type=c('p','smooth'),auto.key=TRUE)
    

    The reason the blue points look different than the others is they came from the average temperature data instead of the model (where everything else came from). Switching the series is essentially assuming the conclusion that recent past looks very different than the far past.

    Essentially this methodology was so poor it could not illustrated or contradicted recent global warming. There are plenty of warning signs that the model fitting are problematic and the conclusion illustrated in the last graph can not actually be proved or disproved from this data (the proxy variables are too weak to be useful, that is not to say there are not other better proxy variables or modeling techniques).

    Related posts:

    1. R examine objects tutorial
    2. Survive R
    3. A Demonstration of Data Mining

    ]]>
    http://www.win-vector.com/blog/2009/12/cru-graph-yet-again-with-r/feed/ 3
    Statistics to English Translation, Part 2a: ’Significant’ Doesn’t Always Mean ’Important’ http://www.win-vector.com/blog/2009/12/statistics-to-english-translation-part-2a-%e2%80%99significant%e2%80%99-doesn%e2%80%99t-always-mean-%e2%80%99important%e2%80%99/ http://www.win-vector.com/blog/2009/12/statistics-to-english-translation-part-2a-%e2%80%99significant%e2%80%99-doesn%e2%80%99t-always-mean-%e2%80%99important%e2%80%99/#comments Fri, 04 Dec 2009 20:39:20 +0000 Nina Zumel http://www.win-vector.com/blog/?p=1186
  • Statistics to English Translation, Part 2b: Calculating Significance
  • “I don’t think that means what you think it means;” Statistics to English Translation, Part 1: Accuracy Measures
  • Living in A Lognormal World
  • ]]>
    In this installment of our ongoing Statistics to English Translation series1, we will look at the technical meaning of the term ‘’significant”. As you might expect, what it means in statistics is not exactly what it means in everyday language.

    As always, a pdf version of this article is available as well.

    Does too much salt cause high blood pressure, or doesn’t it? That debate has raged for decades, with a slew of studies finding “yes” and a slew of others finding “no.” Two new studies out today in the journal Hypertension tip the scales in favor of reducing sodium – particularly for those 1 in 4 Americans who have high blood pressure. One study found that reducing salt intake from 9,700 milligrams a day to 6,500 milligrams decreased blood pressure significantly in blacks, Asians, and whites who had untreated mild hypertension. Another study found that switching to a lower-salt diet helped lower blood pressure in folks with treatment-resistant hypertension.
    - “10 salt shockers that could make hypertension worse,” U.S. News & World Report [Kot09]

    “Great!” you think. “Who needs to spend money on high-blood pressure meds? I can just cut down my salt!” Well, maybe so, maybe not. To come to that conclusion, you need more information than you were given in that paragraph. What was the “significant” decrease in blood pressure? What was the “before” and the “after”? Does “significant” mean important, or useful? And why has there been so much controversy over this?

    Let’s discuss the important points with an example.

    Image sneetches

    Suppose that we wanted to test for a difference in intelligence between two groups, say Star-Bellied Sneetches and Plain-Bellied Sneetches2. We take a group of 50 Star-Bellies and a group of 40 Plain-Bellies, and give them both a series of tests designed to measure their mathematical, linguistic, and problem-solving abilities. After evaluating the data, we conclude that there is “a significant difference in mathematical performance (t(88) = 2.499, p = 0.014) between the two groups”. The mean mathematics score of the Star-Bellies is 78, with a standard deviation of 7, and the mean mathematics score of the Plain-Bellies is 74, with a standard deviation of 8, for a difference of 4 points3.

    Should we interpret this result to mean that Star-Bellied Sneetches are better than Plain-Bellied ones at math? It depends.

    How Hypothesis Tests Work

    The Sneetch example above and the blood-pressure study cited earlier are both examples of hypothesis tests. In hypothesis testing, researchers set their proposed hypothesis (that there is an effect or a relationship) against the null hypothesis that there is no effect or relationship. In this article, we consider proposed relationships of the form

    The mean value of X measured for group A is different from the mean value of X measured for group B.

    In this case, the null hypothesis is

    The mean value of X is the same for groups A and B, and any difference observed in the data is only by observational chance.

    In fact, we are actually testing the stricter null hypothesis:

    The distribution of X is the same for groups A and B, and any difference observed is only by observational chance.

    A and B are sometimes called treatment groups; this terminology comes from the original applications of hypothesis testing procedures, in agriculture and medicine. In the blood pressure study above, the treatment is daily salt intake. One group ingests about 9,700 milligrams of sodium a day, the other group about 6,500 milligrams a day. The question of interest is: does the difference in sodium intake make a difference in the average blood pressure of the two groups? The null hypothesis is “No.”

    Significance

    We call an observed difference significant – meaning that a difference as large as we observed is probably not by chance – if the the value $ 1-p$ is “high enough.” In the Sneetch example, $ p = 0.014$ is the significance level of the result. To interpret the p-value, suppose the null hypothesis is true: there is truly no difference between Star-Bellied math scores and Plain-Bellied math scores. If this is so, then there is only a 0.014 (1.4%) chance that the difference in the average scores of the two groups will be 4 points or larger. In other words, if the null hypothesis is true, and we administer this same test to different groups of 50 Star-Bellies and 40 Plain-Bellies a hundred times, then the difference in scores will be 4 points or more only about once or twice.

    We interpret the fact that we have seen a difference that should be rare to be evidence that the null hypothesis isn’t true. So we reject the null hypothesis and say that there is a “significant difference” in the performance of the two groups. Alternatively, we could say that Star-Bellied Sneetches performed “significantly better” than Plain-Bellied Sneetches on the math test.

    Effect Size

    Four points (or about a 5% difference) is the effect size of the comparison. The effect size represents what might be called the “practical significance” of the result. In general, the larger the effect size, the better. In this example, Star-Bellies might truly outperform Plain-Bellies by about four points on average, but if we were to examine the relationship between math scores and real-life math performance (say, how well college-attending Sneetches do in their math and science courses), we might discover that it takes a test score difference of ten points or more to reliably predict which Sneetches will do better. In that case, a four point average difference would not be a practical difference.

    Evaluating a Result

    When evaluating a result, you should look both for its significance and its effect size. In practice, researchers usually consider a finding to be significant if
    $ p \leq 0.05$ . This is actually a pretty large $ p$ ; it means even if the null hypothesis is true, you would still observe a difference as large as the one that you observed about five times out of every one hundred trials. In fact, Sachs noted that
    $ p < 0.0027$ used to be the commonly used threshold for significance ([Sac84, p. 114]).

    Sometimes results are reported using an asterisk convention: (*) means
    $ p \leq 0.05$ , (**) means
    $ p \leq 0.01$ , and (***) means
    $ p \leq 0.001$ . Hopefully, the actual significance level is reported (it isn’t always), as well as the actual effect size (it isn’t always).

    Image cup_of_coffee

    The effect size in medical studies is often reported in the popular press with statements like “those who abstained from coffee had triple the risk of contracting colon cancer compared to those who drank three or more cups a day.” Does that mean that all confirmed Lapsang Souchong drinkers and the uncaffeinated should run out and learn to embrace Starbucks? Well, no. First of all, ask yourself: what is the baseline risk of colon cancer? If abstaining from coffee triples the risk from 0.01% to 0.03%, well, it probably isn’t worth worrying about. On the other hand, if the risk triples from 5% to 15%, perhaps that is a reason to take up espressos. You should also see who were the subjects of the study, and how similar they are to you. Suppose the study was done on Caucasian males in the U.S., ages 55-65, with no family history of colon cancer. If you are a young white American male, it’s possible that this study says something about your future health. If you are female or non-Caucasian or not living in the U.S., the finding may or may not be relevant to you. It depends on the mechanism that drives the relationship, and whether or not it applies to you as well as to the subjects of the study.

    “Significant” is not the same as “Important”

    With a large sample, even a small difference can be “statistically significant”… . This doesn’t necessarily make it important. Conversely, an important difference may not be statistically significant if the sample size is too small.
    - Freedman, Pisani and Purves, Statistics [FPP07, p. 550]

    The ability of a study to detect a significant difference depends almost entirely on its size. When a researcher designs a study, she has to decide how much risk of error – and what type of error – she is willing to tolerate.

    How big a risk [of inventing a difference] between two indistinguishable treatments are we willing to put up with? This risk is known as the significance level $ \alpha$ . [Sac84, p. 214]

    $ \alpha$ is the probability of rejecting a null hypothesis that should be accepted. This is a Type I error (a false positive). $ \alpha$ enters the design of the study as the threshold for p-values that the researcher will accept as significant.

    How big a risk do we allow of missing a substantial difference between two treatments? … This risk is called $ \beta$ . [Sac84, p. 214]

    $ \beta$ is the probability of accepting a null hypothesis that should have been rejected. This is a Type II error (a false negative). The quantity $ 1-\beta$ is known as the power of the test: the probability that the test will correctly reject the null hypothesis when the alternative hypothesis is true.

    How small a difference should still be recognized as significant? This difference is called $ \delta$ . [Sac84, p. 214]

    $ \delta$ is the minimum effect size that we are willing to consider “practically significant.”

    It is important to consider all three of $ \alpha$ , $ \beta$ , and $ \delta$ when determining an appropriate sample size for a trial. The power of a test and the significance of a result both increase as the sample size $ n$ increases. So if $ \delta$ is not specified, any difference can appear significant, with a large enough $ n$ , even if the difference is really by chance.

    The Central Limit Theorem

    To see why the above statement is true, we need a few more facts about estimating the mean. Suppose we have a random variable $ X$ that is normally (or nearly normally) distributed, with a true mean $ \mu $ and (unknown) variance $ \sigma^2$ . You want to estimate $ \mu $ by drawing $ n$ samples; the sample mean $ \bar{x}$ gives you an estimate of $ \mu $ . According to the Central Limit Theorem, if you were to repeat this experiment over and over again, you would see that the estimated $ \bar{x}$ has a normal distribution, with mean $ \mu $ and variance
    $ \sigma^2/n$ . So $ \bar{x}$ is a good estimate of $ \mu $ , one that improves with a larger sample size $ n$ .

    Another fact about normal distributions is that a little over 95% of the probability mass is within $ \pm 2$ standard deviations of the mean. So, for a single experiment, we can reason that the true mean $ \mu $ is in the interval
    $ \bar{x} \pm 2 \sigma/\sqrt{n}$ with 95% probability4.

    Figure 1: Confidence bounds on the estimate of $ \mu $ for different values of $ n$
    Image fig1

    So, as $ n$ gets larger, we zoom in on $ \mu $ 5.

    Now, back to the problem of checking for the difference of means. We’ll take $ n$ samples from population $ A$ and $ n$ from population $ B$ . Let’s assume for now that the variances are equal.

    Figure 2: Confidence bounds overlap; means may not be truly different
    Image fig2

    With 95% probability,
    $ \mu_A \in \bar{x}_A \pm 2\sigma/\sqrt{n}$ , and
    $ \mu_B \in \bar{x}_B \pm 2\sigma/\sqrt{n}$ . If
    $ \vert\bar{x}_A - \bar{x}_B\vert$ is small compared to
    $ 4 \sigma/\sqrt{n}$ , then the two confidence intervals overlap substantially, and we cannot reject the null hypothesis that
    $ \mu_A = \mu_B$ .

    If, on the other hand,
    $ \vert\bar{x}_A - \bar{x}_B\vert$ is wide compared to
    $ 4 \sigma/\sqrt{n}$ :

    Figure 3: Confidence bounds don’t overlap; means are significantly different
    Image fig3

    then the confidence intervals are well separated, and we can reject the null hypothesis.

    So $ \delta$ , the minimum significant distance – the “resolution” of the experiment – is about the distance when the two confidence intervals touch:
    $ 4 \sigma/\sqrt{n}$ , if our desired significance level is 0.05.

    Figure 4: Minimum significant distance for a given sample size $ n$
    Image fig4

    If $ \delta$ is too large, the experiment may be unable to detect important differences because the confidence intervals overlap too soon. This means that the sample size was too small (the test didn’t have enough power), and the experiment should be repeated with a larger test population.

    If $ \delta$ is too small, then the experiment will potentially detect statistically significant differences that are, for all practical intents and purposes, meaningless. To go back to the Sneetch example, if the math exam has one hundred questions, then an effect size of two points would correspond to one group answering two additional questions correctly, on average. Practically speaking, that’s probably not a very big difference. But if we made the experiment big enough, about 250 Sneetches in each group, it would be a statistically significant difference, to the 0.05 level. In theory, we could even make a difference of less than one point statistically significant! That is why knowing the effect size of a significant result is important.

    “Significant” is not the same as “True”

    The power and significance level of a test play similar roles to the sensitivity and specificity of a diagnostic test. You’ll remember from Part 1 of this series6that sensitivity and specificity are properties of the test, not how the test performs in a given population. To know the practical accuracy of a screening test, you must know the underlying prevalence of the condition that it is screening for. If it is crucial that the screening not miss any positive cases, then the test will be designed to be highly sensitive, possibly at the cost of specificity. In that case, the test will tend to have a high false positive rate if the condition is relatively rare. And yet, this same screening test will have a lower overall false positive rate when used in a population where the condition is more prevalent.

    The same is true for hypothesis tests. The probability that a statistically significant result is actually true depends on the underlying probability that results “of that type” tend to be true in the domain of study. It also depends on whether the researcher was trying to minimize the chance of a false positive error, or a false negative error.

    You should also be careful interpreting the results of exploratory work, where the researchers have run a series of several different studies, but only highlight the “significant” ones. Running twenty experiments and having one of them return a significant result to the $ p=0.05$ level is actually not significant at all.

    John Ioannides discusses these points (and a few others) in his 2005 essay “Why Most Published Research Findings are False”[Ioa05]. The essay made a few waves at the time of its publication, and it is still available online. We recommend that you read it, along with the 2007 followup article by Moonesinghe, et.al [MKJ07]. Now that you’ve read the first two installments of the Statistics to English translation, both essays should be a breeze!

    Some Points to Remember

    • “Significant” is a statistical statement that an observed relationship is unlikely to be by chance. It is not an necessarily a statement about the magnitude or the importance (or the truth!) of the relationship.
    • Knowing the effect size of a significant result will help you decide if the relationship is “practically significant.”
    • With a large enough sample size, any difference in means can appear significant, even when it is by chance.

    You now have a general idea what a “statistically significant result” is. The next installment will go into a little more technical detail of how significance is calculated. You should read that installment if you want to decipher statements in research papers like “
    $ (F(2, 864) = 6.6, p = 0.0014)$ ” — or if you are simply curious.

    Bibliography

    FPP07
    David Freedman, Robert Pisani, and Roger Purves, Statistics, 4th ed., W. W. Norton & Company, New York, 2007.
    Ioa05
    John P. A. Ioannidis, Why most published research findings are false, PLoS Med 2 (2005), no. 8, e124, Available as http://www.plosmedicine.org/article/info:doi/10.1371/journal.pmed.0020124.
    Kot09
    Deborah Kotz, 10 salt shockers that could make hypertension worse, U.S. News & World Report (2009), Online as http://health.usnews.com/articles/health/heart/2009/07/20/10-salt-shockers-that-could-make-hypertension-worse.html.
    MKJ07
    Ramal Moonesinghe, Muin J Khoury, and A. Cecile J. W Janssens, Most published research findings are false — but a little replication goes a long way, PLoS Med 4 (2007), no. 2, e28, Available as http://www.plosmedicine.org/article/info%3Adoi%2F10.1371%2Fjournal.pmed.0040028.
    Sac84
    Lothar Sachs, Applied statistics: A handbook of techniques, 2nd ed., Springer-Verlag, New York, 1984.
    SS99
    Murray R. Spiegel and Larry J. Stephens, Schaum’s outline of statistics, 4th ed., McGraw-Hill, New York, 1999.


    Footnotes

    … series1
    http://www.win-vector.com/blog/category/statistics-to-english-translation/
    … Sneetches2
    “The Sneetchs,” from The Sneetches and Other Stories by Dr. Seuss.
    http://www.youtube.com/watch?v=Ln3V0HgW4eM
    and http://www.youtube.com/watch?v=s0LgMpfLD1Y
    … points3
    This example is based on Exercise 10.17 in [SS99]; the original exercise did not, unfortunately, involve Sneetches.
    … probability4
    The correct way to state this is that for a given (unknown) $ \mu $ , the estimate $ \bar{x}$ falls in the interval
    $ \mu \pm 2 \sigma/\sqrt{n}$ just over 95% of the time. This gets awkward to reason about. Luckily, symmetry arguments let us center the appropriate confidence interval around $ \bar{x}$ instead.
    5
    Of course, we don’t actually know $ \sigma$ , so we don’t know exactly how fast we zoom in. That doesn’t affect our argument, though, since only $ n$ changes
    … series6
    http://www.win-vector.com/blog/2009/11/i-dont-think-that-means-what-you-think-it-means-statistics-to-english-translation-part-1-accuracy-measures/


    Nina Zumel 2009-12-04

    Related posts:

    1. Statistics to English Translation, Part 2b: Calculating Significance
    2. “I don’t think that means what you think it means;” Statistics to English Translation, Part 1: Accuracy Measures
    3. Living in A Lognormal World

    ]]>
    http://www.win-vector.com/blog/2009/12/statistics-to-english-translation-part-2a-%e2%80%99significant%e2%80%99-doesn%e2%80%99t-always-mean-%e2%80%99important%e2%80%99/feed/ 4
    R examine objects tutorial http://www.win-vector.com/blog/2009/11/r-examine-objects-tutorial/ http://www.win-vector.com/blog/2009/11/r-examine-objects-tutorial/#comments Sat, 21 Nov 2009 15:39:21 +0000 John Mount http://www.win-vector.com/blog/?p=1134
  • Survive R
  • CRU graph yet again (with R)
  • The Data Enrichment Method
  • ]]>
    This article is quick concrete example of how to use the techniques from Survive R to lower the steepness of The R Project for Statistical Computing’s learning curve (so an apology to all readers who are not interested in R). What follows is for people who already use R and want to achieve more control of the software.
    I am a fan of the R. The R software does a number of incredible things and is the result of a number of good design choices. However, you can’t fully benefit from R if you are not already familiar the internal workings of R. You can quickly become familiar with the internal workings of R if you learn how to inspect the objects of R (as an addition to using the built in help system). Here I give a concrete example of how to use the R system itself to find answers, with or without the help system. R documentation has the difficult dual responsibility of attempting to explain both how to use the R software and explain the nature of the underlying statistics; so the documentation is not always the quickest thing to browse.

    First let’s give R the commands to build a fake data set that has a variable y that turns out to be 3 times x (another variable) plus some noise:

    > n <- 100
    > x <- rnorm(n)
    > y <- 3*x + 0.2*rnorm(n)
    > d <- data.frame(x,y)
    

    This data set (by design) has a nearly a linear relation between x and y. We can plot
    the data as follows:

    > library(ggplot2)
    > ggplot(data=d) + geom_point(aes(x=x,y=y))
    


    dat1.png

    With data like this the most obvious statistical analysis is a linear regression. R can very quickly perform the linear regression and report the results.

    > model <- lm(y~x,data=d)
    > summary(model)
    
    Call:
    lm(formula = y ~ x, data = d)
    
    Residuals:
         Min       1Q   Median       3Q      Max
    -0.41071 -0.12762 -0.00651  0.10240  0.62772 
    
    Coefficients:
                Estimate Std. Error t value Pr(>|t|)
    (Intercept) -0.02609    0.02102  -1.241    0.217
    x            2.99150    0.02202 135.858   <2e-16 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
    
    Residual standard error: 0.2102 on 98 degrees of freedom
    Multiple R-squared: 0.9947,	Adjusted R-squared: 0.9947
    F-statistic: 1.846e+04 on 1 and 98 DF,  p-value: < 2.2e-16
    

    We can read the report and see that the estimated fit formula is: y = 2.99150*x – 0.02609 (which is very close to the true formula y = 3*x) . At this point the analysis is done (if the goal of the analysis is to just print the results). However, if we want to use the results in a calculation we need to get at the numbers shown in above printout. This printout contains a lot of information (such as the estimate fit coefficients, the standard errors, the t-values and the significances) that a statistician would want to see and want to use in further calculations. But it is unclear how to get at these numbers. For example: how do you get the “standard errors” (the numbers in the “Std. Error” column) from the returned model? Are we forced to cut and paste them from the printed report? What can you do?

    The documentation nearly tells us what we need to know. help(lm) yields:


    The functions summary and anova are used to obtain and print a summary and analysis of variance table of the results. The generic accessor functions coefficients, effects, fitted.values and residuals extract various useful features of the value returned by lm.

    To a newer R user this may not be clear (as there are technical issues from both R and statistics quickly being run through). However, the experienced R user would immediately recognize from this help that what is returned form summary(model) is an object (not just a blob of text) and that looking at the class of the returned object (which turns out to be summary.lm) might tell them what they would need to know.

    Typing:

    >class(summary(model))
    [1] "summary.lm"
    > help(summary.lm)
    

    Yields:


    coefficients: a p x 4 matrix with columns for the estimated coefficient, its standard error, t-statistic and corresponding (two-sided) p-value. Aliased coefficients are omitted.

    But if you are not very familiar with R you might miss that the summary function returns a useful object (instead of blob of text). Also you might only know to look at help(summary) which does not describe the location of the desired standard errors (but does have a reference to summary.lm, so if you are patient you might find it). We describe how to find the information you need by using R’s object inspection facilities. This is a “doing it the hard way” technique for when you do not understand the help system or you are using a package with less complete help documentation.

    First (using the techniques described in the slides: Survive R) examine the model to see if the standard errors are there:

    > names(model)
     [1] "coefficients"  "residuals"     "effects"       "rank"          "fitted.values" "assign"
     [7] "qr"            "df.residual"   "xlevels"       "call"          "terms"         "model"    
    
    > model$coefficients
    (Intercept)           x
    -0.02609243  2.99150259
    

    We found the coefficients, but did not find the standard errors. Now we know the standard errors are reported by summary(model), so they must be somewhere. Instead of performing a wild goose chase to find the standard errors let’s instead trace how the summary method works to find where it gets them. If we type print(summary) we don’t get any really useful information. This is because summary is a generic method and we need to know what type-qualified name the summary of a linear model is called.

    > class(model)
    [1] "lm"
    

    So we see our model is of type lm so the summary(model) call would use a summary method called summary.lm (which as we saw is also the returned class of the summary(model) object). As we mentioned the solution is in help(summary.lm), but if the solution had not been there we could still make progress: we could dump the source of the summary.lm method:

    > print(summary.lm)
    function (object, correlation = FALSE, symbolic.cor = FALSE,
        ...)
    {
        ....
        class(ans) <- "summary.lm"
        ans
    }
    

    We actually deleted the bulk of the print(summary.lm) result because the important thing to notice is that the method is huge and that it returns an object instead of a blob of text. The fact that the method summary.lm was huge means that it is likely calculating the things it reports (confirming that the standard errors are not part of the model object). The fact that an object is returned means that what we are looking for may sitting somewhere in the summary waiting for us. To find what we are looking for we convert the summary into a list (using the unclass() method) and look for something with the name or value we are looking for:

    > unclass(summary(model))
    $call
    lm(formula = y ~ x, data = d)
    ...
    $coefficients
                   Estimate Std. Error    t value      Pr(>|t|)
    (Intercept) -0.02609243 0.02102062  -1.241278  2.174662e-01
    x            2.99150259 0.02201930 135.858209 2.095643e-113
    ...
    

    And we have found it. The named slot summary(model)$coefficients is in fact a table that has what we are looking for in the second column. We can create a new list that will let us look up the standard errors by name (for the variable x and for the intercept):

    > stdErrors <- as.list(summary(model)$coefficients[,2])
    

    Now that we have the stdErrors in list form we can look up the numbers we wanted by name.

    > stdErrors['x']
    $x
    [1] 0.0220193
    
    > stdErrors['(Intercept)']
    $`(Intercept)`
    [1] 0.02102062
    

    And we finally have the standard errors. But why did we want the standard errors? In this case I wanted the standard errors so I could plot the fit model and show the uncertainty of the model. As, is often the case, R already has a function that does all of this. Also (as is often the case) the R function that does this asks the right statistical question (instead of the obvious question) and can draw error bars that display the uncertainty of future predictions. The uncertainty in future prediction is in fact different than the uncertainty of the estimate (what was most obvious to calculate from the standard errors) and (after some reflection) is what I really wanted. Having these sort of distinctions already thought out is why we are using a statistics package like R instead of just coding everything up. These calculations are all trivial to implement- but remembering to perform the calculations that answer the right statistical questions can be difficult. The built in R solution of plotting the the fit model (black line) and the region of expected prediction uncertainty (blue lines) is as follows:

    > pred <- predict.lm(model,interval='prediction')
    > dfit <- data.frame(x,y,fit=pred[,1],lwr=pred[,2],upr=pred[,3])
    > ggplot(data=dfit) + geom_point(aes(x=x,y=y)) +
        geom_line(aes(x,fit)) +
        geom_line(aes(x=x,y=lwr),color='blue') + geom_line(aes(x=x,y=upr),color='blue')
    


    fit1.png

    And we are done.

    Related posts:

    1. Survive R
    2. CRU graph yet again (with R)
    3. The Data Enrichment Method

    ]]>
    http://www.win-vector.com/blog/2009/11/r-examine-objects-tutorial/feed/ 4
    The Local to Global Principle http://www.win-vector.com/blog/2009/11/the-local-to-global-principle/ http://www.win-vector.com/blog/2009/11/the-local-to-global-principle/#comments Wed, 11 Nov 2009 16:37:53 +0000 John Mount http://www.win-vector.com/blog/?p=1123
  • A Demonstration of Data Mining
  • Should your mom use Google search?
  • Betting Best-Of Series
  • ]]>
    We describe the “the local to global principle.” It is a principle used to break algorithmic problem solving into two distinct phases (local criticism followed by global solution) and is an aid both in the design and in the application of algorithms. Instead of giving a formal definition of the principle we quickly define it and discuss a few examples and methods. We have produced both a stand-alone PDF (more legible) and a HTML/blog form (more skimable).

    The Local to Global Principle

    John Mount1

    Date: November 11, 2009


    Abstract:

    We describe the “the local to global principle.” It is a principle used to break algorithmic problem solving into two distinct phases (local criticism followed by global solution) and is an aid both in the design and in the application of algorithms. Instead of giving a formal definition of the principle we quickly define it and discuss a few examples and methods.

    Contents

    Introduction

    A common vain hope of computer scientists and algorithm designers is that a domain expert has already “boiled down” a problem to a precise, but unsolved, algorithmic core. On this point the mathematician Gian-Carlo Rota wrote:

    One of the rarest mathematical talents is the talent for applied mathematics, for picking out of a maze of experimental data the two or three parameters that are relevant, and to discard all other data. This talent is rare. It is taught only at the shop level.[Rot97, ``A Mathematician's Gossip'']

    We describe a useful tool for designing algorithmic applications and solutions which we call “the local to global principle.” The local to global principle is the method of deriving applications and solutions by specifying “local” (and deliberately myopic) heuristics, critiques and methods followed by using a powerful general method to “globalize” this specification into a complete solution.

    There are many important problem solving prescriptions and methods of thought already systematically described and taught:

    • Bacon’s “New Organon” and Mill’s principles of inductive logic.[Mil02]
    • Feynman’s genius method.[Rot97, ``Ten Lessons I Wish I Had Been Taught'']
    • Reductionism (top down and bottom up).
    • Divide and conquer.[CLRS09]
    • Forward deduction, backwards induction.
    • Root Cause Analysis.
    • Polya’s heuristic and conjecture and prove patterns [Pol71,Pol54a,Pol54b]
    • Doron Zeilberger’s “Method of Undetermined Generalization and Specialization.” [Zei95]
    • Zbigniew Michalewicz and David B. Fogel’s presentation of evolutionary algorithms.[MF00]

    The local to global principle is more of an organizational pattern than “computer aided technique” as no one specific species of software or family of notation is required.

    The local to global principle can be identified in a number of previous important applications, but it is not currently an identified principle.2 The principle is very general, so any succinct description of it is going to be painfully vague. Instead, we explain the principle by discussing some example applications and methods. For each of our example applications we deliberately use a different globalization technique. The effective algorithmist or practitioner must in fact come to each problem already familiar with a reasonably large set of already known local and global techniques, so we conclude with some appropriate fields of study and preparation.

    The local to global principle is divided into two parts: local encoding of the problem followed by a globalization step that uses the encoding. The guiding feature of local encodings is that they are usually easy to compute from the data at hand. Any extension that looks like enumeration, search or optimization is best left to the global step. The local step is essentially the translation of your problem into an abstract language that is ready for the globalization step. In contrast globalization methods are often “off the shelf” in that once you abstract and encode the particulars of your problem you can look for pre-existing useful methods or software to finish your solution. The idea of globalization is to find a best overall or global compromise between competing local criteria. The local step does not so much have to avoid conflicts but instead “price them.” There is also an important trade-off that sophisticated local techniques allow the use of simpler globalization methods and more powerful globalization methods allow the use of simpler local techniques.

    The Examples

    To demonstrate the breadth of the local to global principle we choose a diverse collection of example applications: web page link analysis, natural language processing and machine learning. For each example application we will set up the problem, introduce a reasonable set of local criteria and pick an appropriate globalization technique. We will favor finishing each example without describing the globalization technique in detail, as this would distract from our point and is best left to the given references. These examples are previously solved problems, our contribution is demonstrating the shared underlying principle.

    Web Page Link Analysis

    For our first example application we demonstrate web page link analysis in the form of the famous PageRank score.[PBMW98]

    One of the many good ideas leading up to the early Google search engine was the design of a non-text based measure of importance or interestingness of web pages. A search engine that could fold “interestingness” or popularity into its notion of relevance could better sort important pages into the search user’s view. When the web got so large that there were many pages that were exact matches to any common user query popularity became a critical consideration. A link based notion of popularity exploits what is important about the web (the link structure, for example see [Kle97]) and avoids having to depend on a lot of natural language understanding technology. This technique also uses authority outside of the given page, so has some hope at being resistant (though not immune) to web-spam.

    Taken all at once, the task of designing a score of page importance is a daunting task. However, by working in stages (as the local to global principle prescribes) we can quickly derive interesting scores including the famous PageRank score. We start with the idea that popularity (or the amount of web traffic a page receives) is (loosely) correlated with importance. So for our first approximation step we decide to try to estimate popularity (or web traffic) and use this estimate as our importance score. Accurately estimating web traffic is itself a hard problem and a big industry (just a few of the major companies involved in this are: Google/Urchin, Quantcast, Nielsen, comScore, Alexa, Hitwise and LookSmart). For our second approximation step we are going to try and estimate popularity from the link structure4 of the web (using no other measurements or historic data) and use this as our score. This link based estimate is unlikely to completely reproduce real web surfing patterns, but it is very interesting in its own right and has been proven in the market to be a useful score.

    Now the problem is to try to estimate the popularity of a web page from the link structure of the web. We claim: we can generate a useful (but not necessarily accurate) estimate of web traffic from the web’s link structure alone. Consider Figure 1 where we have a universe of three web pages A,B and C that link to each other in the pattern illustrated by what is called a graph5

    Figure 1: A set of Mutually Linked Web Pages
    Image Links1

    In Figure 1 we can consider each link to another page as evidence the other page is interesting or popular. One idea is to simulate a very simple web surfer who clicks on the links on a page uniformly at random. This is called “the random surfer model” and even a model this simple allows us to read some useful information from the link structure of the web. For instance, we could ask what fraction of their time the random surfer spends on each web page, with an eye to the idea that the pages the random surfer visits more often are the more important ones. Let $ p(A)$ denote the proportion of time the random web surfer spends on page A (and define $ p(B)$ and $ p(C)$ similarly). While we do not know any of
    $ p(A), p(B)$ or $ p(C)$ we can derive some relationships between them by inspecting the link graph:

    $\displaystyle p(A)$ $\displaystyle =$ $\displaystyle \frac{1}{2} P(B) + P(C)$  
    $\displaystyle p(B)$ $\displaystyle =$ $\displaystyle \frac{1}{2} P(A)$  
    $\displaystyle p(C)$ $\displaystyle =$ $\displaystyle \frac{1}{2} P(A) + \frac{1}{2} P(B) .$  


    The first equation is just reading from the graph that: all visits on page-A must come from pages B and C, half of the visitors on page-B continue on to A and all of the visitors on page-C continue on to A. The second and third equations are the appropriate summaries of how traffic is routed to pages B and C. We can insist that
    $ P(A) + P(B) + P(C) = 1$ as we want these numbers to represent the fraction of time the random web surfer spends on each page. A more sophisticated model would add more features6 to get a more useful result.

    It turns out we have already encoded enough local rules to completely determine
    $ P(A), P(B)$ and $ P(C)$ . In this example application an algorithmist already familiar with linear algebra [Str76] would recognize these local conditions as “a system of linear equations.” Solving even web-scale systems of linear systems is considered easy with modern techniques and modern computers. For our small example example the solution is:
    $ p(A) = \frac{4}{9}$ ,
    $ p(B) = \frac{2}{9}$ , and
    $ p(C) = \frac{3}{9}$ . The role of the local steps was to reduce a new problem (estimating the importance or popularity of web page from the link structure) to something with its already known known techniques (like solving a linear system as illustrated in Figure 2).

    Figure 2: Linear Algebra Solution: As Taught in School
    Image LinAlg

    So page-A is the most important page by the PageRank measure.

    In this example application the local step was setting up the system of linear equalities (which are easy to derive from the web link graph) and the global step was solving the entire system for the final scores (which were not obvious). You spend most of your time encoding the problem and then use a known technique (in this case solving a linear system) to finish the solution.

    Natural Language Processing

    Our next example application is natural language processing [Cha96,Cha97]. Speech recognition (the alignment or transcription of recognized intelligible segments of sound to written text) is an important problem in natural language processing. An example problem is the need to find the most likely text matching a sequence of sounds such as is shown in Figure 3.

    Figure 3: A Sequence of Sounds
    Image SoundSeq1

    Consider Figure 4 (which shows a bad transcription) and Figure 5 (which shows a good transcription).

    Figure 4: A Bad Transcription
    Image SoundSeq3

    Figure 5: A Good Transcription
    Image SoundSeq2

    Our claim: we can (given access to training data, and this is the age of data [HNP09]) solve this problem with a local step that is a set of simple criticisms of proposed transcriptions. A good starting point is a database of previous sounds to text transcriptions. This database allows the construction of a series of tables that give the historic frequency (or probability) of all of the following:

    • Prior probability of each sound
    • Probability of each sound given the immediately previous sound
    • Prior probability of each word
    • Probability of each word given the immediately previous word
    • Which combinations of word fragments are legitimate words
    • Probability of each sound being assigned to each word fragment (syllables, phonemes and so on).

    These tables encode a “speech model” (the rules involving sounds only), a language model (the rules involving text or words only) and the linkage between the two models. These models are deliberately simple in that they capture only local interactions (like probability of a word given the word before it) but no long range interactions (like subject predicate agreement).

    Each box, nested box and arrow on our diagram represents one possible local critique. For each item in our diagram (again, the boxes and arrows) we can use our tables to assign a goodness or plausibility score. For instance bad word to word transitions (like “won”
    $ \rightarrow$ “won”) will be rare in our historic tables so, just looking up probabilities from the tables (or, better, using the logarithms of probabilities) gives as a “plausibility score” that prefers known patterns of language. Then a score for the overall transcription can be derived by multiplying all of the local scores together. These local scores (though simple) already have encoded enough evidence to prefer the good transcription to the bad transcription without requiring any deep knowledge of speech, text or the meaning of the text. This is because the bad transcription has a series of obvious flaws such as: unlikely sound to word fragment assignments and unlikely word to word transitions.

    Figure 6: Naively Extending a Partial Transcription
    Image SoundSeqPartial

    For example consider Figure 6 where a naive solver is in the process of considering selecting the word “one” as the third word to fill in. The only local critiques they need to consider are:

    • how likely the word “one” is in general (call this $ P[one]$ )
    • how likely the word “one” is to follow the word “nine” (call this
      $ P[one \vert nine]$ )
    • how likely the letter sequence “o” is given the sound “w” (call this
      $P[o \vert \text{w\textschwa}]$ )
    • how likely the letter sequence “ne” is given the sound “n” (call this
      $ P[ne \vert$   n$ ]$ ).

    So the local plausibility of the fill-in word “one” is:
    $P[one] \times P[one \vert nine] \times P[o \vert \text{w\textschwa}] \times P[ne \vert \text{o}]$ . We will call this the critique of “one” in position 3 and write as
    $ C_3(w_2,one)$ where $ w_2$ is the word known to be in position 2. Similarly we can generate all of the possible critiques $ C_1(w_1)$ ,
    $ C_2(w_1,w_2)$ ,
    $ C_3(w_2,w_3)$ ,
    $ C_4(w_3,w_4)$ and the overall criticize of a sequence
    $ w_1 \; w_2 \; w_3 \; w_4$ :
    $ C_1(w_1) \times C_2(w_1,w_2) \times C_3(w_2,w_3) \times C_4(w_3,w_4)$ from our pre-computed tables of probabilities. Notice for all of these critiques only the immediately previous word and the nearby sounds were used to determine the plausibility of the word we are attempting to fit in. Instead of using these critiques to directly fill in a possible solution (or using search) we will package up these critiques (in the form of the $ C_i()$ ) and pass them on to a powerful separate globalization step called Dynamic Programming [Bel57].

    The globalization or finding of a best overall transcription is not trivial even though our score is simple. This is because the overall best sequence could depend on clever non-local fill-ins (like deliberately picking a less likely first word to allow a later favored transition to a fantastically good third word). Dynamic Programing does not fill in the transcription from left to right, but instead uses a table of scores derived from the left to right arrows and the $ C_i()$ . In our example Dynamic Programming consists of building a table of information as shown in Figure 7. Let $ i$ represent the word position we are working looking at (so $ i$ ranges from 1 to 4) and let $ w$ be a variable that ranges over every word in the dictionary. Our table is indexed by $ i$ and $ w$ and when filled in $ T(i,w)$ stores what the highest “plausibility score” of a partial sequence of words where words 1 through $ i$ have been filled in and the $ i$ -th word is $ w$ .

    Figure 7: Dynamic Programming: Back Chaining in $ T()$ for a Solution
    Image DynTableBackFill

    If we already had this magic table $ T()$ we could find a best possible sequence by “back chaining.” We start by finding a fourth word ($ w_4$ ) such that $ T(4,w_4)$ is maximal (in this case “one”). We then find a best third word ($ w_3$ ) by enumerating all words and picking $ w_3$ such that
    $ T(3,w_3) \times C_4(w_3,w_4) = T(4,w_4)$ . We continue back until we had found words $ w_2$ and $ w_1$ to get a complete best sequence. Notice that we work from right to left (backwards) and except for the starting step we pick each word to match the calculation we are trying to un-roll, not to be the maximal entry in the column. For instance we pick
    $ w_1 = dial$ even though it does not have a the highest score, but because
    $ T(1,dial) C_2(dial,nine) C_3(nine,one) C_4(one,one) = T(4,one)$ is the maximal complete chain.

    Of course, we don’t start with the table $ T()$ already filled in- so we need a procedure to build it. This procedure is the heart of the Dynamic Programming method (for more examples see: “Introduction to Algorithms” [CLRS09]). Notice that $ T(1,w)$ can be filled in for all $ w$ just by plugging in words and computing the critiques $ C_1(w)$ (i.e.
    $ T(1,w) = C_1(w)$ ). Once all the $ T(1,w)$ are filled in we can fill in the the $ T(2,w)$ with the general (and slightly trickier) formula:

    $\displaystyle T(i+1,w) = \max_{v} T(i,v) C_{i+1}(v,w) $

    as we illustrate for $ T(2,nine)$ in Figure 8.

    Figure 8: Dynamic Programming: Building the Table $ T()$
    Image DynTableCalculate

    The magic of the Dynamic Programing technique is: by being careful to not store too much in the table $ T(i,w)$ we avoid an explosion in record keeping that would render the method inefficient. Dynamic Programming exploits the small dependence structure encoded in $ C_i()$ (each box in our diagram depending on only a few arrows) and as we have shown can find “clever” solutions (such as taking a sub-optimal first word to get better transitions into preferred later words). For those who want more detail on solving this problem we recommend [Cha96] (as our goal is not to fully explain Dynamic Programming, but to demonstrate how it could be applied to the transcription problem as a pre-packaged globalizer).

    In this example the local step was the graph link based critiques and the globalization step was Dynamic Programming. The separation of concerns from the local scoring to the globalizing step is a strength of the local to global principle.

    Machine Learning

    Our final example application is machine learning. Machine learning is loosely defined as computer programs that adapt or learn from data. Thomas Mitchell helps distinguish this activity as a specialty of artificial intelligence that concentrates on “well-posed learning problems.” [Mit97] Trevor Hastie, Robert Tibshirani, Jerome Friedman emphasize the relation to statistics (versus more traditional symbolic AI) [TH09]. A simple demonstration can be found in [Mou09b].

    Machine learning is perhaps the strongest example of the local to global principle and is inspired by the work of Kristin P. Bennett and Emilio Parrado-Hernandez [BPH06]. In hindsight many machine learning algorithms (each of which has had a turn at being “the most exciting breakthrough ever” for a while) can be seen as the pairing of a performance criterion (which we call a local criterion as it applies to one specific set of parameter values at a time) and an optimization method (what we have been calling the globalization step). The work of Bennett and Parrado-Hernandez calls this distinction out and shows how it is not productive to present machine learning systems as unique named monolithic units, but instead to consider how to break them into an objective function and an optimizer. This allows both choice of better optimizers (such as replacing the inferior method of gradient descent method wherever it occurs) and for explicit control of important concepts such as hypothesis regularization and control of over-fitting (which some algorithms claim to achieve by deliberately using early exit from a an inferior optimizer).

    At a “30,000 feet level” we can build a table of common machine learning techniques and name what is commonly used to implement their local and global steps. When a machine learning algorithm is defined by what conditions are meant to be true at the optimum we are no longer bound by details of the original implementation and can examine fix and improve the components.7 Table 1 is a crude summary of a wide selection for machine learning algorithms that may be more likely to offend everybody than just offend somebody. But this is also the point: it is the algorithmist’s job to think fluidly (beyond given names and provenances) and to invent scaffolding to convert partial analogies into practical correspondences.

    Table 1: Various Machine Learning Techniques
    Machine Learning Method Local Criterion Globalization Method
    Linear Regression [BF97] square error Linear Algebra
    Linear Discriminant Analysis [Fis36] square error Linear Algebra
    Logistic Regression [Kom08] logit penalty Newton’s Method
    Perceptron [BRS91] [BD02] error rate error based update
    Naive Bayes [MK00] [Mar61] [Lew98] frequency tables arithmetic
    Nearest Neighbor [AC06] [IM99] [AI06] Kernel Methods enumeration,
    projection
    Decision Trees [BFSO84] information theory partitioning
    clustering [CV05] square error partitioning
    MaxEnt [Gru00] [GD04] [Ski88] entropy penalty Newton’s Method
    Neural Net with Back Propagation [Hus99] sigmoid penalty function Automatic Differentiation,
    steepest descent
    Winnow [KWA95] error rate multiplicative error based update
    Boosting [FS99] [Bre00] [CSS02] [TTV08] weighted errors,
    data re-weighting
    Conjugate Gradient
    HMM [KCVM04] probability penalty Gibbs Sampler
    Latent Dirichlet Allocation [BNJ03] KL divergence Variational Methods
    Support Vector Machine [Joa98] [STC00] L1 Margin,
    Kernel Methods
    Quadratic Optimization

    This table is a necessarily crude summary. For example: notice that several known techniques can not even be distinguished from each other by the local and global columns of the table.

    There are a few points we would like to make. Back propagation was considered unique to Neural Nets for quite a while because it was so entwined with the technique it was not recognized as the simple application of Automatic Differentiation [RC96] that it is. Support Vector Machines (SVM) are remarkable for their uniform very good choice of component methods (maximum L1 margin objective regularization, Kernel Methods [STC04] and sophisticated optimization methods [Joa06]). Many of the machine learning methods that SVM outperforms become again competitive when they adopt some of SVM’s technologies (especially using kernel methods to produce synthetic features).

    Beyond these points we invoke a “globalizers are pre-packaged” principle and leave the discussion of machine learning and optimization to our reference: [BPH06]. In this example the local step is a per-example score or penalty and the globalization step is optimization.

    Some Methods

    The application of the local to global principle is similar to the Feynman “genius method.” Feynman’s method is to always have in mind a list of problems and a list of solution methods. The genius step is: anytime you see a new problem or a new solution method to immediately try it against every item from the complementary list. [Rot97, ``Ten Lessons I Wish I Had Been Taught''] This deliberate retention and activity greatly increases your problem solving ability. The power of the local to global principle is itself proportional to the number of local methods times the number of globalization strategies. Of course, to even start: the practitioner must already have available a number of candidate local and globalization methods. We list some methods and some guidance on variation and invention.

    Local Methods

    Image nails Good sources of ideas and analogies for local methods include:

    • Introduce a Graph Structure

      A graph structure is a network of nodes connected by edges. Use of graphs was demonstrated both in the natural language processing and web page link analysis examples. We can dress up how we solved these problems and say we used a “Hidden Markov Model”, but the real power was we encoded our problem in a simple graph. Some problems (especially those from logic or those involving time) are essentially solved once they are translated out of their original form and into graph notation (for an example see: [Mou00]).

    • Appeal to Physical Conservation Laws

      A good example physical law is Kirchhoff’s law or conservation of flow. All of the web page link analysis’s equations were derived by saying that the attention of at node is essentially the sum of attentions from other nodes (more sophisticated versions of the analysis actually do create and destroy flow, but they do it in a principled way).

    • Encode the Problem into an Objective Function

      This method is essentially your declaration that you intend to use an optimizer for the globalization step. In operations research this specific technique has long been the practice (with no disrespect: a very productive part of operations research has been translating different problems into linear programs so the simplex method can be applied, for an example see [Mou09a]).

    • Gradient Like Computations

      Includes Gradients, Secants, Lagrangians and other ideas from calculus. Gradients can drive optimizer based globalizers and techniques like Lagrangians are often powerful enough use mere inspection as the globalization step.

    • Violation Driven Updates

      This method is particularly effective when your problem is not amenable to continuous optimization. A good example is the Lin-Kernighan heuristic for solving the traveling salesman problem.[LK73] This heuristic looks at subsets of the problem and suggests improving “surgeries” (until no more such improvements are possible).

    • Introduction of Symbols

      Often, as with the web page link analysis example, you can not specify specific values for the unknowns, but you can specify relationships. You often can then solve for the symbols or introduce additional conditions and use an optimizer to complete the solution (see for example the maximum entropy method as described in [Ski88]).

    • Over Specification

      If we anticipate using a global step like search, enumeration, summation or integration then over specification is a good local idea.

      For example: consider computing the probability that a fair count flipped 10 times comes up with heads exactly 3 times. The easiest way to perform this calculation is to specify exactly which 3 coins come up heads (the local over-specification) and then sum over all choices of 3 out of 10 coins (the global step). In mathematical notation this is:

      $\displaystyle P[$exactly 3 heads out of 10 flips$\displaystyle ] = \binom{10}{3} 2^{-10} \approx 0.117 $

      or just under 12%.

    • Under Specification

      One of the core principles of Dynamic Programming is to forget as much as possible about partial solutions, keeping only partial solution cost and just enough information to extend partial solutions. If you anticipate using something like Dynamic Programming as your globalization step then your goal should be to under specify.

    • Tables

      A key step of the natural language processing example was the use of tables of past experience to determine which sounds likely corresponded to which words, which words likely followed each other (and so on). Encoding domain knowledge or expertise as probability tables is a very effective problem solving strategy (especially if the globalization strategy is going to be search or Dynamic Programming). In natural language processing examples tables and statistics are much easier to manage than comprehensive rules or grammars.

    • Set up as Ranking or Machine Learning Problem

      This tactic is especially appropriate if your solution success metric is counts, frequencies or probabilities (instead of having to always be correct or always be optimal).

    Globalization Methods

    Image hammer The universe of possible globalization methods is very diverse (in particular globalization is not always optimization).

    • Search / Enumeration

      Search can be slow, but it is always an option to consider. If your problem translates naturally into a graph structure or your solutions are naturally seen as being composed of small pieces search should be considered. One of the big advantages using the local phase to formally encode your problem’s structure and putting search off to the global phase is: you can use advanced search techniques. Once you are freed from your specific problem details it becomes much easier to consider search techniques like branch and bound, A*, game theoretic search and general speed-up techniques like hashing and caching.

    • Dynamic Programming

      If your problem has a bit more structure (in that partial solutions summarize and compose easily) then you can likely replace search with Dynamic Programming. The advantage is that Dynamic Programming typically offers an incredible speed up when compared to search.

    • Optimization

      If your problem is continuous (involves numbers instead of discrete or categorical decisions), can be encoded as a reasonable objective function (linear, positive definite quadratic) and has reasonable constraints (linear or convex) then you can immediately apply an optimizer as your globalization step. Typical optimization methods include: conjugate gradient, Newton methods, quasi Newton methods, linear programming and quadratic programming.

    • Combinatorial Optimization

      If your problem includes a “discrete variables” (that is variables that take on one of fixed set of values instead of values from a numeric range) then you may not be able to apply standard optimization techniques. At this point you may want to use more expensive combinatorial optimization techniques like integer linear programing or constraint satisfaction.

    • Fixed Point Methods / Iteration

      Fixed point methods are based on the idea: “incrementally improve until there is no incremental improvement possible.” If the problem is continuous this is similar to steepest descent. If the problem is discrete then this is similar the Lin-Kernighan heuristic.

    • Linear Algebra

      The web page link analysis and optimization examples were essentially solved once we reduced them to linear algebra. If you can write your problem as a linear relationship between unknowns or as the fixed-point of a linear operator (i.e. an $ x$ such that $ A x = x$ ) then you can immediately use linear algebra to solve the problem at very large scale (e.g. web scale).

    • Sampling / Problem Kernels

      A very successful line of attack on large problems is to reduce to a smaller problem containing most of the essential difficulty. David Karger has produced a number of effective algorithms for graph cuts and flows using a theory of sampling [Kar98]. Rod Downey and M. Fellows have demonstrated an effective theory of “problem kernels” that finds solution by focusing on smaller sub-problems (on which we can afford to use more expensive procedures).[DF98]

    • Amortized Analysis / Economic Mechanism Methods

      Daniel Sleator and Robert Tarjan’s ideas of amortized analysis [ST85] allow approximation schemes similar to problem kernels. The method is to approximately optimize by pairing a bunch of unavoidable large penalties (conditions we can’t meet) with some accounting credits (say bonuses from other conditions we are meeting very well). We then isolate these paired items and optimize the rest of the problem exactly. The technique often works by showing the approximation can not be too bad because, due to the pairing of large penalties to good credits, there can not be too many large penalties. An informal example is: if it is impossible to pick someplace where all of an office will eat for lunch, perhaps you can solve the problem by paying one person to accept a restaurant they do not like (if the removal of their objection opens up a venue that is acceptable to everybody else).

    • Relaxation / Homotopic methods

      These methods involve changing hard constraints to soft penalties (so allowing the constraints to be violated, but at a slowly increasing cost). After such a relaxation the homotopic (or continuous deformation) method is to increase the cost of violation and re-solve to try and get a trajectory of better and better nearly acceptable solutions that point to a possible overall solution.

    Conclusion

    The purpose of this article has been to make more visible an idea we call the local to global principle. This principle is an organizing tool useful both in designing and analyzing a wide variety of applications. Essentially the whole point of this writeup is to set up enough framework to quickly write down a table of advice such as Table 2 (and for such a table to mean something).

    Table 2: Various Applications, Local Steps and Global Steps
    Example Local Step Global Step
    speech transcription tables Dynamic Programming
    PageRank graph structure, linear equations Linear Algebra
    machine learning objective function optimization

    The principle is not universal; not everything can be fit into such a table. For example the local to global decoupling is not a feature of the famous EM algorithm [DLR77], which depends on mixing predictions and corrections.

    To conclude: the recipe is as follows. If you come to a problem with a large shopping bag of possible ways to build local criteria and powerful globalization procedures then you stand a very good chance of solving the problem quickly. Also, if you keep the local to global principle in mind you are more likely to identify and retain potential local tricks and globalizers when you see them and thus have a larger more nimble set of tools available to solve problems when the time comes.

    Bibliography

    AC06
    Nir Ailon and Bernard Chazelle, Approximate nearest neighbors and the fast johnson-lindenstrauss transform, STOC (2006).
    AI06
    Alexandr Andoni and Piotr Indyk, Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions.
    BD02
    Avrim Blum and John Dunagan, Smoothed analysis of the perceptron algorithm for linear programming, SODA (2002), 11.
    Bel57
    Richard Bellman, Dynamic programming, Princeton University Press, 1957.
    BF97
    Leo Breiman and Jerome H Friedman, Predicting multivariate responses in multiple linear regression, Journal of the Royal Statistical Society, Series B (Methodological) 59 (1997), no. 1, 3-54.
    BFSO84
    Leo Breiman, Jerome Friedman, Charles J. Stone, and R. A. Olshen, Classification and regression trees, Chapman & Hall/CRC, January 1984.
    BNJ03
    David M Blei, Andrew Y Ng, and Michael I Jordan, Latent dirichlet allocation, Journal of Machine Learning Research 3 (2003), 993-1022.
    BPH06
    Kristin P. Bennett and Emilio Parrado-Hernandez, The interplay of optimization and machine learning research, Journal of Machine Learning Research 7 (2006), 1265-1281.
    Bre00
    Leo Breiman, Special invited paper. additive logistic regression: A statistical view of boosting: Discussion, Ann. Statist. 28 (2000), no. 2, 374-377.
    BRS91
    R Beigel, N Reingold, and D Spielman, The perceptron strikes back, Structure in Complexity Theory Conference 6 (1991), 286-291.
    Cha96
    Eugene Charniak, Statistical language learning, MIT Press, 1996.
    Cha97
    to3em, Statistial techniques for natural language parsing, AI Magazine 18 (1997), no. 4, 33-44.
    CLRS09
    Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein, Introduction to algorithms, MIT Press, 2009.
    CSS02
    Michael Collins, Robert E Schapire, and Yoram Singer, Logistic regression, adaboost and bregman distances, Machine Learning 48 (2002), no. 1/2/3, 30.
    CV05
    Rudi Cilibrasi and Paul M.B Vitanyi, Clustering by compression, IEEE Transactions on Information Theory 51 (2005), no. 4, 1523-1545.
    DF98
    Rod G. Downey and M. R. Fellows, Parameterized complexity, Monographs in Computer Science, Springer, November 1998.
    DLR77
    A P Dempster, N M Laird, and D B Rubin, Maximum likelihood from incomplete data via the em algorithm, Journal of the Royal Statistical Society, Series B (Methodological) 39 (1977), no. 1, 1-38.
    Fis36
    Ronald A Fisher, The use of multiple measurements in taxonomic problems, Annals of Eugenics 7 (1936), 179-188.
    FS99
    Yoav Freund and Robert E Schapire, A short introduction to boosting, Journal of Japanese Society for Artificial Intelligence 14 (1999), no. 5, 771-780.
    GD04
    Peter D Grunwald and A Philip Dawid, Game theory, maximum entropy, minimum discrepancy and robust bayesian decision theory, Ann. Statist. 32 (2004), no. 4, 1367-1433.
    Gru00
    PD Grunwald, Maximum entropy and the glasses you are looking through, Conference on Uncertainty in Artificial Intelligence (2000), 238-246.
    HNP09
    Alon Halevy, Peter Norvig, and Fernando Pereira, The unreasonable effectiveness of data, IEEE Intellegent Systems (2009).
    Hus99
    Dirk Husmeier, Neural networks for conditional probability estimation, Springer, 1999.
    IM99
    Piotr Indyk and Rajeev Motwani, Approximate nearest neighbors: Towards removing the curse of dimensionality.
    Joa98
    Thorsten Joachims, Making large-scale svm learning practical, Advances in Kernel Methods – Support Vector Learning (1998).
    Joa06
    to3em, Training linear svms in linear time, KDD (2006).
    Kar98
    David R Karger, Randomization in graph optimization problems: A survey, Optima: Mathematical Programming Society Newsletter 58 (1998).
    KCVM04
    Trausti Kristjansson, Aron Culotta, Paul Viola, and Andrew Kachites McCallum, Interactive information extraction with constrained conditional random fields, AAAI (2004).
    Kle97
    Jon M Kleinberg, Authoritative souces in a hyperlinked environment, ACM SIAM Symposium on Discrete Algorithms (1997).
    Kom08
    Paul Komarek, Logistic regression for data mining and high-dimensional classification, CMU CS Thesis (2008), 138.
    KWA95
    J Kivinen, Manfred K Warmuth, and P Auer, The perceptron algorithm v.s. winnowo: Linear v.s. logarithmic mistake bounds when few input variables are relevant, COLT (1995), 289-296.
    Lew98
    David D Lewis, Naive (bayes) at forty: The independence assumption in information retrieval, find journal (1998).
    LK73
    S Lin and BW Kernighan, An effective heuristic algorithm for the traveling-salesman problem, Operations Research (1973), 498-516.
    Mar61
    M E Maron, Automatic indexing: An experimental inquiry, RAND Technical Report (1961), 404-417.
    MF00
    Zbigniew Michalewicz and David B. Fogel, How to solve it: Modern heuristics, Springer, 2000.
    Mil02
    John Stuart Mill, A system of logic, University Press of the Pacific, 2002.
    Mit97
    Thomas Mitchell, Machine learning, McGraw-Hill, 1997.
    MK00
    M E Maron and J L Kuhns, On relevance, probabilistic indexing and information retrieval, 1960 (2000), 1-29.
    Mou00
    John A Mount, Automatic detection of potential deadlock, Dr. Dobbs Journal (2000).
    Mou09a
    John Mount, Automatic generation and testing of un-rolls for profitable technical trades, http://www.win-vector.com/blog/2007/10/paper-on-stock-trading/, 2009.
    Mou09b
    to3em, A demonstration of data mining, http://www.win-vector.com/blog/2009/08/a-demonstration-of-data-mining/, 2009.
    PBMW98
    Lawrence Page, Sergey Brin, Rajeev Motwani, and Tery Winograd, The pagerank citation ranking: Bringing order to the web, http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.31.1768 (1998).
    Pol54a
    G. Polya, Induction and analogy in mathematics, Princeton University Press, 1954.
    Pol54b
    to3em, Patterns of plausible inference, Princeton University Press, 1954.
    Pol71
    to3em, How to solve it, Princeton University Press, November 1971.
    RC96
    Louis B Rall and George F Corliss, An introduction to automatic differentiation, SIAM: Computational Differentiation: Techniques, Applications and Tools (1996), 1-18.
    Rot97
    Gian-Carlo Rota, Indiscrete thoughts, Birkhauser, 1997.
    Ski88
    John Skilling, The axioms of maximum entropy, Maximum Entropy and Bayesian Methods in Science and Engineering 1 (1988), no. 173-187.
    ST85
    Daniel Dominic Sleator and Robert Endre Tarjan, Amortized efficiency of list update and paging rules, Communications of the ACM 28 (1985), no. 2.
    STC00
    Jown Shawe-Taylor and Nello Cristianini, Support vector machines, Cambridge University Press, 2000.
    STC04
    to3em, Kernel methods for pattern analysis, Cambridge University Press, 2004.
    Str76
    Gilbert Strang, Linear algebra and its applications, Academic Press, Inc., 1976.
    TH09
    Jerome Friedman Trevor Hastie, Robert Tibshirani, The elements of statistical learning: Data mining, inference and prediction, Springer, 2009.
    TTV08
    Luca Trevisan, Madhur Tulsiani, and Salil Vadhan, Regularity, boosting, and efficiently simulating every high-entropy distribution, Electronic Colloquium on Computational Complexity (2008), 18.
    Zei95
    Doron Zeilberger, The method of undetermined generalization and specialization illustrated with fred galvin’s amazing proof of the dinitz conjecture, http://arxiv.org/abs/math/9506215, 1995.

    Acknowledgement

    A thank you to readers who supplied help and comments on earlier drafts.


    Footnotes

    … Mount1
    email: mailto:jmount@win-vector.com web: http://www.win-vector.com/
    … principle.2
    The pre-existing practice that comes cloesest to the local o global principle is found in operations research where encoding a problem to be solved by an optimizer is a central technique. We claim the natural statement of the local to global principle is more general than always encoding constraints for a particular optimizer (in particular globalization is not always optimization).
    … structure4
    By “link structure” we mean which web pages link to which other web pages.
    … graph5
    Remember, a graph is diagram consisting of nodes and edges (here depicted as arrows).
    … features6
    For example the model could account for:

    • surfers entering and leaving the model
    • link odds that vary where they are on a page
    • surfers staying on a page proportional to how much text is on the page
    • matching known traffic and click behavior where we have such data.

    For simplicity we will just stick with the example given example.

    … components.7
    When a system is named and defined as an exact set of procedures the system can, by definition, not be improved. This is because with any change in procedure we have a new system that no longer matches the original definition and therefore requires a new name.



    John Mount 2009-11-11

    Related posts:

    1. A Demonstration of Data Mining
    2. Should your mom use Google search?
    3. Betting Best-Of Series

    ]]>
    http://www.win-vector.com/blog/2009/11/the-local-to-global-principle/feed/ 0
    “I don’t think that means what you think it means;” Statistics to English Translation, Part 1: Accuracy Measures http://www.win-vector.com/blog/2009/11/i-dont-think-that-means-what-you-think-it-means-statistics-to-english-translation-part-1-accuracy-measures/ http://www.win-vector.com/blog/2009/11/i-dont-think-that-means-what-you-think-it-means-statistics-to-english-translation-part-1-accuracy-measures/#comments Tue, 03 Nov 2009 16:14:00 +0000 Nina Zumel http://www.win-vector.com/blog/?p=1050
  • Statistics to English Translation, Part 2a: ’Significant’ Doesn’t Always Mean ’Important’
  • Statistics to English Translation, Part 2b: Calculating Significance
  • Living in A Lognormal World
  • ]]>
    Scientists, engineers, and statisticians share similar concerns about evaluating the accuracy of their results, but they don’t always talk about it in the same language. This can lead to misunderstandings when reading across disciplines, and the problem is exacerbated when technical work is communicated to and by the popular media.

    The “Statistics to English Translation” series is a new set of articles that we will be posting from time to time, as an attempt to bridge the language gaps. Our goal is to increase statistical literacy: we hope that you will find it easier to read and understand the statistical results in research papers, even if you can’t replicate the analyses. We also hope that you will be able to read popular media accounts of statistical and scientific results more critically, and to recognize common misunderstandings when they occur.

    The first installment discusses some different accuracy measures that are commonly used in various research communities, and how they are related to each other. There is also a more legible PDF version of the article here.

    The Basics

    In informal language and in popular press articles, “accuracy” is often discussed as if it were a one-dimensional property of a diagnostic test or a classifier.

    Image MyMoney

    In general though, a single number is not enough. A test or classifier should detect what’s interesting, and ignore what’s not. How well it accomplishes these two tasks is related to the two kinds of mistakes that a test or classifier can make: false negatives, and false positives.

    For a classification task, positive means that an instance is labeled as belonging to the class of interest: we may want to automatically gather all news articles about Microsoft out of a news feed, or identify fraudulent credit card transactions. For a screening test, positive means that the test detects whatever it was designed to look for: an HIV test detects the presence of human immunodeficiency virus, for example, while an allergy test detects the presence of an allergic reaction. A negative is obviously the opposite of a positive.

    A false positive is concluding that something is positive when it is not. False positives are sometimes called Type I errors. A false negative is concluding that something is negative when it is not. False negatives are sometimes called Type II errors. The terms “Type I error” and “Type II error” are not terribly mnemonic, but they are commonly used, and therefore worth knowing.

    For binary classification or binary test procedures, the False Positive Rate, $ FPR$ , is the fraction of negative instances that are erroneously misclassified as positive.

    $\displaystyle FPR = \frac{\mbox{\char93 false positives}} {\mbox{all negative i... ...se positives}} {\mbox{\char93 false positives} + \mbox{\char93 true negatives}}$ (1)


    Likewise, the False Negative Rate, $ FNR$ , is the fraction of positive instances that are erroneously misclassified as negative.

    $\displaystyle FNR = \frac {\mbox{\char93 false negatives}} {\mbox{all positive ... ...se negatives}} {\mbox{\char93 false negatives} + \mbox{\char93 true positives}}$ (2)


    The True Positive Rate, $ TPR$ , is the fraction of positive instances that are correctly identified as such. It follows from the Definition 2 above that
    $ TPR = 1 - FNR$ .

    The True Negative Rate, $ TNR$ , is the fraction of negative instances that are correctly identified as such. It follows from the Definition 1 above that
    $ TNR = 1 - FPR$ .

    Sensitivity and Specificity

    Image screening

    The terms sensitivity and specificity generally refer to diagnostic or screening procedures, such as an HIV or allergy tests. The sensitivity of a test is its true positive rate; the specificity is its true negative rate, although it can be more intuitive to think of specificity as the complement of the false positive rate: Specificity =
    $ TNR = (1 - FPR)$ .

    The Wikipedia entry on Sensitivity and Specificity [Wik] uses a nice example to illustrate the difference: think of a drug-sniffing dog as a screening test for illicit drugs. If the dog’s nose is highly sensitive to the smell of drugs, then it will detect all the hidden packets of drugs; if it is less sensitive, then it will fail to detect some of the packets. At the same time, the dog should react specifically to drugs, and not, say, jambalaya or doggie biscuits. If the dog is highly specific in its reactions, it will only react to drugs; if it is less specific, then it will react to the occasional care package of yummy home cooking from Mom.

    Screening tests may trade off specificity for sensitivity (and vice-versa). To go back to our drug-sniffing example, we might treat every suitcase and bag that comes through the airport as if it contained drugs; this procedure is perfectly sensitive (it will detect every packet of drugs, for sure), but not specific at all. Or, we might assume that no one is carrying drugs. This is perfectly specific (we will never make a false accusation), but not sensitive at all.

    A more realistic example, inspired by a discussion of mandatory AIDS testing by Joshua Rosenau [Ros06], is the use of the ELISA screening test to detect HIV-infected blood donations. The ELISA test is designed to be very sensitive: it detects 99.7% of the cases of HIV-infection, which gives a false negative rate of
    $ 3 \times 10^{-3}$ . On the other hand, it is not very specific: it has a 1.9% false positive rate1. If you assume that the incidence of HIV-positive individuals in the general population is about 448 out of every 100,000 people [Hig08], then a positive test result is correct only about 19% of the time: one case of true infection out of every five positives. This error rate may be appropriate for screening blood donations, since it is better to discard four perfectly good pints of blood, “just in case”, than to allow a pint of HIV-infected blood into the blood bank. But it is not appropriate to assume that all five of those poor blood donors are HIV-positive, without followup tests to increase the specificity of the screening procedure.

    Sensitivity, Specificity, and Prevalence

    The example above brings up an important point. Sensitivity and specificity are properties of the test itself, not how the test performs in a given population. The absolute accuracy (as the term is commonly understood) of a screening test will change, depending on the prevalence of the condition that the test is screening for.

    Let’s imagine the ELISA test described above as an HIV-screening daemon, who uses two coins to generate uncertainty. When the daemon is shown a pint of infected blood, she flips an unfair quarter. If the quarter comes up heads (which it does 3 times out of every 1000 flips), then she lies and says the blood is uninfected, otherwise she tells the truth. When the daemon is shown a pint of uninfected blood, she flips a silver dollar that comes up heads about 2 times out of every 100 flips. If the silver dollar comes up heads, she lies and says the blood is infected, or else she tells the truth. The quarter and the silver dollar encode ELISA’s sensitivity and specificity, respectively.

    Figure 1: The ELISA daemon screening an uninfected pint of blood
    Image ELISAflip

    Suppose ELISA looks at the blood of 1000 people a day, drawn from the general population. We can expect that about 5 of them are truly infected. That means that ELISA flips her silver dollar 995 times; it will come up heads about 20 times. That’s about twenty false positives a day. She’ll flip the quarter about 5 times, and with high probability, won’t ever see a head. That’s near zero false negatives a day. In total, ELISA will read positive for about 25 pints of blood every day, and she will be wrong for 80% of those cases.

    But suppose ELISA looks at the blood of 1000 people from a high-risk population, where one out of four people are infected. Then ELISA flips her silver dollar about 750 times, and it will come up heads about 15 times: 15 false positives. She’ll also flip the quarter 250 times; the coin just might come up heads one time. Let’s say it does. Then ELISA will read positive for 249+15 = 264 pints of blood, and she’ll be wrong for only about 6% of those cases – plus that case of infection that she missed.

    Same test, same sensitivity and specificity, but different overall accuracy. The percentage of positives that are actually true positives in a given population is called the positive predictive value ($ PPV$ ) of the test within that population; it is the probability for that population that a positive test result correctly predicts a positive instance.

    $\displaystyle PPV = \frac{TPR \times P(+)}{TPR \times P(+) + FPR \times P(-)}$ (3)



    where $ P(+)$ is the probability of a positive instance, or in other words the prevalence of the condition in the population. $ P(-)$ is the probability of a negative instance, and of course
    $ P(+) + P(-) = 1$ .2

    Likelihood Ratios

    Likelihood ratios are another measure of diagnostic test accuracy. The positive likelihood ratio is the true positive rate over the false positive rate:
    $ LR_P = TPR/FPR$ . For our example ELISA test, the positive likelihood ratio is 0.997/0.019 = 52.47. The negative likelihood ratio is the false negative rate over the true negative rate,
    $ LR_N =FNR/TNR$ . For our ELISA example, the negative likelihood ratio is 0.003/.981 = 0.003058.

    Likelihood ratios are a property of the screening test, independent of the prevalence of the condition in the population. If you know the odds of infection for the population of interest,
    $ odds_{pop} = P(+)/P(-)$ , then you can calculate the posterior odds of infection for someone who has tested positive:

    $\displaystyle odds_{post} = LR_P \times odds_{pop} $

    and the posterior odds of infection for someone who has tested negative:

    $\displaystyle odds_{post} = LR_N \times odds_{pop} $

    It’s been argued that likelihood ratios make it easier for non-statistically-minded practitioners to interpret the results of a test than sensitivity and specificity do [JGS94]. It’s also been argued the other way [PSBtR05]. Which framework makes more sense depends on if you prefer thinking in odds or probabilities. In either case you should be leery of “guidelines” of the sort: “$ LR_P > 10$ indicates large and often conclusive increase in the likelihood of the disease.” There is certainly a large increase in the posterior likelihood of infection if $ LR_P > 10$ , but as the ELISA coin-flipping example should have made clear, this posterior likelihood can still be quite small, if the disease is sufficiently rare.

    I occasionally see something called the diagnostic odds ratio. It was developed as “a single indicator of test performance,” and I’ve seen it described as “the odds of the true positive rate divided by the odds of the false positive rate” [HC07]. I could give you the actual formula here, but frankly – it makes no sense. The whole point of having two measures for accuracy is that one is not enough, and if you absolutely must have one number, you are better off using something like the $ F_1$ measure that we describe in the next section.

    Precision and Recall

    Image istock_library

    Precision and recall are similar (but not identical) to sensitivity and specificity. The measures are popular in the information retrieval and machine learning communities.

    Recall is the same as sensitivity, or the true positive rate: the number of true examples correctly classified as such. Precision is defined as the fraction of instances classified as positive that really are positive:

       precision$\displaystyle = \frac{\mbox{\char93 true positives}}{\mbox{\char93 true positives + \char93 false positives}} $

    This is not the same as specificity; it is the same as the positive predictive value, and is a joint property of the classifier and the population that it was evaluated on.

    Information retrieval research concerns itself with efficient discovery of relevant documents from document collections, and that domain motivates the definitions of precision and recall. A library patron queries the library catalog for books on a given topic; the catalog’s search engine should return all of the books relevant to her query, and only those books. Recall is a measure of how well the search delivers “all of the relevant books”, and precision is a measure of how well it delivers “only the relevant books”. If recall is poor, then the patron will miss finding many relevant books; if precision is poor, then she will be inundated with a bunch of book suggestions that have nothing to do with her search.

    As with diagnostic procedures, classifiers may trade precision for recall, and vice-versa. Suppose our library patron is looking for novels about vampires. She could request all novels with the word “vampire” in the title. This search would have almost perfect precision, since presumably a novel with the word “vampire” in the title is going to be about vampires. It would not have perfect recall, since many novels about vampires – like Dracula, or the books from the Twilight series – don’t announce themselves quite that blatantly. Now suppose she is only interested in novels from Anne Rice’s Vampire Chronicles series. She could request all novels authored by Anne Rice. This search would have perfect recall, but not perfect precision, since Ms. Rice did in fact write several novels that are not about vampires.

    These examples show that neither high precision nor high recall guarantee a useful classifier. It is the tension between achieving high precision and high recall that leads to good classifiers.

    As we discussed above, the primary difference between precision and specificity is that precision is a property of the algorithm and the population. One could argue that precision is a more appropriate measure than specificity for many classification and machine learning tasks, especially those related to text or natural language. The fundamental assumption, after all, is that such algorithms are trained on data that is representative of the population that the classifier will be deployed in.

    If you insist: Single Score Measures

    There is another measure called $ F_1$ , the harmonic mean of precision and recall:

    $\displaystyle F_1 = \frac{2}{(1/\mbox{precision} + 1/\mbox{recall})} = 2 \times \frac{\mbox{precision} \times \mbox{recall}}{\mbox{precision + recall}}. $

    $ F_1$ is near one when both precision and recall are high, and near zero when they are both low. It is a convenient single score to characterize overall accuracy, especially for comparing the performance of different classifiers.

    Using $ F_1$ to compare classifiers assumes that precision and recall are equally important for the application. If one criterion is more important than the other, then one can also use the weighted geometric mean:

    $\displaystyle F_\alpha = (1 + \alpha)($precision$\displaystyle \times$   recall$\displaystyle )/(\alpha$   precision + recall$\displaystyle ). $

    $ \alpha$ describes how much more important recall is than precision: use $ F_2$ if recall is twice as important as precision, $ F_{0.5}$ if precision is twice as important as recall.

    It is still better to have separate target goals for precision and recall that a candidate classifier must meet. Still, $ F_1$ and $ F_\alpha$ are found in the literature, so they are presented here.

    ROC Curves

    Not all diagnostic tests or classifiers return a simple “yes-or-no” answer. In fact, most probably don’t. Generally, a classification or diagnostic procedure will return a score along a continuum; ideally, the positive instances score towards one end of the scale, and the negative examples towards the other end. It is up to the scientist or the analyst to set a threshold on that score that separates what is considered a positive result from what is considered a negative result. The Receiver Operating Characteristic Curve, or ROC Curve, is a tool that helps set the best threshold.

    Figure 2: Plot of score distributions for positive and negative instances (Class 10 is positive)
    Image ScoreDensityplots

    Suppose we are trying to classify a set of instances into one of two classes, positive and negative3. We’ve gathered a test set of representative samples, and we’ve developed a scoring procedure to try to separate them. Positives tend to score on the high end of the scale, negatives toward the low end. We want to pick a threshold value.

    Figure 2 shows what happens when score the test set. We can see that the scores of the positive instances (class 10) are in a cluster centered just above 7, and the scores of the negatives (class 0) are in a cluster centered near 5. Still, there is an interval where the two clusters overlap substantially. If we pick a threshold to the right of that interval (say, $ T = 7$ ), almost everything that scores greater than $ T$ will be truly positive (high precision/specificity), but we miss a lot of positives, too (low recall/sensitivity). If we pick a threshold to the left of that interval (say $ T = 5$ ), we will catch almost all the positives (high recall/sensitivity), but we will also pick up a lot of negatives (low specificity/precision). So we want the threshold to be somewhere in the overlap interval, but where?

    Figure 3: ROC Curve corresponding to Figure 2. Selected thresholds are marked on the curve.
    Image ROC

    ROC curves plot the false positive rate on the x-axis and the true positive rate on the y-axis, as we vary the threshold. The point $ (0,0)$ corresponds to rejecting everything; the point $ (1,1)$ corresponds to accepting everything. The ideal point is $ (0,1)$ : accept all positive instances and reject all negative instances. The line $ x=y$ corresponds to random guessing: that is, a procedure that assigns each instance a score uniformly drawn from (in this example) the interval $ [1,8]$ without even checking if the instance is positive or negative.

    The ROC curve represents the tradeoff between true positives and false positives that we make as we increase the threshold from accepting everything to rejecting everything. Figure 3 gives the ROC curve for our example, with a few example thresholds marked on the curve.

    The area between the ROC curve and the $ x=y$ line can be considered a measure of accuracy; the smaller that area, the more the scoring procedure is like random guessing. The larger the area, the better separated the two classes are. We can use the curve to help us decide how to set a threshold that will give us the most acceptable tradeoff between true positives and false positives. For this example, we would probably want to select a threshold somewhere between $ 6$ and $ 6.5$ .

    In Conclusion

    Some points to remember:

    • Classifier and diagnostic test performance are not one-dimensional.
    • Different fields use different (but related) measures of accuracy.
    • Classifier and diagnostic test performance depend on the relative cost of Type I and Type II errors, as well as on the proportion of positive and negative instances in the population of interest.

    Bibliography

    HC07
    Childrens Mercy Hospitals and Clinics, Stats: Meta-analysis for a diagnostic test, http://www.childrens-mercy.org/stats/model/diagnostic.asp, 2007.
    Hig08
    Liz Highleyman, CDC updates estimates of HIV prevalence in the United States, http://www.hivandhepatitis.com/recent/2008/100708_a.html, 2008.
    JGS94
    R. Jaeschke, GH Guyatt, and DL Sackett, Users’ guides to the medical literature. III. How to use an article about a diagnostic test. B. What are the results and will they help me in caring for my patients? The evidence-based medicine working group, JAMA 271 (1994), no. 6, 703-707.
    Org04
    World Health Organization, HIV assays: Operational characteristics (phase 1); Report 15 antigen/antibody ELISAs, http://whqlibdoc.who.int/publications/2004/9241592370.pdf, 2004.
    PSBtR05
    MA Puhan, J. Steurer, LM Bachmann, and G. ter Riet, “A randomized trial of ways to describe test accuracy: the effect on physicians’ post-test probability estimates”, Annals of Internal Medicine 143 (2005), no. 3, 184-âÄì189.
    Ros06
    Joshua Rosenau, AIDS testing, http://scienceblogs.com/tfk/2006/08/aids_testing.php, 2006.
    Wik
    Wikipedia, Sensitivity and specificity, http://en.wikipedia.org/wiki/Sensitivity_and_specificity).

    Appendix: Glossary of Accuracy Terms

    Basic Terms

    Accuracy

    $\displaystyle \frac{\mbox{\char93 true positives} + \mbox{\char93 true negatives}}{\mbox{all instances}} $
    Type I error
    False Positive: to conclude something is a positive instance when it is not.
    Type II error
    False Negative: to conclude something is a negative instance when it is not.
    False Positive Rate ($ FPR$ )
    The fraction of negative instances that are erroneously misclassified as positive.

    $\displaystyle FPR = \frac{\mbox{\char93 false positives}} {\mbox{all negative i... ...se positives}} {\mbox{\char93 false positives} + \mbox{\char93 true negatives}}$    


    False Negative Rate ($ FNR$ )
    The fraction of positive instances that are erroneously misclassified as negative.

    $\displaystyle FNR = \frac {\mbox{\char93 false negatives}} {\mbox{all positive ... ...se negatives}} {\mbox{\char93 false negatives} + \mbox{\char93 true positives}}$    


    True Positive Rate ($ TPR$ )
    The fraction of positive instances that are correctly identified as such.
    $ TPR = 1 - FNR$ .
    True Negative Rate ($ TNR$ )
    The fraction of negative instances that are correctly identified as such.
    $ TNR = 1 - FPR$ .
    Prevalence $ P(+)$
    The proportion of positive instances in the population, or the probability that someone drawn from the population at random in a positive instance. $ P(-)$ is the probability of drawing a negative instance from the population at random. The odds of a positive is the ratio of $ P(+)$ to $ P(-)$ .

    $\displaystyle odds_{pop} = P(+)/P(-) $

    Accuracy Terms

    Terms to describe the accuracy of diagnostic tests are conventionally given in terms of sensitivity and specificity. They have been rephrased here in terms of true positive rate, false positive rate, etc., for clarity.

    $ F_1$
    Single score measure of accuracy. The harmonic mean of precision and recall.

    $\displaystyle F_1 = \frac{2}{(1/\mbox{precision} + 1/\mbox{recall})} = 2 \times \frac{\mbox{precision} \times \mbox{recall}}{\mbox{precision + recall}}. $

    $ F_1$ is near 1 for high accuracy, near 0 for low accuracy. Also see Precision, Recall.

    $ F_2$
    Single score measure of accuracy when recall is twice as important as precision. Also see $ F_1$ , Precision, Recall.
    $ F_0.5$
    Single score measure of accuracy when precision is twice as important as recall. Also see $ F_1$ , Precision, Recall.
    Likelihood-ratio (negative)

    $ LR_N =FNR/TNR$ . In diagnostic screening, used for calculating the posterior odds of a true positive for a subject who has tested positive:

    $\displaystyle odds_{post} = LR_N \times odds_{pop} $
    Likelihood-ratio (positive)

    $ LR_P = TPR/FPR$ . In diagnostic screening, used for calculating the posterior odds of a positive for a subject who has tested negative:

    $\displaystyle odds_{post} = LR_P \times odds_{pop} $
    Positive Predictive Value
    Probability (with respect to a specific assumed prevalence rate) that a positive result from a diagnostic or screening procedure is a true positive. Same as precision.

    $\displaystyle PPV = \frac{TPR \times P(+)}{TPR \times P(+) + FPR \times P(-)}$

    Also see Precision

    Precision
    In information retrieval, the fraction of returned documents that are actually relevant to the query. In classification, the fraction of all instances classified as class A that are truly in class A. The same as Positive Predictive Value.

       precision$\displaystyle = \frac{\mbox{\char93 true positives}}{\mbox{\char93 true positives + \char93 false positives}} $

    Also see Positive Predictive Value

    Recall
    In information retrieval, the fraction of relevant documents that are returned by the query. In classification, the fraction of all true instances of class A that are classified into class A. The same as sensitivity.

       recall$\displaystyle = \frac{\mbox{\char93 true positives}}{\mbox{\char93 true positives + \char93 false negatives}} = TPR $

    Also see Sensitivity

    ROC Curve
    Plot of true positive rate versus false positive rate for a diagnostic test or binary classifier, as the decision threshold is varied.
    Sensitivity
    The true positive rate $ TPR$ of a diagnostic or screening procedure. Also see Recall.
    Specificity
    The true negative rate
    $ TNR = 1 - FPR$ (or the complement of the false positive rate) of a diagnostic or screening procedure.


    Footnotes

    … rate1
    The ELISA sensitivity and specificity numbers are from WHO’s report on the operational characteristics of HIV Assays [Org04, p. 18], using the lower bounds of the confidence interval. They are slightly different from the numbers Rosenau uses
    ….2
    The definition of $ PPV$ is conventionally given in terms of sensitivity and specificity (similarly for the likelihood ratios discussed in the following section). The definitions in this article are given in terms of false positive rate, etc., since that is clearer for people reading outside their discipline.
    … negative3
    We are using a classifier in our example, but a diagnostic test would work the same way.


    Related posts:

    1. Statistics to English Translation, Part 2a: ’Significant’ Doesn’t Always Mean ’Important’
    2. Statistics to English Translation, Part 2b: Calculating Significance
    3. Living in A Lognormal World

    ]]>
    http://www.win-vector.com/blog/2009/11/i-dont-think-that-means-what-you-think-it-means-statistics-to-english-translation-part-1-accuracy-measures/feed/ 3