Home > data science, Expository Writing, Rants, Statistics, Statistics To English Translation > How to test XCOM “dice rolls” for fairness

How to test XCOM “dice rolls” for fairness

December 11th, 2012

XCOM: Enemy Unknown is a turn based video game where the player choses among actions (for example shooting an alien) that are labeled with a declared probability of success.


Ss 14 xl
Image copyright Firaxis Games

A lot of gamers, after missing a 80% chance of success shot, start asking if the game’s pseudo random number generator is fair. Is the game really rolling the dice as stated, or is it cheating? Of course the matching question is: are player memories at all fair; would they remember the other 4 out of 5 times they made such a shot?

This article is intended as an introduction to the methods you would use to test such a question (be it in a video game, in science, or in a business application such as measuring advertisement conversion). There are already some interesting articles on collecting and analyzing XCOM data and finding and characterizing the actual pseudo random generator code in the game, and discussing the importance of repeatable pseudo-random results. But we want to add a discussion pointed a bit more at analysis technique in general. We emphasize methods that are efficient in their use of data. This is a statistical term meaning that a maximal amount of learning is gained from the data. In particular we do not recommend data binning as a first choice for analysis as it cuts down on sample size and thus is not the most efficient estimation technique.In this article we are going to ignore issues that are unique to pseudo random number generators such as “save scumming” and solving for the hidden generator state.

Save scumming is noticing the sequence of coin flips is in fact deterministic, so by re-starting from a save and using a bad flip on an event we don’t care about can allow the player to move a good flip to an event they do care about. Statisticians are fairly clever about avoiding this by ensuring that separate processes use separate random number sources, so a change in behavior of one process can’t introduce a change in behavior in another by changing what random numbers the second process sees.

Solving for the hidden state of the generator is when, after watching a sequence of outputs of the generator, you collect enough information to efficiently recover the complete state of the generator. So no coin flip from that point forward will ever by surprising. For example see “Reconstructing Truncated Integer Variables Satisfying Linear Congruences”, Frieze, Hastad, Kannan, Lagarias, Shamir, Siam J. Comput., Vol. 17, No. 2, April 1988. These are indeed powerful and interesting questions, but are too related to computers games and simulations to apply to data that comes from real world situations (such as advertisement conversion rates). So we will leave these to the experts.

A proper analysis needs at least: a goal, method and data. Our goal is to see if there is any simple systematic net bias in the XCOM dice rolls. We are not testing for a bias varying in a clever way depending on situation or history. In particular we want to see if we are missing “near sure thing” shots more than we should (so we want to know if the bias varies as the reported probability of success changes). Our method will be to test if observed summary statistics have surprisingly unlikely values. We collected data on one partial “classic ironman” run of XCOM: Enemy Unknown on an XBox 360. The data is about 250 rows, is all of the first 20 missions and can be found in “strong TSV format” here: XCOMEUstats.txt. We will use R to analyze the data.

A foremost question for the analyst is: is the data appropriate for the questions I need to answer? For example all of this data is from a single game, so it is completely inappropriate for testing if there is any per-game bias (some state set that makes some play throughs “net lucky” and others “net unlucky”). There are, however, around 250 usable rows in the data set: so the data should be sufficient to test if there is a large unchanging bias (that is assumed to not depend on the play through, game state or history). To test for smaller biases or more complicated theories you would need more data and to record more facts. As an aside: notice that I do not talk about a treatment and control set. I have found slavish experimental set-up (I won’t call it design) to always appeal to “treatment and control” is absolutely no substitute for actually taking the responsibility of thinking through if your data actually supports the type of analysis you are attempting. Just because you “have a control” does not mean you have a usable experimental design, and many legitimate experiments do not have a useful group labeled as “control.”

To make the data collection simple and reliable I recorded only a few facts per row:

  • mission number: this is which mission we are on
  • shot number: this was easier to track than player turn, as it is the data-set row number
  • hit probability: the game-reported chance of success of the shot, this is the quantity we are trying to validate, reported as a percentage from 0 to 100
  • hit: the actual outcome of the shot, 1 if we hit 0 if we missed.
  • grenade: 1 if the action was “throws grenade,” blank otherwise (could be used to condition on these rows from later analysis).
  • rocket: 1 if the action was “rocket firing,” blank otherwise (could be used to condition on these rows from later analysis).
  • headshot: 1 if the action was a “sniper headshot,” blank otherwise (could be used to condition on these rows from later analysis).
  • weapon type: what type of weapon used, right now always “projectile”, “laser” and “arc thrower” (I have not yet unlocked plasma weapons).

The initial goal was to get about 100 observations per major weapon type (arc thrower is a specialist weapon, so it would take a very long time to collect a lot of data on it) from about 10 missions. No analysis was performed prior to stopping at ten missions of data collected. This is a simple (but not entirely necessary) method of avoiding a “stopping bias” as we would expect even a fair coin sequence to appear somewhat unfair on some prefixes (see, for example, the law of the iterated logarithm). So an inspection that played “until something looked off” would have a large bias for false alarms (this is in fact, unfortunately, how most commercial research is done: see Why Most Published Research Findings Are False). We will mention the nature of the false alarm effect when we discuss significance. Like “control groups” this stopping bias isn’t something mystical that can only be avoided through certain rituals- but a real and measurable effect that you need to account for.

First we load the data into R:

d <- read.table('http://www.win-vector.com/dfiles/XCOM/XCOMEUstats.txt',
   header=T,sep='\t')
d[is.na(d)] <- 0 # replace all NA (which came from blanks) with 0

The basic plan for analysis is: chose a summary statistic and compute the significance of the value you observe of that statistic. For our first summary statistics we just use “total number of hits.”

sum(d$hit)

Which turns out to be 191. In our data set “hit” is a variable that is written as 1 if we hit and 0 if we missed. We chose this representation because if hit.probability were the actual correct percent chance of hitting then we should have:

sum(d$hit)  nearly equals  sum(d$hit.probability/100.0).

That is because a probability of a hit is just the expected value of the process that gives you 1 point for a hit and 0 for a miss plus the remarkable fact that expected values always add. The fact that expected values always add is both remarkable and an immediate consequence of the definition of expected value (“The Probabilistic Method” by Noga Alon and Joel H. Spencer calls this the “Linearity of Expectation” and devotes an entire chapter to clever uses of this fact). So what is the sum of reported hit probability in our data set?

sum(d$hit.probability/100.0)

Which turns out to be 179.73. So in my single game I actually hit a bit more often (191 times) than the game claimed I would (179.73 times). A quick question is could this be do to rounding or truncation? We check that the difference in percentage points:

(sum(100*d$hit) - sum(d$hit.probability))/length(d$hit)

Which is 4.49, too large for rounding (which would by at most +/-1 and hopefully +/- 0.5 on average).

This brings us to significance. We want to know is: if this difference of about 11 hits is large or small? We in fact want to know if it was large or small in the special sense: was such a sum likely or unlikely to happen (and this is significance). The question is usually formed as follows: if I assume exactly what I am trying to disprove (that the game is fair) how often when I played would I see a difference (from an assumed fair game, also called the null-hypothesis) a difference as large as what I saw? If what I saw is rare (or hard to produce from a fair game), then I may reject the null hypothesis and say I don’t believe my original assumption that the game is fair (which was my intent in setting up the experiment in the first place). Now you can never “prove the null hypothesis” with this sort of experimental design (you can only reject the null hypothesis or fail to reject the null hypothesis). If the null hypothesis were in fact true, every time you collected more data you would get another equivocal result that you can’t quite reject the null hypothesis yet. “But more data may help.” However, for a true null each time you collect more data you will likely get yet another non-definitive result. So the data scientist will have to use judgement and decide where to stop at some point.

This standard interpretation of significance is why you don’t want to allow “venue shopping” or “data scumming.” Suppose I secretly played 30 different games of XCOM: Enemy Unknown and then showed you only the one play-through where “wow, that set of coin-flips was only 1 in 20 likely- the game must be unfair.” If you know only about the game I showed you the claim is you are seeing something that is only 1/20 likely under the null hypothesis (so a p-value of 0.05) and perhaps decent evidence against the null hypothesis (that the game is fair). However if you are then informed I had to play 30 games to find the bad example (and I only showed you the worst) the response would be: of course in 30 plays you would expect to see something that only happens one time in twenty by random chance- as you took more than 20 trials Of course data scientists always perform more than one analysis. If it was always a-priori obvious what the exact right analysis would be the job would be a lot easier. The saving fact is that we can use a very crude significance correction: if we ran k experiments and the best one had a significance of p (small being more interesting) then the significance of the “cherry pick adjusted” experiment is no more than k*p. So if we run 100 experiments and the best has p-value of 0.0001 then even after the cherry picking correction we know we have a significance of at least 100*0.0001 = 0.01 which is good. The second saving grace is that p-values decrease rapidly when you add more data. If we know we want to try k-experiments than collecting a log(k) multiple more data is enough to defend against data scumming or venue shopping. The thing that is expensive in data is attempting to measure smaller clinical effect sizes. If you halve what you think the size of the effect of some non-existent effect (like ESP) you are trying to measure (“oops I didn’t say I had a 5% advantage guessing wavy cards, I meant a 2.5% advantage”) you need to quadruple the amount of data collected. Effect size you are trying to measure enters your required sample size as a square. This is why it is easy for somebody defending a non-effect to run a cooperating data scientist ragged by revising their claimed expectations.

Back to our XCOM analysis. We said the strategy is to propose a summary and compute its significance. There are a few great ways to do this: empirical re-sampling, permutation tests and simulation. We will use simulation. We will write new code to generate hit outcomes directly from the published probabilities:

library(ggplot2)
simulateC <- function(x) {  # x = probabilities
         simHits <- ifelse(runif(length(x))<=x,1,0)
         sum(simHits)
}
simulateC(d$hit.probability/100.0)
drawsC <- sapply(1:10000,function(x) simulateC(d$hit.probability/100.0))
mean(drawsC)
sC = sum(d$hit)
ggplot() + geom_histogram(aes(x=drawsC)) + geom_vline(x=sC)

The above R-code runs the simulation 10,000 times and plots the histogram of how often different numbers of hits show up. Our game experience is added to the graph as a vertical line. The graph is given below:

CountHistogram

In the above graph the mass to right of the vertical line is how often a random re-simulation saw a count of at least as many hits as us. This is called the “one sided tail” and if there is a lot of mass in this tail then we were not that unlikely (not very significant) and if there is not much mass in this tail our measurement was very rare and very significant. The R commands to compute the mass in the tail are easy:

eC <- ecdf(drawsC)
1 - eC(sC)

This turns out to be 2.78%. The R-command “ecdf()” returns a function that computes the amount of mass below a given threshold. So eC(S) gives us the amount of mass not more than S (a “left tail” if S is small), 1-eC(L) gives us the right tail and eC(S) + 1 - eC(L) gives us the mass in both tails (or the two-sided tail).

Note: trusting the simulation significance results means you are trusting the pseudo random generator used to produce them (in this case R’s generator). The only ways to avoid trusting your test pseudo random generator is to use a trusted true-random entropy source or to deliberately pick a test where you know the exact expected theoretical shape of the cumulative distribution. Statisticians are the masters of exact theoretical tests and usually pick from a very limited set of summary statistics (counts, means, standard deviations) so they can apply known theoretical test distributions (t-tests, f-tests and so on).

Our p-value of 0.0278 is considered significant. The usual rule of thumb is that p ≤ 0.05 is considered significant). Notice we are using an empirical p-value (re-simulating generation of hits from the assumed distribution) instead of a parametric p-value (assuming a distribution of the outcomes and using the theoretical mean and a theoretical variance). Empirical p-values much better to explain (they are a sampling of what would exactly happen if you repeated the null-experiment again and again) and so easy to compute that there is really no reason to use the distributional methods (Normal, Student-t, chi-Sq or so on) until you are repeating the calculation very many times. It saves one level of explanations to directly estimate the significance through re-simulation than to bring in “the standard approximations” (and their attendant assumptions).

One important consideration is that we didn’t specify before running this experiment that we thought we would experience above-average luck (in fact we came in thinking we were getting ripped off, so we were looking for a low hit count). So we should be looking either at “two sided tails” (accept mass from both counts ≤ of of the distribution measure how far we were from the mean in absolute value terms) or at least double our p-value to 0.0556 to respect that we implicitly ran two experiments. The p-value for the two sided tail is gotten as follows:

expectation <- sum(d$hit.probability/100.0)
diff <- abs(sum(d$hit)-expectation)
eC(expectation-diff) + (1-eC(expectation+diff))

Which is 0.0646 (or even worse than the 2*p correction). What this means is that: if we had started the experiment with the hypothesis that XCOM was under-reporting hit probabilities (or equivalently cheating in our favor) we had collected just enough data to reject the null hypothesis (that XCOM is perfectly fair) according to standard clinical standards (which I have never liked, as they are far too lenient). However we started with the hypothesis that XCOM was over-reporting hit probabilities (or cheating in its own favor) and switched hypothesis when we saw our hit count was high. Under this situation we did not collect enough data to reject the null hypothesis as the 2-side p-value is 0.0646 and the corrected 1-sided p-value becomes 0.0556 (both above the middling 0.05 standard). We would not expect to have to double our data to get better p-values (as p-values fall fast when you add data), but if we were to continue to collect data we should know our hypothesis has not been taken from the data (so we should probably use the 2-sided p-value and still multiply by an additional 2 as we have already run a few experiments or done some venue shopping on this data). Also, remember if XCOM is fair all experiments will look equivocal- fail to prove it is unfair but not quite look fair. So really we have seen nothing to be suspicious about at this point. It is a strange but true fact that statistics is an intentional science: what you know and how much of the data you have snooped really does affect the actual objective significances you experience. If you fail to put in some sort of compensation for how many experiments you have run and how often you switched measurement or hypothesis you will mis-report the ideal theoretical significance of a single clean room experiment (that you really did not run) as the significance of the entangled combination of measurements you actually did implement.

Part of the reason we are being so cagey accepting differences (but you always should be so), is that we strongly suspect (due to the forensic science study of Yawning Angel) that the generator is in fact fair. At least it is fair in a total sense (we are not testing for state-driven cheating or streaks).

Another summary we could look at (instead of total counts) is total surprise. This is a metric more sensitive to effects like “I swear I miss 80% shots half the time, how is that fair?” The surprise of an outcome is the negative of the logarithm (base 2) of the probability of the given outcome. Hitting an 80% shot has low surprise: -log_2(0.8) = 0.32 whereas missing an 80% shot has a high surprise -log_2(1-0.8) = 2.32. The total surprise for the shot sequence I observed is given by:

surprise <- function(x,o) {   # x = probabilities, o=actual outcomes
  sum(ifelse(o>=0.5,-log(x,base=2),-log(1-x,base=2)))
}
s <- surprise(d$hit.probability/100.0,d$hit)

This turns out to be 153.7. So we have a new summary statistic, we now need to know if it is significantly large or small. The theoretical expected surprise of a sequence of probabilities is a quantity called the entropy and this is given by:

entropy <- function(x) {  # x = probabilities
  ifelse(x<=0,0.0,ifelse(x>=1,0,-x*log(x,base-2) - (1-x)*log(1-x,base=2)))
}
sum(entropy(d$hit.probability/100.0))

The information theoretic entropy is 164.8. So our experienced surprise is in fact lower than expected, outcomes tended to go the majority direction slightly more often than expected (not less as missing a lot of near sure things would entail). We can again use empirical simulation to get the distribution of expected entropies and estimate the signficance:

simulate <- function(x) {
         simHits <- ifelse(runif(length(x))<=x,1,0)
         surprise(x,simHits)
}
simulate(d$hit.probability/100.0)
draws <- sapply(1:10000,function(x) simulate(d$hit.probability/100.0))
ggplot() + geom_density(aes(x=draws),adjust=0.5) + geom_vline(x=s)

Again we see that we are not a very rare event in terms of the possible distributions of surprise:

EntropyDensity

In fact even the one-sided p-value is quite large (and poor) at 0.1 (e <- ecdf(draws); e(s)), let alone the more appropriate two-sided tail probability.

An additional thing to look for: is can we build a useful probability re-mapping table for the reported probabilities? We know the totals are mostly right and the outcomes of near-certain and rare events are largely right. Could there be some band of predictions that is biased (say the 70% to 80% range)? This is also easy to do in R:

ggplot(data=d,aes(x=hit.probability/100.0,y=hit)) + 
   geom_point(size=5,alpha=0.5,position=position_jitter(w = 0.01, h = 0.01)) + 
   geom_smooth() + geom_abline(slope=1) + opts(aspect.ratio=1) + 
   scale_x_continuous(limits=c(0,1)) + scale_y_continuous(limits=c(0,1))

This produces the following figure:

SmoothCurve

The x-axis is the game-reported hit probability, the y-axis is the observed probabilities (always either 0 or 1 as each hit either happens or does not). Each black circle represents one of our recorded observations. The blue line with error-band is the spline-fit relation. It is estimating the ratio of hits to misses as a function of the stated predicted hit probability. Early on the blue curve is low because most black dots are at y=0, for higher x the curve pulls up proportional to fraction of points at y=1. Notice how close the blue curve is to the black line y=x and the error bar hardly pulls off the black line except in the 0.5 to 0.7 region. So maybe mid-values are slightly under predicted, but we don’t have enough data to say so (and more data would probably just show a new tighter correspondence instead of confirming this divergence). A similar plot can be made using the GAM package, but it is harder to get the error bars.

This graph, which is the kind of thing the data scientist should look at points out yet another data deficiency in our study. The distribution of shots probabilities attempted is given as best for play, possibly not best for analysis (a property of all real data when you don’t get complete control over the experimental design). The distribution (again represented as a density) of the shots I attempted is given below:

ProbDist

(for how to read density plots see My Favorite Graphs).

The core purpose of the article hasn’t so much been the analysis of XCOM itself, but to show how to analyze this type of data. We have emphasized methods that can deal with many different probabilities at the same time (as opposed to binning) in the interest of “statistical efficiency.” That is: to get the most results out of what little data we have. This is always important when you are producing annotated data, which is always going to be per-unit expensive, even in this “age of big data.” Finding usable relations and biases is the exciting part of data science, but one of the responsibilities of the data scientist is protecting the rest of their organization from the ruinous effect of pursuing spurious relations. You really don’t want to report a relation where there was none.


Be Sociable, Share!
  1. Paul F
    December 11th, 2012 at 10:35 | #1

    Saw this on HN browsing the new queue. Very well-crafted article; I might end up using it in my classroom some time in the future when my senior math classes talk about probability.

    If you ever want to do a follow-up, it might be worthwhile to look into Nintendo’s Fire Emblem series of strategy games. Since they focus on knights and whatnot, characters counterattack when attacked. Before you confirm your action, you’re presented with the probabilities that your character and the enemy character will land a hit. For both the “good guys” and the “bad guys”, the “percentages” are incorrect. High percentages are much more likely to land, low ones much less so.

    (The algorithm for how they do this is easy, but I thought you might want the fun of working it out on your own. If you don’t care to do so, feel free to shoot me an email.)

    I can only imagine this is to make the game *feel* more fair to the statistically challenged. “Man, I missed my 67% chance, but the computer hit on its 33%? This game is so cheap!” In Fire Emblem’s math, that 11% chance would instead happen 4.75% of the time. Missing a 90% then nailing a 10% wouldn’t be a 1 in 100 shot; instead, it’s 1 in 2,500.

  2. December 11th, 2012 at 11:20 | #2

    @Paul F
    Thanks, I was hoping that XCOM was cheating. Because, there is a bunch of interesting questions about engineering player experience that you could then think about. I might look at this as an excuse to int Fire Emblem (been meaning to, I liked Advance Wars).

  3. Steven Cole
    December 11th, 2012 at 13:10 | #3

    John, this is quite cool. And the whole time, I kept thinking: “John found a way to buy video games as a business expense.” Awesome.

    As far as cheating games go, I recall hearing a talk at GDC a few years ago about the console game “Civilization Revolutions”… And how focus group players *hated* how often they lost battles where they had an 80% success probability. And so the game designers changed the rules, making an 80% success an almost sure thing, where a loss happens about 1 out of 20 times, rather than 1 out of 5 times.

    Happy players tend to spend money on games, after all, so it became far more important to satisfy the players than the statisticians. ;)

  4. Yawning Angel
    December 11th, 2012 at 13:57 | #4

    Oh cool, someone else that uses R and ggplot2. Nice writeup.

    There’s a few things I explicitly did not address in my analysis (primarily since I was trying to keep it accessible to people who aren’t into math or theoretical computer science), mainly centered around the specific algorithm they used.

    The two main things I omitted are:
    1) The algorithm used has serial coloration problems.
    2) Because they used a power of 2 coefficient for m, the lower bits of the PRNG output have rather poor entropy.

    I don’t particularly believe the first thing is a problem because the actual PRNG output is rather opaque. As long as the generator is sufficiently well equidistributed, and is not overly streaky, I believe it is suitable for something like this. While weak, the XCOM PRNG does satisfy those criteria (though more time spent with dieharder or TestU01 is needed to conclusively prove the “not overly streaky” part, though in my “sample output” it is passing diehard_operm5).

    The 2nd point I suspect is one of implementation convenience It would be easy to correct (at a minimum, “foo.i = (foo.i >> 9) & 0x7FFFFF | 0x3F800000;” reduces the impact of their power of 2 m choice), but at that point it’s also just as easy to use a PRNG that’s not the minimum standard in terms of quality (A combined xorshift/Weyl generator is quite trivial to implement). As it stands, the deficiency is still something that would be near impossible for a end user to detect.

    It was interesting to see how other people approach the problem. Due to past experiences I tend to reach for my programmer tools rather quickly when I want to figure out how things work.

  5. December 11th, 2012 at 14:10 | #5

    @Yawning Angel
    Loved your article, it and your additional points are both excellent. Similar issue with this article: I had to say I am testing only gross totals (counts or surprise), not serial correlation. You can test for serial correlation- but as you stated we already know the PRNG has some known problems there. Finally I explicitly call your work science, whereas mine was just statistics.

  6. December 11th, 2012 at 14:39 | #6

    @Yawning Angel
    Okay, I have the itch. Without a *lot* of data (which you can produce since you tapped into the XCOM PRNG) it is hard to track the serial correlations in a simple principled manner (especially with the varying level of censorship caused by the hit probability varying). In principle with access to the PRNG you can record streaks or every observed outcome sequence of length-k and get at the issue. What I would suggest working from the outside (where it is expensive to produce data and the data is censored) is to upgrade the standard time-series methods used to detect serial correlation to lean on Tobit regression (to deal with the fact you don’t see what was rolled, just if it was under the current hit threshold or not). Looks like the R package for that is this one: http://cran.r-project.org/web/packages/censReg/vignettes/censReg.pdf .

  7. Yawning Angel
    December 11th, 2012 at 16:00 | #7

    @jmount
    Unfortunately I have limited time I can dedicate to this, but since you’re interested, I ported the PRNG algorithm to R for you so you can generate as many data points as you want.

    http://www.schwanenlied.me/yawning/XCOM/XCOMPRNG.R

    You’ll need bitops off CRAN, and the code in not spectacular but it does work. It reseeds the PRNG each time you call the routine, but the function returns a vector for a reason. It should be trivial to modify if you need it to behave differently.

  8. December 11th, 2012 at 21:22 | #8

    @Yawning Angel
    Thanks!

    Having direct access to the pseudo random number generator makes things a lot easier. First I can insist that all simulated shots have the same hit percentage (which make things much easier) and second I can generate a lot more data. As far as I can tell this is a weak serial correlation- but it doesn’t stick out like a sore thumb for my simplistic test until I got to around 50,000 data points. Here is the R-code to find the serial correlation:

    > probs <- xcomPrng(50000);
    > dHit <- ifelse(probs<=0.5,1,0);
    > tab <- table(dHit[1:(length(dHit)-1)],dHit[2:length(dHit)]);
    > fisher.test(tab);
    
    	Fisher's Exact Test for Count Data
    
    data:  tab 
    p-value = 0.001404
    alternative hypothesis: true odds ratio is not equal to 1 
    95 percent confidence interval:
     0.9116370 0.9781879 
    sample estimates:
    odds ratio 
     0.9443433 
    
    > tab
       
            0     1
      0 11958 12676
      1 12676 12689
    

    The table at the end is called a contingency table, and Fisher’s test tells us if the the significance of how far the counts are off from being independent. You read the table row by column: so row 0 column 1 is how many hits were followed by a miss (later entries in the hit vector are newer). Notice there are too few misses followed by misses. The damning evidence is the low p-value. The same code run with R’s own PRNG is looks okay:

    > probs <- runif(50000);
    > dHit <- ifelse(probs<=0.5,1,0);
    > tab <- table(dHit[1:(length(dHit)-1)],dHit[2:length(dHit)]);
    > fisher.test(tab);
    
    	Fisher's Exact Test for Count Data
    
    data:  tab 
    p-value = 0.9287
    alternative hypothesis: true odds ratio is not equal to 1 
    95 percent confidence interval:
     0.9638094 1.0341483 
    sample estimates:
    odds ratio 
     0.9983646 
    
    > tab
       
            0     1
      0 12630 12504
      1 12505 12360
    

    So if your port of the XCOM PRNG to R is correct then we see the XCOM PRNG (as it is used with the re-seedign and so on) is of lower quality than R’s PRNG. However, I couldn’t find a problem until I looked at a large amount of data- so I am not sure if players will see this or not (I may not have tried clever enough tests).

    I agree with the conclusions of your article: the XCOM PRNG is not great, but I doubt the deficiencies are actually player visible.

    And testing the sequence of the last few outcomes versus a most recent outcome:

    > probs <- xcomPrng(50000);
    > dHit <- ifelse(probs<=0.5,1,0);
    > tab <- table(dHit[4:length(dHit)],
       paste(dHit[3:(length(dHit)-1)],
        dHit[2:(length(dHit)-2)],
        dHit[1:(length(dHit)-3)],sep=''))
    > tab
       
         000  001  010  011  100  101  110  111
      0 2845 2962 3140 3058 3265 2822 3376 3193
      1 2962 3236 2947 3510 2933 3635 3192 2921
    > fisher.test(tab,simulate.p.value=T)
    
    	Fisher's Exact Test for Count Data with 
             simulated p-value (based on 2000 replicates)
    
    data:  tab 
    p-value = 0.0004998
    alternative hypothesis: two.sided 
    

    Note: these longer tests I am running here depend more and more on R’s pseudo random source being high quality. In reality we are testing if two pseudo-random sources have similar behavior. There are ways we can fail to achieve a meaningful result. Two obvious bad possibilities are as follows. They could have same behavior and both be wrong (be related bad implementations) which we would falsely record as “both good.” They R pseudo random simulator could be worse than XCOM’s causing us to misattribute a difference in behavior to an XCOM fault. At some point we would have to pick a summary that we knew the theoretic distribution of. Thus we would avoid introducing a second pseudo random generator when performing tests.

  9. Yawning Angel
    December 11th, 2012 at 23:50 | #9

    @jmount
    Thanks for the analysis. Since you’re only calling my routine once per test, the way I chose to seed the LCG is identical to a single game of XCOM (Seed once at the start of the routine, do not re-seed), so the only issues would be bugs in my porting. The code is straight forward enough that I doubt there are any.

    Unless you have any objections I’ll link to this as further reading from my initial writeup, since including the comments it covers most of what I wanted to look at if I ever revisited it.

  10. Brian Slesinsky
    December 11th, 2012 at 23:52 | #10

    Nice article!

    I don’t understand this: “In fact even the one-sided p-value is quite large (and poor) at 0.01.”

    Why is 0.01 considered large in this case? I’m wondering if it’s a typo, since when eyeballing the graph for surprise, the area under the curve below the line looks a lot larger.

  11. December 12th, 2012 at 07:14 | #11

    @Brian Slesinsky
    Typo on my part, the p is 0.1 which is “large” as it is greater than the traditional 0.05. thanks!

  12. December 12th, 2012 at 07:14 | #12

    @Yawning Angel
    Wow, yes thanks!

  13. SR
    December 22nd, 2012 at 12:54 | #13

    Great article!

    For games like this, I’ve always assumed that the display simply lies: if the displayed figure is 80%, the actual figure used is something else. I’m not sure of an easy way to test this, even with the large amounts of raw data here.

    The again, if the PRNG fails the streaky-ness test, then this doesn’t matter much

  14. December 24th, 2012 at 12:31 | #14

    And the conclusion of the XCOM: EU Xbox 360 Classic/Ironman run that donated this data. The game doesn’t have to do anything as subtle is mis-reporting probabilities to cheat

    I finally won on Classic/Ironman and right after it played the victory cut scenes it switched over to the “too many countries have left the council” cut scenes (which was not the case, I had lost only Egypt and all other countries were calm at 1-terror bar each) and scored the game as a defeat. The ironman save is right before this, so I watch this unfold again and again- but I can’t not change the scored outcome.

    Still, overall I give the game an “A.” It had its problems, may be simplified from its ancestors, and may not be to everybody’s taste. But, it felt like a game. And not all current games have that feeling.

    Update 1/25/2013- an update patch converted the lost game into a win. Yey!

  15. Danielle
    December 27th, 2012 at 12:47 | #15

    I’m currently working on my Masters in Methodology and Statistics, and found this article to be an amazingly well-crafted treatment of statistics as a whole — treating power, expected value, interpretation of the null hypothesis, etc., in an easy-to-follow manner. Providing the data and the R code to analyze it — amazing.

    I love it when my interests overlap! Thanks for this article.

  16. December 27th, 2012 at 13:32 | #16

    @Danielle
    Wow thank you. Things like that were my goal, but you never know if you really get close in such things.

Comments are closed.