For team games, a dataset of 61867 matches from OverTrack.gg was loaded into memory. Most matches represent a team of 5 vs. 5, with no possibility for ties. Since tracking matches is volutary, a not insignificant percentage of matches (7.82%) are played against a team of unrated/unknown players. The average number of matchs per unique player is about 2.338, however a decent number of players were tracked for over a hundred matches.
For multiplayer games, a dataset of 2727 games played on rr18xx.com. Abandoned and partial games were not included. Most of these games are played with previously-rated players; of these games, 76 (2.79%) were played where all players were unrated/unknown, and 595 (21.82%) were played where any players were unrated/unknown. The average number of matches per unique player is about 10.90, and the match count of game sizes 2-6 are distributed as
Due to the iterative nature of Trueskill needing more CPU time to converge, it can be expected that Openskill is faster. This is measured using the Benchmark, which samples until a reasonable confidence of the variance is attained. Each operation here represents measuring the impact of one 5v5 Overwatch game.
|Model||Speed (higher is better)||Variance||Samples|
|TrueSkill||2,962 ops/sec||±3.23%||82 runs sampled|
|Openskill/bradleyTerryFull||62,643 ops/sec||±1.09%||91 runs sampled|
|Openskill/bradleyTerryPart||40,152 ops/sec||±0.73%||91 runs sampled|
|Openskill/thurstonMostellerFull||59,336 ops/sec||±0.74%||93 runs sampled|
|Openskill/thurstonMostellerPart||38,666 ops/sec||±1.21%||92 runs sampled|
|Openskill/plackettLuce||23,492 ops/sec||±0.26%||91 runs sampled|
When measuring the impact of a 5 player free-for-all 18xx game, we see similar improvement. The models
thurstonMostellerPart are faster than their
thurstonMostellerFull counterparts because the Full implementations scale O(n^2) with the number of independent teams. Part implementations only change ratings based on the next best and next worse team (a.l. a stepladder), which is also why the accuracy
|Model||Speed (higher is better)||Variance||Samples|
|TrueSkill||2,462 ops/sec||±3.08%||84 runs sampled|
|Openskill/bradleyTerryFull||15,589 ops/sec||±0.54%||92 runs sampled|
|Openskill/bradleyTerryPart||20,381 ops/sec||±0.77%||86 runs sampled|
|Openskill/thurstonMostellerFull||14,456 ops/sec||±0.36%||93 runs sampled|
|Openskill/thurstonMostellerPart||19,889 ops/sec||±0.74%||95 runs sampled|
|Openskill/plackettLuce||12,296 ops/sec||±0.74%||89 runs sampled|
These were run on a 2013 Mac Pro 8-core 3.5 GHz 6-Core Intel Xeon E5, with nodejs 13.11.0 from a command line. The source code for this can be found in the package’s
benchmark directory. The high variance of TrueSkill is likely a result of being an iterative algorithm.
Measuring accuracy is subjective. In a 2 team game, this is simple since there are only two possible outcomes; either the red team or the blue team wins. When we exclude games where two unknown/unrated teams play (which was only 0.4% of games from OverTrack), we find the following
|Model||Correct (higher is better)|
It seems that any ranking algorithm should be able to predict more than 54% of the time, but in pairing a team against another team in Overwatch, the game’s matchmaking system attempts to pair players of similar skill.
With 18xx data, we see a different story. Many matches are either organized among friends, or randomly queued without regard to skill. Predicting the outcome of a match with 3 or more players can be objectively difficult. What is important? Predicting only 1st place? Predicting 1st, 2nd, and 3rd place? The context of the game is also important, for example in a Poker tournament the top 20-25% of players may receive a payout.
Picking Winner, Picking Top 3
First, we look at only the predicted winner of a match. In this case, we’re either right or wrong. This is simplest, and often times the only relevant metric. This is probably the easiest to understand. For this, some marginal leniancy is given toward all algorithms where when multiple players compete with the same rating, if any of them win then the result is considered a success. This reduces noise from new players whose rating sigma might be considered infinite, without biasing any method. Higher here is better.
Second, we consider that the predicted winner is in the top 3 finishers. Again, higher is better, and it can be expected that this should be correct much more often.
|Model||Predicting winner||Predicted in top 3|
Picking entire outcome
Third, we create an artifical metric for comparing the predicted outcome vs. the actual outcome. This is the sum for all players of the distance they are from their predicted rank. For example, if the predicted outcome is
[1, 2, 3, 4] and the actual outcome is
[1, 3, 2, 4], then two players are 1 away from their predicted rank, and the metric for this would be 2. For games with more players, the highest that this could be is higher, so we divide this by the maximum difference; which would be the reverse of the predicted outcome. The negative of this is then taken and added to 1, so that 0.0 means totally wrong, and 1.0 means totally right
|Model||Picking entire outcome|
Picking entire outcome with license
Fourth, we measure the difference from predicted to actual by computing the cosine similarity of the two vectors. This could be more meaningful, since we may have a game of four players, Alice
mu: 40, Betty
mu: 39.999, Cindy,
mu: 20, Daisy,
mu: 0. In this match-up, we would predict the most likely outcome of
Alice, Betty, Cindy, Daisy, however almost as likely would be
Betty, Alice, Cindy, Daisy. That outcome would be much more reasonable than an outcome of
Alice, Betty, Daisy, Cindy, as the difference in Daisy and Cindy is much greater.
|Model||Cosine Similarity to Predicted|
Here this shows that partial paring during re-ranking appears to be significantly better. Further work could be done to incorporate the sigma of each player’s rating, but this should show that
In addition to being unencumbered by patents, it has been shown by various methods that our library is faster, which would be important for large scale machine learning applications; and comparably accurate using the default parameters. It would be appropriate to tune these parameters depending on the application and the specific facets of our game data, for instance Openskill’s ε parameter could be increased to reduce the impact of rating plateaus. Discusson of how these parameters affect prediction is left for future work.
The TrueSkill package used here is ts-trueskill, a TypeScript port of the Python package.