This presents a Javascript implementation of a Weng-Lin rating scheme, using various models described in the paper. The principle difference between this and the more popular Microsoft TrueSkill scheme is that this is not encumbered by patents and licensing. Additionally, this is an order of magnitude faster, and delivers competitively accurate match predictions.

Dataset

For team games, a dataset of 61867 matches from OverTrack.gg was loaded into memory. Most matches represent a team of 5 vs. 5, with no possibility for ties. Since tracking matches is volutary, a not insignificant percentage of matches (7.82%) are played against a team of unrated/unknown players. The average number of matchs per unique player is about 2.338, however a decent number of players were tracked for over a hundred matches.

For multiplayer games, a dataset of 2727 games played on rr18xx.com. Abandoned and partial games were not included. Most of these games are played with previously-rated players; of these games, 76 (2.79%) were played where all players were unrated/unknown, and 595 (21.82%) were played where any players were unrated/unknown. The average number of matches per unique player is about 10.90, and the match count of game sizes 2-6 are distributed as

Players Matches
2 40
3 595
4 1300
5 619
6 173

Faster

Due to the iterative nature of Trueskill needing more CPU time to converge, it can be expected that Openskill is faster. This is measured using the Benchmark, which samples until a reasonable confidence of the variance is attained. Each operation here represents measuring the impact of one 5v5 Overwatch game.

Model Speed (higher is better) Variance Samples
TrueSkill 2,962 ops/sec ±3.23% 82 runs sampled
Openskill/bradleyTerryFull 62,643 ops/sec ±1.09% 91 runs sampled
Openskill/bradleyTerryPart 40,152 ops/sec ±0.73% 91 runs sampled
Openskill/thurstonMostellerFull 59,336 ops/sec ±0.74% 93 runs sampled
Openskill/thurstonMostellerPart 38,666 ops/sec ±1.21% 92 runs sampled
Openskill/plackettLuce 23,492 ops/sec ±0.26% 91 runs sampled

When measuring the impact of a 5 player free-for-all 18xx game, we see similar improvement. The models bradleyTerryPart and thurstonMostellerPart are faster than their bradleyTerryFull and thurstonMostellerFull counterparts because the Full implementations scale O(n^2) with the number of independent teams. Part implementations only change ratings based on the next best and next worse team (a.l. a stepladder), which is also why the accuracy

Model Speed (higher is better) Variance Samples
TrueSkill 2,462 ops/sec ±3.08% 84 runs sampled
Openskill/bradleyTerryFull 15,589 ops/sec ±0.54% 92 runs sampled
Openskill/bradleyTerryPart 20,381 ops/sec ±0.77% 86 runs sampled
Openskill/thurstonMostellerFull 14,456 ops/sec ±0.36% 93 runs sampled
Openskill/thurstonMostellerPart 19,889 ops/sec ±0.74% 95 runs sampled
Openskill/plackettLuce 12,296 ops/sec ±0.74% 89 runs sampled

These were run on a 2013 Mac Pro 8-core 3.5 GHz 6-Core Intel Xeon E5, with nodejs 13.11.0 from a command line. The source code for this can be found in the package’s benchmark directory. The high variance of TrueSkill is likely a result of being an iterative algorithm.

Accuracy

Measuring accuracy is subjective. In a 2 team game, this is simple since there are only two possible outcomes; either the red team or the blue team wins. When we exclude games where two unknown/unrated teams play (which was only 0.4% of games from OverTrack), we find the following

Model Correct (higher is better)
bradleyTerryFull 54.336254%
bradleyTerryPart 54.336254%
thurstonMostellerFull 54.143135%
thurstonMostellerPart 54.143135%
plackettLuce 54.336254%
trueskill 54.193443%

It seems that any ranking algorithm should be able to predict more than 54% of the time, but in pairing a team against another team in Overwatch, the game’s matchmaking system attempts to pair players of similar skill.

With 18xx data, we see a different story. Many matches are either organized among friends, or randomly queued without regard to skill. Predicting the outcome of a match with 3 or more players can be objectively difficult. What is important? Predicting only 1st place? Predicting 1st, 2nd, and 3rd place? The context of the game is also important, for example in a Poker tournament the top 20-25% of players may receive a payout.

Picking Winner, Picking Top 3

First, we look at only the predicted winner of a match. In this case, we’re either right or wrong. This is simplest, and often times the only relevant metric. This is probably the easiest to understand. For this, some marginal leniancy is given toward all algorithms where when multiple players compete with the same rating, if any of them win then the result is considered a success. This reduces noise from new players whose rating sigma might be considered infinite, without biasing any method. Higher here is better.

Second, we consider that the predicted winner is in the top 3 finishers. Again, higher is better, and it can be expected that this should be correct much more often.

Model Predicting winner Predicted in top 3
TrueSkill 45.14% 87.82%
Openskill/bradleyTerryFull 44.44% 87.34%
Openskill/bradleyTerryPart 45.14% 87.56%
Openskill/thurstonMostellerFull 39.12% 85.00%
Openskill/thurstonMostellerPart 44.15% 86.98%
Openskill/plackettLuce 44.92% 87.16%

Picking entire outcome

Third, we create an artifical metric for comparing the predicted outcome vs. the actual outcome. This is the sum for all players of the distance they are from their predicted rank. For example, if the predicted outcome is [1, 2, 3, 4] and the actual outcome is [1, 3, 2, 4], then two players are 1 away from their predicted rank, and the metric for this would be 2. For games with more players, the highest that this could be is higher, so we divide this by the maximum difference; which would be the reverse of the predicted outcome. The negative of this is then taken and added to 1, so that 0.0 means totally wrong, and 1.0 means totally right

Model Picking entire outcome
TrueSkill 0.464
Openskill/bradleyTerryFull 0.455
Openskill/bradleyTerryPart 0.456
Openskill/thurstonMostellerFull 0.423
Openskill/thurstonMostellerPart 0.451
Openskill/plackettLuce 0.458

Picking entire outcome with license

Fourth, we measure the difference from predicted to actual by computing the cosine similarity of the two vectors. This could be more meaningful, since we may have a game of four players, Alice mu: 40, Betty mu: 39.999, Cindy, mu: 20, Daisy, mu: 0. In this match-up, we would predict the most likely outcome of Alice, Betty, Cindy, Daisy, however almost as likely would be Betty, Alice, Cindy, Daisy. That outcome would be much more reasonable than an outcome of Alice, Betty, Daisy, Cindy, as the difference in Daisy and Cindy is much greater.

Model Cosine Similarity to Predicted
TrueSkill 0.9871333882702259
Openskill/bradleyTerryFull 0.9670280189093087
Openskill/bradleyTerryPart 0.991254308108686
Openskill/thurstonMostellerFull 0.9292664200294969
Openskill/thurstonMostellerPart 0.9940281068887633
Openskill/plackettLuce 0.9691814764747895

Here this shows that partial paring during re-ranking appears to be significantly better. Further work could be done to incorporate the sigma of each player’s rating, but this should show that

Summary

In addition to being unencumbered by patents, it has been shown by various methods that our library is faster, which would be important for large scale machine learning applications; and comparably accurate using the default parameters. It would be appropriate to tune these parameters depending on the application and the specific facets of our game data, for instance Openskill’s ε parameter could be increased to reduce the impact of rating plateaus. Discusson of how these parameters affect prediction is left for future work.

Availability

This Openskill package is available on npm as openskill. It is also available as an Elixir package on hex.pm as openskill.

The TrueSkill package used here is ts-trueskill, a TypeScript port of the Python package.