Introducing Player Ratings

May 23rd
written by Barroi

Ratings are often controversial but to some degree necessary or at least helpful. The goal of a rating system, no matter if it focuses on teams or players, should be to add some sort of value. It should be intuitive and accurate in regards to its purpose. A system that has no clear purpose or can not accurately represent its purpose has no value. The esports scene has many examples for failed rating systems and I am sure you have seen some of them.

EDIT: The rating system has been adjusted with a focus on single-match ratings which previously were flawed in that their fluctuations have been unreasonably high. Now the system is more robust against single-match data while not losing any accuracy for cumulative player data. To do this I used a different method to structure the data that lead to me having more data points for almost every hero, which in itself induces a higher rating accuracy for those heroes. The resulting rating-distribution has thus come closer to appear like the traditional bell curve: picks

The Purpose

In Overwatch we currently do not have as many stats as we would wish to have, but I’d like to think that Winston’s Lab is quite successful in squeezing out every possible number. And during the last few months I always tried to develop new metrics that could in some way express a players’ performance. Over the time I accumulated quite a lot of different values that express different things and it got kind of hard to find your way through all those numbers to figure out how good a player actually is. This is where this player rating comes in.

The purpose of this rating system is to have a single number that can express whether a player performed good or bad, was outstanding or underwhelming, carried his team or dragged it down. It evaluates a players’ performance in comparison to the average esports player.

The Process

During the last few weeks I have spend almost all of my time gathering data. I recorded data for 200 hours of VoDs which equates to data worth 2400 hours, because each match features 12 players. The bit of spare time that I had left was put into this project, that made use of the full 2400 hours.

Now, I will not reveal the concrete algorithms I created with all of their 576 variables and constants, but I will explain what I focused on. You can obviously only evaluate a players’ statistics in regards to the hero he is playing. It is highly unreasonable to compare the amount of kills of a Mercy player to the amount of kills of a Genji player. But for me it was important that there are numbers that can be compared no matter which hero was played. Once this is possible you can also create an overall rating for each player, because you can take all of his hero-ratings and mix them together.

How do you get to this point? First you need to focus on each hero individually and create a hero-rating that is unique to every single one. For some heroes it is important to get more kills, for some it is more important not to die. For some it is very important not to die first in a teamfight, you can read more about that in an article I wrote about teamfight statistics.

By using many different metrics and weighting them in such a way that it makes sense for a hero you get a single hero-rating. For the individual hero-ratings to be comparable they all have to have similar characteristics: the same average, a similar mean, the top X% of players should all have a rating of at least y, et cetera. Since I don’t want to get into the details, lets just say I built 24 different Lego houses. If you look at a players performance on a hero as a room in that heroes’ house, you could say that the better he performs the bigger his room is. But most importantly that doesnt make one house bigger than another and thus it becomes possible to compare the rooms in different houses. His overall performance would be the average roomsize of all the rooms he has in different houses.

To come back to the numbers and to be a bit more specific: The Lego houses actually are bell curves and because they all are similar, they are comparable.

The Result

The characteristics of the rating system are all adjustable, meaning I can choose where the average, the top 10% mark and so on is. For the start I decided to have all numbers fall around 1000, a top 25% performance has a rating of about at least 1150 (a bottom 25% performance is at 850 rating and below) and a top 10% performance is achieved by getting a rating of at least 1300 (bottom 10% equates to 700 and less). There are no hard caps on what numbers can be achieved, but it is pretty hard to get anything above 1.6k or below 400.picks

Tracer Players at APEX S3
Tracer Players at APEX S3

Rating a performance is only really feasible once a player played at least 10 minutes on a hero, ideally he should have at least 30 minutes. Which makes a match-performance less important than a performance across multiple matches.

The pages on Winston’s Lab that from now on calculate ratings are the match pages as well as the two player comparison pages (links to ratings for all players from APEX S2 and S3). The latter can for example be used to check out who the best Tracers at APEX S3 are, while the former can give some insight on who carried a team in a specific match.

Problems & What the numbers can’t tell you

What those numbers do not express is how well a team did altogether, this is also not something they should be made to express, since they are designed to evaluate a single player, disregarding what his teammates played, how well they did or against whom he played. The last part is quite important, because it makes it hard to compare a NiP player at a finish LAN with a Rogue player at APEX. In the future I might develop a system that filters out matches against “bad” opponents, so that one can understand how good a player is when faced with problems, but this rating system does definitely not do so.

Another issue that needs to be addressed is that the rating for low-playtime heroes is less accurate. Since there was not a lot of data for such heroes players might get higher/lower numbers than they should, but once this happens the system can adapt since more data is available. The biggest problem is Junkrat, whose numbers are probably totally useless at this point. Of the 400 hours where he could have been picked he was played for 21 minutes, less than 0.1% and definitely not enough to create a satisfying rating. Nevertheless a rating system for him exists and if someone chooses to play him for more than 10 minutes he will get rated. The heroes that still require more data before their ratings are acceptable are: Bastion, Hanzo, Junkrat, Mei, Orisa, Reaper, Sombra, Symmetra and Torbjörn. Heroes that are already fairly accurate but do still need more data are: Mccree, Mercy and Widowmaker.

This is very much a work in progress you can consider this the beta phase. The system can and must be updated constantly, because hero reworks, meta-changing patches or the introduction of new heroes require the algorithms to be adjusted accordingly.

I am also very open to feedback (from experts) if someone sees any issues with the current numbers. For example “Player X should be rated higher than playerY because he got way more kills, even though he died more often” can already be very helpful, but just to stress this again the algorithms make use of many more values than just kills and deaths.

All ratings that encapsulate matches which took place before January 27th are inaccurate for some heroes, so if you would like to explore the numbers make sure to use the filters on Winston’s Lab to get rid of those games.

The last point I want to address is that this rating system doesn’t make human analyses invalid. Those numbers are very much based on what I and other experts think and thus aren’t anything else than a personal preference. They also don’t contain what positions a player played, how often he rescued his team by sleepdarting an ulting Genji and so on an so forth (something that could be made possible by an API 😉 ). What I want to say is that numbers will never be able to replace humans and that an experts’ judgement is often more accurate than a number can ever be. But in the end this rating system can give anybody a first understanding on who is good and how much a team depends on a player, without being an expert in the field.

Enjoy the numbers!



Barroi is the founder of Winston's Lab. He is coder, journalist and statistician at once.