Abstract
We present 10 different strength-based statistical models that we use to model soccer match outcomes with the aim of producing a new ranking. The models are of four main types: Thurstone–Mosteller, Bradley–Terry, independent Poisson and bivariate Poisson, and their common aspect is that the parameters are estimated via weighted maximum likelihood, the weights being a match importance factor and a time depreciation factor giving less weight to matches that are played a long time ago. Since our goal is to build a ranking reflecting the teams’ current strengths, we compare the 10 models on the basis of their predictive performance via the Rank Probability Score at the level of both domestic leagues and national teams. We find that the best models are the bivariate and independent Poisson models. We then illustrate the versatility and usefulness of our new rankings by means of three examples where the existing rankings fail to provide enough information or lead to peculiar results.
Keywords
Introduction
Football, or soccer, is undeniably the most popular sport worldwide. Predicting which team will win the next World Cup or the Champions League final are issues that lead to heated discussions and debates among football fans, and even attract the attention of casual watchers. Or put more simply, the question of which team will win the next match, independent of its circumstances, excites the fans. Bookmakers have made a business out of football predictions, and they use highly advanced models taking numerous factors (like a team's current form, injured players, the history between both teams, the importance of the game for each team, etc.) into account to obtain the odds of winning, losing and drawing for both teams.
One major appeal of football, and a reason for its success, is its simplicity as a game. This stands somehow in contrast to the difficulty of predicting the winner of a football match. A help in this respect would be a ranking of the teams involved in a given competition based on their current strength, as this would enable football fans and casual watchers to have a better feeling for who is the favourite and who is the underdog. However, the existing rankings, both at domestic league level and at national team level, fail to provide this either because they are by nature not designed for that purpose or because they suffer from serious flaws.
Domestic league rankings obey the 3–1–0 principle, meaning that the winner gets 3 points, the loser 0 points and a draw earns each team 1 point. The ranking is very clear and fair, and tells at every moment of the season how strong a team has been since the beginning of the season. However, given that every match has the same impact on the ranking, it is not designed to reflect a team's current strength. A recent illustration of this fact can be found in last year's English Premier League, where the newly promoted team of Huddersfield Town had a very good start in the season 2017–2018 with 7 out of 9 points after the first three rounds. They ended the first half of the season at rank 11 out of 20, with 22 points after 19 games. Their second half season was however very poor, with only 15 points scored in 19 games, earning them the second last spot over the second half of the season (overall they ended the year at rank 16, allowing them to stay in the Premier League). There was a clear tendency of decay in their performance, which was hidden in the overall ranking by their very good performance at the start of the season.
Contrary to domestic league rankings, the FIFA/Coca-Cola World Ranking of national soccer teams is intended to rank teams according to their recent performances in international games. Bearing in mind that the FIFA ranking forms the basis of the seating and the draw in international competitions and its qualifiers, such a requirement on the ranking is indeed necessary. However, the current FIFA ranking 1
While the present paper was in the final stages of the revision procedure, the FIFA decided to change its ranking in order to avoid precisely the flaws we mention here. Given the short time constraint, we were not able to study their new ranking and leave this for future research.
In this article, we intend to fill the gap and develop a ranking that does reflect a soccer team's current strength. To this end, we consider and compare various existing and new statistical models that assign one or more strength parameters to each soccer team and where these parameters are estimated over an entire range of matches by means of maximum likelihood estimation. We shall propose a smooth time depreciation function to give more weight to more recent matches. The comparison between the distinct models will be based on their predictive performance, as the model with the best predictive performance will also yield the best current strength ranking. The resulting ranking represents an interesting addition to the well-established rankings of domestic leagues and can be considered as promising alternative to the FIFA ranking of national teams.
The present article is organized as follows: We shall present, in Section 2, 10 different strength-based models whose parameters are estimated via maximum likelihood. More precisely, via weighted maximum likelihood as we introduce two types of weight parameters: the aforementioned time depreciation effect and a match importance effect for national team matches. In Section 3 we describe the exact computations behind our estimation procedures as well as a criterion according to which we define a statistical model's predictive performance. Two case studies allow us to compare our 10 models at domestic league and national team levels in Section 4: we investigate the English Premier League seasons from 2008 to 2017 (Section 4.1) as well as national team matches between 2008 and 2017 (Section 4.2). On basis of the best-performing models, we then illustrate in Section 5 the advantages of our current strength-based ranking via various examples. We conclude the article with final comments and an outlook on future research in Section 6.
Time depreciation and match importance factors
Our strength-based statistical models are of two main types: Thurstone–Mosteller (TM) and Bradley–Terry (BT) type models on the one hand, which directly model the outcome (home win, draw, away win) of a match, and the independent and bivariate Poisson models on the other hand, which model the scores of a match. Each model assigns strength parameters to all teams involved and models match outcomes via these parameters. Maximum likelihood estimation is employed to estimate the strength parameters, and the teams are ranked according to their resulting overall strengths. More precisely, we shall consider weighted maximum likelihood estimation, where the weights introduced are of two types: time depreciation (domestic leagues and national teams) and match importance (only national teams).
A smooth decay function based on the concept of Half Period
A feature that is common to all considered models is our proposal of decay function in order to reflect the time depreciation. Instead of the step-wise decay function employed in the FIFA ranking, we rather suggest a continuous depreciation function that gives less weight to older matches with a maximum weight of one for a match played today. Specifically, the time weight for a match which is played
meaning that a match played Half-Period days ago only contributes half as much as a match played today and a match played
Comparison of the FIFA ranking decay function versus our exponential smoother (2.1). The continuous depreciation line uses a Half Period of 500 days
While in domestic leagues all matches are equally important, the same cannot be said about national team matches where, for instance, friendly games are way less important than matches played during the World Cup. Therefore we need to introduce importance factors. The FIFA weights seem reasonable for this purpose and will be employed whenever national team matches are analysed. The relative importance of a national match is indicated by
The Thurstone–Mosteller and Bradley–Terry type models
TM models (Thurstone, 1927; Mosteller, 2006) and BT models (Bradley and Terry, 1952) have been designed to predict the outcome of pairwise comparisons. Assume from now that we look at
Thurstone–Mosteller model
The Thurstone-Mosteller model assumes that the performances
If we call
where
The strength parameters are estimated using maximum likelihood estimation on match outcomes. Let
with
In the BT model, the normal distribution is replaced with the logistic distribution. This leads to the assumption that
where again
Bradley–Terry–Davidson model
In the original Bradley-Terry model, there exists no possibility for a draw (
These simple formulae are one of the reasons for the popularity of the BT model. Starting from there, Davidson (1970) modelled the draw probability in the following way:
The draw effect
Thurstone–Mosteller, Bradley–Terry and Bradley–Terry–Davidson models\\ with goal difference weights
The basic TM, BT and Bradley–Terry–Davidson models of the previous sections do not use all of the available information. They only take the match outcomes into account, omitting likely valuable information present in the goal difference. A team that wins by 8–0 and loses the return match by 0–1 is probably stronger than the opponent team. Therefore we propose an extension of these models that modifies the basic models in the sense that matches are given an increasing weight when the goal difference grows. The likelihood function is calculated as follows:
where
with
The Poisson models
Poisson models were first suggested by Maher (1982) to model football match results. He assumed the number of scored goals by both teams to be independent Poisson distributed variables. Let
where
Being a count-type distribution, the Poisson is a natural choice to model soccer matches. It bares yet another advantage when it comes to predicting matches. If
Attributing again a single strength parameter to each team, denoted as before by
where
The bivariate Poisson model
A potential drawback of the independent Poisson models lies precisely in the independence assumption. Of course, some sort of dependence between the two playing teams is introduced by the fact that the strength parameters of each team are present in the Poisson means of both teams, however this may not be a sufficiently rich model to cover the interdependence between two teams.
Karlis and Ntzoufras (2003) suggested a bivariate Poisson model by adding a correlation between the scores. The scores in a match between teams
which is the formula for the bivariate Poisson distribution with parameters
implying that we can again use the Skellam distribution for predicting the winner of future games.
One can think of many other ways to model dependent football scores. Karlis and Ntzoufras (2003) also consider bivariate Poisson models where the dependence parameter
In the previous sections we have defined a slightly simplified version of Maher's original idea. In fact, Maher assumed the scoring rates to be of the form
Since every team is given two strength parameters, in this case, one may wonder how to build rankings. We suggest two options—on the one hand, this model can lead to two rankings, one for attacking strengths and the other for defensive strengths. On the other hand, we can simulate a round-robin tournament with the estimated strength parameters and consider the resulting ranking. We refer the reader to Scarf and Yusof (2011) for details about this approach.
Parameter estimation and model selection
In this section we shall briefly describe two crucial statistical aspects of our investigation, namely how we compute the maximum likelihood estimates and which criterion we apply to select the model with the highest predictive performance.
Computing the maximum likelihood estimates
Parameters in the TM and Bradley–Terry type as well as in the Poisson models are estimated using maximum likelihood estimation. To this end, we have used the optim function in
Measure of predictive performance
The studied models are built to perform three-way outcome prediction (home win, draw or home loss). Each of the three possible match outcomes is predicted with a certain probability but only the actual outcome is observed. The predicted probability of the outcome that was actually observed is thus a natural measure of predictive performance. The ideal predictive performance metric is able to select the model which approximates the true outcome probabilities the best.
The metric we use is the Rank Probability Score (RPS) of Epstein (1969). It represents the difference between cumulative predicted and observed distributions via the formula
where we simplify the previous notations so that
Comparison of the 10 models in terms of their predictive performance
In this section we compare the predictive performance of all 10 models described in Section 2. To this end, we first consider the English Premier League as example for domestic league matches, and then move to national team matches played over a period of 10 years all over the world, that is, without restriction to a particular zone.
Case study 1: Premier League
The engsoccerdata package (Curley, 2015) contains results of all top four-tier football leagues in England since 1888. The dataset contains the date of the match, the teams that played, the tier as well as the result. The number of teams equals 20 for each of the seasons considered (2008–2017). Matches are predicted for every season separately and on every match day of the season, using two years for training the models. We left out the first five rounds of every season, so a total of 3 300 matches are predicted. The reason for the burn-in period is the fact that for the new teams in the Premier League, we cannot have a good estimation yet of their strength at the beginning of the season since we are lacking information about the previous season(s). Matches are predicted in blocks corresponding to each round and after every round, the parameters are updated. In all our models, the Half Period is varied between 30 days and 2 years in steps of 30 days.
Table 1 summarizes the analysis by comparing the best performing models of each of the considered classes, that is, the model with the optimal Half Period. As we can see, the bivariate Poisson model with one strength parameter per team is the best according to the RPS, followed by the independent Poisson model with just one parameter per team. So parsimony in terms of parameters to estimate is important. We also clearly see that all Poisson-based models outperform the TM and BT type models. This was to be expected since Poisson models use the goals as additional information. Considering the goal difference in the TM and BT type models does not improve their performance. It is also noteworthy that the best two models have among the lowest Half Periods.
Comparison table for the best performing models of each of the considered classes with respect to the RPS criterion. The English Premier League matches from rounds 6 to 38 between the seasons 2008–2009 and 2017–2018 are considered
Comparison table for the best performing models of each of the considered classes with respect to the RPS criterion. The English Premier League matches from rounds 6 to 38 between the seasons 2008–2009 and 2017–2018 are considered
For the national team match results, we used the dataset ‘International football results from 1872 to 2018’ uploaded by Mart Jürisoo on the website https://www.kaggle.com/. We predicted the outcome of 4 268 games played all over the world in the period from 2008 to 2017. The last game in our analysis is played on 15 November 2017. To avoid a too extreme computational time, we left out the friendly games in the comparison. The parameters are estimated by maximum likelihood on a period of eight years. The Half Period is varied from a half year to six years in steps of a half year.
The results of our model comparison are provided in Table 2. Exactly as for the Premier League, the bivariate Poisson model with one strength parameter per team comes out first, followed by the independent Poisson model with one strength parameter. We also retrieve all the other conclusions from the domestic level comparison. It is interesting to note that a Half Period of 3 years leads to the lowest RPS for both best models. Given the sparsity of national team matches played over a year, we think that no additional level of detail such as 3 years and 2 months is required, as this may also lead to over-fitting.
Comparison table for the best performing models of each of the considered classes with respect to the RPS criterion. All of the important matches between the national teams in the period 2008–2017 are considered
Comparison table for the best performing models of each of the considered classes with respect to the RPS criterion. All of the important matches between the national teams in the period 2008–2017 are considered
We now illustrate the usefulness of our new current strength-based rankings by means of various examples. Given the dominance of the bivariate Poisson model with one strength parameter in both settings, we will use only this model to build our new rankings.
Example 1: Rankings of Scotland in 2013
As mentioned in the Introduction, the abrupt decay function of the FIFA ranking has entailed that the ranking of Scotland varied a lot in 2013 over a very short period of time—ranked 50th in August 2013, dropped to rank 63 in September 2013 before jumping to rank 35 in October 2013. In Figure 2, we show the variation of Scotland in the FIFA ranking together with its variation in our ranking based on the bivariate Poisson model with one strength parameter and Half Period of 3 years. While both rankings follow the same trend, we clearly see that our ranking method shows less jumps than the FIFA ranking and is much smoother. It thus leads to a more reasonable and stable ranking than the FIFA ranking.
Comparison of the evolution of the FIFA ranking of Scotland in 2013 with the evolution based on our proposed ranking method, using the bivariate Poisson model with one strength parameter and Half Period of 3 years
Comparison of the evolution of the FIFA ranking of Scotland in 2013 with the evolution based on our proposed ranking method, using the bivariate Poisson model with one strength parameter and Half Period of 3 years
Another infamous example of the disadvantages of the official FIFA ranking is the position of Poland at the moment of the draw for the 2018 FIFA World Cup (1 December 2017, but the relevant date for the seating was 16 October 2017). According to the FIFA ranking of 16 October 2017, Poland was ranked 6th, and so it was one of the teams in Pot 1, in contrast to, for example, Spain or England which were in Pot 2 due to Russia as host occupying one of the eight spots in Pot 1. Poland has reached this good position thanks to a very good performance in the World Cup qualifiers and, specifically, by avoiding friendly games during the year before the drawing for the World Cup, since friendly games with their low importance coefficient are very likely to reduce the points underpinning the FIFA ranking. This trick of Poland, who used intelligently the flaws of the FIFA ranking, has led to unbalanced groups at the World Cup, as for instance strong teams such as Spain and Portugal were together in Group B and Belgium and England were together in group G. This raised quite some discussions in the soccer world. In the end, Poland was not able to advance to the next stage of the World Cup 2018 competition in its group with Colombia, Japan and Senegal, where Colombia and Japan ended being first and second, Poland becoming last. This underlines that the position of Poland was not correct in view of their actual strength.
In Table 3, we compare the official FIFA ranking on 16 October 2017 to our ranking based on the bivariate Poisson model with one strength parameter and Half Period of three years. In our ranking, Poland occupies only position 14 and would not be in Pot 1. Spain and Colombia would enter Pot 1 instead of Poland and Portugal. We remark that, in the World Cup 2018, Spain was ranked first in their group with Portugal being second while, as mentioned earlier, Colombia turned out first of Group H while Poland became last. This demonstrates the superiority of our ranking over the FIFA ranking. A further asset is its readability: One can understand the values of the strength parameters as ratios leading to the average number of goals that one team will score against the other. The same cannot be said about the FIFA points which do not allow making predictions.
Top of the ranking of the national teams on 16 October 2017 according to the bivariate Poisson model with 1 strength parameter and a Half Period of 3 years compared to the Official FIFA/Coca-Cola World Ranking on 16 October 2017
Top of the ranking of the national teams on 16 October 2017 according to the bivariate Poisson model with 1 strength parameter and a Half Period of 3 years compared to the Official FIFA/Coca-Cola World Ranking on 16 October 2017
Above: Premier League ranking according to the bivariate Poisson model with 1 strength parameter and Half Period of 390 days, updated every week, starting from the sixth week since the start of the season. Below: Official Premier League ranking, weekly updated, starting from the sixth week
Above: Premier League ranking according to the bivariate Poisson model with 1 strength parameter and Half Period of 390 days, updated every week, starting from the sixth week since the start of the season. Below: Official Premier League ranking, weekly updated, starting from the sixth week
In Figure 3, we compare our ranking based on the bivariate Poisson model with one strength parameter and Half Period of 390 days to the official Premier League ranking for the season –2018, leaving out the first five weeks of the season. At first sight, one can see that our proposed ranking is again smoother than the official ranking, especially in the first part of the season. Besides that, our ranking is constructed in such a way that it depends less on the game schedules, while the intermediate official rankings heavily depend on the latter. Indeed, winning against weak teams can rapidly blow up a team's official ranking, whereas in our ranking, which takes the opponent strength into account, the weakness of the opponents will less increase that team's strength. Furthermore, the postponing of matches may even entail that at a certain moment some teams have played more games than others, which of course results in an official ranking that is in favour of the teams which have played more games at that time—a feature that is avoided in our ranking.
Coming back to the example of Huddersfield Town, mentioned in the Introduction, we can see that our ranking was able to detect Huddersfield as one of the weakest teams in the Premier League after 15 weeks, while their official ranking was still high, thanks to their good start of the season. Thus, our ranking fulfills its purpose—it reflects well a team's current strength.
We have compared 10 different statistical strength-based models according to their potential to serve as rankings, reflecting a team's current strength. Our analysis clearly demonstrates that Poisson models outperform TM and Bradley–Terry type models, and that the best models are those that assign the fewest parameters to teams. Both at domestic team level and national team level, the bivariate Poisson model with one strength parameter per team was found to be the best in terms of the RPS criterion. However, the difference between that model and the independent Poisson with one strength parameter is very small, which is explained by the fact that the covariance in the bivariate Poisson model is close to zero. This is well in line with recent findings of Groll et al. (2017) who used the same bivariate Poisson model in a regression context. Applying it to the European Championships 2004–2012, they got a covariance parameter close to zero.
The time depreciation effect in all models considered in the present article allows to take into account the moment in time when a match was played and gives more weight to more recent matches. An alternative approach to address the problem of giving more weight to recent matches consists in using dynamic time series models. Such dynamic models, also based on Poisson distributions, were proposed in Crowder et al. (2002), Koopman and Lit (2015) and Angelini and De Angelis (2017). In future work, we shall investigate in detail the dynamic approach and also compare the resulting models to the bivariate Poisson model with one strength parameter based on the time depreciation approach.
Acknowledgements
We wish to thank the associate editor as well as two anonymous referees for their useful comments that led to a clear improvement of our article.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.
Funding
The authors received no financial support for the research, authorship and/or publication of this article.
