Abstract

The use of statistical analysis and modelling techniques in sports has gained a rapidly growing interest over the last few decades, as documented by the extensive scientific production on this topic. In particular, in almost any sport the amount of available data is rapidly growing and data structures get more and more complicated, encouraging more sophisticated statistical models and more elaborate data analysis techniques.
To foster the application of these statistical modelling techniques in sports, Statistical Modelling agreed to publish two special issues with articles that promote and disseminate knowledge that represents important developments, extensions and applications on statistical modelling in sports. Accordingly, the contributors to this special issue were encouraged to follow the following principles when setting up their submission:
Combining methodology and practice: All contributions should cover statistical modelling aspects of both methodology and practice and should be based on a problem of real substantive interest for the sports analytics community with appropriate data. Novelty and practical relevance: The contributions should provide clear evidence on the practical relevance of the discussed models/methods, and show why their use results in important new knowledge about the specific sport. Further, submissions should provide justification as to why the work is important to the subject area, and provides gains beyond current methodology applied to the specific sports field. The methodology used should be modern and reasonably sophisticated and have few or no applications so far in the sports analytics literature. Reproducibility: Readers should be able to reproduce the results presented in the articles, apply the techniques to their own problems, and even develop their own extensions of the methodology. To achieve this, software code and data should be provided as electronic supplements.
Altogether, the two special issues, which include a double-issue and a single issue, collect ten contributions that focus on baseball, basketball and football and cover a wide range of topics such as technical issues concerning the game and the way of playing, performance analysis, prediction, sports preferences, match-fixing and betting. In this issue, one contribution is devoted to predicting outcome probabilities in baseball, four works focus on different aspects in (association) football, and another contribution analyses sport rating data with clusters of opposite responses.
Powers et al. (2018) propose a penalty approach for multinomial regression, which is a convex relaxation of reduced rank multinomial regression. It has the advantage of leveraging the underlying structure among the response categories to make better predictions. The method is applied to Major League Baseball play-by-play data to predict outcome probabilities based on batter-pitcher matchups. The results confirm subject-area expertise, but also suggest a novel understanding of what differentiates players.
Lasek and Gagolewski (2018) analyse the efficacy of different league formats in ranking teams according to their true latent strengths. For this purpose, a Poisson regression approach for estimating attacking and defensive strengths on the match outcomes is proposed. The investigated tournament designs are used in the majority of European top-tier association football competitions. It is shown that a two-stage league format comprising of the three round-robin tournament together with an extra single round-robin is the most efficacious setting with regard to selecting the best team as the winner of the league.
In the contribution of Egidi et al. (2018), the information contained in bookmakers’ betting odds is used to improve fit and predictive accuracy of statistical models for football data. The authors propose a hierarchical Bayesian Poisson model in which the scoring rates of the teams are convex combinations of parameters estimated from historical data and the additional source of the betting odds. The model is fitted to a nine-year data set of the most popular European leagues and its predictive performance is investigated by predicting the match outcomes of a tenth season.
Another contribution that focuses on the prediction of football matches is based on random forests (Schauberger and Groll, 2018). They are known for their high predictive power and can be seen as a mixture between machine learning and statistical modelling. Similar to classical regression approaches, random forests can be used to analyse and predict results of international matches in football incorporating several potentially influential covariates with respect to a national team's success, such as betting odds or FIFA rankings. Based on all matches from the four FIFA World Cups 2002–2014, their predictive performance is compared to common regression models. The authors consider both random forests for the precise numbers of goals and the three match outcomes win, draw and loss.
The bookmakers’ odds also play an important role in the contribution of Ötting et al. (2018), where a method for detecting match-fixing in football matches is proposed. While the existing literature and fraud detection systems primarily focus on analysing betting odds, the authors suggest additionally considering the total volume placed on these bets, thereby better exploiting the information available. Both the betting odds and betting volume data are modelled by flexible distributional regression methods, which are then used to conduct outlier detection. On data of the second Italian football division, for which it has effectively been proven that some matches were fixed, the authors show that monitoring both betting volumes and odds can lead to more reliable detection of suspicious matches.
Last but not least, Simone and Iannario (2018) analyse sport rating data with clusters of opposite responses. In particular, they focus on data covering questionnaire-based evaluation of sport preferences, measurements of sport participation, opinions on social implications such as resurgence of racism, violence in stadiums and doping, where the need arises to establish connections among motivations, subjects’ characteristics and responses. Specifically, a two-component mixture of Inverse Hypergeometric distributions is introduced and tested against competing models in order to yield a multifold interpretation of results.
This special issue has been promoted by BDsports (Big Data Analytics in Sports), a project developed as part of the activities of the Big & Open Data Innovation Laboratory of the University of Brescia, born in 2016 thanks to the financial support of Fondazione Cariplo and Regione Lombardia and designed to set up a unique collaboration of experts interested in sport analytics both from a scientific and a practical point of view. Finally, we want to express our thanks to all the authors and reviewers that have made this special issue possible; it really was a pleasure for us to have the opportunity to read so many interesting contributions! We also wish to thank our Editor Jeff Simonoff for his strong support throughout the whole publication process.
