Are Armageddon chess games implemented fairly?

Abstract

A strategy to find fair time controls for Armageddon chess games is presented and exemplified with 2750+ ELO players. The research was divided in two stages. The first one consisted in finding a time control for Stockfish 9 so that it played with an effective strength equal to the average 2750+ ELO players of the 2017 World Blitz Chess Championship. Analysis of this stage showed that a fair time control for an Armageddon game depends on the players’ rating. In the second stage, another instance of Stockfish 9 with a different time control was paired as the black player to the engine with the time control found in the first stage as the white player. The new engine’s time control was adjusted so that the expected result of Armageddon games between these two engines would be a draw. This game generation was done because there is a very small amount of Armageddon games in order to draw statistical conclusions out of them. The resulting time controls were found to be $180 \pm 35$ seconds for white and $110 \pm 12.27$ seconds for black plus a $2 \pm 0.39$ second increment per move, starting from move 1. Therefore, the ratio between the initial times for the black and white players is approximately 3 to 5, which is different from the normally implemented 4 to 5 ratio.

Keywords

Armageddon time control fairness chess

1. Introduction

A chess match is a series of chess games played between two chess players. It is usually divided into stages, which can be thought of as a series mini-matches. If a mini-match is tied, the players move onto the next mini-match, which has a faster time control than the previous one. This could happen a lot because the most probable outcome of a chess game at the top level is a draw (Milener, 2018 (accessed December 1, 2018)). If after a number of those mini-matches the general match is still tied, the tie is resolved by an Armageddon game. Here, the piece colour is randomly assigned to the players. In the case of the World Chess Championship, the player with the white(black) pieces has 5(4) minutes on their clocks. In the case of the 2019 Altibox Norway Chess competition, the player with the white(black) pieces has 10(7) minutes. In both tournaments, after the 60^th move, an increment of 3 seconds is added to the player’s clocks per move. In case of a draw the player with the black pieces is declared the winner.

So far, there is no scientific background on why the 5(4) or 10(7) minute settings are fair ways of breaking a tied match. In fact, many people think that these specific time controls in the Armageddon game favour black,1

¹
This is probably the reason why the 2019 Altibox Norway Chess competition organizers decided to increase the time pressure on black with respect to the white pieces.

though no statistical evidence has been published about this claim.

Therefore, this paper aims to answer, as formally as possible, the questions: how fair is the Armageddon chess game for different player strengths? For a given player strength, what is the fairest Armageddon time control? To answer these questions I propose a methodology explained in Section 2, where the main results of this study are provided. Then, in Section 3 the validity of the findings is discussed. Finally, in Section 4 a short summary is provided and the main conclusions of this work are drawn.

2. Methodology and results

Let R be the outcome of an Armageddon chess game defined by $\begin{array}{l} (1) & R = \{\begin{matrix} 1 & if white wins the game, \\ 0 & otherwise. \end{matrix} \end{array}$ The Armageddon game is defined by the white and black players and by the time controls assigned to them $t_{w}$ and $t_{b}$ , respectively. Here, subscripts referring to the white and black players are omitted in R for simplicity in the notation. When the players are equally strong and the time controls are the same, the outcome of an even number of games match with alternating colours between the players is expected to be a draw in normal chess.

An Armageddon game is said to be fair if $\begin{array}{l} (2) & P (R = 1) = P (R = 0) = 0.5. \end{array}$ It is not the purpose of this research to try to balance the chances between the players according to the difference between their ratings. The presumption is that if two human players have reached the Armageddon stage, it is because their playing performances have been practically the same and the players should be treated that way. Instead, it is the dependence of the chances on the overall rating level of the players that is the issue.

The methodology exposed hereafter consists of two ideas. The first one is to experiment different time controls in order to find the ones that lead to the condition in Eq. (2). As it will be seen, this search is done using computer chess engines because the data on human Armageddon games is limited and negligible in size, so no statistics can be used to search a good estimate of a fair Armageddon time control. Therefore, computer engines games must be played in order to gather enough data for drawing conclusions. This introduces the second idea, which is to find a way to simulate the chess strength of human play.

2.1. World Chess Championship Armageddon time control

To check Eq. (2), the following method is proposed. Five matches between Stockfish 9 against itself (with the same parameters) are played, each consisting of 500 Armageddon chess games where the Perfect 2017 Polyglot opening book2

²
https://sites.google.com/site/computerschess/perfect2017books

was randomly used up to ply 16 to generate diversity of games. The time control for match i is

\begin{array}{l} (3) & \begin{matrix} t_{w, i} & = {(α_{i} seconds | α_{i} / 100 seconds)}_{61} \\ t_{b, i} & = {(4 α_{i} / 5 seconds | α_{i} / 100 seconds)}_{61} . \end{matrix} \end{array}

The time control is denoted as

{(x | y)}_{z}

, where the first component x is the initial time and the second component y is the time increment per move starting at move z. This time control matches the 2018 World Chess Championship for

α_{i} = 300

, as explained in Section 1. Moreover, the lower the

α_{i}

the lower the playing strength, regardless of the hardware (which in my case is an Intel^® Core™ i5-4460 CPU @ 3.20 GHz × 4 processor and a 16 GB RAM). Results for the five matches are shown in Table 1. The table shows that there is a direct relationship between playing strength (which is related to depth (Ferreira, 2013)) and α (see the works by van Harreveld et al. (2007); Burns (2004); Calderwood et al. (1988); Campitelli and Gobet (2004); Charness (1981) for studies supporting this). Furthermore,

P (R = 1)

is relatively low for larger α, meaning that the time control in Eq. (3) greatly favours black, but becomes almost exactly fair at weaker playing strengths (shorter time controls). This in itself is disadvantageous since it is clear that fixing a time control would affect players depending on their ELO.

Table 1

Winning probability for white, depth and nodes per second for each match labelled by the α value. From the 500 games of each match, 40 random samples of size 20 were taken from the results to estimate $P (R_{g} = 1)$ by taking their mean and standard deviation. Similarly, for the depth and nodes per second statistics, 40 random samples of size 200 were taken

$α$ (s)	$P (R = 1)$	White Depth	Black Depth	Nodes per second (×106)
1.5	0.485 ± 0.112	13.819 ± 1.080	12.543 ± 0.611	1.876 ± 0.057
3	0.476 ± 0.120	15.623 ± 1.015	13.925 ± 0.578	2.027 ± 0.063
6	0.152 ± 0.077	17.278 ± 0.812	16.104 ± 0.524	2.171 ± 0.052
12	0.161 ± 0.095	18.770 ± 0.658	17.887 ± 0.657	2.212 ± 0.056
24	0.102 ± 0.066	21.337 ± 0.776	20.488 ± 0.579	2.270 ± 0.067

On the other hand, one may argue that players’ approaches to an Armageddon chess game are not the same (Guntz et al., 2016). The white player will try to set the board on fire, whereas the black player will try to remain solid. To simulate this kind of behaviour, the Stockfish contempt parameter is helpful. In Stockfish 9, contempt goes from −100 to 100, where −100 favours draws and solid play, and 100 favours risky moves giving more chances for a decisive result. For instance, the Stockfish community decided to set the contempt to 20 in the Top Chess Engine Championship for 2018, aiming at more aggressive play from Stockfish and a greater number of wins. It is noteworthy that higher values of contempt are harmful for the results, as too many risky moves are played and hence punished by the opponent. As an example, results for the five matches with contempt set to −20(20) for black(white) player are shown in Table 2. It is observed that Armageddon is even less fair than in the case with 0 contempt for low values of α. Moreover, a monotonic decrease of fairness is present as playing strength increases.

Table 2

Same as Table 1 but with contempt set to −20(20) for black(white) player

$α$ (s)	$P (R = 1)$	White Depth	Black Depth	Nodes per second (×106)
1.5	0.399 ± 0.114	13.873 ± 0.982	12.774 ± 0.793	1.792 ± 0.046
3	0.406 ± 0.112	15.939 ± 1.109	14.104 ± 0.784	2.041 ± 0.047
6	0.329 ± 0.111	17.409 ± 0.981	15.936 ± 0.612	2.191 ± 0.063
12	0.125 ± 0.078	20.001 ± 0.630	19.165 ± 0.587	2.386 ± 0.052
24	0.117 ± 0.079	21.102 ± 0.914	20.309 ± 0.536	2.323 ± 0.057

This same behaviour has been measured in human play in normal chess, when observing thousands of games of players grouped by rating ranges (Regan, 2018 (accessed August 19, 2019), 2016 (accessed August 19, 2019)). Stronger players tend to punish mistakes and convert advantages more easily, as well as defend fiercely, which is why the longer the time control, the more likely a draw is.

2.2. World blitz chess championship time control

From Tables 1 and 2 it is seen that the expected result of an Armageddon game is highly dependent on the player strength; the lower the playing strength, the fairer the result. This dependence poses a big problem when trying to study fair Armageddon time controls for humans. If the dependence of $P (R = 1)$ on the engine strength did not exist, then, one could pick a time control for white, say ${(3 minutes | 2 seconds)}_{1}$ , and start searching for a time control ${(γ minutes | 2 seconds)}_{1}$ so that the expected result between two engines is 0.5. Unfortunately, this will be misleading since the proportion between the white and black time controls for a fair Armageddon game does depend on playing strength, and it is known that engines are much stronger than humans (Campbell et al., 2002; Schaeffer and Plaat, 1997; Wüllenweber et al., 2006) (they are so strong that it was only possible to beat them through state of the art experimentation on reinforcement learning using supercomputers, which produced a completely new type of engine based on artificial intelligence (Silver et al., 2018)). In fact, engines are widely used to evaluate human performance in chess (Regan et al., 2012; Guid and Bratko, 2011). Therefore, one must first try to estimate what would be the expected result between a human at some time control t and an engine at some time control $t^{'}$ . This estimation cannot be based on empirical data, because there is an insignificant amount of data of human vs engine games.

This paper shows how to overcome this issue and hence, tries to give an answer to what is the fairest Armageddon time control for players above 2750 ELO. The proposed strategy that I will follow to solve this question is the next one:

Play a 500 × double round-robin chess tournament between Stockfish 9 engines labelled by α with strength determined by a time control $t_{α} = {(b_{α} nodes | b_{α} / 90 nodes)}_{1}$ , which matches the proportion of the world blitz chess championship time control. Here $b_{α}$ is just a positive integer denoting the number of calculation nodes available to the engine in order to play the game.

Select games where 2750+ ELO players played in the 2017 world blitz chess championship. For each 2750+ ELO player, concatenate their moves of all games into a file $f_{j}$ . Also, for each engine, concatenate all their moves in the previously described tournament into a file $e_{j}$ .

For each file $f_{j}$ and $e_{j}$ , perform a fixed-node count analysis with a witness engine, which in our case is Stockfish 9 at 10000 nodes per move. At the end of the analysis of file $f_{j}$ or $e_{j}$ , a gain histogram $g [f_{j}]$ or $g [e_{j}]$ (as shown by Ferreira (2012)) is built, which is a histogram of the evaluation differences for a set of consecutive moves.This histogram can be represented as a 601 integer component vector, where each component denotes the number of moves in the file with a certain gain. The first component of the histogram vector is the number of moves in the file with a gain of −300 centipawns or less, the second component is the number of moves in the file with a gain of −299 centipawns, and so on. The last component is the number of moves in the file with a gain of 300 centipawns or more. Because of fixed-node count, the witness engine will perform a biased evaluation of the games it analyzes (Ferreira, 2012). The deeper the search, the more accurate the evaluation will be. However, as an approximation, the bias of the witness engine will be the same, on average, for human games in files $f_{j}$ and for engine games in files $e_{j}$ , which is why it is acceptable to have a witness engine with such a low node count.

Let ${\bar{g}}_{f}$ be the average of the player’s histograms, i.e. for N players ${\bar{g}}_{f} = \sum_{j} g [f_{j}] / N$ (note that ${\bar{g}}_{f}$ is again a 601 component vector). For each engine gain histogram $g [e_{j}]$ , perform a cross-correlation as indicated by Ferreira (2012) with ${\bar{g}}_{f}$ in order to obtain the expected result between engine j and the average of human players. The engine θ that is likely to play with the same strength as the average of human players is that for which the aforementioned expected result is 0.5. This correspondence of playing strength becomes more accurate for a witness engine with deeper analysis.

For such an engine θ, find an engine ϕ with strength given by its time control $t_{ϕ} = {(b_{ϕ} nodes | b_{θ} / 90 nodes)}_{1}$ , such that $P (R = 1) = 0.5$ . Notice that the increment of both engines is the same.

To scale the engine time controls to human time controls, it suffices to see that engine θ is as strong as the average 2750+ ELO player’s strength when playing with a ${(3 minutes | 2 seconds)}_{1}$ time control. Therefore the time controls of a fair Armageddon game would be ${(3 minutes | 2 seconds)}_{1}$ for white, and ${(3 b_{ϕ} / b_{θ} minutes | 2 seconds)}_{1}$ for black, in seconds.

Alternatives to the gain histogram method are the stochastic agent-based method proposed by Regan and Haworth (2011) or the more theoretically-robust Markov process approach to decision-making explained by Alliot (2017). However, it has been proven that the gain histogram method provides an excellent characterization of the players, as well as being faster to compute than the previous two studies mentioned earlier (Ferreira, 2013, 2012; Alliot, 2017). The results of the engine tournament are shown in the left panel of Fig. 1. It is seen that strong engines (long initial time) have almost 100% wins against weak engines (short initial time). When engines of equal strength play, the expected score is 50%. The gain cross-correlation defined by Ferreira (2013, 2012) to estimate the expected results for this tournament are shown in the right panel of Fig. 1. Interestingly, the cross-correlation expected results shrink in the expected score axis dramatically (This has been reported before in Table 6 of the paper by Alliot (2017)). This method states that strong engines have around 60% chance of beating weak engines, which is wrong, as evidenced in the actual tournament results from the left panel. Despite this important nuisance, there is a clear and logical behaviour of the strongest machine always having an expected score larger than 0.5. Moreover, the cross-correlation method shows that the expected score between two equally strong engines is 0.5.

³
For example if one wants to know the expected result between the $b_{α} = 1048576$ nodes and the $b_{α} = 131072$ nodes, one can look at the blue line in the left panel, and see where it intersects the third vertical grey line, from left to right, and see that the expected result is almost 1.

Fig. 1.

Tournament results, estimated results with neural network and estimated results with gain cross correlations in the left, centre and right panels, respectively, for each engine defined by the time controls $t_{α}$ . The vertical dotted lines indicate each of the opponent engines.3

Unfortunately, the cross-correlation method cannot be used to estimate the expected score in general, even though it behaves well around 0.5. As pointed out by Alliot (2017), this problem can be overcome after a deeper interpretation of cross-correlations between gain histograms. In the case of this work, the shrinking problem of the cross-correlation method to calculate the expected score is dealt with through the use of a machine learning algorithm that can be trained to correctly calculate the expected score. For this, a neural network with just one hidden neuron was used to predict three classes (win for white, draw or loss for white). The neural network with one hidden neuron computes the quantity $ReLu (\vec{w} \cdot \vec{x} + b)$ , where $\vec{w}$ , b are adjustable parameters and $\vec{x}$ are the components of the gain histogram. ReLu is the Rectified Linear unit function. Depending on the value of this quantity, the neural network classifies the gain histograms, as will be further explained in what follows. For each game i of the engine tournament between two engines α and β, the gain histograms of both engines are obtained $g_{i, α}$ and $g_{i, β}$ . If $r_{i}$ is the normal chess result for white of game i i.e. $r_{i}$ is 0 if black wins, 1/2 if they draw, and 1 if white wins, then, the samples to train the neural network were $\begin{array}{l} (4) & x_{i} = (g_{i, α}, g_{i, β}), y_{i} = r_{i} . \end{array}$ Therefore, the job of the neural network is to predict the outcome of a game for white. The learning procedure and the dataset to train the neural network can be enriched by also considering the outcome for black. More explicitly, if game i between engines α and β had an outcome $r_{i}$ , its corresponding vector of features and label is the one written in Eq. (4). But the vector of features $x_{i}^{'} = (g_{i, β}, g_{i, α})$ and the label $r_{i}^{'} = (1 / 2 - r_{i}) + 1 / 2$ is also valid.

With this enriched dataset, the neural network was trained. The engine tournament games were divided into two sets: one containing 7200 games, and the other one 10800. The one containing 7200 games was divided into a training set of 5040 games (70%) and a testing set of 2160 games (30%). The accuracy in predicting the result from the gain histograms was 92.9% in the test set, and 95.9% in the train set, showing that the neural network did not overfit. With this neural network, I proceeded to calculate the expected results for the tournament using the remaining 10800 games. These expected results are shown in the central panel of Fig. 1, which are more consistent with the left panel (that shows the empirical expected scores) than the right panel, as discussed before. In fact, Pearson’s correlation coefficient between the true expected results and the neural network calculated expected results is astoundingly 99.9%, which is even higher than the 95% achieved with a modified version of the cross-correlation method reported by Alliot (2017).

Now that there is certainty that the neural network provides an acceptable prediction of the result of a game between two players given their gain histograms, I will compare the human gain histograms with the engine gain histograms. For that matter, Fig. 2 shows the performance of both the average 2750+ ELO player, as well as the performance of each of the 2750+ ELO players in the world blitz chess championship. For the first 5 engines, the expected result of the average 2750+ ELO player behaves logarithmically with respect to the time control, following the curve $\begin{array}{l} (5) & Expected score = (- 0.210 \pm 0.003) log (Initial time in nodes) + (3 \pm 0.03), \end{array}$ where a linear function was fitted. The value of initial time in nodes for which the expected score matches 0.5 is $(1.44 \pm 0.28) \times 10^{5}$ . This means that if the average 2750+ ELO player were to play a series of games at a time control of ${(3 minutes | 2 seconds)}_{1}$ versus Stockfish 9 with a time control of ${((1.44 \pm 0.28) \times 10^{5} nodes | (1.60 \pm 0.31) \times 10^{3} nodes)}_{1}$ , the expected result would be 0.5, or in other words, those two opponents would have the same strength. With this, step 4 from the strategy is completed.

Fig. 2.

Expected result of human players versus the different engines from the engine tournament. All 2750+ ELO players shown in left panel, and the average of the gain histogram of these players is used to compute the expected score vs the engines in the right panel. The orange line and the shadowed region indicate a 1–sigma confidence interval of the fit.

For the next step one must assume the following. Let $t_{h}$ be a human time control and $t_{e}$ be an engine time control, such that the expected score between the human and the engine is 0.5. Let $0 < γ < 1$ . I will assume that the expected score between the human with time control $t_{h}$ and the human with time control $γ t_{h}$ is the same as the expected score between the engine with time control $t_{e}$ and the engine with time control $γ t_{e}$ . In other words, a decrease in initial time affects in the same way both the human and the engine. This is a needed condition because I will find the proportion parameter γ that leads to a fair Armageddon expected score for engine players, and in order to extrapolate this to human players, I must rely on the aforementioned assumption.

With this in mind, a grid of 9 different initial times in nodes was picked as the black time control playing against the ${(144000 nodes | 1600 nodes)}_{1}$ engine. Each of these 9 different engines played 10000 games as black. Figure 3 shows the average score of each of the engines, which seem to have a logarithmic relation with the initial time in nodes $\begin{array}{l} (6) & Expected score = (- 0.341 \pm 0.02) log (Initial time in nodes) + (4.38 \pm 0.03), \end{array}$ which is different from Eq. (5) because in this case the increment for the black and the white players is the same. The line crosses the 0.5 mark, or the fair result, at $(8.8 \pm 1.0) \times 10^{4}$ nodes, which means that the fair time control for 2750+ ELO players is $\begin{array}{l} (7) & \begin{matrix} White \to {(144000 \pm 28000 nodes | 1600 \pm 310 nodes)}_{1} \\ Black \to {(88000 \pm 10000 nodes | 1600 \pm 310 nodes)}_{1}, \end{matrix} \end{array}$ which is close to a 3 to 5 ratio of the initial times, different from the 4 to 5 ratio used in the 2018 World Chess Championship, and to the 7 to 10 ratio used in the recent 2019 Altibox Norway Chess competition (Peterson, 2018 (accessed November 30, 2018)). In this tournament the initial times were 10 minutes for white and 7 minutes for black, with increment of 3 seconds per move starting from move 61. A total of 34 Armageddon games were played, where white scored 15 points, or 44.1%. Although this is a small sample size, it is the first real experiment with 2750+ ELO players playing Armageddon games. The results so far presented argue that the most likely ratio between black and white initial times for an expected score of 0.5 is 3 to 5. But as was shown, this ratio may strongly depend on the strength of the players. Clearly, the longer the time control, the better the quality of the moves. Therefore, the only certain recommendation drawn from this research is that the most likely time control to achieve a fair result from an Armageddon game for 2750+ ELO players is: $\begin{array}{l} (8) & \begin{matrix} White \to {(180 \pm 35 seconds | 2 \pm 0.39 seconds)}_{1} \\ Black \to {(110 \pm 12.27 seconds | 2 \pm 0.39 seconds)}_{1} . \end{matrix} \end{array}$

Fig. 3.

Normal expected score (purple) and Armageddon expected score (blue) for matches as black against the ${(144000 nodes | 1600 nodes)}_{1}$ engine as white. The orange line is the logarithmic trend approximation of the expected score and the shadowed region indicates a 1–sigma confidence interval of the fit.

3. Threats to validity

The study so far presented is tied to some assumptions and approximations which must be highlighted. A witness engine was used to evaluate chess play and build gain histograms. Because of finite horizon search, the evaluation that the witness engine provides is thought to have some bias, and I assumed that this bias equally affects both players of a game being analyzed. However, this sometimes might be incorrect. Suppose that the witness engine analyzes at a fixed-node count white’s move. When the witness engine analyzes black’s move, white’s move is already on the board, which, in principle, makes black’s move analysis more accurate with respect to the previous white’s move evaluation. Of course, after black moves, another white’s move comes. This is why I argued that bias of the witness engine could equally affect both players. But there are scenarios when this may not happen (Ferreira, 2012).

Even though the neural network that was used to determine the winner of a game given the players’ gain histogram has a high prediction accuracy, it gives some room for error. Therefore, the uncertainty of the time control of the engine that plays at the same strength as the average 2750+ ELO player is prone to be underestimated. A more rigorous approach can be taken to take this neural network error into account, although it must be emphasized that this error is visually negligible around the value 0.5 of the expected result (see central panel of Fig. 1).

Additionally, throughout the work, I have used fixed-node count and not a fixed-depth for the witness engine because fixed-node count assigns the same computational power to every move. Although this condition is fair from my point of view, literature has mostly used fixed-depth (e.g. Guid and Bratko (2017)). In fact, when using fixed-node count, search depths can vary a lot since some positions are more complex than others. This can lead to unequal assessments of the different positions, yielding much better assessments on simple positions compared to those on complex positions. A systematic study of how fixed-node count influences position evaluation in several positions with respect to fixed-depth search is needed to clarify whether the seemingly fair fixed-node count strategy is valid.

Where there could be more improvement in this work is in understanding and quantifying how different humans and engines take decisions. Particularly, a set of questions of interest are the following. Suppose that a certain amount of time is given to an engine and to a human, and they have to decide what move to play. How much does this decision improve if they are given twice as much time as they had in the beginning? Who will improve the most? How much will they improve? How does this depend on the complexity of the position being analyzed? All these questions are important in order to be able to correctly simulate human play. During this research I have tried to modify the time control of the engine so that its effective strength comes close to the human strength. However, some elements of human play are lost and cannot be simulated (at least for now), such as how much time is assigned to make each move. Therefore, the assumption that a decrease in initial time affects in the same way both the human and the engine is a strong assumption and has to be checked in future work. Therefore, the validity of the results should only be considered solid at time controls similar to Eq. (8).

4. Conclusions and perspectives

I presented a method that combines 2750+ ELO human and engine chess play analysis to estimate the optimal time control so that the expected result of an Armageddon game is fair, i.e. is 0.5. Specifically, the method used gain histograms of both engine and human players in order to determine the time control at which the engine plays at the same strength as the top chess players in the world, as well as to determine the optimal estimate of the ratio between white and black time controls in an Armageddon game, so that it is fair. These optimal time controls were found to be $180 \pm 35$ seconds for white and $110 \pm 12.27$ seconds for black plus a $2 \pm 0.39$ second increment per move, starting from move 1. It was shown that the expected score for an Armageddon chess game strongly depends on the players’ strength. One might expect that having a fixed value of time controls for Armageddon games that applies to all competitors – say in a large tournament with multiple rating-graded sections – would be a mark of fairness. The results presented in this paper imply that surprisingly it is not. Instead, the time difference between white and black should be handicapped according to rating. This means that a better and more general tiebreak method should be looked for and implemented in chess tournaments (probably the bidding system of the U.S. Chess Championship (Championship, 2016 (accessed August 19, 2019))). However, Armageddon games provide a good spectacle for the spectators and chess fans, and it is often set as the ultimate tiebreak of a match.

Hopefully, the present study will contribute as a first step from the scientific perspective towards implementing a fair time control in Armageddon chess games. However, the estimated time controls presented in this paper are just that: an estimation. Therefore I recommend the top chess tournaments that want to include Armageddon games to start testing whether the time control found here is really fair in human play.

For future work, I would like to perform the comparison between gain histograms of human white players and engine white players with different values of contempt, and to study if aggressiveness of a player can be simulated or approximated by contempt. Also, it seems that the agent-based method by Regan and Haworth (2011) can improve the simulation of human skills using an engine, since in this study I forced the engine to make mistakes by limiting its computing capabilities with short time thinks.

References

Alliot, J.-M. (2017). Who is the master? ICGA Journal, 39(1), 3–43. doi:10.3233/ICG-160012.

Burns, B.D. (2004). The effects of speed on skilled chess performance. Psychological Science, 15(7), 442–447. PMID: 15200627. doi:10.1111/j.0956-7976.2004.00699.x.

Calderwood, R., Klein, G.A. & Crandall, B.W. (1988). Time pressure, skill, and move quality in chess. The American Journal of Psychology, 101(4), 481–493. http://www.jstor.org/stable/1423226. doi:10.2307/1423226.

Campbell, M., Hoane, A.J. & Feng-hsiung, H. (2002). Deep blue. Artificial Intelligence, 134(1), 57–83. http://www.sciencedirect.com/science/article/pii/S0004370201001291. doi:10.1016/S0004-3702(01)00129-1.

Campitelli, G. & Gobet, F. (2004). Adaptive expert decision making: Skilled chess players search more and deeper. ICGA Journal, 27(4), 209–216. doi:10.3233/ICG-2004-27403.

Championship, U.S. (2016). REGULATIONS 2016 U.S. Championship, https://www.uschesschamps.com/information-2016-us-championship/regulations (accessed August 19, 2019).

Charness, N. (1981). Search in chess: Age and skill differences. Journal of Experimental Psychology: Human Perception and Performance, 7(2), 467–476.

Ferreira, D.R. (2012). Determining the strength of chess players based on actual play. ICGA Journal, 35(1), 3–19. doi:10.3233/ICG-2012-35102.

Ferreira, D.R. (2013). The impact of the search depth on chess playing strength. ICGA Journal, 36(2), 67–80. doi:10.3233/ICG-2013-36202.

10.

Guid, M. & Bratko, I. (2011). Using heuristic-search based engines for estimating human skill at chess. ICGA Journal, 34(2), 71–81. doi:10.3233/ICG-2011-34204.

11.

Guid, M. & Bratko, I. (2017). Influence of search depth on position evaluation. In Advances in Computer Games (pp. 115–126). Springer. doi:10.1007/978-3-319-71649-7_10.

12.

Guntz, T., Crowley, J.L., Vaufreydaz, D., Balzarini, R. & Dessus, P. (2016). The role of emotion in problem solving: First results from observing chess. In Proceedings of the Workshop on Modeling Cognitive Processes from Multimodal Data. MCPMD ’18 (pp. 12:1–12:8). New York, NY, USA: ACM. doi:10.1145/3279810.3279846.

13.

Milener, G. (2018). A new angle on understanding the draw problem, https://en.chessbase.com/post/a-new-angle-on-understanding-the-draw-problem (accessed December 1, 2018).

14.

Peterson, M. (2018). Norway Chess Armageddon gambit, https://en.chessbase.com/post/norway-chess-armageddon-gambit (accessed November 30, 2018).

15.

Regan, K. (2016). Magnus and the Turkey Grinder, https://rjlipton.wordpress.com/2016/12/08/magnus-and-the-turkey-grinder (accessed August 19, 2019).

16.

Regan, K. (2018). Sliding-Scale Problems, https://rjlipton.wordpress.com/2018/09/07/sliding-scale-problems (accessed August 19, 2019).

17.

Regan, K.W. & Haworth, G.M. (2011). Intrinsic chess ratings. In Twenty-Fifth AAAI Conference on Artificial Intelligence.

18.

Regan, K.W., Macieja, B. & Haworth, G.M. (2012). Understanding distributions of chess performances. In

H.J.

van den Herik and

Plaat (Eds.), Advances in Computer Games (pp. 230–243). Berlin, Heidelberg: Springer. doi:10.1007/978-3-642-31866-5_20.

19.

Schaeffer, J. & Plaat, A. (1997). Kasparov versus Deep Blue: The Rematch. ICGA Journal, 20(2), 95–101. doi:10.3233/ICG-1997-20209.

20.

Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., Lillicrap, T., Simonyan, K. & Hassabis, D. (2018). A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science, 362(6419), 1140–1144. https://science.sciencemag.org/content/362/6419/1140. doi:10.1126/science.aar6404.

21.

van Harreveld, F., Wagenmakers, E.-J. & van der Maas, H.L.J. (2007). The effects of time pressure on chess skill: An investigation into fast and slow processes underlying expert performance. Psychological Research, 71(5), 591–597. doi:10.1007/s00426-006-0076-0.

22.

Wüllenweber, M., Friedel, F. & Feist, M. (2006). Kramnik vs. deep fritz 10: Computer wins match by 4–2. ICGA Journal, 29(4), 208–213. doi:10.3233/ICG-2006-29407.