Abstract
For developers of game-playing programs, it is an important but non-trivial task to measure the strengths of the programs, especially with respect to the theoretically optimal players. The optimal players are hard to obtain in practice since many real-world games are too hard to solve entirely. In this paper, a reduced version of Chinese dark chess (CDC),
Introduction
When developing game-playing programs, one important thing is to measure the strengths of the programs. Assuming that an optimal player exists, a program’s absolute strength can be measured by the win rate playing against the optimal player. However, it is impractical to expect access to such optimal players in many real-world games since these games are too complex to solve entirely (Kishimoto & Mueller, 2015; van den Herik et al., 2002). Thus, for the strength analysis of game-playing programs of these games in practice, many researchers used win rates as well as some other metrics such as prediction rates and mean squared errors to measure their strengths.
The first goal of this paper is to analyze the correlation between these practically measured strengths and absolute strengths for the game of Chinese dark chess (CDC), a non-deterministic game normally played on a
The second goal is to investigate whether the win rates obtained in reduced and small-sized games can be used to predict those in the full-sized games, since games with smaller sizes usually share similar properties to the original ones. A strong correlation in this case implies that new algorithms can be tested in small-sized games instead of full-sized games to save time. The experiment results showed that most of the win rates of the MCTS-based programs (Hsueh et al., 2016) obtained in
The rest of the paper is structured as follows. Sect. 2 describes background knowledge including the game of CDC, the reduced
Background
Chinese dark chess (CDC)
Chinese dark chess (CDC) (B.-N. Chen et al., 2010; Hsueh et al., 2016; Yen et al., 2015) is a two-player zero-sum non-deterministic game which uses the same set of pieces as Chinese chess. The two players, red and black, respectively own one king (
The game is played on a

(a) The initial position and (b) an example of position for CDC.
All pieces can be moved to any of its adjacent empty squares. Pieces other than cannons can be moved to an adjacent square to capture an opponent piece. Whether a piece can capture another piece depends on its and its opponent piece’s ranks, which, from the highest to the lowest, are kings, guards, ministers, rooks, knights, cannons, and pawns. Generally, pieces can capture opponent pieces with equal or lower ranks. Three exceptions are illustrated as follows. First, kings cannot capture pawns. Second, pawns can capture kings. Third, cannons can capture all types of opponent pieces but need to jump over exactly one piece in the same row or the same column, which is called one-step-jump. An example for one-step-jump is shown in Fig. 1(b), where the black cannon (
CDC is one of the games played in the Computer Olympiad since 2010 (Hsueh & Wu, 2015; Tseng et al., 2017; van den Herik et al., 2012; Yen et al., 2010, 2013). Many programs were developed based on alpha-beta search (Chang et al., 2017; B.-N. Chen et al., 2010; J.-C. Chen et al., 2018; Yen et al., 2010). MCTS was first applied to CDC by Yen et al. (2015). Hsueh et al. (2016) further improved an MCTS-based game-playing program by incorporating four techniques, described in more detail in Sect. 2.3.
A reduced version of CDC,
All legal positions were assigned theoretical values from retrograde analysis (Chang & Hsu, 2014). In their work, expected win rates were used as theoretical values. If a position does not contain any unrevealed piece or only one unrevealed piece type remains, the game becomes deterministic. Thus, the theoretical value is 1, 0, or
24 fair and equivalent material combinations in
CDC
24 fair and equivalent material combinations in

Examples of positions in
Monte-Carlo tree search (MCTS) (Browne et al., 2012) is a kind of best-first search where states are evaluated by Monte Carlo simulations. One popular implementation in practice is upper confidence bounds applied to trees (UCT) (Kocsis & Szepesvári, 2006) which applies upper confidence bounds (UCB) (Auer et al., 2002) at selection phase, one of the four phases in MCTS. The UCB function is shown as follows:
To generate a move, MCTS usually consists of several iterations (or simulations). Each iteration contains four phases, selection, expansion, playout, and backpropagation. The selection phase follows the above UCB formula to traverse UCT until a leaf is reached. During the expansion phase, one or more children from the leaf is expanded. A playout policy is then used to play the game to the end, obtaining a game outcome in the process. Backpropagation then refers to the process of updating this outcome of the playout to all nodes traversed during selection. As for playout policies, the simplest one is to randomly select a move. More sophisticated playout policies were used to make the playouts cleverer for CDC (Hsueh et al., 2016; Yen et al., 2015). They used piece weights, as an example listed in Table 2, to calculate scores for moves when captures and/or escapes happened. Roulette-wheel selections were then applied to select moves according to their scores.
Hsueh et al. (2016) further incorporated four techniques into an MCTS-based CDC program to improve the playing strength. The four techniques were early playout terminations, implicit minimax backups, quality-based rewards, and progressive bias, respectively introduced in the following four subsections.
Early playout terminations
The technique of early playout terminations (EPT) terminates playouts before the end of games. For CDC, Hsueh et al. (2016) terminated playouts immediately and returned the corresponding outcomes whenever a material combination of a position is very likely to win (e.g., the player has a king and a guard while the opponent has only two ministers), or very likely to draw (e.g., the player has only one guard while the opponent has two ministers). In their article, all material combinations were analyzed manually for EPT, and labeled as likely wins, likely draws, likely losses, or unknown.
The advantages of EPT were to speed up and make the outcomes of playouts more accurate. In their experiments, they also reversed the outcomes for those positions with likely wins and likely losses with a probability of ε to show the importance of the accuracy of the outcomes of playouts. Clearly, when ε was 0, EPT performed best and obtained a win rate of 60.75% (± 2.76%) against the original program. Note that these programs were run with 30,000 simulations per move, and the same for all experiments mentioned in the rest of this section.
Implicit minimax backups
The technique of implicit minimax backups (IMB) was proposed by Lanctot et al. (2014). The objective was to help MCTS deal with tactical positions by introducing minimax scores from heuristic evaluations of positions. The
The heuristic evaluation function for CDC used by Hsueh et al. (2016) was the weighted material sum, namely the difference of the total piece weights between the player and the opponent. The piece weights were also the ones in Table 2. The minimax scores of chance nodes were weighted according to visit counts instead of probabilities of unrevealed pieces. The experiment results showed that with
Piece weights for CDC used by Hsueh et al. (2016)
Piece weights for CDC used by Hsueh et al. (2016)
The technique of quality-based rewards was proposed by Pepels et al. (2014). The rewards of wins and losses obtained in simulations were adjusted according to simulation lengths (SL) or terminal state qualities (TSQ) collected online. Simulation length was the length of the game in the simulation from the root to the end. Terminal state quality used by Hsueh et al. (2016) for CDC was the remaining number of pieces with considering piece weights,
Progressive bias
The technique of progressive bias (PB) was proposed by several researchers with different forms (Chaslot et al., 2008; Ikeda & Viennot, 2014; Nijssen & Winands, 2011). The idea was to use available heuristics to guide selections among nodes with few simulations. Hsueh et al. (2016) extended the UCB formula (Formula (1)) to:
The heuristic scores of moves designed by Hsueh et al. (2016) for CDC used some basic heuristics including capturing opponent pieces, escaping the player’s pieces, and suiciding the player’s pieces. These basic heuristics are collectively referred to in the remainder of this paper as
Investigated strength analysis metrics
Win rates playing against a designated player
For game-playing programs, win rates against designated players are commonly used as a metric to measure the strengths of the programs. It is usually claimed that the higher the win rate obtained, the stronger the program is (Tesauro, 1995). Playing against human players is also a way (Campbell et al., 2002; Silver et al., 2016, 2017; Tesauro, 1995); however, since it is too slow to obtain statistically significant results (Tesauro, 1995), usually, the designated players are computer programs, often called baseline programs.
Baseline programs in practice can be a random player (randomly make a move), an early version of the tested program, or other available programs. Usually, random players are weak and are easily defeated. In the past, when demonstrating the strengths of new algorithms or heuristics, usually they were made to play against early versions of the tested programs (Chaslot et al., 2008; B.-N. Chen et al., 2010; Gelly & Silver, 2007; Hsueh et al., 2016; Lanctot et al., 2014; Nijssen & Winands, 2011; Pepels et al., 2014; Runarsson & Lucas, 2014; Silver et al., 2016, 2017; Tian & Zhu, 2016; Yen et al., 2015). Sometimes, they were made to play against other openly available programs (Chaslot et al., 2008; Coulom, 2007; Gelly & Silver, 2007; Ikeda & Viennot, 2014; Lanctot et al., 2014; Silver et al., 2016, 2017; Tian & Zhu, 2016; Yen et al., 2015).
However, it is still a question whether and how much the win rates can reflect the absolute strengths of the tested programs. In the solved

A partial tree with theoretical values of positions in the red player’s view from a game of “
Note that, for deterministic positions with wins or losses, the theoretical values did not provide information of distances to win or loss (Chang & Hsu, 2014). Thus, the optimal player does not take into consideration shortest wins and longest losses. In addition, the optimal player does not take into consideration whether a move is risky or not. For the example in Fig. 3, the move b1(?) has a theoretical value of 0.000 from the red player’s view, which is as good as the move a2–a1. Thus, the optimal player just randomly chooses from the two moves. However, the move b1(?) is riskier since the red player has a probability of ¼ to flip the
Assuming that a set of positions with moves played by experts (optimal players at best) are available, prediction rates to those expert moves are commonly used to analyze and measure strengths of programs (Chang et al., 2015; Coulom, 2007; Lai, 2015; Liskowski et al., 2018; Runarsson & Lucas, 2014; Silver et al., 2016, 2017; Tian & Zhu, 2016). The prediction rate of a program is the probability that the moves selected by the program match expert moves. For example, moves from game records played by human players were used as expert moves in the game of Go (Coulom, 2007; Silver et al., 2016, 2017; Tian & Zhu, 2016) and Othello (Liskowski et al., 2018; Runarsson & Lucas, 2014). In some cases, the moves generated by a program with much deeper search also served as expert moves. For example, Lai (2015) used the moves from a program with time-limited searches as expert moves in chess.
In
If a program has a high prediction rate to the expert moves, it can be said that the strength of the program is close to the experts’ and thus can be inferred as a strong program. However, any moves not matching expert moves do not necessarily indicate sub-optimal plays for the following two reasons. First, the expert moves (from human players or programs) may not be the theoretically best moves. Second, it is possible that there exist several equally good moves given a position, but only the one selected by the expert is counted as an expert move. An example that stronger programs have lower prediction rates is shown by Silver et al. (2017) for the game of Go, where AlphaGo Zero had much higher win rates but lower prediction rates to moves by professional players (cf. Figs 3 and 4 by Silver et al. (2017)).
Mean squared errors of values of positions
Assuming that a set of positions with ground-truth values (theoretical values under best circumstances) are available, mean squared errors between the given ground-truth values and the values estimated by game-playing programs can be used to analyze the strengths of the programs (Gelly & Silver, 2007; Silver et al., 2016, 2017). The mean squared error of a program on a given set of positions P and the corresponding ground-truth values
In
If a program has a low mean squared error of values of positions, it means that the program understands the positions better and thus is usually stronger. Mean squared errors were shown to have a strong correlation to the strengths of programs (Gelly & Silver, 2007; Silver et al., 2017).
Experiments for
CDC
In the experiments, all analyzed CDC programs were based on MCTS with different techniques or different parameters (Hsueh et al., 2016). Although the programs were originally written to play
In the following subsections, for each of the settings of MCTS above, win rates are collected given five different baseline programs in Sect. 4.1, prediction rates are collected given three different sets of expert moves in Sect. 4.2, and mean squared errors are collected given three different sources of ground-truth values in Sect. 4.3. All the detailed values of win rates, prediction rates, and mean squared errors are presented in Appendix A.
Details of the 129 analyzed program settings for
CDC
Details of the 129 analyzed program settings for
This subsection investigates whether the win rates against practical and available baseline programs are correlated to the win rates against the optimal player (absolute strengths). In the experiments, the win rate of a program against a designated baseline program was obtained in the following way. The program played against the designated baseline program 1,000 games, with altering turns playing first, in each of the 24 combinations as listed in Table 1. Its win rate was the average score of the 24,000 games where a win was counted as 1, a draw as 0.5, and a loss as 0.
In
Six-ply EMM ran about 150 times slower than that of four-ply. Due to time constraints, experiments on EMM with more plies were not conducted.
In order to analyze the correlation between win rates against practical and available baseline programs and those against OPT, each of the 129 program settings (listed in Table 3) played against the five baseline programs respectively. For these program settings, let WROPT, WREMM, WR1k, WR5k and WR30k be the win rates respectively against OPT, EMM, MCTS-1k, MCTS-5k, and MCTS-30k. Figure 4(a)–(d) show respectively WREMM, WR1k, WR5k, and WR30k with respect to the absolute strength WROPT for all program settings. These plots show very highly positive correlations whose correlation coefficients r (Asuero et al., 2006; Lane et al., 2014), described in more detail in Appendix B, were 0.9633, 0.9790, 0.9880, and 0.9922 respectively. The stronger the baseline program was, the higher the correlation coefficient obtained. The sample standard errors (Lane et al., 2014) with respect to the predicted lines (the dotted lines in Fig. 4) were 1.57%, 1.19%, 0.90%, and 0.73% respectively. Appendix B describes more on the sample standard errors and the predicted lines. The above observations suggested that the baseline program should be as strong as possible in order to obtain accurate strengths of game-playing programs when the optimal player was not available. However, this would be a tradeoff between the accuracy of the metric and the time needed to perform experiments, since a stronger baseline program usually implies more computation time.

Scatter plots for (a) WREMM, (b) WR1k, (c) WR5k, and (d) WR30k to WROPT.
This subsection investigates whether the prediction rates to practical and available expert moves are correlated to those to the best moves (moves by OPT), and whether all of these prediction rates are correlated to the absolute strengths (win rates against OPT). A set of 2,400 games, 100 in each of the 24 material combinations listed in Table 1, were collected and about 47,000 positions in the game set were used for experiments on prediction rates.
Three sets of expert moves were considered, the best moves as described in Sect. 3.2, the moves by MCTS-1k, and the moves by MCTS-1M (the original MCTS with 1,000,000 simulations per move). MCTS-1k and MCTS-1M had win rates of 37.67% (± 0.57%) and 49.21% (± 0.58%) against OPT respectively. Note that MCTS-1k was intentionally included to investigate the cases where the quality of the collected expert moves was low. To obtain the prediction rate of a program, the program played moves for the positions in the selected game set. In the experiment, both MCTS-1k and MCTS-1M obtained prediction rates respectively to the best moves with 84.46% (± 0.33%) and 93.31% (± 0.23%). The results clearly showed that MCTS-1M played the best moves more often than MCTS-1k.
In order to analyze the correlation of prediction rates, each of the 129 program settings (listed in Table 3) played moves for the positions in the selected game set. For these program settings, let PRBEST, PR1k and PR1M be the prediction rates respectively to the best moves, the moves by MCTS-1k, and the moves by MCTS-1M. Figure 5(a) and (b) show respectively PR1k and PR1M with respect to PRBEST for all program settings. Note that the values in PRBEST tended to be higher than those in PR1k and PR1M, for the reason that it was easier to match best moves since a position may have multiple best moves as described in Sect. 3.2. From the experiment results, as expected, PR1k could not reflect PRBEST well, which only had a correlation coefficient of 0.3938. On the contrary, PR1M had a highly positive correlation to PRBEST, which had a correlation coefficient of 0.8998. The sample standard errors with respect to the predicted lines were 3.29% and 1.56%. The results showed that the quality of expert moves influenced the results of prediction rates greatly, and that prediction rates to moves by a stronger program can be used to approximate those to the best moves.

Scatter plots for (a) PR1k and (b) PR1M to PRBEST.
Next, the correlation between the prediction rates and the absolute strengths (win rates against OPT) was investigated. Figure 6(a)–(c) show the prediction rates, PR1k, PR1M and PRBEST, with respect to WROPT. Similar to those shown in Fig. 5, PR1k was not a good metric to the absolute strengths either, which only had a correlation coefficient of 0.5159. PR1M had a very highly positive correlation coefficient of 0.9167 to the absolute strengths. The sample standard errors with respect to the predicted lines were 5.00% and 2.33%. These suggested that when expert moves with high quality were available, prediction rates to those expert moves were good metrics to the strengths of game-playing programs, though not as good as win rates playing against a baseline program as shown in Sect. 4.1. In addition, PRBEST had a higher correlation to absolute strengths than both PR1k and PR1M. Namely, the correlation coefficient of PRBEST to WROPT was 0.9802, and the sample standard error with respect to the predicted line was 1.16%. The analyses showed that the strengths of game-playing programs had strong correlations to the ability to select the best moves.

Scatter plots for (a) PR1k, (b) PR1M, and (c) PRBEST to WROPT.
For
In the experiments, the same set of about 47,000 positions as in Sect. 4.2 was used. Two sources of ground-truth values for these positions were considered, the theoretical values and the outcomes of games from self-plays by MCTS-1M. Namely, a self-play game was performed by MCTS-1M from a given position to the end of the game. The game outcomes, consisting of wins, draws, and losses, were converted to 1’s, 0’s and
In order to analyze the correlation of mean squared errors, each of the 129 program settings (listed in Table 3) ran MCTS on all positions in the game set. For these programs, let MSETH, MSEGO-1k and MSEGO-1M be the mean squared errors respectively with theoretical values, game outcomes from self-plays by MCTS-1k, and game outcomes from self-plays by MCTS-1M. Figure 7(a) and (b) show respectively MSEGO-1k and MSEGO-1M with respective to MSETH for all program settings. Note that the values in MSETH tended to be lower than those in MSEGO-1k and MSEGO-1M, for the reason that the granularity of theoretical values was lower than that of game outcomes. The experiment results showed very highly positive correlation coefficients, 0.9971 and 0.9999, and the sample standard errors with respect to the predicted lines were 0.0008 and 0.0002. These showed that mean squared errors with different sources of game outcomes could predict those with respect to theoretical values well even when the expert had a lower quality such as MCTS-1k.

Scatter plots for (a) MSEGO-1k and (b) MSEGO-1M to MSETH.
Next, the correlation between the mean squared errors and the absolute strengths (win rates against OPT) was investigated. Figure 8(a)–(c) show the mean squared errors MSEGO-1k, MSEGO-1M, and MSETH with respect to WROPT. All showed highly negative correlations, which had correlation coefficients of
In this section, the absolute strengths of game-playing programs in
The details of the 66 program settings for both

Scatter plots for (a) MSEGO-1k, (b) MSEGO-1M, and (c) MSETH to WROPT.
Details of the 66 analyzed program settings for both
Figure 9 shows that

Scatter plot for
In brief,
In this paper, three metrics measuring strengths of game-playing programs are analyzed in a solved game,
In addition, the second metric, prediction rates to expert moves, also has a high correlation to the absolute strengths, as long as expert moves with high qualities were collected. The third metric, mean squared errors of values of positions, also has a high correlation to the absolute strengths, which is influenced less by the quality of the collected data. The results show that the strengths of game-playing programs can be roughly measured by these two metrics.
In addition, the win rates for game-playing programs with the same settings in both
Footnotes
Acknowledgements
The authors would like to thank the Ministry of Science and Technology of the Republic of China (Taiwan) for financial support of this research under contract numbers MOST 106-2221-E-009-139-MY2, 106-2221-E-305-016-MY2, 106-2221-E-305-017-MY2, 107-2634-F-009-011, and 107-2634-F-259-001.
Detailed experiment results
Table A.1 lists the win rates of the 129 program settings against five
Table A.2 lists the prediction rates of the 129 program settings for
Simple linear regression
Simple linear regression, described in more detail by Lane et al. (2014), is a model to derive how two random variables are linearly correlated. This section illustrates it by the example of two random variables X and Y with sample data
The regression line, the best-fitting straight line, is
