Abstract
Monte-Carlo tree search (MCTS) algorithms play an important role in developing computer players, especially for games for which good evaluation functions are hard to obtain, like Go. The performance of MCTS players is often leveraged in combination with online and/or offline knowledge, despite the lack of game-theoretic guarantees. For the games for which we already have good evaluation functions, the use of evaluation functions in MCTS algorithms achieved a success. However, the effect of evaluation functions on the performance of MCTS algorithms have not been investigated well, especially in terms of the quality of evaluation functions. In this study, we try to address this issue by using Othello (Reversi) as the target game. Based on the evaluation function used in Zebra, a top-level open-source player, we design 15 variants of evaluation functions and use them in three ways in MCTS algorithms. We conduct a set of experiments exhaustively and analyze the effect of evaluation functions in MCTS algorithms.
Introduction
In game programming, Monte-Carlo tree search (MCTS) (Coulom (2006); Kocsis and Szepesvári (2006); Browne et al. (2012)) is a method for finding an optimal move by performing random simulations of games called playouts or rollouts and building a search tree according to the results. After the first proposal by Coulom (2006), MCTS has been intensively studied for computer Go. Now MCTS is an absolutely necessary algorithm for many games, for which evaluation functions are difficult to obtain and thus it is hard to apply minimax search (Browne et al. (2012)).
In theory, the upper confidence bounds for trees (UCT) algorithm is proved by Kocsis and Szepesvári (2006) to return an optimal move after an infinite number of playouts. In practice, however, we are requested to select a move in a limited amount of time. For this aim, despite the lack of game-theoretic guarantees, the majority of MCTS players perform tree traversals and/or playouts with some game-specific heuristic bias. Among them, a popular enhancement is rapid action value estimation (RAVE) algorithm by Gelly and Silver (2007), which utilizes online knowledge obtained through playouts.
In rather long history, several researchers and game-player developers devoted themselves to obtaining good position evaluation functions. Several stronger MCTS players have also been developed by combining evaluation functions, for example, for Amazons (Lorentz (2008); Kloetzer et al. (2007)), Breakthrough (Lorentz and Horey (2013)), Havannah (Lorentz (2016)), and Lines of Actions (Winands and Björnsson (2010)). Even for Go, AlphaGo (Silver et al. (2016)) combined MCTS with evaluation functions developed with deep convolutional neural networks and caught up human top-players.
Intuitively, those evaluation functions are desired to be more accurate for stronger MCTS players. However in fact, we often encounter trade-offs between the accuracy and speed of evaluation functions. More interestingly, Gelly and Silver (2007) reported that the use of more accurate evaluation function performed worse when used in playouts. As far as the authors know, the effect of evaluation functions remains unclear especially in the combination of MCTS. Recently, James et al. (2017) studied this topic using a synthetic game and curious evaluation functions.
In this study, we tackle the problem using Othello (Reversi) as the target game, for which we have already obtained very good evaluation functions. We use the evaluation function of Zebra (Andersson (2011)), an open-source champion-level computer player, and its 15 variants. As MCTS enhancements with evaluation functions, we test three popular ones, namely: Node Prior that uses evaluation functions at each node in a search tree (Gelly and Silver (2007); Chaslot et al. (2009)), Progressive Bias that uses evaluation functions for selecting random moves in playouts (Gelly and Silver (2011); Gelly et al. (2006); Bouzy (2005)), and Early Playout Termination that uses evaluation functions for the rewards of playouts (Lorentz (2016); Kloetzer et al. (2007)). We generate a set of test data that consist of game positions associated with the scores of the moves, and analyze the effect of evaluation functions from an exhaustive set of experiments. We also test the strength of MCTS players with evaluation functions against the baseline MCTS player.
Here we summarize some interesting findings in this paper.
For an earlier stage of Othello, we find linear relation between the performance of the single use of evaluation functions and that of MCTS players (Figs 3 and 4). For a later stage of Othello, we find not linear but quadratic (or higher) relation between the performance of the single use of evaluation functions and that of MCTS players (Figs 5 and 6). If the evaluation functions are good enough (including Zebra’s evaluation function), the MCTS players perform even worse than the single use of evaluation functions (Figs 3 and 4). The algorithm called early playout termination performs better in many cases but we find contradictory results with a very bad evaluation function. (Fig. 12)
The rest of the paper is organized as follows. In Section 2, we introduce the test data and the evaluation measures used in this study. In Section 3, we introduce Zebra’s evaluation function and then design 15 variants of it. In Section 4, we review three MCTS algorithms that combines with evaluation functions. In Section 5, we report the experiment results and analyze the results from three viewpoints. We review related work in Section 6 and conclude the paper in Section 7.
Test positions and evaluation measure
In game programming, the players (and algorithms) are often evaluated by the winning ratio or some rating scores through a number of games. In our previous work, we did preliminary evaluation in this manner (Kitamura and Matsuzaki (2018)). In this study, we generated a set of test data that consist of game positions associated with the scores of the moves to make it possible to analyze the characteristics more in detail. It is worth noting that the approach of using test data of game positions was in existing papers (Hashimoto et al. (2013); Matsumoto and Kobayashi (2013)), in which the evaluation was just based on win/lose.
We generated two sets of test data from self-matches of a simple MCTS player with 1000 playouts; T-20 consisting of 100 positions after 20 moves (Fig. 1), and T-40 consisting of 100 positions after 40 moves. For each position, every legal move is associated with a score computed as follows. For the positions after 20 moves (in T-20), we used WZebra (the Windows client of Zebra) to compute the evaluation values with the setting of look-aheads of 24 moves. Note that the evaluation values of WZebra are normalized to the scale of the number of stones. For the positions after 40 moves (in T-40), we perform perfect analysis to compute the difference of the numbers of stones. We removed (almost) one-sided positions from both of data set: more concretely, we excluded the positions if the score of the best move is less than
Using the scores associated with the moves, we can consider the following more useful measures of players instead of win/lose.
Let rank be normalized so that the best move is zero and the worst move is one. Rank rate is the average of normalized ranks of selected moves (the best is zero). Let error be defined as the difference of scores of the best move and the selected move. Error rate is the average of errors of selected moves (the best is the zero). When a move has ties, we count the rank of the selected move as the best of them. The ratio (in percent) that the selected moves are the best moves (the best is 100).

Example position in T-20 and normalized rank and error of moves.
In this study, we borrow the evaluation function used in Zebra, an open-source top-level computer player. The evaluation function uses a pattern-based evaluation scheme (also known as N-tuple networks), where 11 configurations (N-tuples) are from row/column, diagonal, edge + adjacent X-squares, and corners (The patterns are illustrated by black dots in the first row of Table 1).
We designed 15 variants by removing 1–9 of these patterns from the original evaluation function. The evaluation functions used in this study are summarized in Table 1. We tested more than 30 variants of evaluation functions and extracted those 15 variants to cover a wide range of characteristics.
We evaluated the original evaluation function and the 15 variants with the two measures in Section 2. The results are given in the second columns in Tables 2 and 3. Figure 2 plots the error rate for T-20 against that for T-40. As expected, the original evaluation function is the best among the designed ones, meanwhile it still has larger error rates than expected: 2.10 for T-20 and 4.68 for T-40. It is interesting that the error rates may differ significantly by the removal of patterns (for example, see the pairs Z7-b and Z9, or Z3-a and Z8-a). In the later evaluation in Section 5, we look close the results of Z0, Z1-c, Z3-a, Z8-a, Z7-b, and Z9.
N-tuples used in evaluation functions
N-tuples used in evaluation functions

Error rates of designed evaluation functions (single use).
Monte-Carlo tree search (MCTS) Coulom (2006) is a method for finding an optimal move by performing random simulations of games (called playouts) and building a search tree. In our study, we use the upper confidence bounds for trees (UCT) algorithm as the base MCTS algorithm. In each iteration of UCT, we compute as follows.
Traverse the search tree down to a leaf. Here, a child with the highest UCB1 value is selected at each step. If the number of playouts at the leaf exceeds some threshold (in this study, 20), the leaf is expanded and the children are added to the search tree. Perform a playout (random simulation) from the position of the leaf until the end of the game. Update the values on nodes by the result of the playout.
The UCB1 value of a node s is defined as
It is known that using only win/lose mapped to 1/0 is often better than using game scores. In this study, we use the game scores without converting to win/lose to see the effects of evaluation functions more directly.
We extend the UCT algorithm by combining evaluation functions in the following three ways.
( ( (
Comparison of error rate and best-move ratio
We implemented 15 variants of evaluation functions (Section 3) and 3 enhancements of MCTS algorithms (Section 4), and conducted a set of experiments exhaustively for test data T-20 and T-40 changing the number of playouts from
As the first analysis, we compare the error rates of the single use of evaluation functions and those of MCTS players. Figure 3 plots the experiment results for T-20 with
Experiment results for test data T-20. Each column shows the error rate and best-move ratio in brackets
Experiment results for test data T-20. Each column shows the error rate and best-move ratio in brackets
Experiment results for test data T-40. Each column shows the error rate and best-move ratio in brackets
From Figs 3 and 4, we can find several interesting facts in the case of T-20 (an earlier stage). Firstly,

T-20 with

T-20 with

T-40 with

T-40 with
From Figs 5 and 6, we can also find some interesting facts in the case of T-40 (a later stage). As we can see rather clearly in Fig. 6 (
As the second analysis, we plot the error rates of MCTS players changing the number of playouts, for six evaluation functions Z0, Z1-c, Z3-a, Z8-a, Z7-b and Z9 (Figs 7–12). As a general trend, the error rates decrease when we increase the number of playouts. An interesting fact can be seen on the gradients of

MCTS players with Z0.

MCTS players with Z1-c.

MCTS players with Z3-a.

MCTS players with Z8-a.

MCTS players with Z7-b.

MCTS players with Z9.

Win ratios w.r.t. error rates for T-20.

Win ratios w.r.t. error rates for T-40.
As the second set of experiments, we had matches of the MCTS players with evaluation functions against the baseline MCTS player. The baseline MCTS player did not use any evaluation function but it performed 128,000 playouts for each move. The MCTS player with evaluation functions performed 1,000 playouts for each move. The number of matches was 1,000 for each evaluation function.
Figures 13 and 14 plot the win ratio of the MCTS players against the error rate of single-use evaluation function for T-20 and T-40, respectively. Roughly speaking, the players developed with EPT were the strongest, which was congruent with the results of previous experiments. For two evaluations Z3-a and Z3-b, the players with EPT were weaker than ones with P-Bias. Though this is a interest result, we do not have concrete evidences to explain it. Another interesting result is that we have smaller win ratios with evaluation functions with not large error rates (corresponding evaluations functions are Z7-b, Z8-a, and Z8-b).
Related work
Despite the lack of game-theoretic guarantees, the majority of MCTS players perform tree traversals and/or playouts with some game-specific heuristic bias (Browne et al. (2012)). Among them, a popular enhancement is rapid action value estimation (RAVE) algorithm by Gelly and Silver (2007), which utilizes online knowledge obtained through playouts. Similar approaches include move-average sampling technique (MAST) by Finnsson and Björnsson (2008) and playout policy adaptation (PPA) by Cazenave (2015).
As we discussed in Section 4, several MCTS algorithms have been developed to utilize offline knowledge like evaluation functions. The most promising approach is early playout termination (EPT) (Lorentz (2016)), which was used in several players for Amazons (Lorentz (2008); Kloetzer et al. (2007)), Breakthrough (Lorentz and Horey (2013)), Havannah (Lorentz (2016)), Lines of Action (Winands and Björnsson (2010)). In this study, we also confirmed the advantage of EPT, compared with node prior or progressive bias. Recently, Lanctot et al. (2014) proposed an integration of minimax and MCTS to enhance the advantage of evaluation functions. Hsueh et al. (2015) proposed to combine EPT implicitly with minimax, which provided improvements.
In this study, we try to analyze the effect of evaluation functions in MCTS using Othello, for which we already have very strong players (also we can do perfect analysis for later positions of the game). This is on a recent trend of game-informatics research. For instance, the characteristics of the MCTS algorithms were studied using perfectly-analyzed subset of the game (Hashimoto et al. (2013); Matsumoto and Kobayashi (2013); Kato et al. (2015)). The most related study was by James et al. (2017), in which MCTS algorithms combined with evaluation functions were analyzed with a synthetic game.
Conclusion
In this study, we have tried to clarify the effects of evaluation functions combined with MCTS algorithms, using Othello as the target game. The method we used was (i) designing several evaluation functions as variants of Zebra’s evaluation function and (ii) evaluating the performance of evaluation functions through the test positions associated with exact (or almost exact) scores.
We have found several interesting facts from the experiments. For instance, linear relation between error rates of the single use and MCTS players for an earlier stage (T-20) but non-linear relation between them for a later stage (T-40), degradation of error rates combined with MCTS for very good evaluation functions.
We have noticed that several important issues still remain unclear after this study. Since we checked the performance of evaluation functions only for the positions after 20 moves and after 40 moves, we cannot do further analyses for each of evaluation functions. We may find more interesting facts if we plot the error rates over the progress of the games, which remains as our future work. It is also an interesting future work to apply the method to other games to derive game-generic trends.
Footnotes
Acknowledgement
Most of the experiments in this paper were conducted on the IACP cluster in Kochi University of Technology.
