Abstract
This paper extends simulation adjusting that optimizes parameters in a policy function so that the Monte-Carlo method produces correct moves, and gives a firm theoretical basis for simulation adjusting. Preliminary experiments show that improved simulation adjusting is capable of adjusting 28 parameters of the policy function in a Go program.
Introduction
The Monte-Carlo (MC) method was first applied to the game of Go by Brügmann (1993). This method evaluates candidate moves in a board position by performing many random simulations and returns the move with the best win rate. Although theoretically interesting, the methodology presented in the paper had little success from the viewpoint of enhancing the strength of Go programs. Later, the idea of applying the MC method to Go was picked up by Coulom (2007), and Gelly and Silver (2007) in the form of the Monte-Carlo Tree Search (MCTS), which is now viewed as being indispensable for modern strong Go programs. In October 2015, Silver et al. (2016) and their program AlphaGo beat a professional Go player with no handicap for the first time using MCTS and other methods.
MCTS is supported by two major techniques, each of which has made significant contributions to the strength of Go programs. The first technique, Upper Confidence bound applied to Trees (UCT, Kocsis and Szepesvári, 2006), is a system that efficiently allocates computational resources (Gelly and Silver, 2007); the UCT algorithm gradually concentrates computing power on promising nodes. The other important technique is the machine learning method due to Coulom (2007) that learns a policy to perform smarter simulations from human game records. Coulom (2007) considered the problem of predicting the next move from a current position, modeled it using a generalized Bradley-Terry model, and optimized the parameters of the model to fit the expert players’ game records by using the Minorization-Maximization (MM) method (Hunter, 2004). Coulom’s method significantly raises the correct answer ratio of the next move prediction, and it has been empirically shown that the simulation policy developed by Coulom’s method outperforms a uniformly random simulation policy. Coulom’s method is currently widely used, including in AlphaGo.
Looking at these two techniques from the theoretical viewpoint, we can notice that they have different statuses. UCT has a firm theoretical basis that comes from the theory of the bandit problems (Auer et al., 2002; Kocsis and Szepesvári, 2006). And also, UCT has been proved to be correct in the sense that the probability of selecting the best move converges to 1 as the number of simulations grows to infinity. In contrast, the validity of Coulom’s method has been supported on an experimental basis. In fact, Coulom’s model is aimed at better move prediction, not a correct result of MCTS method. As such, a natural question arises; is it possible to develop a better method than Coulom’s method if the aim is a correct result of MCTS method? In other words, can we develop a method to make a policy that gives correct moves when it is used in MCTS? This question motivates the current paper.
One may think that a stronger simulation policy would make an MCTS Go program stronger, but this is not true. Gelly and Silver (2007) reported that a stronger simulation policy may not enhance the strength of an MCTS Go program. Their experiments show that their Go program MoGo was stronger when a handmade simulation policy was used, which was a weak player itself. In particular, the famous policy of MoGo decides a move almost deterministically. MoGo’s policy works fine in practice, and it is sometimes better than policies developed with the MM method. As a result, many MCTS Go programs still use it. To the best of the authors’ knowledge, however, the reason why MoGo style simulation policies work well is as yet unclear.
Silver and Tesauro (2009) proposed a new learning method called simulation balancing to acquire a better simulation policy with the help of a stronger trainer Go program. After that, Huang et al. (2010) tried simulation balancing on a
Inspired by simulation balancing, Araki et al. (2014) proposed a method to learn a simulation policy in a MC method, called simulation adjusting. Simulation adjusting aims to match an AI’s move to a correct move in game records of experts by using gradient descent as in simulation balancing. Unlike Coulom’s method, simulation adjusting aims at directly developing a better simulation policy, not a better move predictor. Moreover, unlike simulation balancing, simulation adjusting does not require a strong trainer AI. We can find some existing work that aim to match moves of AI to those of human experts in chess (Tesauro, 2001) and shogi (Hoki and Kaneko, 2014). Their work trained evaluation functions of heuristic minimax search method. Therefore, it would be interesting to research such idea to train a policy function of MCTS method.
This paper is a continuation of the conference paper (Araki et al., 2014), which for the first time proposed the idea of simulation adjusting and reported preliminary experiments on
After the second submission of this paper, we have noticed the work by Graf and Platzner (2016), which also have reexamined simulation balancing and improved the playing strength of a Go program.
Theory
Following Coulom (2007), we shall use the soft-max model (Bishop, 2006) in simulations of primitive MC. Let
Next, we construct a model parameterized by
The game record
In contrast, we propose to adjust the parameter θ to increase the probability that the selected move after simulations is the experts’ move. For this purpose, we design our objective function in the following way. First, defining the notation
Below, we describe some of the properties of the objective function. We define for
We have, for
It holds that for
We have
From Theorem 3, we can find a descent direction by approximating the partial derivatives, and construct the improved simulation adjusting algorithm. Since
Several remarks on the algorithm follow.

Improved Simulation Adjusting
The subscript
N is the number of simulations starting from
Since the supposed application of improved simulation adjusting is the game of Go, which, in general, has a large set of candidate moves, using all candidate moves poses a significant drawback in performance. However, in other games, it might be possible to use all candidate moves.
The approximate value of the objective function is computed as follows:
As we have mentioned, the conference paper (Araki et al., 2014) already proposed the idea of improved simulation adjusting. In the conference paper, the objective function to be minimized is defined as
First, the objective function had a drawback: with θ minimizing this objective function,
Second, the size of the training data was 60, whereas the number of features was several thousands. Therefore, the learned parameters could easily overfit the data. To overcome this difficulty, we discarded
Third, the objective function value did not decrease steadily, and the variance of it was not negligible. As a result, the variance of correct answer ratio was relatively large, too. To obtain more reliable results, we increase the number of simulations N in the experiments from 50 to 1000 on the larger board in this paper.
Preliminary experiments on improved simulation adjusting
We tested the improved simulation adjusting algorithm on a set of Go endgame problems on
To avoid data bias, we used cross validation as follows. First, we divided
Our original program, called “MC_ark” was used in the experiments. MC_ark is based on the MCTS technique and uses the maximum entropy method (Berger et al., 1996) to generate a policy. MC_ark won 7th place in the 8th UEC Cup.1
As we stated in the previous section, we restricted the candidate moves because of computational performance limitations. Instead of the set of all legal moves on s,
We imposed an upper bound of
Figure 1 shows the experimental results. The bottom axis is the number of iteration L. It took one 16-core machine (Xeon E5-2687W) about half a day to obtain the results. We can see that the objective function values have steadily decreased in the first several iterations. After the decrement, the value fluctuates. Improved simulation adjusting achieved the correct answer rate of test data 0.70, whereas MM method did 0.67. Our experiment showed that improved simulation adjusting reached to a comparable level of MM method. It would be difficult to see a result that outperforms MM method by a statistically significant amount when the policy function is modeled by only 28 parameters and the training and test data contain only 286 correct moves on the

Change in the objective function value estimated by using 11-fold cross validation.
This paper has shown an advanced version of simulation adjusting. The differences from the original work are: (i) the new objective function is capable of avoiding the case where the simulation values reduce to a constant value, (ii) the number of weights is less than the number of training data to avoid overfitting, and (iii) the number of simulations is increased to compute the objective function and its partial derivative values accurately.
We have carried out numerical optimization of 28 weights in the policy function by using
We have revealed a barrier in applying simulation adjusting to a
There are some future works that are necessary to be investigated. First, we shall develop a method to combine simulation adjusting with Upper Confidence Bound (Auer et al., 2002) or its variants to balance exploration and exploitation of candidate moves. Note that, in this paper, candidates were narrowed down by means of the static evaluation function without using simulation results. Second, we shall examine the performance of MCTS when a policy learned by simulation adjusting is used. Note that, in this paper, the performance of policy was measured by using the primitive Monte-Carlo method instead of using MCTS. Third, we shall improve computational efficiency of simulation adjusting to enlarge the size of board and training data.
Footnotes
Acknowledgements
This work was supported by JSPS KAKENHI Grant Numbers JP15J11695, JP26870200, JP25730212, JP26330025, JP16K00503, JP26280005, and JP15H02972.
