A comparison between UCB and UCB-Tuned as selection policies in GGP

Abstract

In this paper, we present a comparative analysis of two selection policies in the General Game Playing (GGP) context: Upper Confidence Bound (UCB) and Upper Confidence Bound Tuned (UCB-Tuned). The aim of the analysis is to identify which policy has the best performance in terms of victories in the GGP domain, a measure used in most of literature with other policies. In order to carry out the comparison, two agents were programmed using the GGP-base framework and the Monte Carlo Tree Search (MCTS) method. The games Breakthrough, Knightthrough and Connect Four were used as experimental scenarios, not compared previously to the best of our knowledge. The results show that UCB-Tuned is better when less than 100 simulations are used in MCTS; however, when 1000 simulations are used, both policies have similar performance.

Keywords

General Game Playing Upper Confidence Bound Upper Confidence Bound Tuned Policies

1 Introduction

One of the aims of Artificial Intelligence (AI) has been the development of intelligent agents capable of playing board game at the same level -or even better- than humans [4]. The challenge in developing playing agents drift that they must possess characteristics pertaining to human intelligence, such as deduction, reasoning, problem resolution, intelligent search, knowledge representation, planning, learning, creativity, perception and natural language processing among others [21]. AI has proposed agents capable of playing at the level of human champions in specific games, like Chinook [19] (there is a strategy that allows the player to win regardless the plays by its opponent [18]), Deep Blue for Chess [6] and Dark Knight for Banqi or Chinese Chess [13].

The field of AI that has focused its efforts in developing intelligent agents capable of playing any kind of game is General Game Playing (GGP). In GGP, agents must be able to develop their own playing strategies autonomously, without human intervention, having played the game previously and just considering the rule given at the start of the game (often using the Game Description Language) [4 , 21].

Although the agents in GGP cannot be compared to specialized game agents in terms of performance, their use is higher as they can perform in different domains, even the techniques developed in GGP can be used in other areas such as business process, electronic business, military operations, among others [11, 12].

An important result in the GGP area has been the implementation of the Monte Carlo Tree Search method (MCTS) proposed by Finnsson [10] and used at the CadiaPlayer agent which have won the GGP international competition three times. MCTS method defined the state of the art of agents in GGP [11]. MCTS is guided by Monte Carlo Methods [5], using the results of previous explorations to estimate with higher precision the values of the promising results [5, 9].

Algorithms based on MCTS use a tree structure where each edge represents a possible movement and each node represents a state of the game. Each node has associated a series of statistics like the number of victories and the number of node visits. The algorithm consist of the iterative application of four steps [9]: selection, expansion, simulation and backpropagation. During the selection step, the tree is traversed from the root until a node which can be expanded is reached, using a selection criterion that determines which node should be visited in each level. Next, in the expansion step, the new node is added to the tree. Then, during the simulation step, an execution of the game is done from the state of the node added, running the moves of each player, until the game ends. Finally, in the backpropagation step the result of the previous step is propagated in each visited node.

Every time that the MCTS performs the selection step, it faces with following dilemma: which node should be visited, the one who seems to be the best at the moment or other nodes less promising that may turn out to be better in later iterations. This is known as the Exploration-Exploitation Dilemma, the Multi Armed Bandit Problem (MABP) is representative of this dilemma [15 , 22]. The MABP is described as follows: given a set of K slot machines where each one has certain probability to be chosen to gain a reward, the question is: which slot machine should be activated if it is desired to obtain the greatest possible reward. Upper Confidence Bound (UCB), proposed in [3], is the most popular algorithm used in MABP, and it has been used in MCTS as selection policy with excellent results [10], UCB+MCTS is known as Upper Confidence Bounds Applied to Trees (UCT).

Due to its results in specific games UCB is the most used policy in GGP [7, 8] since it is an independent domain so no previous knowledge of the game is required. Since the conception of UCB, new policies have been developed to MABP like UCB-Tuned [3], UCB-V, PAC-UCB [2], UCB-Minimal [14], Minimax Optimal Strategy (MOSS) [1], and others [5]. So far there is no evidence of the application of said policies in GGP, this may be due to the excellent results obtained by UCB. However, these policies are also domain independent so they can be applied to GGP in order to evaluate whether they improve UCB in victories obtained.

Perick et al. [16] present a comparative study of various policies in MCTS using as study case the Tron game. The results show that UCB-Tuned is the policy with the best performance. However, if these policies are applied to GGP domain, and due to the variety of games in the literature, it is possible to obtain different results from those obtained by Perick et al. when applied to other games.

This paper shows a comparative study of the performance of the UCB and UCB-Tuned policies in GGP using the Breakthrough, Knightthrough and Connect Four games. The purpose of this comparison is to analyze the behavior of these policies and their performance in terms of victories in the GGP context. To the best of our knowledge no comparison has been done previously with these games using these policies.

In Section 2, the Monte Carlo Tree Search is described; in Section 3, the selection policies are described; in Section 4, the comparative analysis is presented; in Section 5, the results are presented. Finally, in Section 6 the conclusions are outlined.

2 Monte Carlo Tree Search

Monte Carlo Tree Search (MCTS) is a guided method based on Monte Carlo simulations [5], which uses the results of previous explorations for estimating the most promising moves in the game with better precision [5, 9]. In the tree structure each edge is a possible move and each node is a state of the game associated with a set of statistics values: the number of victories and the number of visits. Figure 1 shows a game tree whose nodes show the victories in the left side and the visits in the right side. The game tree starts only with the root and it is expanded according to the method used.

Fig.1

An example of MCTS for the Tic Tac Toe game.

In each interaction MCTS performs four steps which are repeated cyclically until a stop criterion is satisfied [5, 9] (see Algorithm 1 and Fig. 2): Selection, Expansion, Simulation and Backpropagation.

Selection In this step the game tree is traversed from the root to a leaf (see Fig. 2a and algorithm 2). In each level the method must decide which node should be expanded, this decision is guided by a selection policy. At the beginning, a policy based on the rate between the number of victories and visits is used, but recently, the UCB method is used.

Fig.2

MCTS method.

Algorithm 1 MCTS algorithm, with stop condition by number
of simulations
1: ProcedureMCTS(s) ⊳ s is a node that represents a game state
2: whilesimulations < max do
3: v← SELECTIONs
4: r← SIMULATIONv
5: simulations ← simulations + 1
6: PROPAGATIONv, r
7: end while
8: return MOV(s) ⊳ return a move
9: end procedure

Algorithm 2 Selection step algorithm from MCTS method
1: Procedure SELECTIONv ⊳v is a node of game
2: whilev is not leaf do
3: ifv is not completely expanded then
4: return EXPANDv
5: else
6: v← SELECTIONPOLICYv
7: end if
8: end while
9: returnv
10: end procedure

Algorithm 3 Algorithm of Expansion step of MCTS Search
method
1: Procedure EXPANSION(v) ⊳ v is a node of game
2: h← new child-node of v
3: h . father ← v
4: returnh
5: end procedure

Expansion Once the leaf node is found, it is expanded until a child-node is added, which represents a new reachable game state (see Fig. 2b and Algorithm 3).

Simulation From the game state of the child-node, the method performs random moves simulating a playing between both players until a final result is obtained. A final result can be a victory or a defeat (in this paper a tie is considered a defeat). A disadvantage of using random moves is that the method does not emulate an intelligent player. For example, in Fig. 1b the method can choose the left or central moves, but an intelligent player would choose the right move. Cazenave [7, 8] showed that the MCTS can achieve better performance when it uses some moves that led to victories in previous interactions.

Backpropagation In this step the last result obtained is propagated on each node that has been visited in the Selection step, increasing the amount of visits and victories (see Fig. 2d and Algorithm 4).

Algorithm 4 Backpropagation step from the MCTS method
1: Procedure PROPAGATION(v) ⊳v is a node of game
2: whilev is not null do
3: v . visits ← v . visits + 1
4: if result is a victory then
5: v . victories ← v . victories + 1
6: end if
7: v ← v . father
8: end while
9: end procedure

These steps are repeated until a stop criterion is satisfied, this can be a limit of simulations, run time or iterations. Finally, when the method ends, the agent must choose between the child-node of the root-node considering any of the next criteria [5]:

Max child The node with the greatest number of victories is chosen.

Robust child The node with the greatest number of visits is chosen.

Max-Robust child The node with the greatest number of both victories and visits is chosen. If in a certain iteration neither node satisfies the condition, the MCTS method is executed until the condition has been satisfied.

By Policy The node is chosen using the Selection Policy.

3 Selection policies

In each iteration, in the selection step, in each level of the game tree, the MCTS method faces the following decision: Which node should be visited? The best one so far or some else that may be better than this one in future iterations? The above is known as the Exploration-Exploitation Trade-Off, which the Multi Armed Bandit Problem (MABP) also belongs.

The decision of which node should be visited in each level of the game tree is equivalent to decide which slot machine to activate in MABP, so each node can be a machine as follows:

The number of victories of a node is equivalent to number of times that a slot machine has given rewards.

The number of visits of a node is equivalent to number of times that a slot machine has been activated.

Each level in a game tree is a MABP.

Due to their equivalence, it is possible to use the techniques developed to MABP in MCTS. A selection policy is an algorithm that allows to decide in MABP which slot machine to activate. The most popular policy is Upper Confidence Bound.

3.1 Upper Confidence Bound

Proposed by Auer et al. [3] of MABP, establishes that the slot machine to activate is the one that has the maximum value of Equation (1).

${\bar{x}}_{j} + \sqrt{\frac{2 ln n}{n_{j}}}$ (1) where ${\bar{x}}_{j}$ is the average of rewards of the j slot machine, n_j is the number of the times that the j slot machine has been to activated at the moment, and n is the sum of times each machine has been to activated.

UCB is the most popular policy used in MCTS because it is easy of implement since it only requires the number of victories and the number of times that the node has been visited [5].

3.2 Upper Confidence Bound Tuned

Auer et al. [3] propose UCB-Tuned as an improvement to UCB, establishing that the slot machine to activate is the one that has the maximum value of Equation (2).

${\bar{x}}_{j} + \sqrt{\frac{ln n}{n_{j}} min {\frac{1}{4}, V_{j} (n_{j})}}$ (2)

$V_{j} (s) = (\frac{1}{s} (\sum_{τ}^{s} x_{j, τ}) - {\bar{x}}_{j, s}^{2} + \sqrt{\frac{2 ln t}{s}})$ (3) where s is the number of the times that the slot machine has been activate and t is the sum of the times that the all machines have been activate. It should be noted that Equation (3) makes use of the sample variance. UCB-Tuned has been applied in games like Go, Othello and Tron [5], but not specifically in GGP.

4 Comparative analysis

In order to make the comparison, the GGP-Base framework was used and the Breakthrough, Knightthrough (both games used in [7, 8]) and Connect Four.

4.1 GGP-base framework

GGP-Base 1 is a framework written in Java, which allows the development of GGP agents, besides incorporating a Server application which allows to perform matches between agents. GGP-Base can connect with servers that provide games in Game Description Language that allow to test the agents in different scenarios.

4.2 Breakthrough

Breakthrough is a board game of i × j size for two players and with turn-based moves. At the beginning, the first two rows of each player had pawns (see Fig. 3). The goal is to take a pawn to the first row of the opponent; for this, the pawn, can move one square forward or diagonally, being able to capture only diagonally. Breakthrough has been solved for a board size of 6 × 5 determining that the second player has a strategy that allows it to win no matter what moves the other player does [17]. For the research here presented a 6 × 6 board was used which is available in GGP-Base.

Fig.3

Breakthrough 6 × 6.

4.3 Knightthrough

Like Breakthrough, Knightthrough is a board game of i × j size, for two players with turn-based moves, but with the difference that instead of using pawns it uses knights (see Fig. 4). The goal is to take a knight to the first row of the opponent. For the research here presented, a 8 × 8 board was used, which is available in GGP-Base.

Fig.4

Knightthrough 8 × 8.

4.4 Connect Four

Connect Four is a board game for two players with turn-based moves (see Fig. 5). Each player has a set of discs of the one color which are added by the top of the board. The goal is to align four discs on the board before the other player. For the research, a 8 × 6 board was used, available in GGP-Base.

Fig.5

Connect Four 8 × 6.

5 Execution and results

To compare the policies, two agents were created, one base in UCB and the other in UCB-Tuned. The analysis was performed in two stages; in the first one the MCTS method was limited to 100 simulations per iteration, while in the second one the MCTS method was limited to 1000 simulations. In each stage the agents competed three times in 1000 matches of Breakthrough, 1000 of Knighttrough and 1000 of Connect Four. The results show that UCB-Tuned has the best performance in terms of victories when the MCTS method is limited to 100 simulations; however, when the method is limited to 1000 simulations UCB wins in two of the three games.

5.1 MCTS limited to 100 simulations

In the first stage the results shown in the Table 1 were obtained, where it can be observed that the agent which has the best performance in terms of victories, is the one based in UCB-Tuned, winning the 64.39% of times in the three games, in average. It should be noted that in Knightthrough game, UCB-Tuned won approximately the 80% of the times. So in this stage UCB-Tuned is the indisputable winner, meaning it is better to use it when having a limited number of simulations.

Table 1
Results of 100 simulations

Breakthrough Knighttrough Connect Four

Competition UCB UCB-TUNED UCB UCB-TUNED UCB UCB-TUNED

1 392 608 214 786 481 519

2 397 603 220 780 432 568

3 385 615 223 777 461 539

Average 391.4 (6.1) 608.7 (6.1) 219.0 (4.6) 781.0 (4.6) 458 (24.6) 542 (24.6)

	Breakthrough	Knighttrough	Connect Four
1	392	608	214	786	481	519
2	397	603	220	780	432	568
3	385	615	223	777	461	539
Average	391.4 (6.1)	608.7 (6.1)	219.0 (4.6)	781.0 (4.6)	458 (24.6)	542 (24.6)

5.2 MCTS limited to 1000 simulations

In the second stage the results shown in the Table 2 were obtained. In this case, we can observe that the UCB won in Breakthrough and Connect Four, while UCB-Tuned only won in Knightthrough. Although the difference between the victories of both players is not greater than the previous stage. For example, UCB only wins the 55.1% of the times in Breakthrough and the 58.5% in Connect Four and looses 61.9% of times in Knightthrough.

Table 2
Results of 1000 simulations

Breakthrough Knighttrough Connect Four

Competition UCB UCB-TUNED UCB UCB-TUNED UCB UCB-TUNED

1 539 461 412 588 568 432

2 563 437 380 620 598 402

3 552 448 351 649 589 411

Average 551.4 (12.0) 448.7 (12.0) 381.0 (30.5) 619 (30.5) 585 (15.4) 415 (15.4)

	Breakthrough	Knighttrough	Connect Four
1	539	461	412	588	568	432
2	563	437	380	620	598	402
3	552	448	351	649	589	411
Average	551.4 (12.0)	448.7 (12.0)	381.0 (30.5)	619 (30.5)	585 (15.4)	415 (15.4)

6 Conclusions

This paper shows a comparative analysis of two selection policies, UCB and UCB-Tuned, in the context of GGP, with the purpose of detecting which policy has the best performance in terms of numbers of victories. Taking the results obtained from the comparative analysis we can conclude that de policy with the best performance in GGP is UCB-Tuned when the MCTS is limited in the number of simulations which happens when the agents are limited in time, very common in competitions like The International General Game Playing Competition [11]. When the MCTS method has a greater number of simulations available, the best policy to use is UCB but only with a small difference over UCB-Tuned.

In the game Knightthrough, the UCB-Tuned policy wins in both stages so we can conclude that in this specific game UCB-Tuned is better to use than UCB.

As the results show that UCB-Tuned wins in four of six competitions, it may be concluded that this policy has better performance than UCB in General Game Playing.

This is a preliminary work so is left as future work:

The use of other games that are more complex than the games presented in this paper, for example Amazons, Chess or Go.

It is also proposed to carry out an analysis with other polices like UCB-V, PAC-UCB, UCB-Minimal y Minimax Optimal Strategy.

The use of different amounts of simulations in MCTS.

Footnotes

References

Audibert

J.-Y.

and Bubeck

, Minimax policies for adversarial and stochastic bandits, In COLT, 2009, pp. 217–226.

Audibert

J.-Y.

, Munos

and Szepesvári

, Tuning bandit algorithms in stochastic environments, In ALT, volume 4754, Springer, 2007, pp. 150–165.

Auer

, Cesa-Bianchi

and Fischer

, Finite-time analysis of the multiarmed bandit problem, Machine Learning47(2-3) (2002), 235–256.

Björnsson

and Schiffel

, Comonparis of GDL reasoners, In Proceedings of the IJCAI-13 Workshop on General Game Playing (GIGA’13), 2013, pp. 55–62.

Browne

C.B.

, Powley

, Whitehouse

, Lucas

S.M.

, Cowling

P.I.

, Rohlfshagen

, Tavener

, Perez

, Samothrakis

and Colton

, A survey of monte carlo tree search methods, IEEE Transactions on Computational Intelligence and AI in Games4(1) (2012), 1–43.

Campbell

, Hoane

A.J.

and Hsu

F.H.

, Deep blue, Artificial Intelligence134(1) (2002), 57–83.

Cazenave

, Playout policy adaptation for games, In Advances in Computer Games, Springer, 2015, pp. 20–28.

Cazenave

, Playout policy adaptation with move features, Theoretical Computer Science644 (2016), 43–52.

Chaslot

, Monte-carlo tree search, Universiteit Maastricht, Maastricht, 2010.

10.

Finnsson

and Björnsson

, Simulation-based approach to general game playing, In AAAI, volume 8, 2008, pp. 259–264.

11.

Genesereth

and Björnsson

, The international general game playing competition, AI Magazine34(2) (2013), 107.

12.

Genesereth

, Love

and Pell

, General game playing: Overview of the AAAI competition, AI Magazine26(2) (2005), 62.

13.

Hsueh

C.-H.

, Wu

I.-C.

, Tseng

W.-J.

, Yen

S.-J.

and Chen

J.-C.

, Strength improvement and analysis for an mcts-based chinese dark chess program, In Advances in Computer Games, Springer, 2015, pp. 29–40.

14.

Maes

, Wehenkel

and Ernst

, Automatic discovery of ranking formulas for playing with multi-armed bandits, In European Workshop on Reinforcement Learning, Springer, 2011, pp. 5–17.

15.

Munos

, et al., From bandits to monte-carlo tree search: The optimistic principle applied to optimization and planning, Foundations and Trends in Machine Learning7(1) (2014), 1–129.

16.

Perick

, St-Pierre

D.L.

, Maes

and Ernst

, Comparison of different selection strategies in montecarlo tree search for the game of tron, In Computational Intelligence and Games (CIG), 2012 IEEE Conference on, IEEE, 2012, pp. 242–249.

17.

Saffidine

, Jouandeau

and Cazenave

, Solving breakthrough with race patterns and job-level proof number search, In Advances in Computer Games, Springer, 2011, pp. 196–207.

18.

Schaeffer

, Burch

, Björnsson

, Kishimoto

, Müller

, Lake

, Lu

and Sutphen

, Checkers is solved, Science317(5844) (2007), 1518–1522.

19.

Schaeffer

, Lake

, Lu

and Bryant

, Chinook the world man-machine checkers champion, AI Magazine17(1) (1996), 21.

20.

Sutton

R.S.

and Barto

A.G.

, Reinforcement learning: An introduction, volume 1, MIT Press, Cambridge, 1998.

21.

Świechowski

and Mándziuk

, Specialized vs. multi-game approaches to AI in games, In Intelligent Systems’ 2014, Springer, 2015, pp. 243–254.

22.

Zhou

, A survey on contextual multi-armed bandits. CoRR, abs/1508.03326, 2015.