Abstract
Since DeepMind’s AlphaZero, Zero learning quickly became the state-of-the-art method for many board games. It can be improved using a fully convolutional structure (no fully connected layer). Using such an architecture plus global pooling, we can create bots independent of the board size. The training can be made more robust by keeping track of the best checkpoints during the training and by training against them. Using these features, we release Polygames, our framework for Zero learning, with its library of games and its checkpoints. We won against strong humans at the game of Hex in
Keywords
Introduction
In spite of AlphaGo (Silver et al., 2016), some games still resist to computers and for many games the computational requirement is still huge. We present Polygames, our open-source zero learning framework available at
Zero learning
AlphaGo and AlphaZero (Silver et al., 2016, 2017) proposed a combination between Monte Carlo Tree Search (Coulom, 2007) and Deep Learning. The version in (Silver et al., 2017) is a simplified and elegant version, learnt end-to-end.
Monte Carlo tree search
Monte Carlo consists in approximating values (i.e. expected rewards) by averaging random simulations. MCTS (Monte Carlo Tree Search) consists in biasing these random simulations: using the statistics from previous simulations, we increase the frequency of moves which look good. The most well known variant of MCTS is Upper Confidence Trees (UCT) (Kocsis and Szepesvári, 2006), which uses, for biasing the simulations, a formula inspired from the bandit literature: a move is chosen if it maximises a score as follows:
Neural policies
Let us assume that we have a function tensor which maps a state to a tensor
Zero model of play: How the MCTS is modified by adding the NN
The zero-model is then as follows. First, the UCT formula (Eq. (1)) is adapted as PUCT (Silver et al., 2017), as follows:
The output of the neural network is actually corrupted by a Dirichlet noise: rather than
Second, when a simulation reached a state in which no statistics are available (because no simulation has been performed here), instead of returning the reward of a random rollout until the end of the game, we return the reward estimated by
Let us generate games with a zero-model as above, using e.g. r is the final reward of the game. p is the tensor of frequencies in the MCTS simulations at s. More precisely, with a an action identified to a location in the output tensor as above,
These 3-tuples are then used for training the network so that
Overall architecture
There is a server (typically for us 8 GPUs) and many clients (tested up to 500 GPUs and 5 000 cores in our experiments):
The server receives 3-tuples The clients send data (3-tuples) to the server.
The number of clients should be tuned so that the cycles performed by the trainer are just a bit faster than the speed at which data are provided; Section 2.4 provides a robust solution for ensuring this, in particular for low computational power.
Other open-source frameworks
Many teams have replicated and sometimes improved the Alpha Zero approach for different games.
Elf/OpenGo (Tian et al., 2019) is an open-source implementation of AlphaGo Zero for the game of Go. After two weeks of training on 2 000 GPUs it reached superhuman level and beat professional Go players. Compared to Elf/OpenGo, Polygames features a wide library of games, boardsize-invariance, and growing architectures.
Leela Zero (Pascutto, 2017) is an open-source program that uses a community of contributors who donate GPU time to replicate the Alpha Zero approach. It has been applied with success to Go and Chess. Compared to Polygames, its library of games is smaller, it is not boardsize-invariant and does not support growing architectures.
Crazy Zero by Rémi Coulom is a zero learning framework that has been applied to the game of Go as well as Chess, Shogi, Gomoku, Renju, Othello and Ataxx. With limited hardware it was able to reach superhuman level at Go using large batches in self-play and improvements of the targets to learn such as learning territory in Go. While the predicted final territory information (attribution of each location on the board) is not used by the Zero player, learning territory, as a side information in Go, increases considerably the speed of learning. Polygames does not have this clever use of side information. This is under progress. Polygames, on the other hand, features a large library of games, boardsize invariance, and growing architectures.
KataGo (Wu, 2019) is an open-source implementation of AlphaGo Zero that improves learning in many ways. It converges to superhuman level much faster than alternative approaches such as Elf/OpenGo or Leela Zero. It makes use of different optimizations such as using a low number of playouts for most of the moves in a game so as to have more data about the value in a shorter time. It also uses, as a side information for training the neural network, additional training targets so as to regularize the networks. Compared to KataGo, Polygames has a wide library of games, boardsize invariance, diversity of opponents for the training, and growing architectures.
Galvanise Zero (Emslie, 2019) is an open-source program that is linked to General Game Playing (GGP) (Pitrat, 1968). It uses rules of different games represented in the Game Description Language (GDL) (Love et al., 2006), which makes it a truly general zero learning program able to be applied as is to many different games. The current games supported by Galvanise Zero are Chess, Connect6, Hex11, Hex13, Hex19, Reversi8, Reversi10, Amazons, Breakthrough, International Draughts. Contrarily to Galvanise Zero, Polygames is based on compiled C++ game descriptions, and uses growing architectures and boardsize-invariant neural networks.
To the best of our knowledge, besides differences pointed out above, Polygames is the only framework featuring a population of models (the past checkpoints), ranked by an ELO scale and used for sampling sparring partners for the training. It is also the framework in which growing architectures are most systematized.
Innovations
Some innovations in Polygames consist in using ideas from computer vision inside Zero learning (Section 2.1): global pooling, fully convolutional neural networks. We also use a diverse set of opponents for filling the replay buffer (i.e. the current model plays against previous models sampled according to their ELO rank as described in Section 2.3), and growing architectures (Section 2.2). These elements are detailed in the present section.
Structured zero learning
Fully convolutional models: Taking into account the natural mapping between inputs and outputs
Many zero-learning methods are based on traditional convolutions, followed by fully connected layers. However, policy learning in board games is often closer to image segmentation (i.e. several output values per input pixel) than to classical image classification (fixed output size, independent of the number of pixels) as actions are naturally mapped on boards (a vector of logits for each location on the board). More precisely, for many games:
The input has various channels, and two dimensions matching the board size (spatial dimensions; see Fig. 1). Similarly, the output of the network has various channels, corresponding to various possible moves, and two dimensions matching the board size as well.
Therefore, we can apply fully convolutional models – not a single fully connected layer is necessary in the policy part. This contributes to scale invariance (Section 2.1.2) but also takes into account the partial translation invariance of policy learning. Polygames features fully convolutional networks (Shelhamer et al., 2017), residual fully convolutional networks (He et al., 2015), including with global pooling (Wu, 2019) as detailed below for scale invariance, and U-networks (Ronneberger et al., 2015).

Traditional deep convolutional network. The convolutional layers preserve the spatial coordinates of the shape – possibly not the depth, i.e. the number of feature maps. The number of parameters of those layers, therefore does not depend on the board size, and a same neural network can work in

Structure of a boardsize invariant neural network in Polygames.
The widely cited (Marcus, 2018) pointed out that zero-learning frameworks are not boardsize invariant. In the present section, we point out that Polygames learns in a boardsize invariant manner. As usually in zero learning, our neural network has two heads: one for the policy and one for the value. The one for the policy is fully convolutional (Section 2.1.1), and therefore it works independently of the input size, i.e. independently of the board size. The value part, however, does not have this property if it is fully connected. We therefore use global pooling as in (Wu, 2019). Global pooling replaces each channel c, of shape possibly
Neuroplasticity
Several modifications are almost neutral when initialized close to zero:
addition of residual blocks (i.e. switching from 12 blocks of 3 convolutional layers to 13 or 14 blocks of 3 convolutional layers); addition of new channels; extension of the kernel size (from
Polygames provides a script “convert” that makes such a growth of the neural network easy. Training can be resumed after such extensions of the neural architectures; we can train, then grow (while preserving the quality of the model as it remains almost equal to the previous model as new weights are close to 0), then resume the training with more degrees of freedom.
Tournament mode
In order to fight catastrophic forgetting (McCloskey and Cohen, 1989) or the red queen effect (oscillations of performance (Johansson, 2011)), we add a tournament mode, as follows. Each completed game is used for evaluating the ELO rank of players. Each client, when it must start a new simulated game, randomly draws a checkpoint in the archive. With
Other features
Checkpoints
We provide checkpoints2
Heuristically, we consider that an example, in the replay-buffer, should never be seen more than 8 times. When the clients are not fast enough for filling the replay buffer, for example because of preemption of clients or slow game, we artificially add delays in the learning when an example is seen more than 8 times.
Easy addition of games
Adding a new game can be made by writing a new class that inherits from State and overrides a few methods (see the implementation of Connect Four3
Polygames can handle stochastic games (see e.g. our bot “randototoro” on LittleGolem, playing the game of Einstein Würfelt Nicht), which is not that common in existing frameworks but not conceptually hard: we just need random nodes, instead of only minimization and maximization nodes. The adaptation of MCTS to partial observation is more tricky: we cannot simulate the underlying dynamics from a given state, because players have only a partial observation. Actually, (Buffet et al., 2012) has presented a class of partially observable games that can be converted to an equivalent stochastic game, and therefore can be handled by MCTS (and, equivalently, by Zero learning as shown by Polygames): all games in which the visible information is the same for all players. This class includes a few two-player games: for example Chinese Dark Chess: there is hidden information, but the observable part is the same for both players. It also includes all one-player games, in particular Minesweeper and Mastermind. In those games, the undecidability results such as (Auger and Teytaud, 2012) do not apply.
For self-containedness, let us explain the reasoning in (Buffet et al., 2012). We do not know the state, so instead of a state, s is the history of observations: we must make a decision in a given s. This concept of history of observations is correctly defined because, by assumption, the history of observations is the same for all players. Given an history of observations s (including actions) and an action a, we can solve the transition by simulating Consider Randomly draw If results are not consistent with the observation in s, go back to step 2.
This is fully developed in (Buffet et al., 2012) in the case of MCTS and implemented inside Polygames for Minesweeper and Mastermind. In the case of Minesweeper, we implemented the more sophisticated method from (Buffet et al., 2012), faster than the rejection sampling.
Success stories
Results here are obtained using (i) trainings stabilized by the tournament mode (Section 2.3), (ii) architectures growing in order to double the number or parameters each time the ELO rank stagnates (Section 2.2) (iii) boardsize-invariant architectures (Section 2.1), with top performance typically obtained after 9 days with 500 GPUs (usually playing at a very strong level after 3 days).

Game of Hex
According to (Bonnet et al., 2016), “Since its independent inventions in 1942 and 1948 by the poet and mathematician Piet Hein and the economist and mathematician John Nash, the game of Hex has acquired a special spot in the heart of abstract game aficionados. Its purity and depth has lead Jack van Rijswijck to conclude his PhD thesis with the following hyperbole (van Rijswijck, 2006): ‘Hex has a Platonic existence, independent of human thought. If ever we find an extraterrestrial civilization at all, they will know Hex, without any doubt.’” The rules are simple. Black and white fill an empty cell, in turn (Fig. 3). Black wins if it connects North and South, White wins if it connects West and East. The game is made more fair by a pie rule: at the second move, the second player can decide to swap colors. The game is hard because, as a connection game, its reward is based on a global criterion (no local criterion to sum). Figure 3 shows the first win against a top level human.

Results of our bot Mootwo since its game in Fig. 3.
At TAAI 2019, Polygames was ranked first in Othello10 (3 programs), Breakthrough (4 programs) and Connect6 (2 programs).4
WZebra and Ltbel (Othello8),
the winner of TAAI 2018 at Connect6, namely Kavalan,

Game won by Polygames against Kavalan: move 26 (left), which made the situation excellent for Polygames, and final position (right). Polygames won 15 games out of 15.

Left: The game of Havannah and the three different ways for winning. Middle: a win by Polygames against Tony, Elo rank 2167. Right: a win against Mirko Rahn, Elo rank 2133, winner of many Havannah tournaments. Sources: wikipedia (left) and LittleGolem (middle and right).
Havannah was invented specifically for being hard for computers (Fig. 6). It follows the game play of Hex, also with hexagonal cells but now on an hexagonal board, and winning conditions are more diverse:
Connecting two of the six corners wins (15 possibilities); Connecting three of the six sides wins (20 possibilities); Realizing a loop (even if it does not contain empty cells) wins.
According to (Lorentz, 2015), “The state of Havannah programming is still in its early stages. Though the top programs do play at a reasonable level, about the level of somebody who has played the game for 6 months or a year, they still play with a very unnatural style, and often win their games by virtue of tactical shots missed by the human opponent. Our feeling is that Havannah programs cannot be expected to play at an elite level until they learn to play a more natural, human-like game.”
Draws are theoretically possible but very rare. The game is also played with a pie rule. Some decent players have been defeated by computers, but never the best humans until Polygames. On Littlegolem, an early version won 3 games out of 4 against Mirko Rahn (Elo rank 2133), and 2 games out of 2 against tony (Elo rank 2167), who belong to the top four players on this website (see Fig. 6 middle and right for examples of games). Then, a new version was trained and won two games out of two against Eobllor (Elo rank 2415): both games were quite short, with excellent opening moves by Polygames (Fig. 7). This bot, our strongest version, was using a single Quadro GP100, 40 minutes per move with a shared GPU (used simultaneously

The two games won against Eobllor (Elo rank 2415 before starting to play against Polygames), who dominates most competitions in Havannah. Eobllor won most games against the previous version of Polygames (Fig. 6), so we retrained it and got the present results. We note that the 3 fastest games lost by Eobllor, who rarely loses a game, are these 2 games and the game he lost against the previous version.
We propose a state-of-the-art framework, called Polygames, that can play to various games. It is based on zero learning, with innovations detailed in Section 2, and had success stories detailed in Section 3. Polygames contains new architectures, allows architecture search thanks to neutral components addition and is stabilized by a tournament mode and an overfitting fighting method. It was widely tested in the TAAI competition and on LittleGolem (www.littlegolem.net). The source code is publicly available under an open-source license. Our plan includes ablation studies and experimenting some of the innovations – e.g. the one-player case, including partial observability as in Section 2.4.4.
Footnotes
Acknowledgements
We are grateful to Wikipedia and LittleGolem as image sources and platforms for running experiments. We are grateful to human players who accepted to play with our bot, in particular Eobllor who beat successive preliminary versions of our bot. This work was supported in part by MOST under contracts 109-2634-F-259-001-through Pervasive Artificial Intelligence Research (PAIR) Labs, Taiwan. Tristan Cazenave is supported by the PRAIRIE institute.
