Abstract
AlphaZero has achieved superhuman performance in Go, chess, and shogi with a general reinforcement learning (RL) algorithm. This achievement is remarkable because AlphaZero does not rely on any training dataset of strong players. However, AlphaZero-style training requires substantial computational resources. Gumbel AlphaZero, a recently introduced more efficient version of AlphaZero, reduces the computational cost of AlphaZero training. The goal of this study is to further improve the playing strength of Gumbel AlphaZero under a limited amount of computational resources. We focus on the diversity in training games, inspired by procedural generation and domain randomization in RL studies, and propose a novel method, initial state diversification. This method diversifies the initial states of a self-play game to encourage the RL agent to understand the game in a more general manner through diverse experiences. For example, in shogi, the initial state of each self-play game is diversified by rearranging the pieces under realistic domain constraints. Experiments demonstrated that training with initial state diversification improves the playing strength of Gumbel AlphaZero in shogi, within the same computational budget for training.
Introduction
AlphaZero (Silver et al., 2018) achieved superhuman performance in Go, chess, and shogi through self-play with a general reinforcement learning (RL) algorithm. This achievement is indeed remarkable because the algorithm does not rely on any training dataset of strong players. However, AlphaZero-style training requires substantial computational resources, especially for conducting millions of self-play games. For example, the AlphaZero experiments performed by Silver et al. (2018) used more than 5000 TPUs, and the training performed by Tian et al. (2019) of Elf Open Go used 2000 GPUs for self-play of 20 million games.
Gumbel AlphaZero (Danihelka et al., 2022), an efficient successor to AlphaZero, addresses this requirement. It can reliably learn from games produced at much less computational cost. While AlphaZero typically runs Monte Carlo tree search (MCTS) with 400 or 800 simulations per move, Gumbel AlphaZero successfully learns with 2–32 simulations per move. This reduces the computational cost of self-play training. Our research goal is to further improve the playing strength of Gumbel AlphaZero within a limited amount of computational resources.
This paper focuses on a suitable method for exploring candidate moves during self-play with a smaller number of simulations per move in Gumbel AlphaZero. A sufficient exploration or diversity of experiences is widely recognized as crucial in RL studies (Sutton and Barto, 2018). To ensure the diversity of game records, the AlphaZero RL agent was programmed to select an explorative move more frequently in the first 30 moves in self-play. The agent selects explorative moves based on a probability proportional to the number of simulations associated with the move (Danihelka et al., 2022; Silver et al., 2018,2017). Intuitively, exploration with the distribution loses granularity as the number of simulations decreases, especially when the number is less than the number of legal moves.
For exploration independent of the number of simulations, we propose a novel method initial state diversification. This method carefully diversifies the initial state of each game in self-play training to encourage the RL agent to learn diverse and general tactics of the game. As a proof of concept in shogi, we introduce two schemes for stochastic rearrangement of the pieces. This approach is inspired by procedural generation (Justesen et al., 2018; Juliani et al., 2019; Cobbe et al., 2020) and domain randomization (Tobin et al., 2017) in other RL domains.
Through experiments in shogi, we demonstrate that training with initial state diversification improves the strength of agents trained by Gumbel AlphaZero. Note that improvements in the agents’ strength within the same computational budget imply improvements in the learning efficiency as well.
Specifically, our contributions are as follows: We introduce initial state
diversification, a novel method for diversifying the
initial state of each game in self-play training and thereby encouraging
the agent to experience various states that are useful for obtaining a
more general understanding of the game. (Section 3) For a specific instantiation
of initial state diversification in shogi, we introduce
RandomStart2, a simple piece rearrangement method
using two random moves, and Shogi816K, a more complex
piece randomization method inspired by Chess960 (also known as Fischer
Random Chess) (Section 3.2). Experiments
demonstrated that the combination of RandomStart2 and Shogi816K achieves
the highest playing performance among our candidates (Section 4.3). We
empirically show that an agent trained using the initial state
diversification method improves the playing strength of Gumbel AlphaZero
within the same amount of computational budget of self-play training
(Section 4.3).
Background
Our method of diversifying the initial state of each game is built on top of AlphaZero and Gumbel AlphaZero, so we begin by briefly introducing them. We use RL terminology in addition to terms used in the chess and shogi programming literature. We refer to the position of a game as the state and the move of a player as an action.
AlphaZero
AlphaZero (Silver et al., 2018), a model-based RL algorithm, has outperformed state-of-the-art computer programs in chess, shogi, and the game of Go. In AlphaZero training, its neural network is trained using tens of millions of self-play games. The network has 19 residual blocks (He et al., 2016) followed by a policy head and a value head comprising a policy value network. The policy head outputs logit probabilities of actions, and the logits are converted to the probability distribution of policy with softmax function. The value head outputs a scalar value as a prediction for the outcome of the game. During the self-play games, the RL agent runs many MCTS simulations (e.g., 800 in the original AlphaZero report) for each move. After a game is finished, the game record is stored in the replay buffer. Data in the replay buffer is sampled to train the neural network.
Gumbel AlphaZero
Gumbel AlphaZero (Danihelka et al., 2022) is an improved version of AlphaZero. Although both Gumbel MuZero and Gumbel AlphaZero were built on top of MuZero and AlphaZero, respectively, we focus on Gumbel AlphaZero because it was chosen for experiments on chess (Danihelka et al., 2022, Section 7.2). Gumbel AlphaZero can reliably learn from games with few MCTS simulations (even two simulations per move according to the report), whereas the original AlphaZero reportedly fails to learn with small MCTS simulation budgets. This means that Gumbel AlphaZero enables researchers to train agents with relatively few computational resources.
Gumbel AlphaZero basically follows the original AlphaZero in the sense that it also trains the policy value network through self-play. However, Gumbel AlphaZero’s move selection algorithm is redesigned to guarantee a policy improvement (i.e., a move selected by the agent after the game tree search should be better than, or as good as, a move predicted by the policy network alone even when the simulation budget is limited), which is not guaranteed by the original AlphaZero especially when the number of simulations is small. The main difference with the original AlphaZero is in the search at the root node: The stochasticity of a candidate move is governed by the Gumbel-Top-k trick for sampling moves without replacement instead of the visit count to ensure a policy improvement. To this end, Gumbel AlphaZero introduced a two-fold search process, sequential halving (Karnin et al., 2013) at the root and MCTS among the descendants, while AlphaZero solely uses MCTS with the PUCT (predictor upper confidence for trees) algorithm (Silver et al., 2016). Sequential halving is a multi-armed bandit algorithm for simple regret minimization.
These modifications enable Gumbel AlphaZero to significantly reduce the learning time, especially with limited computational resources. Using Gumbel AlphaZero enabled us to conduct experiments in shogi with relatively few computational resources.
Exploration for quality-diverse games
Intuitively, sufficient variation of self-play games is needed for successful training, which is difficult for greedy (consequently deterministic) agents. Various methods for exploring candidate moves have been used for AlphaZero and Gumbel AlphaZero self-play.
AlphaZero introduced stochasticity in which MCTS agents follow a probability distribution induced by the visit count at the root node. This is reasonable because MCTS typically assigns more simulations to identify more effective moves. AlphaZero uses a visit-count based exploration strategy, as in AlphaGo Zero (Silver et al., 2017,2018), in which actions are selected in proportion to the root visit count for the first 30 moves of each game. In addition, for further exploration, Dirichlet noise is added to the output of the policy network at the root node (Appendix B.3).
In contrast, Gumbel AlphaZero introduced stochastic move selection, which is
equivalent to a softmax policy (i.e., Boltzmann policy) if tree search is
omitted. The Gumbel AlphaZero RL agent iteratively narrows down the candidate
moves in a root node by sequential halving. At each halving iteration, the agent
drops the less promising moves, i.e., moves with scores below the median, and
performs tree search to refine the score estimates of the remaining moves.
Finally, the agent selects a move with the maximum score from the remaining
candidates. Formally, the move with the maximum score is defined as
In practice, we need randomly sampled
Gumbel variables, each of which is obtained as
While these two exploration methods are completely different, the visit-count based exploration used in AlphaZero reportedly worked well when combined with Gumbel AlphaZero in 9 × 9 Go experiments (Danihelka et al., 2022, Appendix F). However, their effectiveness was not evaluated for chess variants. Our experiments demonstrated that the proposed method works well for shogi, which is one of the chess variants.
Our initial state diversification method produces diverse game records that are effective for RL training. The main idea is for the initial state to be sampled from a well-designed set of diverse states. This enables the agent to experience a wider set of states and to obtain a more general understanding of the game. For a concrete instantiation in shogi, we used the proposed method to rearrange the pieces to generate a set of diverse initial positions.
After first giving an overview of our method (Section 3.1), we present the specific implementation in shogi (Section 3.2).
Rationale behind initial state diversification
In chess variants including shogi, there is a single initial state of the game, as shown in Fig. 1(a), whereas card games such as poker have various initial states. Additionally, typical AlphaZero-style training is performed through play against oneself rather than against different players. We hypothesize that this combination makes self-play games in chess variants less diverse than other games with various initial states. This hypothesis is partially based on empirical observation in shogi that AlphaZero-style agents often fail to find opening strategies frequently used by human professionals. The improvements in the diversity of opening strategies can also be observed in the frequency heatmaps for their kings and rooks (Appendix C). Inspired by these backgrounds, we propose diversifying the initial states of the games.
Our method requires only small changes to AlphaZero-style self-play training as it simply diversifies the initial state at the beginning of a game. In self-play training, each game starts with an initial state uniformly sampled from a set of diverse initial states. Except for the variety of starting states, self-play games are played following the original game rules. The rest of the training procedure is the same as in (Gumbel) AlphaZero.

Initial state diversification for shogi. Highlighted squares denote rearranged pieces.
Algorithm 1 provides an overview of our proposed training procedure. The modifications due to initial state diversification are italicized.

Self-play with initial state diversification
For a concrete instantiation in shogi, we devised two diversification schemes, namely, RandomStart2 and Shogi816K (Fig. 1).
RandomStart2
RandomStart2 (Fig. 1(b)) provides simple and intuitive diversification. The initial state set consists of states in which one legal move is made for each player from the original initial position. In shogi, the legal moves are independent of the opponent’s move for the two initial moves. From that point, a set of 900 unique initial states can be obtained. Note that there are 30 legal moves each for the first and second moves in shogi, so there are 900 reachable states after the first two moves. The reason for using two moves instead of four or more moves is based on our observation that four or more random moves often produced unbalanced and/or urgent states incurring checks or threats (Appendix A). Compared with Shogi816K, RandomStart2 is conservative in the sense that piece locations are closer to those in the original initial state.
Shogi816K
Shogi816K (Fig. 1(c)) was inspired by Chess960 (also known as Fischer Random Chess), a chess variant with 960 unique initial positions (FIDE, 2023). Chess960 is an interesting variation of chess, originally invented for human players interested in fresh opening positions. Chess960 itself is not originally intended for use in machine learning.2
To the
best of our knowledge, whether training with Chess960 helps the RL
agent to improve the strength for the original chess has not been
published yet, while an AlphaZero-style chess engine Leela Chess
Zero (Lc0) supports training with Chess960 game records.
Pawns remain untouched.
The “black” rook and bishop are randomly placed in the 8th rank (the 8th row in Fig. 1(c)). Note that the top (bottom) row is referred to as the 1st (9th) rank in shogi, that all shogi pieces have the same color, and that the player making the first move is referred to as the “black player” and the opponent is referred to as the “white player” by convention.
The remaining black pieces (lances, knights, silver generals, gold generals, and king) are randomly placed on the 9th rank.
The white pieces (the pieces in the 1st–3rd ranks from the top in Fig. 1(c)) are arranged in a rotation-symmetric configuration compared with the black pieces (the pieces in the 7th–9th ranks).3
In addition to rotation-symmetric configuration, line-symmetric configuration is another interesting option if one wants to optimize the agent for some openings like “static rook vs ranging rook” (both the black rook and the white rook are placed on the same horizontal side). In our small preliminary experiments, line-symmetric configuration showed better playing strength for “static rook vs ranging rook” openings at the expense of sub-optimal performance for the other openings.
Compared with RandomStart2, the opening states produced by Shogi816K are diversified aggressively in the sense that piece locations differ greatly from the original ones. On the one hand, this difference may negatively affect the learning of opening strategies. On the other hand, it may contribute to a general understanding of middlegame and/or endgame tactics due to a wider variety of experiences. There is thus a trade-off between the diversity of initial states and the similarity to the original state.
To utilize the benefits of these two diversification schemes, we propose combining them: pre-training with Shogi816K and fine-tuning with RandomStart2. In AlphaZero-style RL, an agent’s neural network gradually improves with the number, typically millions, of self-play games. Therefore, it is intuitive to first train the agent by using Shogi816K to give it a general understanding of the game and then by using RandomStart2 to enable it to better establish an opening strategy starting from the original state. This approach is inspired by the pre-training and fine-tuning approach in computer vision research (Krizhevsky et al., 2012).
Six variations of Gumbel AlphaZero used in comparison experiments
Six variations of Gumbel AlphaZero used in comparison experiments
We experimentally evaluated how the proposed diversification improves an agent’s
playing strength by comparing six variations of Gumbel AlphaZero training in shogi.
They consist of a baseline, the two diversification schemes, their combination (the
proposed method), and two alternatives with visit-count exploration, as summarized
in Table 1.
Training procedure
We trained each of the six agents through self-play in shogi with the Gumbel AlphaZero algorithm. Each agent was trained using 10 million self-play games. The number of simulations per move (n) is set to 64. The maximum number of moves considered at the root node (m) is 16, as in Gumbel AlphaZero. Whereas the original AlphaZero experiment used 24 million self-play games and 800 simulations per move, we used fewer games and simulations due to limited computational resources. In addition, the original AlphaZero experiment demonstrated that training using 10 million games resulted in performance close to that of the then state-of-the-art computer shogi program.
Since the training process was still time-consuming with our limited computational resources, we introduced domain-specific enhancements to the input features to speed up the training (Appendix B). With our implementation, it took about two weeks to train the six agents twice each on a maximum of 20 GPUs (Appendix F).
Evaluation procedure
We evaluated the playing strength of the agents against that of a reference player, Suisho4
Suisho5-YaneuraOu-v7.6.1. Available
from
The evaluation games were played under the following conditions: The agents selected an action using
sequential halving. The number of simulations per move is set to
800, following the 9 × 9 Go experiments in Gumbel AlphaZero
(Danihelka et al., 2022, Section 7.1). Diversification
and exploration were disabled. Each
evaluation game is started from the opening positions randomly
sampled from a game dataset consisting of games played by
professional human players (The details about the dataset are
described in Appendix E). For each
game, one game record was sampled from the dataset, and the first 30
moves were played following the record.6 The hyperparameter “30 moves”
was chosen because AlphaZero selects explorative moves for
the first 30 moves of each game. Also in chess, human
opening positions and TCEC (Top Chess Engine Championship)
opening positions are used in the AlphaZero experiments
(Silver et al., 2018, “Match conditions” section in the
supplementary material).
For fairness and reproducibility, parallel MCTS search (e.g., parallel MCTS with virtual loss (Segal, 2011) used in AlphaGo (Silver et al., 2016)) was disabled in the agent. The reference opponent also did not use parallel search.
Effect of training on Elo rating
We used the Elo rating system, a method for calculating the relative skill levels of players in zero-sum games, as the performance metric. The improvement in playing strength with the number of training games for the six agents is plotted in Fig. 2. The playing strength (y-axis) is the Elo rating relative to the reference player.8
The winning probability corresponding
to the Elo rating difference is given by
We followed the configuration in Gumbel AlphaZero’s experiments for the 9 × 9 Go (Danihelka et al., 2022, Section 7.1), where agents used 800 simulations per move and the reference player, Pachi, used 10K simulations.
As the blue curve indicates, the proposed method (Shogi816K+RandomStart2) achieved the highest average performance among the ablated agents (Shogi816K and RandomStart2) and others (Exploration (PUCT), Exploration (Sequential halving), Base) after the same 10 million training games (Fig. 2). It performed better than Shogi816K alone (more aggressive diversification) and RandomStart2 alone (more conservative diversification). Note that the ratings generally improved more rapidly immediately after the six millionth game, when the diversification method was switched from Shogi816K to RandomStart2. This indicates that the instance implementation inspired by the concept of pre-training and fine-tuning works well in shogi. It is shown that the proposed method is an efficient training method in the sense that the method improves the playing strength within the same computational budget.

Relative Elo rating for shogi against number of self-play training
games. Relative Elo rating is computed so that the reference player
(Suisho with 20K search nodes per move) has a rating of
None of the other five agents reached the performance of the proposed method. Among four agents trained with a single diversification or exploration method, RandomStart2 performed as well as or better than Exploration (PUCT), in which the agent took explorative actions proportional to the PUCT visit count. This demonstrates the effectiveness of initial state diversification. Both RandomStart2 and Shogi816K+RandomStart2 did not use explorative actions proportional to the visit count. This suggests that initial state diversification can be used as an alternative to explorative actions.
Moreover, exploration with sequential halving and visit count did not work well, while exploration with PUCT and visit count improved agent performance compared with the performance of the base agent to some extent. We hypothesize that this is because sequential halving gives almost equal visit counts to both the best move and the second-best move. This means the best move and the second-best move would be selected with almost equal probabilities. This can degrade the accuracy of the game outcome estimation, especially in states with only one effective move (e.g., recapturing major pieces).
We evaluate the playing strength in various opening strategies. It is important for shogi engines to be robust enough to play various shogi opening strategies well. If the engine has a weakness in a certain opening, the strong opponents often try to attack the weakness. In addition, since most human users of shogi engines are interested in various openings, it is preferable for the engine to be able to provide analyses with consistent quality.
We evaluated playing performance separately for nine opening strategies popular among human players. Figure 3 plots the win rates against the reference opponent for the nine strategies after training using 10 million self-play games. The agents were evaluated through 2000 games against the reference opponent for each opening strategy. The nine openings are the nine most frequent strategies in the dataset (Appendix E), each of which appeared more than 2000 times in the records. The openings from (1) to (4) are categorized as “static rook” and the openings from (5) to (9) are categorized as “ranging rook.”
Overall, Fig. 3 demonstrates that the proposed method (diversification with Shogi816K+RandomStart2) (blue) achieved the highest performance for all nine openings among the six agents. This indicates that the proposed method improves the robustness of agents in the sense that the proposed method does not have major weakness for all nine opening strategies.

Win rate against reference player for nine shogi openings. Error bars denote standard errors from two independent training runs with different random seeds. The proposed method (diversification with Shogi816K+RandomStart2) (blue) consistently performed the best in all the nine opening strategies.
In more detail, Shogi816K showed relatively good performance for (7) third file rook and (9) double ranging rook, both of which are categorized as “ranging rook” openings (a strategy of moving the rook to the left (for the black player) or to the right (for the white player) in the openings). In contrast, RandomStart2 performed relatively well for (1) fortress, (2) double wing attack, and (4) bishop exchange, all of which are categorized as “static rook” openings (a strategy of placing the rook on the starting square). Shogi816K+RandomStart2 benefited from the combination of pre-training with Shogi816K and fine-tuning with RandomStart2.
In addition, Fig. 3 demonstrates that performance against the reference opponent depended on the opening strategy. For example, all the agents tended to achieve higher win rates (approximately 70–85%) for (2) double wing attack opening and lower win rates (approximately 50–70%) for (7) third file rook opening. This indicates that some openings may be more difficult to learn than others, suggesting that there is room for further improvement in the diversification and exploration algorithms.
Accelerating AlphaZero training
Many studies have focused on accelerating the self-play training of AlphaZero. Promising ideas have been proposed, including improving the neural network architecture, enhancing the MCTS, using a variable MCTS budget, and setting learning targets. Some of the improvements are introduced below.
We share the research goals of these studies, although the approach differs. We expect that some studies may work well in combination with our method to yield further improved agents, depending on the games or the detailed conditions.
Diversity of data in RL
The diversification in initial states with the proposed method can be interpreted as domain randomization or procedural generation in RL or as data augmentation of initial states. Our method was inspired by the success of studies in these areas.
Limitations and future work
The concept of initial state diversification should be useful as well for training agents for other board games. However, only the initial states are diversified, so the proposed method is effective only for domains that need diversification, i.e., those having a limited diversity in initial states or opening variations. Chess variants including shogi are such domains because there is a single and fixed initial state, and the set of legal moves for each state is a very limited subset of all possible moves among all states. In contrast, while Go also has a single fixed initial state (an empty board), the legal moves are much less constrained. Moreover, the initial states of most card games are already diversified with randomly dealt cards.
This paper followed AlphaZero in that our method is free from strong players’ resources such as game records or opening suites. This advantage is important to discuss the applications to other domains. For example, one can consider a similar method of Shogi816K for shogi variants, e.g., minishogi, while there are domains apparently inconsistent with our method such as Go. However, in terms of purely making shogi engines stronger, making use of a dataset from the strong players may become another option. For example, training from the opening positions generated by the strong computer players is an interesting choice. In addition, training from the opening positions that are sampled from the professional players is another interesting configuration.
The design of Shogi816K incorporates domain knowledge for use in rearranging the pieces to keep the resulting states balanced between the two players. Considering these limitations, we are working to devise a more general strategy that can be more easily exported to other games.
Conclusion
We proposed initial state diversification to improve the training of AlphaZero-style agents. The proposed method is based on some properties of shogi, but is still free from strong players’ game records. We demonstrated that initial state diversification improves the efficiency of self-play training in the sense that it improves the playing strength of the agent within the same computational budget. Moreover, experiments showed that initial state diversification improves the robustness of the agent in terms of being able to play well given a variety of opening positions in shogi.
Initial state diversification is an effective alternative to explorative actions. Our experiments demonstrated that agents trained with initial state diversification did not require explorative actions proportional to the visit count. Our approach should thus be particularly valuable when training with a small number of simulations per move.
Footnotes
Statistics on RandomStart2 and Shogi816K
To see how balanced the diversified states are, we empirically evaluated sampled states for three methods; RandomStart2, RandomStart4 (two random moves for each player), and Shogi816K. Figure 4 illustrates the first player’s advantage for the three sets of initial states. The horizontal axis is the first player’s advantage as evaluated by reference to Suisho (Sugimura, 2020) with 20K search nodes and is shown in centipawn scale. To keep the sample size identical, the distribution plots for RandomStart4 and Shogi816K were obtained from 900 random samples for each while that for RandomStart2 was obtained from all 900 states. We can see that the states are more concentrated near the peak for Shogi816K and are more widely distributed for the other two methods. Note that the mean value is slightly positive for all sets. This is consistent with our empirical property in shogi.
Table 2 shows the statistics for these three sets of initial states. The standard deviation of the first mover advantage for four random moves (219.9) is almost twice that for RandomStart2 (128.5). This is because the initial states obtained from four random moves often start with urgent situations such as checks to the king or threats to major pieces. Considering these results, we decided to use RandomStart2 instead of four or more random moves in our experiments.
In summary, Shogi816K had the smallest standard deviation and the largest number of available initial states among the alternatives. This means that Shogi816K has desirable properties, includes few states that are too advantageous or disadvantageous for one player or the other.
Experimental details
Frequency heatmaps of kings and rooks during self-play training
To provide an intuitive understanding about the variation in opening strategies (by each method), we visualize how the agents play shogi in self-play training. In this section, we focus on two important types of pieces in shogi: kings and rooks. We observe how these pieces are moved by the agents in order to evaluate the diversity of the self-play game records.
Figure 6 and 7 are the heatmaps of the kings and rooks during self-play training. Each heatmap represents the 9 × 9 squares for the shogi board. The color of each square shows the frequency of the kings or rooks placed on that square just after the 30th move of each game. We focus on the kings and rooks because these pieces are important. Kings are the most valuable pieces and are closely related to opening strategies. Rooks are the strongest attackers in shogi and play a key role in the offense. To visualize the diversity of opening games, we have collected the positions just after the 30th move for each game. Since the original AlphaZero selects explorative moves in the first 30 moves of each game, the positions just after the 30th move should represent the diversity of openings during training.
Figure 6 shows the heatmaps of the kings for both sides at 1, 5, and 10 million training games. Each row represents the agent, and each column represents the progress of self-play training. At the bottom right of the figure, we added the heatmap of the kings for human professionals (data collected from the dataset in Appendix E) for comparison. We can see that the king location in Base and Exploration (PUCT) is limited in terms of the king positions; the kings tend to go to the same square (i.e., the upper left neighbor from the initial placement) for almost all self-play games, especially after training of 10 million games. Exploration (Sequential halving) has a wider coverage of king squares, although the frequencies of the kings are concentrated around the original king square. In contrast, Shogi816K has the widest coverage of king squares among the six agents. The distribution of king frequencies for Shogi816K is almost evenly distributed from the second to the eighth file. Like Shogi816K, the proposed method (Shogi816K+RandomStart2) has diverse king distributions at 1M and 5M training games. After the proposed method switched from Shogi816K to RandomStart2 at 6M training games, both the proposed method and RandomStart2 show similar distributions at 10M games. While there is little difference between the proposed method and RandomStart2 at 10M games, they both are similar to the heatmap of professional players. Observing these heatmaps, the agents without initial state diversification prefer limited opening strategies with respect to the arrangement of the kings.
Figure 7 shows the same heatmaps of the rooks. Figure 7 demonstrates that the Base agent (no diversification and exploration) almost always used the rooks at the initial placement, especially as the training progressed. The Exploration (PUCT) agent found vertical rook movements, although it tended to stop using rooks horizontally. In contrast, the other methods, including the proposed method, move the rooks in both horizontal and vertical directions. In shogi, moving the rooks horizontally is a famous category of opening strategy called “ranging rook”. This difference would be one of the reasons why the proposed method performs better in the “ranging rook” category of openings than the Exploration (PUCT) and Base.
In addition, we plot the entropy along with the training progress for the kings (Fig. 8a) and rooks (Fig. 8b). The y-axis shows the entropy of the empirical probability distribution of the pieces after the 30th move for each game. The higher the entropy, the more diverse the piece locations. The entropy observed from the human professional dataset is shown as a horizontal line (pink) for comparison. These figures demonstrate that Base has the lowest entropy among the six candidates, which is consistent with the previous heatmaps. The entropy of Exploration (PUCT) gradually decreases as the training progresses. This would be one of the reasons for the saturation in the playing performance. In contrast, Shogi816K has the highest entropy, which is higher than that of human professionals, throughout the training. The proposed method (Shogi816K+RandomStart2), RandomStart2, and Exploration (sequential halving) maintain an entropy similar to that of human professionals. However, Exploration (sequential halving) is significantly weaker than the other two agents in Fig. 2. Therefore, we need to emphasize that these entropies give us an interesting intuition, but the entropy itself is not suitable for the objective function to be maximized in training.
Role of opening suites for engine evaluation
In this section, we examine the role of opening suites for engine evaluation in shogi. First, we conducted an additional evaluation of our six agents with computer generated opening suites instead of human opening suites (Appendix D.1). Second, we evaluate the agents without opening suites (Appendix D.2).
Data availability
We used a game dataset mostly composed of professional human players’ games15
Currently available on the Internet
Archive.
For the training, we did not use any dataset. As in Gumbel AlphaZero, our method is free from any dataset in the training of agents.
Computational resources
We used a maximum of 20 GPUs as listed below in the experiments. A training run took about 1–2 days depending on the available GPUs.
In each training run, one GPU was used to optimize the network parameters with
PyTorch, and a subset of the other GPUs generated self-playing games in
parallel. Each self-play worker asynchronously generated 20,000 games and
fetched the latest network parameters before generating the next set of games.
1 x GeForce RTX
4090 2 x GeForce RTX 4070
Ti 4 x GeForce RTX
3090 2 x GeForce RTX 3080
Ti 12 x GeForce RTX 2080
Ti
Recent strong computer shogi programs
The most famous computer shogi competition is the annual World Computer Shogi Championship,16
We used Suisho (Sugimura, 2020) as a reference player for all the evaluation experiments. Suisho is one of the state-of-the-art computer shogi programs and won the 2020 World Computer Shogi Online Championship. Suisho has an excellent evaluation function based on an efficiently updatable neural network (NNUE) (Nasu, 2018) and runs on YaneuraOu (Isozaki, 2024) empowered by an alpha-beta search. Suisho is an open source software and is available online.18
Available
from
Dlshogi (Yamaoka, 2024) is a strong shogi program with deep neural networks, which won the 2022 and 2023 World Computer Shogi Championships. Dlshogi has a larger neural network with 30 residual blocks and 384 channels than that of the original AlphaZero with 19 blocks and 256 channels. In addition, various enhancements are adopted in dlshogi (e.g., the network is trained not only with self-play games, but also with the games against the other strong engines). We did not use dlshogi as a reference player in this paper because the latest version of dlshogi is only available on the commercial online service.19
Shogi engine developed with initial state diversification
The first author of this paper successfully combined the proposed method with some of the other improvements to develop a strong shogi engine named Gikou (Demura, 2024). Gikou won the 5th place among 45 participants in the 2024 World Computer Shogi Championship.20
