Abstract
Agent-based combat simulation is an important research method in the field of military science and system simulation. Behaviour decision model plays the key role in the design of combat simulation agents. The behaviour tree (BT) designed by nonplayer characters (NPCs) in the game provides an efficient and concise method for the construction of combat simulation agents and has been widely used. Because the rationality of BT construction directly affects the rationality of agent decision logic, designing a reasonable BT has become a crucial step. The design of the operational agent BT not only relies on rich tactical experience but also needs to repeatedly adjust and optimize the BT according to the operational deduction and simulation results. To avoid unreasonable BT design caused by lack of experience and eliminate the process of repeated debugging, a modelling method of a combat simulation agent that combines reinforcement learning and the BT method was proposed. This method not only makes the design process of BT more automatic but also simplifies the experience requirements of the combat simulation agent designers. Experiments show that RL-BT effectively integrates the reinforcement learning method and can endow combat simulation agents with battlefield learning ability while making independent decisions. The agent based on RL-BT for decision modelling can continuously adjust and optimize the decision process through experience accumulation, and its performance in combat simulation is significantly better than that of the agent using the original BT.
Introduction
Combat modelling technology based on information systems is a quantitative and informatized modern warfare research method that has been used in combat simulations, decision-making aids, command training and other military fields due to its controllable process, low cost, nondestructive nature, repeatable experiments and other advantages [1, 2]. With the in-depth military reforms and the profound changes in the form of war, the complexity of the war system has been increasing, in turn promoting the innovation and development of combat modelling methods [3]. Agent-based modelling (ABM) is an effective modelling theory for complex adaptive systems. Since Lachinski first proposed the theory of combat modelling based on multiple agents [4], ABM has been widely used in the field of combat modelling and simulation [5, 6].
Compared with the general agent system, the combat simulation agent has the characteristics such as complicated behaviour space and decision-making processes, and variable and unpredictable states of deduction and simulation. More importantly, the rationality, decision-making efficiency and behaviour scheduling strategy of the combat simulation agent for different events in different states jointly determine the rationality and operation efficiency of the whole system [7]. Because of the various advantages of BTs, in recent years, many researchers have adopted the BT method in the construction process of combat simulation agents. For example, based on the analysis of the research of traditional air combat decision-making, Dong et al. proposed the behaviour tree modelling and simulation method of air combat decision-making [8]. Combining BT with the system engineering method, Liu et al. proposed a command and control modelling framework based on BT technology [9]. Fu introduced BTs into a combat simulation system to effectively describe the behaviour of computer-generated forces (CGFs) [10]. These studies prove that the construction of an agent decision model based on a behaviour tree in combat simulation is more efficient and flexible. At the same time, this method also shows clear advantages over the traditional methods in real-time monitoring and dynamic adjustment of simulation systems.
However, a well-designed combat simulation agent should not only be able to make independent decisions but also make more scientific decisions. It should enable agents to continuously learn in the environment and adjust and optimize behaviour strategies. In addition, uncertainty about the state of the environment is an important challenge in the design of agent’s behaviour [11]. As a deep learning paradigm, reinforcement learning (RL) achieves maximum returns through learning strategies in the process of interaction between the agent and the environment. If the reinforcement learning mechanism is added to the decision-making process of the combat simulation agent, it is hoped that the agent can have learning ability while making independent decisions. Reference [12] also demonstrates the reinforcement learning is highly significant to study the mechanisms of coordination between agents in Multiagent systems.
Several relevant research studies on the combination of BT and RL have been reported in the literature. For example, a QL-BT construction method was first proposed [13] and its effectiveness and feasibility was verified using the stalker game as a case study. Reference [14] simulates a fire rescue scenario and verifies that the agent based on RL-BT has a strong learning ability. However, there have been few studies on the application of RL-BT in the field of combat simulation. Faced with an unfamiliar battlefield environment, the learning ability of combat entities cannot be fully reflected in combat simulation which will inevitably affect the scientific nature and practical guidance value of combat deduction or simulation [15]. The aim of this paper is to propose an agent construction method for combat simulation, where the agent’s decision model is constructed based on BT. The RL-BT approach gives the agent the ability to learn about the battlefield environment and its own behaviour, and to optimise the BT with the help of this learning ability. Such work is of great value in achieving sound agent behavioural scenario design and in conducting tactical level military simulations.
Principles and methods
Agent modelling methods
Agent-based modelling is an important theoretical method in the field of artificial intelligence that is used to study the overall behaviour and evolution of complex systems through their emergence from the interaction between individuals. Currently, there are various ways to build an Agent depending on the purpose and domain of application. Firstly, there are existing Agent modelling platforms such as NetLogo, Swarm, Repast and Anylogic, etc. These platforms can provide researchers with a good Agent programmable environment, which helps to quickly build target intelligences and realise simulation extrapolation according to simulation requirements. However, these platforms are unable to conduct combat simulation directly in the geographical environment, making it difficult to avoid the complex work of constructing the combat environment, and the object interaction capability in the process of spatial analysis and simulation is weak, and the visualisation effect is difficult to realise the combination with the geographical environment.
Therefore, in the design of the combat simulation agents in this paper, an agent construction approach based on the geographic data model is adopted. However, there is a key issue to be solved here, which is the construction of the agent behavioural decision model. Combined with the agent behavioural decision theory in the game [16], in the existing Methods, naive AI is the simplest but is difficult to adapt to the construction of complex decision processes and is difficult to reuse. Finite state machines (FSMs) show improved reusability and modularity, but their decision-making process is still rigid and difficult to expand [17]. The fuzzy state machine has good randomness, but its adaptability to complex scenes is still insufficient [18]. Compared with the above methods, the behaviour tree (BT) has the advantages of simplicity, flexibility and modularity and is more suitable for agent control under complex system conditions. In addition, BT has good randomness, scalability, and reusability and can build complex execution logic [19, 20]. These characteristics are driving the development of behaviour tree methods from game development to industrial applications, mainly for applications such as AI robots, automated guided vehicles (AGVs) and unmanned driving [21]. In addition, based on the use of BT for combat intelligence construction, if reinforcement learning methods are integrated to form a RL-BT-based combat simulation agents construction model, it will be possible to make the design of agent’s BT more automated while also allowing the agent to have battlefield learning capabilities.
Behaviour tree
The behaviour tree was originally developed as a tool for modularizing nonplayer character (NPC) control structures in games [22], and provides better support for code reuse, functional incremental design and efficient testing [23]. From a formal point of view, BT is a directed tree consisting of a root node, internal nodes and leaf nodes. In BT, internal nodes are generally called control flow nodes, and leaf nodes are either execution nodes or conditional nodes. Table 1 shows all of the node types, symbols and their functions in BT.
Node types and function of BT
Node types and function of BT
Among these nodes, the behaviour node, as a leaf node, is at the bottom of the tree and is the last to receive information, which means that it needs to perform specific operations and feedback on the information filtered out by the control node. Sequence nodes, selection nodes, and parallel nodes belong to combination nodes that are mainly responsible for the execution logic judgement, selection and execution sequence of agent behaviour.
Figure 1 shows the execution logic of the three types of nodes. The sequence node is executed from the left child node; if it encounters a child node that returns false, it does not proceed further, and the sequence node returns false. If and only if all child nodes return true, the sequence node returns true. The selection nodes are executed in the same order as the sequence nodes, but the return value is treated as the opposite of the sequence node. As long as one child node returns true, the selection node returns true, and the node returns false if and only if all children return false. The execution order of parallel nodes is different from the previous two cases, and its children will perform at the same time.

The execution logic of composite nodes in the behaviour tree.
The main task of reinforcement learning is to continuously try in the environment through the agent, adjust the strategy according to the feedback information obtained from the attempt, and finally obtain the optimal strategy. Based on this strategy, the agent can know what action to perform in what state. The pros and cons of the strategy depend on the final cumulative reward after its execution. This type of problem can be expressed as a Markov decision process (MDP). In the MDP problems, the learning agent interacts with the environment in discrete time t = 1, 2, 3…. At each time step, the agent will observe the system state at time t, s t ∈ S (where S is the state space of the system), and perform operations that can be performed at time t, a t ∈ A s t , where A s is a set of finite and nonempty executable actions of the agent in state s. The goal of the agent is to learn a strategy to maximize the long-term reward expected by each state s [24].
Q-learning is one of the most commonly used deep learning techniques and is mainly used to solve learning problems in uncertain environments. It was first proposed by Watkins. Singh et al. proved the convergence of the Q-learning algorithm [25]. In Q-Learning, the rewards of performing different actions in different states are recorded through the Q table. Q (s, a) is used to represent the reward value of action a in state s when the optimal strategy is selected. The calculation formula is as follows:
In the formula, s and a are the current state and executable actions, respectively. s′ and a′ are the next state, and its executable action evolves after executing action a in state s. α ∈ [0, 1] is the learning rate that is used to specify the memory of the agent. A smaller value of α is corresponds to a weaker retention of the effect of previous training. r represents the reward value obtained by executing action a in the current state s. γ ∈ [0, 1] is a discount factor used to control the importance of future rewards. The algorithm of Q-learning is as follows:
In the algorithm, the selection of action in state s adopts ɛ - greedy strategy to determine the randomness and exploration ability of agent behaviour according to the specific situation of the application. The execution process of the whole algorithm is to constantly update the value of the Q table and then evaluate what action to take in a state according to the new value.
To combine the advantages of BT and RL, comprehensively referring to previous research methods and application case documents [13, 26], an RL-BT decision model for combat agents is constructed. The main steps are as follows:
Step 1: Initialize BT, state space S, behaviour space A and Q-table.
Step 2: Perform offline learning according to the Q-learning algorithm to obtain a convergent Q-table, divide the Q-table based on the actions, and obtain the state allowable sub-table of each action;
Step 3: Integrate the child table with BT, replace the conditional nodes under the sequential nodes in the original BT with state-permitted subtables, and take the maximum value Q in the child table as the value of the parent node to obtain RL-BT with the Q value (see Fig. 2).

Split Q table and Inserting Q-values into RL-BT.
Step 4: Reorder the behaviour tree according to the node value Q, and output the RL-BT with value Q and state allowable list after reordering (see Fig. 3).

Reordering RL-BT children.
Modelling process
Based on method analysis, the construction of a combat simulation agent is carried out. The main steps are shown in Fig. 4:

Modelling process of combat simulation agent based on RL-BT.
(1) Agent object definition: The data structure of the agent adopts the object data model of the pan-spatial information system [27], and the definition of attributes and behaviour lists are completed according to the types of combat entities. For example, a combat agent object needs to have attribute information such as health value, attack power, and movement speed, as well as the ability to attack, advance, retreat, and determine the value of life and attack. When defining the behavioural capabilities of an object, it is necessary to consider information such as the parameters and return values required for behaviour execution.
(2) Behavioural component development and integration: In the full-space information system, the behavioural capabilities of combat agent objects are developed and implemented in the form of components, and system integration is carried out to support the expansion of the behavioural capabilities of the agent objects in a loosely coupled mode.
(3) Construction of the original behaviour tree: According to the basic behaviour rules of the agent formulated by the user, the original behaviour tree is constructed according to the basic principles of the behaviour tree in Section 2.1.
(4) Q learning: After completing the definition of the behaviour space, state space and initial reward value, the agent is trained according to the Q-learning method in Section 2.2. The training process may involve adjustment of parameters and reward values, and finally, a convergent Q table can be obtained.
(5) RL-BT construction based on the Q table: According to the methods and steps mentioned in Section 2.3, the convergent Q table obtained after training is integrated with the original behaviour tree, and the behaviour tree is sorted according to the Q value to obtain RL-BT. The RL-BT is mounted on the agent object to complete the entire modelling process.
State space
The Combat Agent participates in combat simulation and has autonomous cognitive and behavioural capabilities. According to the solution method of the MDP problem, it is necessary to define its state space, behaviour space, and reward value for performing different behaviours in each state.
Considering the actual combat scenario, the combat agent state space is defined as follows: Health status: Health (None, Low, Medium, High); Number of companion neighbours (None, Low, Medium, High); Distance to Enemy (Near, Medium, Far).
In the state definition, the state value is discretized. Among the four values expressing the agent’s health status, None means dead, and Low, Medium and High means low, medium, and high health, respectively. The number of companion neighbours is used to indicate the number of comrades around the agent, which may affect its behaviour. Distance to Enemy mainly affects the hit probability when the agent executes Attack. The closer the distance, the higher the hit rate. Under the three distance states of near, medium and far, the hit probability of agent executing attack is 100%, 10% and 1%, respectively.
Behaviour space
The action space of the combat agent defines the actions that the combat object can perform in the battlefield environment. In this experiment, five common tactical actions of combat objects are defined: Patrol: Agent patrols according to the initial setting plan; Advance: move towards the target object; Attack: perform an attack; Retreat: stay away from the enemy object; Hide: hide
Reward function
On the basis of defining the state space and the behaviour space, it is necessary to define the executable behaviours in different states and the reward of each behaviour. For fastest convergence, prior knowledge is embedded first, action a1 is stimulated in state s0, and a fixed reward value is artificially and empirically given in advance. This is given by:
where r is an artificially designated experience value, and Table 2 shows the prior reward values corresponding to all states and behaviours. All other state-action pairs resulted in a reward of 0.
Rewards for state-action pairs
Rewards for state-action pairs
First, build the initial behaviour tree of the combat simulation agent and define its basic decision logic. The tree structure in Fig. 5 shows how a combat agent judges the actions it needs to perform in each state. Starting from the root node, first judge its own health status. If Health is High or Medium, go to the next selection node, and judge whether it needs to execute Patrol, Advance or Attack according to the distance from the enemy. Otherwise, if the health status is not good, the agent must consider whether to conceal or retreat. If the health value is 0, the agent will die. In the construction of the initial behaviour tree, to keep BT concise, when the agent’s health value is medium or high, the number of nearby friendly forces is not considered.

Original BT.
After completing the initial behaviour tree construction of the agent, the agent needs to be trained according to the Q-learning method to obtain the Q table. The training environment is independently constructed with reference to the experimental simulation environment, and the combat space is discretized. The behavioural decision-making conditions and rewards of the agent are set according to Table 1. The agent training is carried out for 1,000,000 iterations. During training, the learning rate (α) and the discount factor (γ) are both set to 0.9. While considering the learning speed, the balance between maximizing current rewards and potential future rewards is taken into account. The ɛ-greedy policy is adopted when learning, and the ɛ parameter is set to 0.3.
RL-BT construction
Prior to obtaining a stable Q-score table, a certain amount of time is required for learning. Through reinforcement learning, after completing the training of the Q table, according to the method in Fig. 2, the Q table is divided according to the behaviour, and the sub-table obtained after the division is replaced with the corresponding condition node in the initial behaviour tree. Figure 6 shows that the RL-BT after the conditional node is replaced by the child table. Unlike for the standard BT in Fig. 4, the agent is no longer required to perform specified tactical actions in the state specified by the designer. We found that the RL-BT agent is more inclined to perform attack actions. When the health value is insufficient, the choice of the two actions of concealment and retreat is also different. After learning, the agent will choose to conceal rather than retreat when the health value is insufficient in more cases. After learning, the agent will choose to hide rather than retreat when its health value is insufficient.

RL-BT (with max Q-Values).
The experiment takes the land encounter combat between the red and blue sides as an example. Since the scene design should be expressed as an MDP problem at first, the combat environment is set in a limited land area. The red and blue sides will attack after encountering each other, and this will realize autonomy by the combat agent. In the experiment, every agent object has autonomous behaviour ability that can judge and execute corresponding tactical actions according to the original BT or RL-BT.
System architecture and environment deploy
The combat agent experiment system is built based on the pan-spatial information system platform. This is a totally new distributed GIS system whose object-oriented spatiotemporal data organization structure can be easily combined with the data structure required by agent construction [28]. In this system, the agent is stored in an object-oriented data model to construct a spatiotemporal entity with autonomous decision-making ability. The whole simulation system adopts a distributed storage and computation framework. During the process of deduction simulation, the main computing task derives from the logical execution process of behaviour. Every agent traverses based on BT and generates tasks according to the behaviour as well as its parameter information after determining the behaviour to be executed under each simulation time node. Then, in a multiagent system, each simulation time node will generate a task list. A parallel task handling framework will request calculation tasks in an orderly manner in the server to calculate the execution results of behaviour as soon as possible and store them in an object-oriented database for spatiotemporal entities. The detailed process is shown in Fig. 7.

Calculation process.
Agent design based on an object-oriented data model, distributed data storage and parallel task management framework, as well as the loosely coupled characteristics of BT design, behaviour calculation, result storage and visual module, can effectively support the needs of different task scenes and large-scale agent calculation.
To compare the different application effects of original BT and RL-BT, two different scenarios were tested.
Scenario 1: Both the red and blue sides made decisions based on the original BT model.
Scenario 2: The blue side took the original BT as the decision model, while the red side took the RL-BT.
Setting 30 combat objects, the force setting and formation of both sides in the battle were consistent. For these two types of tests, 30 combat experiments were performed.
In the experiment, the confrontation sides attacked autonomously based on their selected decision model. The visualization system was developed on the basis of the Cesium platform, and the simulation results are displayed here (see Fig. 8). Here, some battlefield details in the simulation experiment and some query and analysis functions of the visualization system are displayed. According to this visualization system, users are able to acquaint the data of loss, combat capability combination and some other information of both sides at each confrontation stage from multiple angles. In the process of deduction, users can also perform some interactive operations, such as scene roaming, information querying and battle-field situation analysis. In addition, this visualization system allows users to pause the simulation at any time node and carry out time fallback, which makes it more convenient to study the details of the whole combat process.

Visualization of tactical deduction simulation based on the RL-BT agent.
Analysis of win-loss statistics
Every confrontation experiment was terminated when the fighting forces of one side were completely eliminated and the number of the remaining forces of the wining side was recorded at the end of the experiment. Figure 9(a) shows the combat simulation results when both the red and blue sides adopted the original BT to make decisions. When the number of troops, formation and battle field were consistent, the wins and losses of the two sides were approximately identical. Additionally, the average number of each side’s surviving agents was 1.16 and 1.26, respectively, showing a small overall difference. However, when the agents applying RL-BT combat the agents applying the original BT, this difference was significantly expanded (Fig. 9(b)). First, the side adopting RL-BT won remarkably more times than that adopting original BT. In addition, the difference between the average survival agents of both sides showing a noticeable increase to 2.5 and 0.46, respectively.

Experimental results of multiagent combat with different decision models.
The analysis of the result data reveals that the agent based on RL-BT uses better tactics in the combat simulation. In addition, although the configuration and environmental state of the two sides have been set to be consistent as much as possible, an obvious randomness is still observed in the simulation process. This randomness is manifested in the results in the uncertainty of victory and defeat of both sides as well as the variation of the final survival force. In particular, the side based on the RL-BT decision-making model shows stronger randomness, and the fluctuation of the number of survival agents is more distinct.
After extrapolating the two scenarios of the experimental design, data on the change in the average surviving strength of both sides during the resistance was obtained and Fig. 10 shows the trend of the data.

The process of changing the strength of the red and blue sides in the two scenarios.
The comparison revealed the following conclusions:
(1) The longest simulation end time for scenario 1 is smaller than that of scenario 2, mainly because, in scenario 1, both sides prefer the strategy of retreat in the H(L) state, which prolongs the overall simulation duration.
(2) Scenario 1 has a lower rate of troop loss than Scenario 2 in the first half of the confrontation, with a sudden increase in the rate of troop loss between simulation time 15 : 08 : 10 and 15 : 09 : 50, indicating that under the decision logic of Scenario 1, the intense combat confrontation is mainly concentrated in this time period.
(3) The overall strength loss curve for Scenario 2 changes relatively evenly, with Red gaining the advantage after 15 : 07 : 50 simulation time.
(4) On average, the Red Agent who adopted RL-BT in Scenario 2 was more inclined to adopt an aggressive confrontation strategy and ultimately had better combat results than the Blue Agent who adopted Original BT.
Discussion
The research work in this paper realises the combination of RL-BT methods and combat simulation agents. When faced with complex and specialised tactical manoeuvre choices, the RL method can be used to design more reasonable BTs, and these works are of great value for research on how combat simulation agents make strategic choices in different states of the battlefield environment.
In addition, based on GIS, this paper designs and builds a distributed and parallel framework for the generation and computation of agent behavioural tasks, providing an effective solution for the simulation and computation of large-scale intelligent spatio-temporal entities. The solution combines the advantages of both gaming and simulation. On the one hand, the combat simulation agent proposed in this paper is constructed directly based on real geographical coordinates, which can simplify the process of constructing the geographical environment in combat simulation compared to existing simulation platforms, while enhancing the interactivity in the simulation process. On the other hand, the parallel processing architecture and distributed storage structure, compared to games, provide a feasible solution for large-scale spatio-temporal objects to participate in computation and simulation at the same time.
The magnitude of the uncertainty is an important origin of the difficulty of accurately predicting the results of complex system simulations. The rational expression of uncertainty in the process of agent modelling is strongly related to the rationality of simulation or simulation results. Although the behaviour in the original BT has been considered with some random factors, such as the different probability of hitting under different attack distances and the stochastic consideration of the damage caused, the whole decision-making logic in BT is strictly set according to the conditions. The decision-making process in RL-BT after Q-learning is based on the value of Q under different states, and the use of the ɛ - greedy strategy in learning further increases the randomness of agent decision-making.
Conclusions
To build a combat simulation agent in an efficient and simple manner and realize calculation and simulation in a real geographical environment, this paper first studies the previous methods. It is found that the BT method can easily carry out the construction of a combat simulation agent decision model, and its advantages in reusability and scalability also make the design and development of a combat simulation system more efficient and flexible, as was also verified in this paper.
However, the design of an optimal decision model is still a difficult task for the designer of a combat simulation agent that requires not only basic battlefield tactical experience but also a large number of simulations and debugging after completing the design to achieve a reasonable effect. In all previous works, it is necessary to artificially set the behaviour execution conditions of the agent, and manually complete commissioning and testing.
The combination of BT and reinforcement learning provides researchers with a more convenient means to build a more intelligent, more efficient behaviour scheduling and an agent more capable of learning. In this paper, an agent modelling method of combat simulation based on the RL-BT method was proposed. The reinforcement learning method is successfully integrated in the design of a combat simulation agent decision model, verifying the feasibility and advantages of the RL-BT method in the application of combat simulation.
The BT method needs to select specific agent behaviour under fixed conditions. Compared with RL-BT, the tactical action selection of RL-BT is more flexible, showing that the optimization of the agent behaviour strategy can be realized with the help of reinforcement learning. The simulation results show that the combat simulation agent based on RL-BT is significantly superior to the agent using the original BT. This result has important reference value for the effect analysis of combat entities executing different tactical actions in different battlefield environments.
