Minmax fuzzy deterministic policy gradient for zero-sum differential game: Take pursuit-evasion problem as example

Abstract

A novel actor-critic algorithm is introduced and applied to zero-sum differential game. The proposed novel structure consists of two actors and a critic. Different actors represent the control policies of different players, and the critic is used to approximate the state-action utility function. Instead of neural network, the fuzzy inference system is applied as approximators for the actors and critic so that the specific practical meaning can be represented by the linguistic fuzzy rules. Since the goals of the players in the game are completely opposite, the actors for different players are simultaneously updated in opposite directions during the training. One actor is updated updated toward the direction that can minimize the Q value while the other updated toward the direction that can maximize the Q value. A pursuit-evasion problem with two pursuers and one evader is taken as an example to illustrate the validity of our method. In this problem, the two pursuers the same actor and the symmetry in the problem is used to improve the replay buffer. At the end of this paper, some confrontations between the policies with different training episodes are conducted.

Keywords

Fuzzy inference system differential game reinforcement learning pursuit-evasion problem deterministic policy gradient

1 Introduction

Differential games are a class of problems about the modeling and analysis of conflicts in dynamics. The pursuit-evasion problem is a classical zero-sum differential game, and was originally proposed by Isaacs [1]. The original pursuit-evasion problem involves optimize control policies for two players, the pursuer and the evader, playing against each other. The goal of the pursuer is to catch the evader in as little time as possible. And the goal of the evader is to delay the moment of collision as much as possible [2, 3]. Since Isaacs, to solve the pursuit-evasion problem, a large number of studies have been published, and various methods have been proposed [4 –6]. In addition, more types of pursuit-evasion problems have been raised, such as pursuit-evasion problems with multiple pursuers or multiple evaders [7, 8].

In addition to the classical method listed above, in some studies, reinforcement learning (RL) algorithms have been applied for solving differential games [9]. RL is an area of machine learning concerned with how agents ought to take actions in an environment in order to maximize the cumulative reward [10, 11]. Unlike supervised learning, which are trained using data provided by humans, RL learns from the experiences gained in interaction with the environment [10, 12]. However, there is limited research on the application of RL alone to solve complex problems. With the development of deep learning technology, in many studies, deep neural networks are applied as function approximators do deal with high dimensional states. These approaches are known as deep RL [13 –16].

Fuzzy logic, as a widely used engineering method, can acquire fuzzy rules from human experience and demonstrate the capacity to address the knowledge that cannot be precisely described. In recent years, fuzzy logic has been combined and adopted in the pursuit-evasion problem [16 –18]. In these researches, the controllers were optimized based on value-based RL methods such as Q-learning.

Value-based RL can only handle discrete and low-dimensional action spaces. An obvious approach to adapting such RL method to problems with continuous action space is to simply discretize the action space [15]. In some cases with high-dimensional action space, however, learning fuzzy rules directly using value-based RL method can lead to "curse of dimensionality" [19]. To make RL available for problems with high dimensional state space, the actor-critic (AC) algorithm was proposed. In AC method, the evaluator (critic) and controller (actor) are separated with different approximators [20] such as neural networks. The actor learns to approximate the control policy, i.e. take action according to the current system state. The critic learns to estimate the Q value function of a state-action combination [21].

Previously, it was thought that, the application of a parametric probability density function to represent the control policy is the only wey. This is also known as the stochastic strategy gradient (SPG) theorem. After that, the deterministic policy gradient (DPG) theorem was proposed [22] In DPG theorem, the output of the controller is deterministic. It no longer depends on the probability distribution but on the system state. Recently, the combination of deep neural networks and DPG have been applied to problems with continuous action spaces [15].

However, it is generally acknowledged that a neural network is black box, its practical meaning is difficult to be described [23]. In this paper, to overcome the "curse of dimensionality" and clarify the practical meaning of learning results, the combination of DPG and fuzzy logic is used to solve the pursuit-evasion problem. A algorithm based on fuzzy inference system (FIS) and DPG referred to as the Minmax Fuzzy Deterministic Policy Gradient (Minmax-FDPG) is proposed. In this algorithm, some FISs [24] are used to approximate the actors and critics in a novel AC structure to learn the control policies. Unlike the classical AC structure, the novel structure consists of two actors and one critic, different actors represent control policies of different players and due to the properties of zero-sum game, the direction of optimization of different actors are opposite. In addition, the input of the critic is the combination of system state and all player actions. Similar to many existing RL algorithms, sometimes, nonlinear approximators may cause some convergence problems. To ensure the convergence of the proposed Minmax-FDPG, the target approximators, replay buffer and mini-batch techniques are used.

In order to illustrate the effectiveness of our method, a pursuit-evasion problem with two pursuers and one evader is established as an Markov Decision Process (MDP), and the Minmax-FDPG is used to train the control policies of these three players simultaneously. In this problem, the two pursuers share the same actor, while the evader uses the other actor. And the symmetry in the problem is used to improve the replay buffer. The goal is to train the actors of both sides of the game to learn the relatively optimal control policies via interacting with the environment.

In general, the main contributions of this paper are as follows:

Instead of deep neural networks, some FISs are applied as approximators in the proposed method to deal with the zero-sum differential game with continuous action space, and a pursuit-evasion problem with two pursuers and one evader is taken as an example.

Instead of training a player’s policy with a fixed opponent’s policy, the policies of all players in a pursuit-evasion game can be optimized simultaneously in the proposed method.

Based on the properties of FIS, the learning results can describe the practical meaning of each rule.

To demonstrate the effectiveness of training, some confrontations between the policies with different training episodes are conducted.

The remainder of this paper is organized as follows: Some background knowledge are introduced in Section 2. In Section 3, the structure of the proposed Minmax-FDPG is described, the gradients of the parameters in each actor and critic are derived and the updating process is introduced. In Section 4, a pursuit-evasion problem is taken as an example to illustrate the validity of the proposed method. Lastly, a brief conclusion is presented in Section 5.

2 Background

2.1 Reinforcement learning

As described earlier, instead of learning from data provided by human, RL learns from data generated by interaction with the environment. In the majority of the existing studies, the environment is modeled as an MDP denoted by a tuple $(S, A, F, R)$ , where $S$ is the state space, $A$ is the action space, $F$ is the transition model and $R$ is the reward function. At each time step t, the agent reaches a system state $s_{t} \in S$ , takes an action $a_{t} \in A$ and receives a reward $r_{t} = R (s_{t}, a_{t})$ and ends up in the next state $s_{t + 1} = F (s_{t}, a_{t})$ . The future discounted return at time step t is defined as $R_{t} = \sum_{t^{'} = t}^{\infty} γ^{t^{'} - t} r_{t^{'}}$ , where γ ∈ [0, 1) is the discount factor. An agent’s behavior is defined by a control policy $π : S \to A$ . The state-action value function, also known as Q function $Q^{π} (s, a) = E_{π} (R_{t} | s_{t} = s, a_{t} = a)$ represents the expected future discounted return after taking action a at state s and thereafter following policy π. The Bellman equation reveals the conditions satisfied by the optimal Q value function [25, 26]:

$Q^{*} (s, a) = R (s, a) + γ max_{a^{'}} Q^{*} (s^{'}, a^{'})$ (1) where $s^{'} = F (s, a)$ . Based on the Bellman equation, many RL algorithms have been developed, such as Sarsa algorithm, Q-learning [25 , 27–29], policy gradient [30] and so on.

AC is the most commonly used RL structure [31], whose architecture is displayed in Fig. 1. Deep Deterministic Policy Gradient (DDPG) is one of AC algorithm. For critic, a deep neural network Q^θ (s, a) parameterized by θ is used to represent the Q value function. For the actor, another deep neural network A^α (s) with parameters α is used to approximate the control policy.

Fig. 1

The AC architecture.

In each training iteration, the agent interacts with the environment once, the experience (s_t, a_t, r_t, s_t+1) is stored in the replay buffer, and a mini-batch of experiences are randomly sampled from the replay buffer to train the actor and the critic. The target critic and actor are updated by soft copying Q^θ and A^α to Q^θ′ and A^α′ every training iterations.

In this paper, the Minmax-FDPG is modified on the basis of DDPG, and will be introduced in the next section.

2.2 Fuzzy inference system

Unlike the black box represented by a neural network, a FIS is an interpretable mapping consists of several rules. A typical form of a FIS model with several inputs can be described as the following equation [16, 32]:

$R_{l} : x_{i} is F_{1}^{l} \land . . . \land x_{n} is F_{n}^{l} \Rightarrow f_{l} = K_{l}$ (2) where $F_{i}^{l}$ is the fuzzy set of the ith input in rule R_l and K_l is the output of the consequent f_l.

Fig. 2 shows the structure of a FIS consists of five layers. The first layer is the input layer, In this layer, the data consisting of multiple components are input to the FIS. The second layer is the fuzzy layer. In this layer, the membership of the ith input in the lth rule is defined by a Gaussian function with mean $m_{i}^{l}$ and standard deviation $σ_{i}^{l}$ :

Fig. 2

Components of fuzzy inference system.

$μ_{R_{l}}^{F_{i}^{l}} (x_{i}) = exp [{(- \frac{x_{i} - m_{i}^{l}}{σ_{i}^{l}})}^{2}]$ (3) The firing strength of the lth rule is:

$ω (R_{l}) = \prod_{i} μ_{R_{l}}^{F_{i}^{l}} (x_{i})$ (4) The third and forth layer are the normalize layer and weighting layer, respectively. In these layers, the firing strengths are normalized and the final output is given as the weighted average of the firing strengths, i.e.

$f = \sum_{l} \frac{ω (R_{l})}{\sum_{l} ω (R_{l})} K_{l}$ (5)

2.3 Pursuit-evasion problem

We consider a pursuit-evasion problem with two pursuers and one evasion, as shown in Fig. 3.

Fig. 3

The schematic of pursuit-evader problem.

In Fig. 3, the tuples $(x_{p}^{1}, y_{p}^{1}, Φ_{p}^{1})$ and $(x_{p}^{2}, y_{p}^{2}, Φ_{p}^{2})$ denote the positions and heading angles of Pursuer I and Pursuer II, respectively. And the tuple (x_e, y_e, Φ_e) is the position and heading angle of the evader. The velocities of the two pursuer are denoted as V_p, and the velocity of the evader is denoted as V_e. Both V_p and V_e are constants. The kinetic behaviors of the pursuers and evader are described by the following differential equations [17]:

$\begin{matrix} {\begin{matrix} {\dot{x}}_{p}^{1} = V_{p} cos Φ_{p}^{1} \\ {\dot{y}}_{p}^{1} = V_{p} sin Φ_{p}^{1} \\ {\dot{Φ}}_{p}^{1} = \frac{V_{p}}{L} tan a^{1} \end{matrix} \\ {\begin{matrix} {\dot{x}}_{p}^{2} = V_{p} cos Φ_{p}^{2} \\ {\dot{y}}_{p}^{2} = V_{p} sin Φ_{p}^{2} \\ {\dot{Φ}}_{p}^{2} = \frac{V_{p}}{L} tan a^{2} \end{matrix} \\ {\begin{matrix} {\dot{x}}_{e} = V_{e} cos Φ_{e} \\ {\dot{y}}_{e} = V_{e} sin Φ_{e} \\ {\dot{Φ}}_{e} = \frac{V_{e}}{L} tan b \end{matrix} \end{matrix}$ (6) where L is the distance from the front axle to the rear axle, a¹, a² ∈ [- a_max, a_max] are the steering angles of the two pursuers, b ∈ [- b_max, b_max] is the steering angle of the evader. Equation 6 indicates that, each player has a minimum turning radius:

$\begin{matrix} R_{p \min} = \frac{L}{tan a_{\max}} \\ R_{e \min} = \frac{L}{tan b_{\max}} \end{matrix}$ (7) The evader is rendered as more maneuverable than the pursuer, i.e. b_max > a_max while the pursuers are rendered as faster than the evader, i.e. V_p > V_e. The capture is considered to occur when the distance between the evader and one of the pursuers is less than a threshold value d_capture, i.e.

$d = min (d^{1}, d^{2}) \leq d_{capture}$ (8) where

$\begin{matrix} d^{1} = \sqrt{(x_{e} - x_{p}^{1})^{2} + (y_{e} - y_{p}^{1})^{2}} \\ d^{2} = \sqrt{(x_{e} - x_{p}^{2})^{2} + (y_{e} - y_{p}^{2})^{2}} \end{matrix}$ (9) The goal of the pursuer is to catch the evader in as little time as possible, while the goal of the evader is to delay the moment of capture as much as possible.

3 Method details

3.1 MDP modeling

In the pursuit-evasion problem, the next states of both sides of the game are merely determined by the states and actions at this step. Therefore, it can be modelled as a MDP. To simplify the analysis of the problem, we consider a state space with reduced dimension. The pursuit-evasion game can be stated in a coordinate system fixed to the body of the evader [33, 34], see Fig. 4. Then the state of the system can be represented as:

Fig. 4

Reduced space of the pursuit-evasion game.

$s_{t} = {[φ_{1}, φ_{2}, φ_{3}, φ_{4}, φ_{5}, φ_{6}]}^{T}$ (10) where

$\begin{matrix} φ_{1} = (x_{p}^{1} - x_{e}) cos Φ_{e} + (y_{p}^{1} - y_{e}) sin Φ_{e} \\ φ_{2} = (y_{p}^{1} - y_{e}) cos Φ_{e} - (x_{p}^{1} - x_{e}) sin Φ_{e} \\ φ_{3} = Φ_{p}^{1} - Φ_{e} \\ φ_{4} = (x_{p}^{2} - x_{e}) cos Φ_{e} + (y_{p}^{2} - y_{e}) sin Φ_{e} \\ φ_{5} = (y_{p}^{2} - y_{e}) cos Φ_{e} - (x_{p}^{2} - x_{e}) sin Φ_{e} \\ φ_{6} = Φ_{p}^{2} - Φ_{e} \end{matrix}$ (11) That means [φ₁, φ₂, φ₃] ^T and [φ₄, φ₅, φ₆] ^T are the states of Pursuer I and Pursuer II in the reduced space of the pursuit-evasion game. In the rest of this paper, these two vectors are denoted as s¹ and s² respectively.

At time step t, the action consists of the steering angles of both players, i.e.:

$u_{t} = [a_{t}^{1}, a_{t}^{2}, b_{t}]^{T}$ (12) The reward function is set as:

$r_{t} = D_{t + 1} - D_{t}$ (13) where

$\begin{matrix} D_{t} = min (\sqrt{(x_{e} (t) - x_{p}^{1} (t))^{2} + (y_{e} (t) - y_{p}^{1} (t))^{2}}, \\ \sqrt{(x_{e} (t) - x_{p}^{2} (t))^{2} + (y_{e} (t) - y_{p}^{2} (t))^{2}}) \end{matrix}$ (14) The reward function is the criteria for determining whether the pursuers are moving far from the evader.

3.2 Minmax-FDPG structure

In the pursuit-evasion game, the purpose of the pursuers and the evader are diametrically opposed, the goal of the pursuers is to close the distance with the evader, while the goal of the evader is to distance itself from the pursuers. Therefore, some changes need to be made in the Bellman equation in Equ. 1. The optimal state-action value function in the pursuit-evasion game is:

$\begin{matrix} Q^{*} (s, a^{1}, a^{2}, b) = R (s, a^{1}, a^{2}, b) \\ + γ max_{b^{'}} min_{{a^{1}}^{'}, {a^{2}}^{'}} Q^{*} (s^{'}, {a^{1}}^{'}, {a^{2}}^{'}, b^{'}) \end{matrix}$ (15) where $s^{'} = F (s, a^{1}, a^{2}, b)$ . Based on the preceding equation, the overall block diagram of the proposed Minmax-FDPG adopted in the pursuit-evasion problem is shown in Fig. 5. The proposed Minmax-FDPG is an improved version based on DDPG. The actor A^α and actor A^β are the actors of the pursuers and the evader respectively. The relationship between the input and output of the actor A^α is divided into two cases: When it takes Pursuer I’s state s¹ = [φ₁, φ₂, φ₃] ^T as input, it outputs the Pursuer I’s steering angle a¹, and when it takes Pursuer II’s state s² = [φ₄, φ₅, φ₆] ^T as input, it outputs the Pursuer II’s steering angle a², i.e.

Fig. 5

Structure of Minmax-FDPG

$\begin{matrix} a^{1} = A^{α} (s^{1}) \\ a^{2} = A^{α} (s^{2}) \end{matrix}$ (16) The actor A^β takes the whole state vector as input and outputs the steering angle b of the evader. Then, the critic outputs a Q value with inputting an action, which is a combination of a¹, a² and b, and a state. The pursuit-evasion game environment transfers to a new state and return a reward.

In order to overcome the divergence caused by the nonlinear function approximators, the replay buffer and target value technique are used [25]. And in our method, the replay buffer is improved using the symmetry in the problem. First, the agents’ experiences generated by interacting with environment are pushed into the buffer. Second, an iterative update mechanism is adopted to tune the Q values to the target values which are only periodically updated. Third, the two actors is updated towards opposite directions, for the pursuer, its actor is updated toward the direction that can minimize the Q value, while for the evader, its actor is updated toward the direction that can maximize the Q value. The updating process will be described in detail in the following sections.

3.3 Optimization process of the Minmax-FDPG

3.3.1 Replay buffer with consideration of symmetry and mini-batch

The replay buffer is a block of memory space applied to store the agents’ experiences at each time of interacting with environment. In each training step, a mini-batch consists of some experiences which are randomly selected from the replay buffer is used to update the parameters of the networks. The replay buffer always stores the newest experience and drops the oldest experience. Moreover, the pursuit-evasion game in this paper implies a certain symmetry: For the evader, after exchanging the states of Pursuer I and Pursuer II, the situation remains the same and, therefore, the action it should take remains the same. And for the pursuers, after exchanging the states of Pursuer I and Pursuer II, the actions they should take also need to be exchanged. see Algorithm. 1.

Algorithm 1 Replay buffer with consideration of symmetry.

1: Initial replay buffer $B$ , i ← 0;

2: Observe the state [φ_1,t, φ_2,t, φ_3,t, φ_4,t, φ_5,t, φ_6,t] ^T, take the action $(a_{t}^{1}, a_{t}^{2}, b_{t})$ and obtain the reward r_t and the next state [φ_1,t+1, φ_2,t+1, φ_3,t+1, φ_4,t+1, φ_5,t+1, φ_6,t+1] ^T;

3: if $i = length (B)$ then

4: i ← 0;

5: end if

6: s_t ← [φ_1,t, φ_2,t, φ_3,t, φ_4,t, φ_5,t, φ_6,t] ^T;

7: s_t+1 ← [φ_1,t+1, φ_2,t+1, φ_3,t+1, φ_4,t+1, φ_5,t+1, φ_6,t+1] ^T;

8: $e \leftarrow (s_{t}, a_{t}^{1}, a_{t}^{2}, b_{t}, r_{t}, s_{t + 1})$

9: $B [i] \leftarrow e$ ;

10: i ← i + 1;

11: s_t ← [φ_4,t, φ_5,t, φ_6,t, φ_1,t, φ_2,t, φ_3,t] ^T;

12: s_t+1 ← [φ_4,t+1, φ_5,t+1, φ_6,t+1, φ_1,t+1, φ_2,t+1, φ_3,t+1] ^T;

13: $e \leftarrow (s_{t}, a_{t}^{2}, a_{t}^{1}, b_{t}, r_{t}, s_{t + 1})$

14: $B [i] \leftarrow e$ ;

15: i ← i + 1;

As can be seen in Algorithm. 1, After taking the symmetry into account, the agents can generate two tuples of experiences for each interaction with the environment.

3.3.2 Tuning of the parameters of actors

For the pursuers, the gradient of the actor A^α is:

$\begin{matrix} \nabla_{α} = & \frac{\partial Q^{θ} (s, A^{α} (s^{1}), A^{α} (s^{2}), A^{β} (s))}{\partial α} \\ = & \frac{\partial Q^{θ} (s, a^{1}, a^{2}, b)}{\partial a^{1}} \frac{\partial a^{1}}{\partial α} \\ + \frac{\partial Q^{θ} (s, a^{1}, a^{2}, b)}{\partial a^{2}} \frac{\partial a^{2}}{\partial α} \end{matrix}$ (17) As a FIS, the critic’s parameters θ consist of three parts: $m_{l}^{θ}, σ_{l}^{θ}, K_{l}^{θ}$ . Then the Q value can be denoted as:

$\begin{matrix} Q = \sum_{l} [\frac{ω (R_{l}^{θ})}{\sum_{l} ω (R_{l}^{θ})} K_{l}^{θ}] \\ ω (R_{l}^{θ}) = \prod_{i} μ_{R_{l}^{θ}}^{F_{i}^{l θ}} (x_{i}) \end{matrix}$ (18) Here, x_i represents each input of the critic, i.e. x_i∈ { φ₁, φ₂, φ₃, φ₄, φ₅, φ₆, a¹, a², b }. First, the gradient of the Q value with respect to Pursuer I’s steering angle a¹ can be derived using chain rule:

$\frac{\partial Q}{\partial a^{1}} = \sum_{l} \frac{\partial Q}{\partial ω (R_{l}^{θ})} \frac{\partial ω (R_{l}^{θ})}{\partial a^{1}}$ (19) Similarly, the gradient of the Q value with respect to Pursuer II’s steering angle a² is:

$\frac{\partial Q}{\partial a^{2}} = \sum_{l} \frac{\partial Q}{\partial ω (R_{l}^{θ})} \frac{\partial ω (R_{l}^{θ})}{\partial a^{2}}$ (20)

The terms $\frac{\partial ω (R_{l}^{θ})}{\partial a^{1}}$ and $\frac{\partial ω (R_{l}^{θ})}{\partial a^{2}}$ are calculated according to Equ. 3:

$\begin{matrix} \frac{\partial ω (R_{l}^{θ})}{\partial a^{1}} & = \frac{\partial}{\partial a^{1}} {exp [- \sum_{i} {(\frac{x_{i} - m_{i}^{l θ}}{σ_{i}^{l θ}})}^{2}]} \\ = - 2 \frac{a^{1} - m_{a^{1}}^{l θ}}{{(σ_{a^{1}}^{l θ})}^{2}} ω (R_{l}^{θ}) \\ \frac{\partial ω (R_{l}^{θ})}{\partial a^{2}} & = \frac{\partial}{\partial a^{2}} {exp [- \sum_{i} {(\frac{x_{i} - m_{i}^{l θ}}{σ_{i}^{l θ}})}^{2}]} \\ = - 2 \frac{a^{2} - m_{a^{1}}^{l θ}}{{(σ_{a^{2}}^{l θ})}^{2}} ω (R_{l}^{θ}) \end{matrix}$ (21) The term $\frac{\partial Q}{\partial ω (R_{l}^{θ})}$ is calculated from Equ. 18:

$\frac{\partial Q}{\partial ω (R_{l}^{θ})} = \frac{\partial}{\partial ω (R_{l}^{θ})} [\frac{\sum_{l} ω (R_{l}^{θ}) K_{l}^{θ}}{\sum_{l} ω (R_{l}^{θ})}] = \frac{K_{l} - Q}{\sum_{l} ω (R_{l}^{θ})}$ (22) Combining Equ. (181-183), the gradients of Q with respect to a¹ and a² comes to:

$\begin{matrix} \frac{\partial Q}{\partial a^{1}} = \sum_{l} [\frac{K_{l} - Q}{\sum_{l} ω (R_{l}^{θ})} (- 2 \frac{a^{1} - m_{a^{1}}^{l θ}}{{(σ_{a^{1}}^{l θ})}^{2}} ω (R_{l}^{θ}))] \\ \frac{\partial Q}{\partial a^{2}} = \sum_{l} [\frac{K_{l} - Q}{\sum_{l} ω (R_{l}^{θ})} (- 2 \frac{a^{2} - m_{a^{2}}^{l θ}}{{(σ_{a^{2}}^{l θ})}^{2}} ω (R_{l}^{θ}))] \end{matrix}$ (23)

Second, the gradient of a¹ with respect to α is:

$\begin{matrix} \frac{\partial a^{1}}{\partial α} = [\begin{matrix} \frac{\partial a^{1}}{\partial m^{α}} \\ \frac{\partial a^{1}}{\partial σ^{α}} \\ \frac{\partial a^{1}}{\partial K^{α}} \end{matrix}] \\ = [\begin{matrix} 2 \sum_{l} (\frac{K_{l}^{α} - a^{1}}{\sum_{l} ω (R_{l}^{α})} ω (R_{l}^{α}) \frac{(s - m_{a^{1}}^{l α})}{(σ_{i}^{l α})^{2}}) \\ 2 \sum_{l} (\frac{K_{l}^{α} - a^{1}}{\sum_{l} ω (R_{l}^{α})} ω (R_{l}^{α}) \frac{(s - m_{a^{1}}^{l α})^{2}}{(σ_{i}^{l α})^{3}}) \\ \sum_{l} (\frac{ω (R_{l}^{α})}{\sum_{l} ω (R_{l}^{α})}) \end{matrix}] \end{matrix}$ (24) where m^α, σ^α, K^α are the components of parameters α.

Similarly, the gradient of a² with respect to α is:

$\begin{matrix} \frac{\partial a^{2}}{\partial α} = [\begin{matrix} 2 \sum_{l} (\frac{K_{l}^{α} - a^{2}}{\sum_{l} ω (R_{l}^{α})} ω (R_{l}^{α}) \frac{(s - m_{a^{2}}^{l α})}{(σ_{i}^{l α})^{2}}) \\ 2 \sum_{l} (\frac{K_{l}^{α} - a^{2}}{\sum_{l} ω (R_{l}^{α})} ω (R_{l}^{α}) \frac{(s - m_{a^{2}}^{l α})^{2}}{(σ_{i}^{l α})^{3}}) \\ \sum_{l} (\frac{ω (R_{l}^{α})}{\sum_{l} ω (R_{l}^{α})}) \end{matrix}] \end{matrix}$ (25)

In practical applications, the gradient in Equ. (17) is estimated from a mini-batch of experiences, denoted as ${s_{i}, a_{i}^{1}, a_{i}^{2}, b_{i}, r_{i}, s_{i}^{'}}_{i = 1}^{N}$ , randomly sampled from the replay buffer, that is:

$\begin{matrix} \nabla_{α} = \\ \frac{1}{N} \sum_{i = 1}^{N} [\frac{\partial Q^{θ} (s, a^{1}, a^{2}, b)}{\partial a^{1}} |_{s = s_{i}, a^{1} = a_{i}^{1}, a^{2} = a_{i}^{2}, b = b_{i}} \frac{\partial a^{1}}{\partial α} |_{s = s_{i}} \\ + \frac{\partial Q^{θ} (s, a^{1}, a^{2}, b)}{\partial a^{2}} |_{s = s_{i}, a^{1} = a_{i}^{1}, a^{2} = a_{i}^{2}, b = b_{i}} \frac{\partial a^{2}}{\partial α} |_{s = s_{i}}] \end{matrix}$ (26)

For the evader, similarly, the gradient of Q value with respect to β can be estimated by:

$\begin{matrix} \nabla_{β} = \\ \frac{1}{N} \sum_{i = 1}^{N} [\frac{\partial Q^{θ} (s, a^{1}, a^{2}, b)}{\partial b} |_{s = s_{i}, a^{1} = a_{i}^{1}, a^{2} = a_{i}^{2}, b = b_{i}} \\ \frac{\partial A^{β} (s)}{\partial β} |_{s = s_{i}}] \end{matrix}$ (27)

Since the goal of the pursuer is to close the distance with the evader, the direction of updating parameters α should be the direction that can minimize the Q value. The parameters α can be optimized by applying the following gradient descent:

$α (t + 1) = α (t) - l_{a} \nabla_{α}$ (28) The goal of the evader is to distance itself from the pursuer. Therefore, the direction of updating parameters β should be the direction that can maximize the Q value. The parameters β can be optimized by applying the following gradient ascent:

$β (t + 1) = β (t) + l_{a} \nabla_{β}$ (29) where l_a is the learning rate of the actors.

3.3.3 Tuning of the parameters of critic

The critic network Q^θ (s, a¹, a², b) is learned based on Bellman Equation as in DDPG. The mean square error (MSE) loss function is defined as:

$J (θ) = \frac{1}{N} \sum_{i = 1}^{N} {(Q^{θ} (s_{i}, a_{i}^{1}, a_{i}^{2}, b_{i}) - y_{i})}^{2}$ (30) where $y_{i} = r_{i} + γ Q^{θ^{'}} (s_{i}^{'}, A^{α^{'}} ({s_{i}^{1}}^{'}), A^{α^{'}} ({s_{i}^{2}}^{'}), A^{β^{'}} (s_{i}^{'}))$ . y_i is also dependent on θ, but it is can be ignored [15]. The gradient of the loss function with respect to parameters θ is:

$\begin{matrix} \frac{\partial J}{\partial θ} = \frac{2}{N} \sum_{i = 1}^{N} [Q^{θ} (s_{i}, a_{i}^{1}, a_{i}^{2}, b_{i}) \\ \frac{\partial Q^{θ} (s, a^{1}, a^{2}, b)}{\partial θ} |_{s = s_{i}, a^{1} = a_{i}^{1}, a^{2} = a_{i}^{2}, b = b_{i}}] \end{matrix}$ (31) Similar as that in the FISs of actors, the term $\frac{\partial Q^{θ} (s, a, b)}{\partial θ}$ can be expressed as:

$\begin{matrix} \frac{\partial Q^{θ} (s, a^{1}, a^{2}, b)}{\partial θ} = [\begin{matrix} \frac{\partial Q}{\partial m^{θ}} \\ \frac{\partial Q}{\partial σ^{θ}} \\ \frac{\partial Q}{\partial K^{θ}} \end{matrix}] = \\ [\begin{matrix} 2 \sum_{l} (\frac{K_{l}^{θ} - Q}{\sum_{l} ω (R_{l}^{θ})} ω (R_{l}^{θ}) \frac{(s - m_{Q}^{l θ})}{(σ_{i}^{l θ})^{2}}) \\ 2 \sum_{l} (\frac{K_{l}^{θ} - Q}{\sum_{l} ω (R_{l}^{θ})} ω (R_{l}^{θ}) \frac{(s - m_{Q}^{l θ})^{2}}{(σ_{i}^{l θ})^{3}}) \\ \sum_{l} (\frac{ω (R_{l}^{θ})}{\sum_{l} ω (R_{l}^{θ})}) \end{matrix}] \end{matrix}$ (32) where m^θ, σ^θ, K^θ are the components of parameters θ.

To tune the parameters θ, the following gradient descent algorithm is applied:

$θ (t + 1) = θ (t) - l_{c} \frac{\partial J}{\partial θ}$ (33) where l_c is the learning rate of the critic.

3.3.4 Update of the target FISs

Directly copying the parameters of networks A^α, A^β, Q^θ to the target networks A^α′, A^β′, Q^θ′ might be unstable in some environments [15]. Instead of a direct copy, a soft update method is applied:

$\begin{matrix} θ^{'} (t + 1) = τ θ (t + 1) + (1 - τ) θ^{'} (t) \\ α^{'} (t + 1) = τ α (t + 1) + (1 - τ) α^{'} (t) \\ β^{'} (t + 1) = τ β (t + 1) + (1 - τ) β^{'} (t) \end{matrix}$ (34) where τ is a small positive number.

3.3.5 Complete algorithm of the proposed Minmax-FDPG

Given all the techniques introduced above, the complete algorithm of the proposed Minmax-FDPG is described in Algorithm 2.

Algorithm 2 Minmax-FDPG Algorithm for the pursuit-evasion problems.

1: Input: Discount factor γ, number of training episodes M, number of training steps T in each episode, batch size N, learning rate for actors l_a and learning rate for critic l_c;

2: Randomly initialize neural networks for two actors A^α, A^β and network for critic Q^θ;

3: Initialize the target networks A^α′, A^β′, Q^θ′, and α′ ← α, β′ ← β, θ′ ← θ;

4: Initialize replay buffer $B$ ;

5: for j ← 1, . . . , M do

6: Initialize a random noise for exploration $N_{j}$ ;

7: Randomly initialize the state of pursuit-evasion game, observe the state [φ_1,1, φ_2,1, φ_3,1, φ_4,1, φ_5,1, φ_6,1] ^T and denote it as s₁;

8: for t ← 1, . . . , T do

9: $s_{t}^{1} \leftarrow [φ_{1, t}, φ_{2, t}, φ_{3, t}]^{T}$ ;

10: $s_{t}^{2} \leftarrow [φ_{4, t}, φ_{5, t}, φ_{6, t}]^{T}$ ;

11: $a_{t}^{1} \leftarrow A^{α} (s_{t}^{1}) + N_{j}$ ;

12: $a_{t}^{2} \leftarrow A^{α} (s_{t}^{2}) + N_{j}$ ;

13: $b_{t} \leftarrow A^{β} (s_{t}) + N_{j}$ ;

14: Execute $a_{t}^{1}$ , $a_{t}^{2}$ and b_t to the pursuit-evasion game and receive the reward r_t, observe the new state [φ_1,t+1, φ_2,t+1, φ_3,t+1, φ_4,t+1, φ_5,t+1, φ_6,t+1] ^T and denote it as s_t+1;

15: Store transition $(s_{t}, a_{t}^{1}, a_{t}^{2}, b_{t}, r_{t}, s_{t + 1})$ to $B$ ;

16: s_t ← [φ_4,t, φ_5,t, φ_6,t, φ_1,t, φ_2,t, φ_3,t] ^T;

17: s_t+1 ← [φ_4,t+1, φ_5,t+1, φ_6,t+1, φ_1,t+1, φ_2,t+1, φ_3,t+1] ^T;

18: Store transition $(s_{t}, a_{t}^{2}, a_{t}^{1}, b_{t}, r_{t}, s_{t + 1})$ to $B$ ;

19: Randomly sample N transitions as a mini-batch ${(s_{i}, a_{i}^{1}, a_{i}^{2}, b_{i}, r_{i}, s_{i}^{'})}_{i = 1}^{N}$ ;

20: Set $y_{i} \leftarrow r_{i} + γ Q^{θ^{'}} (s_{i}, A^{α^{'}} (s_{i}^{1}), A^{α^{'}} (s_{i}^{2}), A^{β^{'}} (s_{i}))$ ;

21: Update Q^θ′ by minimizing the loss: $\frac{1}{N} \sum_{i = 1}^{N} {(y_{i} - Q^{θ} (s_{i}, a_{i}^{1}, a_{i}^{2}, b_{i}))}^{2}$

22: Calculate the gradient for pursuers’ actor: $\begin{matrix} \begin{matrix} \nabla_{α} \leftarrow \frac{1}{N} \sum_{i = 1}^{N} [{\frac{\partial Q^{θ} (s, a^{1}, a^{2}, b)}{\partial a^{1}} |}_{s = s_{i}, a^{1} = a_{i}^{1}, a^{2} = a_{i}^{2}, b = b_{i}} {\frac{\partial A^{α} (s^{1})}{\partial α} |}_{s^{1} = s_{i}^{1}} \\ + {\frac{\partial Q^{θ} (s, a^{1}, a^{2}, b)}{\partial a^{2}} |}_{s = s_{i}, a^{1} = a_{i}^{1}, a^{2} = a_{i}^{2}, b = b_{i}} {\frac{\partial A^{α} (s^{2})}{\partial α} |}_{s^{2} = s_{i}^{2}}] \end{matrix} \end{matrix}$

23: Calculate the gradient for evader’s actor: $\nabla_{β} \leftarrow \frac{1}{N} \sum_{i = 1}^{N} [{\frac{\partial Q^{θ} (s, a^{1}, a^{2}, b)}{\partial b} |}_{s = s_{i}, a^{1} = a_{i}^{1}, a^{2} = a_{i}^{2}, b = b_{i}} {\frac{\partial A^{β} (s)}{\partial β} |}_{s = s_{i}}]$

24: Update the actors of both players: α ← α - l_a ∇ _α, β ← β + l_a ∇ _β;

25: Update the target networks: θ′ ← τθ + (1 - τ) θ′, α′ ← τα + (1 - τ) α′, β′ ← τβ + (1 - τ) β′;

26: end if

27: end if

28: Output: Control policy for Pursuer I a¹ = A^α (s¹), control policy for Pursuer II a² = A^α (s²) and control policy for evader b = A^β (s);

4 Numerical examples

In this section, after transfering the pursuit-evasion problem into an MDP, the Minmax-FDPG method is applied to train the control policies of pursuer and evader. The training is conducted on a platform using CPU Intel Core i7-7800X with a 4.10GHz clock frequency and 32GBs of RAM. Adaptive moment estimation (ADAM) method is used to optimize the FISs. In pursuers’ actor FIS, there are three fuzzy sets for each input component, making a total of nine. In evader’s actor FIS, there are 18 fuzzy sets for six input components. The outputs of the actors have an up/down limit, a hyperbolic tangent function is used as an activation function for each actor. In the critic, there are 27 fuzzy sets in total, each input component contains three. The output of each rule in each FIS is initialized as zero. The other hyperparameters in this example are listed in Table 1. The settings of the pursuit-evader game are outlined in Table 2. At the begining of each episode, the positions of the pursuers are randomly initialized in the area [-3, 3] × [-3, 3], and their orientations are randomly initialized in the interval [- π, π). An episode is terminated once one of the pursuers catches the evader or time step reaches 600. The training process starts when the replay buffer is full. The replay buffers with and without the consideration of symmetry are tested. And it is assumed that, both pursuers and evader do not known anything about the optimal control policies. Their initial control policies are randomly generated.

Table 1
Hyperparameters in simulation

Parameters Value Description

γ 0.99 Discount factor of future reward

τ 0.01 Soft copy factor

Mini-batch size 32 —

Replay buffer size 10000 —

M 500 Episode number

T 600 Training steps for each episode

l _a 0.001 Learning rate for actors

l _c 0.001 Learning rate for critic

β ₁ 0.9 Exponential decay in ADAM

β ₂ 0.999 Exponential decay in ADAM

Parameters	Value	Description
γ	0.99	Discount factor of future reward
τ	0.01	Soft copy factor
Mini-batch size	32	—
Replay buffer size	10000	—
M	500	Episode number
T	600	Training steps for each episode
l _a	0.001	Learning rate for actors
l _c	0.001	Learning rate for critic
β ₁	0.9	Exponential decay in ADAM
β ₂	0.999	Exponential decay in ADAM

Table 2

Pursuit-evasion game setup

Parameters	Value	Description
(x₀, y₀)	(0, 0)	Initial position of the evader
Θ _e0	0	Initial orientation of the evader
V _p	1 m/s	Velocity of the pursuer
L	0.3 m	Distance from the front axle to the rear axle
a _max	0.5	Maximum steering angle of the pursuer
V _e	0.5 m/s	Velocity of the evader
b _max	1	Maximum steering angle of the evader
Δt	0.1	Time step size
d _capture	0.1 m	Capture radius

4.1 Training results

Fig. 6 displays the changes in total reward and loss function with number of training episodes. The training starts at the 34th episode. In both cases with and without the consideration of symmetry, the total reward decreases to a small value and maintains stable after 400 episode, and the loss function descents with the increasing training episodes. However, the total reward and the loss function decrease faster when symmetry is considered. It indicates that considering symmetry in the replay buffer can effectively improve the training rate. The trajectories of all players after 50 training episodes and 500 training episodes are displayed in Fig. 7 and Fig. 8, respectively.

Fig. 6

Total reward and loss function.

Fig. 7

Trajectories after 50 training episodes.

Fig. 8

Trajectories after 500 training episodes.

The initial positions of both players are marked with hollow circles and the terminal positions are marked with crosses. It can be seen that both Pursuer I and Pursuer II fail to catch the evader after 50 training episodes and at least one pursuer successfully catches the evader starting from a different initial position after 500 training episodes. Fig. 9 and Fig. 10 show the membership functions in the FIS of the pursuers and evader’s actors after tunning using the proposed Minmax-FDPG algorithm. Note that the FIS of the pursuers’ actor has three inputs ([φ₁, φ₂, φ₃] ^T for Pursuer I or [φ₄, φ₅, φ₆] ^T for Pursuer II). And the FIS of the evader’s actor has six inputs. The symbols "N", "Z" and "P" in the figure represent the linguistic values "Negative", "Zero" and "Positive", respectively.

Fig. 9

Membership functions in the FIS of the evader’s actor.

Fig. 10

Membership functions in the FIS of the pursuers’ actor.

4.2 Confrontation between different control policies

In the proposed Minmax-FDPG method, the control policies of both sides of the game are improved simultaneously. However, in a zero-sum differential game similar to the pursuit-evasion problem, the goals of different players are opposite, the changes in total reward and loss function with number of training episodes cannot demonstrate the effects of training in a comprehensive manner. A more persuasive verification approach is to conduct some confrontations between control policies with different training episodes.

During the training, the actors of the pursuers and the evader are saved every 50 episodes. The collections of alternative actors for the pursuers and the evader are denoted as:

$\begin{matrix} A_{p} = {A_{50}^{α}, A_{100}^{α}, . . ., A_{500}^{α}} \\ A_{e} = {A_{50}^{β}, A_{100}^{β}, . . ., A_{500}^{β}} \end{matrix}$ (35) In the preceding equation, the number at the subscript of each actor indicates the number of training episodes of it.

We take 100 times of simulations for the confrontation between each pair of policies, and take the mean value of the total rewards and capture times of these simulations as the result of the confrontation. The initial state and termination condition of each simulation are set as in the training process. The results of these confrontations are shown in Fig. 11. The two matrixes displayed in Fig. 11 have rows and columns corresponding to the policies taken by the pursuers and the evader. As shown in the figure, the total reward and capture time decrease with the row and increase with the column, roughly. It means that the performance of the policies of the pursuers and the evader improve with the increasing training episodes.

Fig. 11

Results of confrontations.

5 Conclusions

This paper proposes a novel RL framework referred to as the Minmax-FDPG for solving pursuit-evasion problem. The proposed method is modified on the basis of DDPG algorithm, which is a kind of AC based method. The structure of the proposed algorithm consists of two actors and one critic. These actors represent the control policies of different players. In order to obtain the practical meaning of learning result, instead of neural networks, FISs are applied to approximate the control policies and the Q value function. The parameters in the FISs are tuned by the temporal difference method. During the training process, the parameters in the FISs of different players’ actor are updated along the opposite directions simultaneously. For the pursuer, its actor is updated toward the direction that can minimize the Q value, while for the evader, its actor is updated toward the direction that can maximize the Q value. A pursuit-evasion problem with two pursuers and one evader is taken as an example to illustrate the validity of the proposed algorithm. First, the pursuit-evasion problem is transformed into an MDP. Then the proposed Minmax-FDPG algorithm is adopted to optimize the actors and the critic. During the training, the symmetry in such a problem is used to improve the replay buffer. In order to demonstrate the effects of training in a comprehensive manner, the FISs of actors are saved every 50 episodes. After the training is complete, the specific practical meaning of each linguistic fuzzy rule is obtained. Lastly, several confrontations between the control policies with different training episodes and the analytical optimal control policies are conducted to show the changes in total reward and capture time over training episodes of policies of both sides of the game. The result indicate that, after training, the pursuer and evader can learn a relatively good policy, and performance of the policies of both pursuer and evader improve with the increasing training episodes. And considering the symmetry in the replay buffer can significantly accelerate the training process.

In our future work, some more complex structure of FIS can be used to solve some more practical differential games or the pursuit-evasion problems with more number of pursuers and evaders.

Footnotes

Acknowledgments

The authors gratefully acknowledge support from National Defense Outstanding Youth Science Foundation (Grant No. 2018-JCJQ-ZQ-053), and Central University Basic Scientific Research Operating Expenses Special Fund Project Support (Grant No. NF2018001). Also, the authors would like to thank the anonymous reviewers, associate editor, and editor for their valuable and constructive comments and suggestions.

References

Isaacs

, Differential games: a mathematical theory with applications to warfare and pursuit, control and optimization, 01 1965.

, Bryson

and Baron

, Differential games and optimal pursuit-evasion strategies, IEEE Transactions on Automatic Control 10(4) (1965), 385–389.

Liubarshchuk

and Althoefer

, The problem of approach in differential–difference games, International Journal of Game Theory 45, 02 2015.

Makkapati

V.R.

, Sun

and Tsiotras

, Optimal evading strategies for two-pursuer/one-evader problems, Journal of Guidance, Control, and Dynamics 41(4) (2018), 851–862.

Lim

S.H.

, Furukawa

, Dissanayake

and Durrant-Whyte

H.F.

, A time-optimal control strategy for pursuit-evasion games problems, In, IEEE International Conference on Robotics and Automation, 2004. Proceedings, ICRA ’04. 2004 4 (2004), 3962–3967.

Sun

, Tsiotras

, Lolla

, Subramani

D.N.

and Lermusiaux

P.F.J.

, Pursuit-evasion games in dynamic flow fields via reachability set analysis, In 2017 American Control Conference (ACC), pages 4595–4600, 2017.

Wang

, Dong

and Sun

, Cooperative control for multi-player pursuit-evasion games with reinforcement learning, Neurocomputing 412 (2020), 101–114.

Liang

, Wang

, Liu

and Liu

, Guidance strategies for interceptor against active defense spacecraft in two-on-two engagement, Aerospace Science and Technology 96 (2020), 105529.

Staddon

J.E.R.

, The dynamics of behavior: Review of sutton and barto: Reinforcement learning: An introduction (2nd ed.), Journal of the Experimental Analysis of Behavior 113(2) (2020), 485–491.

10.

Liu

, Liu

, Wu

and Zhang

, A pursuit-evasion algorithm based on hierarchical reinforcement learning, Measuring Technology and Mechatronics Automation, International Conference on 2 (2009), 482–486.

11.

Kiumarsi

, Vamvoudakis

K.G.

, Modares

and Lewis

, Optimal and autonomous control using reinforcement learning: A survey, IEEE Transactions on Neural Networks and Learning Systems PP (2017), 1–21.

12.

Jia

, Wang

and Shen

, A continuous-time markov decision process-based method with application in a pursuit-evasion example, IEEE Transactions on Systems, Man, and Cybernetics: Systems 46 (2015), 1–11.

13.

Wang

H.-N.

, Liu

, Zhang

Y.-Y.

, Feng

D.-W.

, Huang

, Li

D.-S.

and Zhang

, Deep reinforcement learning: a survey, Frontiers of Information Technology & Electronic Engineering, 10 2020.

14.

Mnih

, Kavukcuoglu

, Silver

, Graves

, Antonoglou

, Wierstra

and Riedmiller

, Playing atari with deep reinforcement learning, 12 2013.

15.

Lillicrap

, Hunt

, Pritzel

, Heess

, Erez

, Tassa

, Silver

and Wierstra

, Continuous control with deep reinforcement learning, CoRR, 09 2015.

16.

Wang

, Wang

and Yue

, A fuzzy deterministic policy gradient algorithm for pursuit-evasion differential games, Neurocomputing 362(Oct.14) (2019), 106–117.

17.

Desouky

S.F.

and Schwartz

H.M.

, Self-learning fuzzy logic controllers for pursuit–evasion differential games, Robotics and Autonomous Systems 59(1) (2011), 22–33.

18.

Awheda

and Schwartz

, A residual gradient fuzzy reinforcement learning algorithm for differential games, International Journal of Fuzzy Systems 19 (2017), 1058–1076.

19.

Zhou

, van Kampen

E.-J.

and Chu

, Hybrid hierarchical reinforcement learning for online guidance and navigation with partial observability, Neurocomputing 331 (2019), 443–457.

20.

Grondman

, Busoniu

, Lopes

and Babuska

, A survey of actor-critic reinforcement learning: Standard and natural policy gradients, IEEE Transactions on Systems Man and Cybernetics Part B-Cybernetics 42 (2012), 1291–1307.

21.

Vamvoudakis

K.G.

and Lewis

, Online actor-critic algorithm to solve the continuous-time infinite horizon optimal control problem, Automatica 46 (2010), 878–888.

22.

Silver

, Lever

, Heess

, Degris

, Wierstra

and Riedmiller

, Deterministic policy gradient algorithms, In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32, ICML’14, page I–387–I–395. JMLR.org, 2014.

23.

Song

, Wei

and Song

, Neural-network-based synchronous iteration learning method for multi-player zero-sum games, Neurocomputing 242 (2017), 73–82.

24.

Jouffe

, Fuzzy inference system learning by reinforcement methods, Trans Sys Man Cyber Part C 28(3) (1998), 338–355.

25.

Mnih

, Kavukcuoglu

, Silver

, Rusu

, Veness

, Bellemare

, Graves

, Riedmiller

, Fidjeland

, Ostrovski

, Petersen

, Beattie

, Sadik

, Antonoglou

, King

, Kumaran

, Wierstra

, Legg

and Hassabis

, Human-level control through deep reinforcement learning, Nature 518 (2015), 529–533.

26.

Riedmiller

, Neural fitted q iteration - first experiences with a data efficient neural reinforcement learning method, Mach. Learn.: ECML, page 317–328, 2005.

27.

Watkins

J.C.H.

and Dayan

, Q-learning, Mach Learn 8 (1992), 279–292.

28.

, Lillicrap

, Sutskever

and Levine

, Continuous deep q-learning with model-based acceleration, 03 2016.

29.

Wang

, Schaul

, Hessel

, Hasselt

Hado V.

, Lanctot

and Freitas

N.D.

, Dueling network architectures for deep reinforcement learning, In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, page 1995–2003. JMLR.org, 2016.

30.

Sutton

, Mcallester

, Singh

and Mansour

, Policy gradient methods for reinforcement learning with function approximation, Adv Neural Inf Process Syst 12 (2000), 1057–1063.

31.

Prokhorov

D.V.

and Wunsch

, Adaptive critic design, IEEE transactions on neural networks / a publication of the IEEE Neural Networks Council 8 (1997), 997–1007.

32.

TAKAGI

TOMOHIRO

and SUGENO

MICHIO

, Fuzzy identification of systems and its applications to modeling and control, IEEE Trans. Systems, Man, and Cybernet 15 (1985), 116–132.

33.

Bravo

, Ruiz

and Murrieta-Cid

Rafael

, A pursuit–evasion game between two identical differential drive robots, Journal of the Franklin Institute 357(10) (2020), 5773–5808.

34.

Akametalu

, Ghosh

, Fisac

and Tomlin

, A minimum discounted reward hamilton-jacobi formulation for computing reachable sets, 09 2018.