Two-phase selective decentralization to improve reinforcement learning systems with MDP

Abstract

In this paper, we explore the capability of selective decentralization in improving the reinforcement learning performance for unknown systems using model-based approaches. In selective decentralization, we automatically select the best communication policies among agents. Our learning design, which is built on the control system principles, includes two phases. First, we apply system identification to train an approximated model for the unknown systems. Second, we find the suboptimal solution of the Hamilton–Jacobi–Bellman (HJB) equation to derive the suboptimal control. For linear systems, the HJB equation transforms to the well-known Riccati equation with closed-form solution. In nonlinear system, we discretize the approximation model as a Markov Decision Process (MDP) in order to determine the control using dynamic programming algorithms. Since the theoretical foundation of using MDP to control the nonlinear system has not been thoroughly developed, we prove that the control law learned by the discrete-MDP approach is guarantee to stabilize the system, which is the learning goal, given several sufficient conditions. These learning and control techniques could be applied in centralized, completely decentralized and selectively decentralized manner. Our results show that selective decentralization outperforms the complete decentralization and the centralization approaches when the systems are completely decoupled or strongly interconnected.

Keywords

Decentralized control Hamilton–Jacobi–Bellman equation Markov process multi-agent systems

1. Introduction

1.1. Overview of decentralized reinforcement learning

To deal with the complexity AI systems, decentralization and multi-agent learning has been one of the major approaches in reinforcement learning. Decentralization decouples the entire system’s state variables into subsystems using domain knowledge or partition techniques and assigns an agent for each subsystem. Each agent is responsible to learn the optimal control strategy for the assigned subsystem. With decentralization, the learning algorithms operate on less number of state variables and are less susceptible to uncertain system parameters [25]. In addition, decentralization makes the system more adaptive to structural changes than the corresponding centralized systems [55]. Another benefit of decentralization is that if one agent fails in learning, the other agents could compensate for it in the overall learning problem resulting in only graceful degradation of performance [13]. Although decentralization is a promising approach for large-scale reinforcement learning, this type of approach is likely to suffer from instability in the presence of interconnections among subsystems regardless of the interconnection strength [20,25].

To overcome the stability issue, one of the key questions in decentralized learning is to set up a communication policy among the learning agents. The question of how to choose a suitable communication policy to use is still open because the number of communication policies grows following the Bell’s number, which is more than exponential [49]. To the extent of our knowledge, there are two classical approaches in designing communication policy in decentralized learning: partial communication and multi-model switching. In partial communication, each agent is responsible to select the other agents to communicate with, depending on the agent’s state variables and communication costs [40]. Some of the recent state-of-the-art techniques in partial communication demonstrate how each agent decides the communication in Q-learning problems [3,4,57], partially ordered subsystems [54], fuzzy logic systems [23,48] and probabilistic control sharing systems [36]. In multi-model-switching, the entire system has K policies to allow the agents to communicate, and the entire system has a central communicator who is responsible to switch the communication policy depending on the resulting performance [6,21,41,42]. Also, criteria to decide policy switch may depend on the domain-specific optimization of the problem, such as power efficiency function in energy system [11,35] and aerodynamic performance in hypersonic vehicle systems [19]. In addition, communication among agents also depends on the characteristics of the tasks, or the final goals, of the entire system. From this perspective, the communication policy and learning algorithms could be categorized into fully cooperative tasks, explicit coordination mechanisms, fully competitive tasks and mixed tasks [13]. Although the communication policy problem has been broadly explored, the existing solutions still require full or partial knowledge about the agents’ connectivity and operating regimes. Other practical questions in decentralization are how to create and justify the subsystem decompositions, and how fast the decentralized learning algorithms converge.

From the theoretical point of view, a reinforcement learning AI problem could be considered as an adaptive control problem [33], in which solving the Hamilton–Jacobi–Bellman (HJB) equation is the theoretical key in the reinforcement learning and control system theory. Most of the decentralization techniques focus on learning linear systems [20,55], in which the centralized and decentralized system could be uniformly represented in matrix form. For the linear system, the HJB equation becomes the well-known Riccati equation with a complete solution [9]. However, in most of the real-world cases, the system is nonlinear where the closed-form solution for HJB equation is very difficult to find. Solving the nonlinear HJB equation in decentralized manner is even more difficult. Therefore, researchers have been focusing on approximation methods to tackle nonlinear HJB equation problem such as [1,24,47,52]. Generally, these efforts focus on the nonlinear feedback-linearization system, in which the closed-form solution for the approximation of HJB equation has been found [31]. Theoretically, the HJB equation could be solved with dynamic programming [16]. Therefore, a simple idea is to discretize the nonlinear system to convert it into a Markov-Decision-Process (MDP) and solve it by the policy iteration algorithm [50]. Such discretization of continuous-state nonlinear control systems has been studied in [27,38,39]. Results of MDP convergence for decentralized learning in Markov systems have been derived in [14,58]. In addition, the matrix-properties of MDP could support the representation of decentralized learning and control. With this discretization approach, we successfully solved the nonlinear control problem in several case-studies. However, from our knowledge, the theoretical proof about the existence and approximation of the MDP’s solution in the general form HJB equation has not been widely explored.

In addition, the adaptive control and reinforcement learning has another problem due to the unknown nature of the systems. However, this problem could be tackled by system identification techniques. System identification constructs an approximation to model the dynamical changes of the system and environment [47]. For linear system identification, the gradient descent is one of the most robust methods as shown in [26]. For nonlinear system, neural network is one of the most well-known approaches for identification. Neural networks have been known for their capability to approximate a large and general class of nonlinear functions over compact domains. Theoretical foundation and application of neural network as such universal functional approximators in control systems can be found in [1,18,37].

1.2. State-of-the-art techniques and our contributions

Most of the recent state-of-the-art techniques in decentralized reinforcement learning are primarily extensions and adjustments form single-agent reinforcement learning. For example, in [4,30,44,56], each learning agent applies Q-learning algorithm to make its own action in a cooperative learning problem. In [17,43,53], the learning problems are in MDP format and each agent makes the decision by using policy iteration scheme. In another example, [34] shows how each learning agent can apply its own policy computed by adaptive dynamic programming in a system-stabilizing problem. Overall, in these techniques, each learning agent still makes its own decisions independently. The communication among the agents is represented as an additional term to the learning algorithms. Although these approaches have the advantage in the simplicity of the design and maximizing the parallel computing, they often require prior knowledge on the communicative patterns, which may not be available in some learning problems. In addition, in collaborative learning problems, the notion of ‘collaboration’ is loosen in these approaches, as each agent may not specify to whom it should work with to make a jointed decision.

In this paper, we make two major contributions:

Inspired by the model-switching ideas, we propose the selective decentralization method , to learn how to control the completely unknown-interconnection system in two-phase approach: system identification and control in fully cooperative tasks problem. This method also allows the learning agents to learn the suboptimal communication policy when the agents’ connectivity and operating regimes are completely unknown. The fundamental difference between our proposed approach and the state-of-the-art approaches above is the notion of ‘collaboration’. In our approach, when different agents need to work together, they do not make individual decision. Instead, all of the agents’ states and actions are combined at once; and they function as a learning unit with a joined decision.

We design a discretized-MDP approach to tackle the nonlinear HJB equation in the most general form, due to the assumption that the AI reinforcement learning system is completely unknown. This is also another difference between our approaches and some state-of-the-art approaches, where the learning problems are limited in certain format, especially in feedback- linearizable nonlinear systems. The discretized-MDP approach helps in the control phase in the nonlinear-system case. We also provide theoretical analysis about the necessary conditions for the MDP’s discrete state vector to converge to the real continuous state vector asymptotically. In addition, we also prove that the MDP’s solution guarantees to stabilize the learning systems in general form when the systems satisfy certain conditions.

From our knowledge, the approach using the decentralized method with system identification/control to unknown system, especially beyond the feedback-linearization systems, is relatively unexplored. Our focus in this work in the nonlinear system. However, we include several examples of linear system to demonstrate how the selective decentralization perform in a well-known and well-solved problem. We compare the learning and control performance of our selective decentralization method with the completely decentralized method and the centralized method using simulation studies. We also compare our discretized-MDP algorithm with the adaptive dynamic programming algorithm in a feedback-linearizable nonlinear problem, which adaptive dynamic programming has been known as one of the most well-studied algorithm, and show that our discretized-MDP algorithm shows faster learning speed.

2. Problem statement

In this paper, we focus on discrete time, continuous-state, time-invariant system in the general format $\begin{matrix} (1) & x (t + 1) = f (x (t), u (t)) \end{matrix}$ Where $x \in R^{N}$ stands for the N-dimensional bounded state vector, $u \in R^{M}$ stands for the M-dimensional bounded control unit, t stands for the iteration number, $x (0)$ is given and $f : R^{N} \times R^{M} \to R^{N}$ is a continuously differentiable unknown function. Here, the symmetric boundaries $[- χ, χ]$ and $[- μ, μ]$ for all components of x and u are known. Let $p : R^{N} \to R$ and $q : R^{M} \to R$ be the two continuously semi-definite negative and differentiable reward functions with the following properties $\begin{matrix} (2) & \begin{matrix} p (x_{1}) & ⩽ p (x_{2}) \Leftrightarrow ‖ x_{1} ‖ ⩾ ‖ x_{2} ‖, \\ p (0) & = 0 \end{matrix} \end{matrix}$ $\begin{matrix} (3) & \begin{matrix} q (u_{1}) & ⩽ q (u_{2}) \Leftrightarrow ‖ u_{1} ‖ ⩾ ‖ u_{2} ‖, \\ q (0) & = 0 \end{matrix} \end{matrix}$ where $‖ x ‖$ denotes the second norm of x. The main objective is to learn the control unit u such that $\begin{matrix} (4) & x (t) \to 0, u (t) \to 0 as t \to \infty \end{matrix}$ To formulate a control or learning problem, we convert the objective in (4) into a more formal control problem with discount factor $0 < γ \to 1$ [44] $\begin{matrix} (5) & J (x_{0}) = \sum_{t = 0}^{\infty} (p (x (t)) + q (u (t))) \end{matrix}$ Thus, the goal is to optimize $J (x_{0})$ . The function $J (x)$ defined in (5) is called the state value function [50]. Since f is unknown, in the model-based approach, the intermediate goal is to find the approximated $\hat{f}$ such that with the predicted state vector $\begin{matrix} (6) & \hat{x} (t + 1) = \hat{f} (x (t), u (t)) \end{matrix}$ the identification error $\begin{matrix} (7) & e (t) = ‖ x (t) - \hat{x} (t) ‖ \end{matrix}$ approaches 0 as $t \to \infty$ .

3. Learning the near-optimal control

3.1. Linear system

In the linear system $\begin{matrix} (8) & x (t + 1) = Ax (t) + Bu (t) \end{matrix}$ in which B is a known $N \times M$ and A is an unknown $N \times N$ matrix. Suppose that the reward functions are $p (x) = - x^{T} Qx$ and $q (u) = - u^{T} Ru$ , where Q and R are positive-definite matrices. To compute control vector u, we find the solution P of the Riccati equation [29] $\begin{array}{l} A^{T} PA - P - A^{T} PB {(B^{T} PB + R)}^{- 1} B^{T} PA + Q \\ (9) & = 0 \end{array}$ We use DARE algorithm implemented by Arnold et al. [2] to solve for P. At each iteration, by replacing A by the approximator $\hat{A} (t)$ in (9) and solution $\hat{P} (t)$ , we compute the control vector $u (t)$ by $\begin{matrix} (10) & u (t) = - {(R + B^{T} \hat{P} (t) B)}^{- 1} B^{T} \hat{P} (t) \hat{A} (t) x (t) \end{matrix}$ To find the approximator $\hat{P} (t)$ , we could apply the techniques in [26].

3.2. Nonlinear system

Theoretically, the solution for the nonlinear control system described from (1)–(5) is the solution of the corresponding HJB equation [31]. Since in general the closed-form solutions for the nonlinear HJB equations are unknown and we know the boundary of the state and control vectors, we discretize the state and control vector to construct an MDP problem closed to the underlying nonlinear function. We use the solution of the MDP problem as the near-optimal solution for the nonlinear system (1)–(5). Since the solution for an MDP problem has been extensively studied, to be brief, we use policy iteration algorithm to compute the optimal policy [50]. In this section, we will focus more on the discretization and set up the MDP process.

3.2.1. Discretizing the state and control vector space

Let K be the number of intervals in each dimension of x and u for which we uniformly divide the dimension into small grids. Therefore, the entire state space is divided into $K^{N}$ small hypercubes and the control space is divided into $K^{M}$ small hypercubes. All points inside a hypercube are discretely represented by the center of the hypercube. Points on the borders between two hypercubes are represented by the center of the ‘left’ hypercube. Mathematically, the discretization process is described by the following formulas $\begin{array}{l} (11) & \begin{aligned} x [i] & \to θ_{x} + χ / K \forall i \in [1, N] and \\ x [i] & \in [θ_{x}, θ_{x} + 2 χ / K) \end{aligned} \\ (12) & \begin{aligned} u [i] & \to θ_{u} + μ / K \forall i \in [1, N] and \\ u [i] & \in [θ_{u}, θ_{x} + 2 μ / K) \end{aligned} \end{array}$ where $θ_{x} \in {- χ, - χ + 2 χ / K, - χ + 4 χ / K, \dots, χ - 2 χ / K}$ and $θ_{u} \in {- μ, - μ + 2 μ / K, - μ + 4 μ / K, \dots, μ - 2 μ / K}$ , which are the ‘left’ boundaries in the hyper cubes.

Let $δ = max {2 χ / K, 2 μ / K}$ . It is easy to see that inside each small hypercube, the largest distance between any two points is bounded by $\begin{matrix} (13) & \sqrt{δ^{2} + δ^{2} + \dots + δ^{2}} = \sqrt{N δ^{2}} = \sqrt{N} δ \end{matrix}$ in the state space and by $\sqrt{M} δ$ in the control space. The left side of (13) has N terms for x dimension or M terms for u dimension. Trivially, $K \to \infty \Leftrightarrow δ \to 0$ , which means that the discretization is more precise.

From this point, for any state vector x, we denote $x_{dis}$ as the discretized form of x; for any control vector u, we denote $u_{dis}$ the discretized form of u. We also denote ( $x_{dis}$ ) and ( $u_{dis}$ ) as the hypercube where every discretization of x and u is $x_{dis}$ and $u_{dis}$ , correspondingly. Formally, from (11) and (12), we have $\begin{array}{l} (x_{dis}) = & [x_{dis} (i) - χ / K, x_{dis} (i) + χ / K] \\ (14) & \forall i \in {1, \dots, N} \\ (u_{dis}) = & [u_{dis} (i) - μ / K, u_{dis} (i) + μ / K] \\ (15) & \forall i \in {1, \dots, M} \end{array}$

Figure 1 demonstrates the discretization process in a simple two-dimensional space with $K = 7$ . Here, each hypercube is a square with size $2 χ / K$ . In each square, every state-point (in continuous) is represented by the square center. For example, every point in the bottom-left square (shaded) has the discretized form $(- χ + 2 χ / K, - χ + 2 χ / K)$ , with $θ_{x} = - χ$ . For points on the square borders, such as the intersections of the four squares on the top-right, their discretized from is the center of the neighbor ‘bottom-left’ square.

3.2.2. Setting up the state transition matrix for the MDP problem

The state transition matrix for the MDP problem, which contains all conditional probability $P ({x^{'}}_{dis} | x_{dis}, u_{dis})$ , has the dimension of $K^{M} \times K^{N} \times K^{M}$ , where ${x^{'}}_{dis}$ denotes the next discrete state reached by executing action $u_{dis}$ at state $x_{dis}$ . Let $x^{'} = f (x, u) R^{N}$ stands for the next state vector observed by executing action u at state x. Then, we denote ${x^{'}}_{dis}$ as the discrete form of $x^{'}$ . It is easy to observe that for each triple $({x^{'}}_{dis}, x_{dis}, u_{dis})$ the conditional probability $\begin{array}{l} P ({x^{'}}_{dis} | x_{dis}, u_{dis}) \\ (16) & = \frac{∭_{(x_{dis}) \times (u_{dis}) \times ({x^{'}}_{dis})} d x d u d x^{'}}{∭_{(x_{dis}) \times (u_{dis}) \times C} d x d u d x^{'}} \end{array}$ where C is the subspace containing all possible value of $f (x, u) \forall x, u \in (x_{dis}) \times (u_{dis})$ . In our problem statement, since f is unknown, we replace f by $\hat{f}$ , which is approximated by the neural network. Figure 2 illustrates a simple case of this conditional probability when $N = 1$ . Although the integral could be approximated by the Monte Carlo method [12], the simpler method to approximate $P ({x^{'}}_{dis} | x_{dis}, u_{dis})$ is as follow.

Generate a large number of S points $(x, u)$ following the uniform distribution in $(x_{dis}) \times (u_{dis})$ . Here, we emphasize that the computation of $P ({x^{'}}_{dis} | x_{dis}, u_{dis})$ does not use any sample $(x (t), u (t))$ . These S points are randomly generated without any prior knowledge of the model to avoid bias.

Count the number of points T such that ${\hat{f}}^{(} x, u) \in ({x^{'}}_{dis})$ .

Then $T / S \to P ({x^{'}}_{dis} | x_{dis}, u_{dis}) when S \to \infty .$

Fig. 1.

State and action space discretization in a simple 2D example.

Fig. 2.

An example of (16) in one-dimension state space. $⟨ 1 ⟩$ , the dash surface, is the numerator in (16). $⟨ 2 ⟩$ , the bold surface, is the denominator of (16).

3.2.3. State value function in MDP problem

In (5), from Bellman’s principle of optimality [8], for the solution $u (t)$ of the HJB equation (1)–(5), we have $\begin{array}{l} J (x (t)) \\ = p (x (t)) + q (u (t)) \\ + \sum_{τ = t + 1}^{\infty} γ^{τ} (p (x (τ)) + q (u (τ))) \\ (17) & = p (x (t)) + q (u (t)) + J (x (t + 1)) \end{array}$ Because f is stable at the origin, from (2) and (3), $J (0) = 0$ . Since the state value function in the HJB equation (1)–(5) contains a discount factor, we define the corresponding value function in the MDP as $\begin{array}{l} R (x_{dis} (t)) \\ = p (x_{dis} (t)) + q (x_{dis} (t)) \\ + γ \sum_{\forall {x^{'}}_{dis}} P ({x^{'}}_{dis} | x_{dis}, u_{dis}) \\ (18) & \times R ({x^{'}}_{dis} (t + 1)) \end{array}$ And $R (x_{dis}) = 0$ if $(x_{dis})$ contains 0 or has 0 on the boundary.

4. Analysis of the discretized MDP for near optimal nonlinear control

In this section, we examine several conditions for the trajectory of discrete state and control obtained by the discretized MDP method, denoted as $x_{MDP} (t)$ and $u_{MDP} (t)$ , converge to $x (t)$ and $u (t)$ when $t \to \infty$ . More specifically, we answer the following questions. First, suppose that we know an admissible control $u (t) = g (x (t))$ and discretize this admissible control (without the MDP policy iteration algorithm), what is the boundary of $| x (t) - x_{MDP} (t) |$ ? In the long term, at any time t, if the discrete state (computed or sampled by the MDP) could be closed to the real state (computed by the real system), then the MDP solution will be useful to control the real system. Second, without any knowledge of the admissible control, in which condition the MDP solution could near-optimally stabilize the system? To simplify the analysis, in this section, we assume that f is known. Although this assumption is against our initial problem statement, this assumption is logical given that the neural network, as the functional approximator $\hat{h}$ , could approximate any arbitrary function given sufficient training sample [1,18,37].

4.1. The autonomous system

When we linearize an autonomous system using Taylor series expansion $\begin{matrix} (19) & x (t + 1) = f (x (t)) \end{matrix}$ at point p in the domain of f, we have $\begin{matrix} (20) & f (x) \approx f (p) + M (x - p) \end{matrix}$ where M is the matrix of partial derivative of f on x at p $\begin{matrix} (21) & M = [\begin{matrix} \frac{\partial f_{1}}{\partial x_{1}} |_{\begin{array}{c} x = p \end{array}} & \frac{\partial f_{1}}{\partial x_{2}} |_{\begin{array}{c} x = p \end{array}} & \dots & \frac{\partial f_{1}}{\partial x_{n}} |_{\begin{array}{c} x = p \end{array}} \\ \frac{\partial f_{2}}{\partial x_{1}} |_{\begin{array}{c} x = p \end{array}} & \frac{\partial f_{2}}{\partial x_{2}} |_{\begin{array}{c} x = p \end{array}} & \dots & \frac{\partial f_{2}}{\partial x_{n}} |_{\begin{array}{c} x = p \end{array}} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ \frac{\partial f_{n}}{\partial x_{1}} |_{\begin{array}{c} x = p \end{array}} & \frac{\partial f_{n}}{\partial x_{2}} |_{\begin{array}{c} x = p \end{array}} & \dots & \frac{\partial f_{n}}{\partial x_{n}} |_{\begin{array}{c} x = p \end{array}} \end{matrix}] \end{matrix}$ In this section, we will refer M as the partial derivative matrix and general and $M_{x}$ , where the state stands at the subscript, as the partial derivative matrix at a specific state x.

Suppose that at time t, region ( $x_{MDP} (t)$ ) contains $x (t)$ as showed in (11). Let $C_{η}$ be the set of all $x (t + η)$ computed by tracking all points in ( $x_{MDP}$ ) on f after η time points. Obviously $C_{η}$ has to be a close region because it is spanned from a close region by a continuous function. Therefore, there exists two points $x_{1} (t + η)$ and $x_{2} (t + η)$ such that $| x_{1} (t + η) - x_{2} (t + η) |$ is the maximum for all pairs of points in $C_{η}$ . There must exist two chains: $x_{1} (t), x_{1} (t + 1), \dots, x_{1} (t + η - 1)$ and $x_{2} (t), x_{2} (t + 1), \dots, x_{2} (t + η - 1)$ such that $x_{1} (t + η) = f (x_{1} (t + η - 1)) = \dots = f^{n} (x_{1} (t))$ and $x_{2} (t + η) = f (x_{2} (t + η - 1)) = \dots = f^{n} (x_{2} (t))$ . Applying the Taylor series expansion, we have $\begin{array}{l} x_{1} (t + η) - x_{2} (t + η) \\ = f^{n} (x_{1} (t)) - f^{n} (x_{2} (t)) \\ (22) & = \frac{\partial f^{n}}{\partial x_{2} (t)} (x_{1} (t) - x_{2} (t)) + O (δ^{2}) \end{array}$ Applying the derivative chain rule for $\frac{\partial f^{n}}{\partial x_{2} (t)} (x_{1} (t) - x_{2} (t))$ , we have $\begin{array}{l} x_{1} (t + η) - x_{2} (t + η) \\ = \frac{\partial f}{\partial (x_{2} (t + η - 1))} \\ \times \frac{\partial f}{\partial (x_{2} (t + η - 2))} \times \dots \\ \times \frac{\partial f}{\partial (x_{2} (t))} (x_{1} (t) - x_{2} (t)) \\ (23) & + O (δ^{2}) \end{array}$ Therefore, $\begin{array}{l} ‖ x_{1} (t + η) - x_{2} (t + η) ‖ \\ ⩽ ‖ M_{x_{2} (t + η - 1)} \times M_{x_{2} (t + η - 2)} \\ (24) & \times \dots \times M_{x_{2} (t)} (x_{1} (t) - x_{2} (t)) ‖ \end{array}$ where each matrix M is setup according to (21). From (21)–(24), we have the following necessary conditions for the $x_{MDP} (t + η)$ approaches to $x (t + η)$ .

$1$ . If all matrices M generated by (21) have no eigenvalue outside the unit circle on the complex plane, then $x_{MDP} (t + η)$ approaches to $x (t + η)$ as $K \to \infty$ .

The proof is as follow. Let λ be the most prominent eigenvalue of all matrices M with the largest magnitude. Then from (24) $\begin{array}{l} ‖ x_{1} (t + η) - x_{2} (t + η) ‖ \\ ⩽ ‖ M_{x_{2} (t + η - 1)} \times M_{x_{2} (t + η - 2)} \\ \times \dots \times M_{x_{2} (t)} (x_{1} (t) - x_{2} (t)) ‖ \\ (25) & ⩽ ‖ λ ‖^{η} ‖ x_{1} (t) - x_{2} (t) ‖ \end{array}$ In (13), we showed that the distance between any two points in $(x_{dis})$ cannot be larger than the ‘main diagonal’ $δ \sqrt{N}$ . Therefore, $\begin{array}{l} ‖ x_{1} (t + η) - x_{2} (t + η) ‖ \\ ⩽ ‖ λ ‖^{η} ‖ x_{1} (t) - x_{2} (t) ‖ \\ (26) & ⩽ ‖ λ ‖^{η} δ \sqrt{N} \end{array}$ Since $‖ λ ‖ < 1$ , $‖ λ ‖^{η}$ is finite with $η \to \infty$ . Therefore $K \to \infty \Leftrightarrow δ^{η} \to 0$ . From the method we used in constructing the MDP, $x_{MDP} (t + η)$ also falls in $C_{η}$ . Thus, $‖ x (t + η) - x_{MDP} (t + η) ‖ ⩽ ‖ x_{1} (t + η) - x_{2} (t + η) ‖$ will also approaches 0.

$2$ . If the system (19) has an asymptotic equilibrium point $x^{*}$ such that the linearized matrix $M_{x^{*}}$ has all eigenvalues inside the unit circle of the complex plane, then $x_{MDP} (t + η)$ approaches to $x (t + η)$ as $K \to \infty$ .

The proof is as follow. Since the derivative of f is continuous, there must exist a region $C_{ε}$ with size ε around $x^{*}$ such that all of the derivative matrices M in that region have all eigenvalues within the unit complex circle. Let λ be the eigenvalue with the largest magnitude among these matrices. In addition, since (19) has an asymptotic equilibrium point, after a finite time $T, x (t)$ must be inside $C_{ε}$ . Then, from (24) $\begin{array}{l} ‖ x_{1} (t + η) - x_{2} (t + η) ‖ \\ ⩽ ‖ M_{(x_{2} (t + η - 1))} \\ \times M_{(x_{2} (t + η - 2))} \times \dots \\ \times M_{(x_{2} (t))} (x_{1} (t) - x_{2} (t)) ‖ \\ = ‖ M_{(x_{2} (t + η - 1))} \\ \times M_{(x_{2} (t + η - 2))} \times \dots \times M_{(x_{2} (T))} \\ (this has η factors) \\ \times M_{x_{2} (T + 1))} \times M_{x_{2} (T + 2))} \times \dots \\ \times M_{x_{2} (t)} (x_{1} (t) - x_{2} (t)) ‖ \\ (this has T factors) \\ ⩽ ‖ λ ‖^{η - T + 1} \times ‖ λ_{T} ‖ \times ‖ λ_{T - 1} ‖ \times \dots \\ \times ‖ λ_{1} ‖ \times ‖ x_{1} (t) - x_{2} (t) ‖ \\ ⩽ ‖ λ ‖^{η - T + 1} \times ‖ λ_{T} ‖ \times ‖ λ_{T - 1} ‖ \times \dots \\ (27) & \times ‖ λ_{1} ‖ \times δ \sqrt{N} \end{array}$ Because λ is within the complex unit circle, $‖ λ ‖^{η - T + 1}$ is finite as $η \to \infty$ . $‖ λ ‖^{η - T + 1} \times ‖ λ_{T} ‖ \times ‖ λ_{T - 1} ‖ \times \dots \times ‖ λ_{1} ‖$ is also finite since T is finite. Therefore, $‖ λ ‖^{η - T + 1} \times ‖ λ_{T} ‖ \times ‖ λ_{T - 1} ‖ \times \dots \times ‖ λ_{1} ‖ \times δ \sqrt{N}$ approaches to 0 as $K \to \infty$ (or $δ \to 0$ ). From the method we used in constructing the MDP, both $x_{MDP} (t + η)$ and $x (t + η)$ should be bounded by $x_{1} (t + η)$ and $x_{2} (t + η)$ , which leads to $‖ x (t + η) - x_{MDP} (t + η) ‖$ approaching 0.

$3$ . For a special case: If the system is asymptotically stable at 0 (regardless of the linearization), then $x_{MDP} (t + η)$ approaches to $x (t + η)$ as $K \to \infty$ .

The proof for this statement is relatively simpler. For any discretization threshold δ, we can guarantee that the state $x (t)$ will fall inside the region $[- δ, δ]$ at some finite time T, and remain in $[- δ, δ] \forall t > T$ . This fact implies that with discretization, the MDP will have an absorbing state specified by the region $[- δ, δ]$ . In addition, regardless of the starting state $x (0)$ and $x_{dis} (0)$ , there must be a path toward the absorbing state/region. Therefore, the MDP will eventually bring $x_{dis} (t)$ to the absorbing state after some finite time L. Thus, after $max (T, L)$ , both $x_{dis} (t)$ and $x (t)$ will stay inside $[- δ, δ]$ . Therefore, $‖ x (t) - x_{MDP} (t) ‖ ⩽ δ$ as $t \to \infty$ .

Fig. 3.

The closeness between x (real system) and $x_{MDP}$ (MDP). The left figure corresponds to system (28). The right figure corresponds to system (29).

Fig. 4.

Derivative $\partial f / \partial x$ in system (28) on the left and system (29) on the right.

In Fig. 3 and Fig. 4, we show some toy examples in the one-dimensional system to demonstrate the first necessary condition. In these figures, $x_{MDP}$ is computed from the MDP with sampling method in [15]. The left side is the result of the system $\begin{matrix} (28) & x (t + 1) = sin (x (t)) + 0.1 e^{(- {(x (t))}^{2})} \end{matrix}$ and the right side is the result of the system $\begin{matrix} (29) & x (t + 1) = sin (x (t)) + 1.1 e^{(- {(x (t))}^{2})} \end{matrix}$ The state space in both of these systems is $[- 1.5, 1.5]$ ; the initial $x (0)$ is 0.5 for both of them; and we discretize the entire state space into $K = 100$ regions. The derivative matrices (21) for systems (28) and (29) are one-dimensional functions $cos (x) - 0.2 x e^{(- x^{2})}$ and $cos (x) - 2.2 x e^{(- x^{2})}$ , correspondingly. As in Fig. 3, where we plot the derivative of (28) and (29) in the domain $[- 1.5.1.5]$ , system (28) satisfies the first necessary condition; while system (29) does not. We observe that x and $x_{MDP}$ approach closely to each other in system (28) but not in system (29).

4.2. The non-autonomous system

When we linearize the general system (1) using Taylor series expansion at any point $⟨ x, u ⟩ = [p, q]$ , we have $\begin{matrix} (30) & f (x) \approx f (p, q) + M_{p} (x - p) + M_{q} (u - q) \end{matrix}$ where $M_{p}$ and $M_{q}$ are the partial derivative of f at $[p, q]$ $\begin{matrix} (31) & M_{p} = [\begin{matrix} \frac{\partial f_{1}}{\partial x_{1}} |_{\begin{array}{c} x = p \\ u = q \end{array}} & \frac{\partial f_{1}}{\partial x_{2}} |_{\begin{array}{c} x = p \\ u = q \end{array}} & \dots & \frac{\partial f_{1}}{\partial x_{n}} |_{\begin{array}{c} x = p \\ u = q \end{array}} \\ \frac{\partial f_{2}}{\partial x_{1}} |_{\begin{array}{c} x = p \\ u = q \end{array}} & \frac{\partial f_{2}}{\partial x_{2}} |_{\begin{array}{c} x = p \\ u = q \end{array}} & \dots & \frac{\partial f_{2}}{\partial x_{n}} |_{\begin{array}{c} x = p \\ u = q \end{array}} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ \frac{\partial f_{n}}{\partial x_{1}} |_{\begin{array}{c} x = p \\ u = q \end{array}} & \frac{\partial f_{n}}{\partial x_{2}} |_{\begin{array}{c} x = p \\ u = q \end{array}} & \dots & \frac{\partial f_{n}}{\partial x_{n}} |_{\begin{array}{c} x = p \\ u = q \end{array}} \end{matrix}] \end{matrix}$ and $\begin{matrix} (32) & M_{q} = [\begin{matrix} \frac{\partial f_{1}}{\partial u_{1}} |_{\begin{array}{c} x = p \\ u = q \end{array}} & \frac{\partial f_{1}}{\partial u_{2}} |_{\begin{array}{c} x = p \\ u = q \end{array}} & \dots & \frac{\partial f_{1}}{\partial u_{m}} |_{\begin{array}{c} x = p \\ u = q \end{array}} \\ \frac{\partial f_{2}}{\partial u_{1}} |_{\begin{array}{c} x = p \\ u = q \end{array}} & \frac{\partial f_{2}}{\partial u_{2}} |_{\begin{array}{c} x = p \\ u = q \end{array}} & \dots & \frac{\partial f_{2}}{\partial u_{m}} |_{\begin{array}{c} x = p \\ u = q \end{array}} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ \frac{\partial f_{n}}{\partial u_{1}} |_{\begin{array}{c} x = p \\ u = q \end{array}} & \frac{\partial f_{n}}{\partial u_{2}} |_{\begin{array}{c} x = p \\ u = q \end{array}} & \dots & \frac{\partial f_{n}}{\partial u_{m}} |_{\begin{array}{c} x = p \\ u = q \end{array}} \end{matrix}] \end{matrix}$ Similar to the autonomous system, for the close region $([x_{MDP} (t), u_{MDP} (t)])$ (11), including the boundary, containing $[x (t), u (t)]$ , let $C_{η}$ be the set of all $x (t + η)$ computed by tracking all points in ( $[x_{MDP} (t), u_{MDP} (t)]$ ) on f after η time points. On the region $C_{η}$ containing all possible $x (t + η)$ , there exists two points $x_{1} (t + η)$ and $x_{2} (t + η)$ such that $‖ x_{1} (t + η) - x_{2} (t + η) ‖$ is the maximum for all pairs of points in $C_{η}$ . There must exist two chains: $[x_{1} (t), u_{1} (t)], [x_{1} (t + 1), u_{1} (t + 1)], \dots, [x_{1} (t + η), u_{1} (t + η)]$ and $[x_{2} (t), u_{2} (t)], [x_{2} (t + 1), u_{2} (t + 1)], \dots, [x_{2} (t + η), u_{2} (t + η)]$ such that $u_{1} (t + η) = f (x_{1} (t + η - 1), u_{1} (t + η - 1)) = f (f (x_{1} (t + η - 2), u_{1} (t + η - 2))) = \dots = f^{η} (x_{1} (t), u_{1} (t))$ and $u_{2} (t + η) = f (x_{2} (t + η - 1), u_{2} (t + η - 1)) = f (f (x_{2} (t + η - 2), u_{2} (t + η - 2))) = \dots = f^{η} (x_{2} (t), u_{2} (t))$ . Applying the Taylor series expansion, we have $\begin{array}{l} x_{1} (t + η) - x_{2} (t + η) \\ = \frac{\partial f}{\partial (x_{2} (t + η - 1)), u_{2} (t + η - 1))} \\ \times ([x_{1} (t + η - 1), u_{1} (t + η - 1)] \\ - [x_{2} (t + η - 1), u_{2} (t + η - 1)]) + O (δ^{2}) \\ = M_{p, x_{2} (t + η - 1)} (x_{1} (t + η - 1) \\ - x_{2} (t + η - 1)) + M_{q, u_{2} (t + η - 1)} \\ (33) & \times (u_{1} (t + η - 1) - u_{2} (t + η - 1)) \end{array}$ where $M_{p, x_{2}}$ and $M_{q, u_{2}}$ are the $M_{p}$ (31) and $M_{q}$ (32) at $[x_{2}, u_{2}]$ , respectively.

Suppose that we have an arbitrary control law $u = k (x)$ . Taking the derivative of the control rule, we have $Δ u = M_{k} Δ x$ such that $\begin{array}{l} (34) & M_{k} = [\begin{matrix} \frac{\partial k_{1}}{\partial x_{1}} |_{\begin{array}{c} x = p \end{array}} & \frac{\partial k_{1}}{\partial x_{2}} |_{\begin{array}{c} x = p \end{array}} & \dots & \frac{\partial k_{1}}{\partial x_{n}} |_{\begin{array}{c} x = p \end{array}} \\ \frac{\partial k_{2}}{\partial x_{1}} |_{\begin{array}{c} x = p \end{array}} & \frac{\partial k_{2}}{\partial x_{2}} |_{\begin{array}{c} x = p \end{array}} & \dots & \frac{\partial k_{2}}{\partial x_{n}} |_{\begin{array}{c} x = p \end{array}} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ \frac{\partial k_{m}}{\partial x_{1}} |_{\begin{array}{c} x = p \end{array}} & \frac{\partial k_{m}}{\partial x_{2}} |_{\begin{array}{c} x = p \end{array}} & \dots & \frac{\partial k_{m}}{\partial x_{n}} |_{\begin{array}{c} x = p \end{array}} \end{matrix}] \end{array}$ For any state x, we denote $M_{k x}$ as the specific $M_{k}$ matrix at state x. Substitute (34) to (33), we have $\begin{array}{l} ‖ x_{1} (t + η) - x_{2} (t + η) ‖ \\ = ‖ M_{p, x_{2} (t + η - 1)} (x_{1} (t + η - 1) \\ - x_{2} (t + η - 1)) + M_{q, u_{2} (t + η - 1)} \\ \times (u_{1} (t + η - 1) - u_{2} (t + η - 1)) ‖ \\ ⩽ ‖ (M_{p, x_{2} (t + η - 1)} \\ + M_{q, u_{2} (t + η - 1)} M_{k, x_{2} (t + η - 1)}) \\ (35) & \times (x_{1} (t + η - 1) - x_{2} (t + η - 1)) ‖ \end{array}$ Recursively applying the derivative chain rule on $(x_{1} (t + η - 1) - x_{2} (t + η - 1))$ until $[x (t), u (t)]$ , with the same argument from (33) to (35), we have $\begin{array}{l} ‖ x_{1} (t + η) - x_{2} (t + η) ‖ \\ ⩽ ‖ (M_{p, x_{2} (t + η - 1)} \\ + M_{q, u_{2} (t + η - 1)} M_{k, x_{2} (t + η - 1)}) \\ \times (M_{p, x_{2} (t + η - 2)} \\ + M_{q, u_{2} (t + η - 2)} M_{k, x_{2} (t + η - 2)}) \times \dots \\ \times (M_{p, x_{2} (t + 1)} + M_{q, u_{2} (t + 1)} M_{k, x_{2} (t + 1)}) \\ (36) & \times (x_{1} (t) - x_{2} (t)) ‖ \end{array}$ From this point, similar to the autonomous system, we have the necessary conditions for the $x_{MDP} (t + η)$ approaches to $x (t + η)$ .

$1$ . If the matrices $M_{p} + M_{q} M_{k}$ generated by (31), (32) and (34) have no eigenvalue outside the unit circle on the complex plane, then $x_{MDP} (t + η)$ approaches to $x (t + η)$ as $δ \to 0$ with any η.

$2$ . If the system (1) has an asymptotic equilibrium point p such that the linearized matrix $M_{p} + M_{q} M_{k}$ at the equilibrium point has all eigenvalues inside the unit circle of the complex plane, then $x_{MDP} (t + η)$ approaches to $x (t + η)$ as $δ \to 0$ with any η.

We omit the proof for these two statements since the proof is almost similar to the proof we already showed in the autonomous system section.

Fig. 5.

The closeness between x (real system) and $x_{MDP}$ (MDP). The left figure corresponds to system (37). The right figure corresponds to system (38).

In Fig. 5, we show some toy examples in one-dimensional system to demonstrate the first necessary condition. Similar to the autonomous system examples, in this figures, $x_{MDP}$ is computed from the MDP with sampling method in [15]. The left side is the result of the system $\begin{array}{l} x (t + 1) & = sin (x (t)) + u (t) and control law \\ (37) & u (t) & = - 0.5 x (t) \end{array}$ and the right side is the result of the system $\begin{array}{l} x (t + 1) & = sin (x (t)) + u (t) and control law \\ (38) & u (t) & = - 2 x (t) \end{array}$ The state space in both of these systems is $[- 1, 1]$ ; the initial $x (0)$ is 0.5 for both of them; and we discretize the entire state space into $K = 100$ regions. In (37), $M_{p} + M_{q} M_{k} = cos (x (t)) - 0.5$ , which is within $[0.0403, 0.5]$ . Therefore, (37) meets the first necessary condition. In (38), $M_{p} + M_{q} M_{k} = cos (x (t)) - 2$ , which is within $[- 1.5403, - 1]$ . Therefore, (37) does not meet the necessary condition. As in Fig. 4, $x_{MDP} (t)$ converges to $x (t)$ in system (37), but not in system (38).

4.3. The existence of the MDP solution as to near-optimally stabilize the system

In this section, we show the existence of the MDP solution when the system (1) is stable at the equilibrium point. The stability definition is defined as follow: there exist a positive small number ε such that if $‖ x ‖ < ε$ then $‖ f (x, 0) ‖ < ε$ . With this assumption, when we choose K such that $χ / K < ε$ , the MDP will have a special state $x_{MDP}^{*} = 0$ with the following properties:

The MDP’s optimal policy at $x_{MDP}^{*}$ is $u_{MDP}^{*} = 0$ .

The later states in the MDP are also $x_{MDP}^{*}$ . The proof of these properties is relatively simple due to the properties of the state and action reward functions in (2) and (3), where the optima are at 0. From this stability assumption of f, we prove the following statements.

$1$ . If the system (1) is stable and the HJB equation (1)–(5) has a finite solution as $γ \to 1$ , then in the MDP, $x_{MDP}^{*} (t) = 0$ as $t \to \infty$ .

The proof of this statement is as follow. If the HJB equation (1)–(5) has a finite solution as $γ \to 1$ , then the control function $u (t)$ has to be able to bring $x (t)$ to 0 in finite time. Otherwise, the state and action rewards are always negative and will approach infinite as $γ \to 1$ . Since $x (t)$ is 0 in finite time, there must exist a path in the MDP that can reach $x_{MDP}^{*}$ with positive probability. Obviously one of these paths is the discretization of the HJB’s solution $u (t)$ . Since the policy iteration in MDP has been proven to converge to the optimal policy [51], this policy cannot be worse than the policy induced by discretizing the HJB equation’s solution. Therefore, in the MDP’s optimal policy, there must exist a path from any state to $x_{MDP}^{*}$ with positive probability $ϕ > 0$ . With infinite number of visit t $\to \infty$ , the maximum probability for $not$ reaching $x_{MDP}^{*}$ is ${(1 - ϕ)}^{\infty} = 0$ .

$2$ . If all $M_{p}$ matrices (31) have the most prominent eigenvalues within the unit circle $\forall x, u$ and $x_{MDP} (t) = 0$ as $t \to \infty$ in the MDP solution for all starting $x (0)$ , then by applying the MDP’s control unit $x_{MDP} (t)$ on $x (t), ‖ x (t) ‖ ⩽ δ \sqrt{N}$ .

The proof of this statement is as follow. Since we apply $u_{MDP} (t)$ for all $x (t)$ in $(x_{MDP} (t))$ region, the difference of the control unit cancels. Thus, the equation (30) becomes $\begin{matrix} (39) & f (x) \approx f (p, q) + M_{p} (x - p) \end{matrix}$ Following the same argument from (31) to (34), we have $\begin{array}{l} ‖ x_{1} (t + η) - x_{2} (t + η) ‖ \\ ⩽ ‖ (M_{p, x_{2} (t + η - 1)}) \times (M_{p, x_{2} (t + η - 2)}) \times \dots \\ (40) & \times (M_{p, x_{2} (t + 1)}) (x_{1} (t) - x_{2} (t)) ‖ \end{array}$ Because the most prominent eigenvalues of $M_{p}$ are within unit circle, from (40), we have $\begin{array}{l} ‖ x (t) - x_{MDP} (t) ‖ \\ ⩽ ‖ x_{1} (t + η) - x_{2} (t + η) ‖ \\ (41) & ⩽ ‖ x_{1} (t) - x_{2} (t) ‖ ⩽ δ \sqrt{N} \end{array}$ Therefore, if $x_{MDP} (t) \to 0, then ‖ x (t) ‖ ⩽ δ \sqrt{N}$ .

5. Learning control system with selective decentralization approach

5.1. Statement of selective decentralization

Let us rewrite system (1) as $\begin{matrix} (42) & Σ : x (t + 1) = f [x (t), u (t), θ] \end{matrix}$ where θ is an unknown parameter vector in $R^{N}$ . In the identification phrase, the intermediate objective is to estimate θ using measurements of the overall system. In the problem of interest to us, the system is assumed to consist of r subsystems of low dimension which are interconnected. However, how these subsystems interconnect is unknown. If the state vectors of the subsystems $Σ_{1}, Σ_{2}, \dots, Σ_{r}$ are respectively $x_{1}, x_{2}, \dots, x_{r}$ , it is assumed that each subsystem can be described by the difference equation $\begin{array}{l} Σ_{i} : x_{i} (k + 1) \\ (43) & = f_{i} [x_{i} (k), u_{i} (k), θ_{i}] + σ_{i} [z_{i} (k)] \end{array}$ where the parameter $σ_{i}$ is assumed to be small, and $[x_{i}, z_{i}] = x^{T}$ (i.e., the elements of $z_{i}$ are state variables not contained in $x_{i}$ ). A decentralized approximated model can be set up as $\begin{matrix} (44) & \hat{x_{i}} (t + 1) = \hat{f_{i}} [x_{i} (t), z_{i} (t), u (t), θ (t)] \end{matrix}$ To be more specific, for the linear system, the decentralized model has the form $\begin{matrix} (45) & \hat{A} = [\begin{matrix} \hat{A_{1}} & \hat{a_{1, 2}} & \dots & \hat{a_{1, r}} \\ \hat{a_{2, 1}} & \hat{A_{2}} & \dots & \hat{a_{2, r}} \\ ⋮ & \cdot & ⋱ & ⋮ \\ \hat{a_{r, 1}} & \hat{a_{r, 2}} & \dots & \hat{A_{r}} \end{matrix}] \end{matrix}$ where the lower-case $\hat{a}$ stands for the estimated communication among the subsystems, which is expected to be minor. The nonlinear decentralized model has the form $\begin{array}{l} \hat{x} (t + 1) & = [\begin{matrix} \hat{x_{1}} (t + 1) \\ \hat{x_{r}} (t + 1) \\ ⋮ \\ \hat{x_{r}} (t + 1) \end{matrix}] = \hat{f} (x (t), u (t)) \\ (46) & = [\begin{matrix} \hat{f_{1}} (x_{1} (t), u_{1} (t)) \\ \hat{f_{2}} (x_{2} (t), u_{2} (t)) \\ ⋮ \\ \hat{f_{r}} (x_{r} (t), u_{r} (t)) \end{matrix}] \end{array}$ At this stage, the knowledge that each subsystem has about the components of z that affect it becomes important. Here, we assume the unknown decentralization structure: every subsystem $Σ_{i}$ knows the small set of variables in $z_{i}$ that might affect its outputs, but does not know exactly which variables do affect them.

Selective decentralization policy: The number of possible decentralization structures for r subsystems is $B (r)$ (the rth Bell’s number), which grows super-exponentially. We set up a separate identification model for each such decentralization structure and adaptively switch among the models implementing the different decentralization policies to determine the best model.

Complete decentralization policy: The subsystems perform identification and calculate their local control using their own state and control subspace without any communication. In this work, we mention this naive approach to compare the control performance with the selective decentralization approach.

Fig. 6.

Illustrating selective decentralization in a simple linear system example, where the identification error is used to select the best decentralization scheme.

Fig. 7.

The learning design for selective decentralized control.

In addition, in this paper, we refer centralized control, or centralization, as considering the whole system as one component. In this case, $r = 1$ and $B (r) = 1$ . The other formulation is the same to decentralization. An illustration of complete decentralization and centralization could be found in Fig. 6.

5.2. Selective decentralized control framework

Figure 7 shows the design of the learning control system in this work with two phases: identification and control. In the identification phase, we train the neural networks to acquire the functional approximators $\hat{f}$ from using $⟨ x (t), u (t) ⟩$ as the input tuples and $x (t + 1)$ as the outputs. The details of system identification is omitted in this paper since we have already presented them in [45]. In the control phase, to compute the near-optimal control, we use (9)–(10) for the linear system, and policy iteration algorithm for the nonlinear system after setting up the corresponding MDP [50]. Here, the window size parameter Ω decides how frequently we call the identification phase. In other words, Ω decides the number of $⟨ x (t), u (t), x (t + 1) ⟩$ tuples to train $\hat{f}$ .

Fig. 8.

The 3-mass mass-spring system: (upper) at the resting positions; (lower) the forces applying on these masses when the masses are not at the resting positions.

The selective decentralized control examines all of the $B (r)$ connection schemes among the subsystem and uses the scheme with lowest identification error to apply the control algorithm, as showed in Fig. 6 in a toy linear system scenario. For example, with $r = 3$ , we have $B (r = 5)$ possible decentralization schemes: ${{1, 2, 3}}, {{1, 2}, {3}}, {{1, 3}, {2}}, {{1}, {2, 3}}$ and ${{1}, {2}, {3}}$ , in which each scheme has 1, 2, 2, 2 and 3 subsystem(s), correspondingly. A subsystem only uses its state and control variable to compute its own approximator. For example, in the linear system, with scheme ${{1, 2}, {3}}$ , we have the format $\hat{A} = [\begin{matrix} \hat{A_{1, 2}} \\ \hat{A_{3}} \end{matrix}]$ . In this example, $\hat{A_{1, 2}} (t)$ is computed only using $x_{1} (t - 1), x_{2} (t - 1), u_{1} (t - 1)$ and $u_{2} (t - 1)$ , meanwhile $\hat{A_{3}} (t)$ is computed only using $x_{3} (t - 1)$ and $u_{3} (t - 1)$ . If scheme {{1, 2}, {3}} returns the lowest identification error, then from (10), we compute the next control [ $u_{1} (t), u_{2} (t)$ ] using only $\hat{A_{1, 2}} (t)$ and $u_{3} (t)$ using only $\hat{A_{3}} (t)$ . Applying control, the system move to the next state $x (t + 1)$ and repeat the identification-control process

Let w be the window index. Then the window w covers the discrete time index from $t = (w - 1) Ω + 1$ to $t = w Ω$ . Let $E (w)$ be the window-identification error at window w, which is the average of $e (t)$ from $t = (w - 1) Ω + 1$ to $t = w Ω$ . Let $ρ_{1}$ and $ρ_{2}$ be two small numbers for thresholding. The pseudo code for selective decentralization is as follow:

6. Simulation results

6.1. Linear system

In this simulation, we setup a system the mass-spring system [69], which is the building block for automatic braking system in real-world. Figure 8 demonstrates the mass-spring system of three masses landing and moving horizontally. Masses (measured by kg) $m_{1}$ and $m_{3}$ connect to the fixed wall by springs (measured by elasticity constant unit $kg / m^{2}$ ) $k_{1}$ and $k_{3}$ . Mass $m_{2}$ stands between $m_{1}$ and $m_{3}$ , and connects to the other masses by springs $k_{12}$ and $k_{23}$ . The resting positions of $m_{1}$ , $m_{2}$ and $m_{3}$ are $p_{1}$ , $p_{2}$ and $p_{3}$ , correspondingly. Without the loss of generalization in the theory and result, suppose that $m_{1}$ , $m_{2}$ and $m_{3}$ stay such that $k_{1}$ , $k_{3}$ are compressed and $k_{12}$ , $k_{23}$ are stretched as the Fig. 8 (lower) shows.

Analyzing the forces action on each mass, we have

Mass $m_{1}$ has: force $\vec{F_{1, 1}}$ caused by the compressed $k_{1}$ pushing to the right, force $\vec{F_{12, 1}}$ caused by the stretched $k_{12}$ pushing to the right, and individual control force $\vec{u_{1}}$ pushing to the right in order to return to the resting point $p_{1}$ .

Mass $m_{2}$ has: force $\vec{F_{12, 2}}$ caused by the stretched $k_{12}$ pushing to the left, force $\vec{F_{23, 2}}$ caused by the stretched $k_{23}$ pushing to the right, and individual control force $\vec{u_{2}}$ pushing to the right in order to return to the resting point $p_{2}$ .

Mass $m_{3}$ has: force $\vec{F_{3, 3}}$ caused by the compressed $k_{3}$ pushing to the left, force $\vec{F_{23, 3}}$ caused by the stretched $k_{23}$ pushing to the left, and individual control force $\vec{u_{3}}$ pushing to the left in order to return to the resting point $p_{3}$ .

Writing the second Newton’s law vector-equations [28] for these masses, we have $\begin{matrix} (47) & \{\begin{matrix} m_{1} \vec{a_{1}} = \vec{F_{1, 1}} + \vec{F_{12, 1}} + \vec{u_{1}} \\ m_{2} \vec{a_{2}} = \vec{F_{12, 2}} + \vec{F_{23, 2}} + \vec{u_{2}} \\ m_{2} \vec{a_{2}} = \vec{F_{3, 3}} + F_{23, 3} + \vec{u_{3}} \end{matrix} \end{matrix}$ Where $\vec{a_{1}}$ , $\vec{a_{2}}$ and $\vec{a_{3}}$ stand for the accelerations of $m_{1}$ , $m_{2}$ and $m_{3}$ , correspondingly. Let $(x_{1}, v_{1})$ , $(x_{2}, v_{2})$ and $(x_{1}, v_{3})$ denote the displacement and velocity of $m_{1}$ , $m_{2}$ and $m_{3}$ , correspondingly. Applying Hooke’s law for elastic spring [22] and linearizing (47) with small time interval $Δ t$ , we have the system $\begin{array}{l} x (t + 1) \\ = [\begin{matrix} 1 & Δ t & 0 \\ - \frac{Δ t}{m_{1}} (k_{1} + k_{12}) & 1 & k_{12} \frac{Δ t}{m_{1}} \\ 0 & 0 & 1 \\ \frac{- k_{12} Δ t}{m_{2}} & 0 & \frac{(k_{12} + k_{23}) Δ t}{m_{2}} \\ 0 & 0 & 0 \\ 0 & 0 & \frac{k_{23} Δ t}{m_{3}} \end{matrix} \\ \begin{matrix} 0 & 0 & 0 \\ 0 & 0 & 0 \\ Δ t & 0 & 0 \\ 1 & \frac{- k_{23} Δ t}{m_{2}} & 0 \\ 0 & 1 & Δ t \\ 0 & - \frac{(k_{3} + k_{23}) Δ t}{m_{3}} & 1 \end{matrix}] x (t) \\ (48) & + [\begin{matrix} 0 & 0 & 0 \\ \frac{Δ t}{m_{1}} & 0 & 0 \\ 0 & 0 & 0 \\ 0 & \frac{Δ t}{m_{2}} & 0 \\ 0 & 0 & 0 \\ 0 & 0 & \frac{Δ t}{m_{2}} \end{matrix}] u (t) \end{array}$ where $\begin{matrix} x = [\begin{matrix} x_{1} \\ v_{1} \\ x_{2} \\ v_{2} \\ x_{3} \\ v_{3} \end{matrix}] and u = [\begin{matrix} u_{1} \\ u_{2} \\ u_{3} \end{matrix}] . \end{matrix}$ Bringing these masses toward the resting position implies that $v_{1} = v_{2} = v_{3} = 0$ and displacement $x_{1} = x_{2} = x_{3} = 0$ , or $x = 0$ .

Fig. 9.

System (48) learning and control performance of the approaches: centralized reinforcement learning (RL), completely decentralized RL and selectively decentralized RL; top figure: state trajectory; bottom figure: control trajectory.

We set the experiment up the following parameters. The masses are $m_{1} = m_{2} = m_{3} = 1$ (kg). The spring elastic constants are $k_{1} = k_{3} = 1$ ( $kg / m^{2}$ ), $k_{12} = k_{23} = 0.5$ ( $k g / m^{2}$ ). The small time interval for linearization is $Δ t = 0.01$ (s). The discount factor in equation (5) is 0.9. Also, in (5), $p (x) = x^{T} x$ and $q (u) = u^{T} u$ . Initially, the displacements are $x_{1} = - 0.5$ (m), $x_{2} = - 0.3$ (m) and $x_{3} = 0.2$ (m), and the initial velocities for these masses are $v_{1} = v_{2} = v_{3} = 0$ (m/s). The learning rate $α = 0.05$ .

Figure 9 shows that the selectively decentralized approach outperforms the centralized and the completely decentralized approaches in stabilizing the system. Here, we denote $norm (X)$ and $norm (U)$ as the second norm (trajectory) of the state control. Both the completely decentralized RL and the centralized RL fails to stabilize the system. The completely decentralized RL could bring the masses closer to the resting point. In the other hands, the selectively decentralized RL stabilizes the system within 3 seconds, by bringing the masses toward the resting positions and stop the masses movement.

6.2. Nonlinear system

In this example, we choose the system $\begin{matrix} (49) & x (t + 1) = sin (Ax (t) + u (t)) \end{matrix}$ where $x, u \in R^{4}$ , matrix A is defined by normalizing $\tilde{A}$ into a Markov matrix where $\begin{matrix} (50) & \tilde{A} = [\begin{matrix} 0.7 & 0.3 & ψ \\ 0.2 & 0.8 \\ 1 \\ ψ & 1 \end{matrix}] \end{matrix}$ and the sin function is defined as $\begin{matrix} (51) & sin (x) = [\begin{matrix} sin (x_{1}) \\ sin (x_{2}) \\ ⋮ \\ sin (x_{n}) \end{matrix}] \end{matrix}$ and $x (0) = 0.2$ . Here, we assume that the boundary of x an u is known as $- 0.2 ⩽ x_{i}, u_{i} ⩽ 0.2 \forall i \in [1, 4]$ and the real subsystem component in (1) is ${{1, 2}, {3}, {4}}$ . The reward functions are $p (x) = - x^{T} x$ and $q (u) = - u^{T} u$ . The discount reward factor in (5) is $γ = 0.9$ .

For system approximation, we use a three-layer neural network with 30 hidden units, sigmoid activation function, and backpropagation to train the neural network for $\hat{f}$ . For each training step, we pass the training sample set $⟨ x (t), u (t) ⟩$ 2000 times. We set window size $Ω = 50$ (Fig. 1). In addition, we run the experiment for at most 10000 iteration. Similar to the linear system case study, we setup the completely decoupled system by setting $ψ = 0$ and the strongly coupled system by $ψ = 0.1$ . In each state and control vector dimension, we divide the dimension into $K = 8$ regions, which makes the resolution threshold (13) 0.05.

In Figs 10 and 11, we observe that the selectively decentralized system shows better control performance than the completely decentralized system and the centralized system. Similar to Figs 2 and 3, we use $norm (x)$ to denote the second-norm of x. For the ease of visualization, we only draw the result up to the 500th iteration, when the selective decentralization is showed to converge. Here, we observe that when the system is completely decoupled, the centralized system converges to 0 significantly slower than the selectively decentralized system does. Surprisingly, the completely decentralized system does not converge within the maximum number of iterations in our experiment. In addition, when the system is strongly coupled, both the completely decentralized system and the centralized system fail to control within the maximum number of iterations in our experiment.

Fig. 10.

Comparison of control performance among the centralized system, the completely decentralized system and the selectively decentralized system when the system (49) is completely decoupled.

Fig. 11.

Comparison of control performance among the centralized system, the completely decentralized system and the selectively decentralized system when the system (49) is strongly coupled.

6.3. Comparison among the discretized-MDP, ADP and Q-learning approaches

In Fig. 12, we compare the learning performance among the Discretized-MDP, Adaptive Dynamic Programming (ADP) [59,65,70] and Q-learning [60]. Q-learning is one of the most well-known techniques in reinforcement learning, following the temporal-difference (TD) principles [5]. ADP, which is one of the most promising approaches aiming for online learning, has been proven to stabilize the nonlinear system in feedback linearization form. The example used in this section is $\begin{matrix} (52) & x (t + 1) = sin (Ax (t)) + u (t) \end{matrix}$ where $x, u \in R^{3}$ and matrix A is $\begin{matrix} (53) & A = [\begin{matrix} 0.6 & 0.2 & 0.2 \\ 0.2 & 0.6 & 0.2 \\ 0.2 & 0.2 & 0.6 \end{matrix}] \end{matrix}$ In addition, we experimented these approaches without any decentralization. The starting state $x (0)$ is $[0.5, 0.5, 0.5]$ for all experiments. We show the implementation details for Q-learning as in [46]. The implementation for ADP is accordant to [70]. For discretization of both the discretized-MDP and the Q-learning approaches, we make the resolution threshold (13) 0.05. We observe that these techniques could stabilize the system; however, the ADP and discretized-MDP approaches are significantly superior to the Q-learning approach. The discretized-MDP approach also stabilizes the system faster than the ADP approach.

Fig. 12.

System (52) comparison of control performance among the Discretized-MDP approach, the ADP approach and the Q-learning approach.

There are several points to note in Fig. 12. First, since the difference of converging time in these approaches could be exponential, we draw the x-axis, which stands for converging time measured by the number of windows, in log scale. Therefore, the state trajectory (second norm of x) may neither be smooth nor seem differentiable. Second, since the Q-learning performance in [10] is measured by the average state trajectory over a window, the x-axis unit Fig. 10 is the window index, with window size $Ω = 50$ . Therefore, the lines in Fig. 10 show the average of state-trajectory over each window.

7. Conclusions

7.1. Discussions

In this paper, we show that selective decentralization can improve the learning performance in both linear and nonlinear systems with several levels of interconnection among subsystems. Here, we measure the performance on the number of iterations, or samples, needed in learning. This measurement of performance is useful for problems in which the number of training samples is limited. In addition, we show that the discrete-MDP technique could help in learning nonlinear control problem in general form.

Compared to adaptive dynamic programming (ADP) [61–68], which is one of the most popular approaches in reinforcement learning and adaptive control in the recent years, our discrete-MDP approach is more limited in utilizing the capability of neural networks. In the ADP approach, the neural networks are used to approximate both the control function $u = k (x)$ and the state utility function $J (x)$ . In our approach, we only use the neural networks in system identification. From our point of view, when the system is completely unknown, it is difficult to initialize the admissible control [32] for the action neural networks, which is the necessary condition for convergence in ADP. Furthermore, the initialization of state utility for the citric neural networks is another challenge in ADP for controlling unknown system. Although [65,70] show techniques to initialize the state utility by arbitrary positive-definite functions, the necessary condition is that the state utility is non-negative, which is different from the state utility assumption in our paper. In the other hand, as we have shown that the discrete-MDP approach could approximate an admissible control for the system given some mild prerequisites, it is possible to use the result of the discrete-MDP approach as the initialization of the ADP’s action network.

In addition, this work handles the learning problem such that the identification and control could be executed consecutively and repeatedly. In most of the theoretical reinforcement learning AI work, especially the ADP [61–67], to tackle the unknown nature of the problem, the learning agent initially executes random actions to acquire enough number of samples for one-time identification. The number of random actions could be between thousands and millions, depending on the system. This work shows that the learning agent may not need to execute any random actions: acting ‘optimally’ according to the most updated approximation of the system, even if the approximation may not be precise, could stabilize the system. In Fig. 12, we show that ADP could be executed in this manner, although the discrete-MDP shows faster learning speed.

There are several limitations in this paper. First, the discretization thresholds need the distribution of the next state assuming that the current state and control vectors are uniformly distributed and may require a number of ad-hoc steps. Second, in selective decentralization, we still explore all possible decoupling scheme $B (k)$ , which grows exponentially. However, since the selectively decentralized system converges faster than the centralized system in most of the cases, we believe that the heavily computational model-switching phase in the selective decentralized system will be relatively short. Therefore, the selectively decentralized system may be more computationally efficient than the centralized system, which must run the learning algorithm in high dimensional data for long term.

7.2. Future works

For the future work, we would develop this approach in three directions. First, we would tackle the number of decentralized scheme issue. A suitable idea is to predict the potential good decentralized schemes given different global state input. This could be done with neural network as the estimator. There are two possible designs for this task: one design take the global state vector as the input and the best decentralized schemes as the output; and one design take the global state vector and the decentralized scheme as the input and output the estimated metric to select the scheme (such as estimated identification error). Second, we would extend the application of this work toward more real-world applications, especially in system biology, where the learning problems are known to be large and contain a large degree of uncertainty. Here, domain knowledge will be critical in setting up the system. Third, we would examine the performance of selective decentralization in combination with other learning algorithms, especially adaptive dynamic programming.

Footnotes

Acknowledgement

The research presented in this paper was supported by the United States National Science Foundation grant No. ECCS-1407925.

References

Abu-Khalaf and

F.L.

Lewis, Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach, Automatica 41(5) (2005), 779–791. doi:10.1016/j.automatica.2004.11.034.

W.F.

ArnoldIII. and

A.J.

Laub, Generalized eigenproblem algorithms and software for algebraic Riccati equations, in: Proceedings of the IEEE, Vol. 72, 1984, pp. 1746–1754.

Arslan and

Yüksel, Decentralized Q-learning for weakly acyclic stochastic dynamic games, in: IEEE Conference on Decision and Control, 2015, pp. 6743–6748.

Arslan and

Yüksel, Decentralized Q-learning for stochastic teams and games, IEEE Transactions on Automatic Control 62(4) (2017), 1545–1558. doi:10.1109/TAC.2016.2598476.

A.G.

Barto, Temporal difference learning, Scholarpedia 2(11) (2007), 1604. doi:10.4249/scholarpedia.1604.

Battistelli,

J.P.

Hespanha,

Mosca and

Tesi, Model-free adaptive switching control of time-varying plants, IEEE Transactions on Automatic Control 58(5) (2013), 1208–1220. doi:10.1109/TAC.2013.2243974.

R.W.

Beard,

G.N.

Saridis and

J.T.

Wen, Galerkin approximations of the generalized Hamilton–Jacobi–Bellman equation, Automatica 33(12) (1997), 2159–2177. doi:10.1016/S0005-1098(97)00128-3.

Bellman, On the theory of dynamic programming, Proceedings of the National Academy of Sciences 38(8) (1952), 716–719. doi:10.1073/pnas.38.8.716.

Bellon, Riccati Equations in Optimal Control Theory, 2008, available at https://scholarworks.gsu.edu/cgi/viewcontent.cgi?article=1045&context=math_theses.

10.

D.P.

Bertsekas, Dynamic Programming and Optimal Control, 3rd edn, Vol. II, Athena Scientific, Belmont, MA, 2011.

11.

Bian,

Jiang and

Z.-P.

Jiang, Decentralized adaptive optimal control of large-scale systems with application to power systems, IEEE Transactions on Industrial Electronics 62(4) (2015), 2439–2447. doi:10.1109/TIE.2014.2345343.

12.

C.M.

Bishop, Pattern Recognition, Machine Learning, Springer, 2006, pp. 537–541.

13.

Busoniu,

Babuska and

De Schutter, A comprehensive survey of multiagent reinforcement learning, IEEE Transactions on Systems, Man, And Cybernetics-Part C: Applications and Reviews 38(2) (2008), 156–172. doi:10.1109/TSMCC.2007.913919.

14.

H.S.

Chang, Decentralized learning in finite Markov chains: Revisited, IEEE Transactions on Automatic Control 54(7) (2009), 1648–1653. doi:10.1109/TAC.2009.2017977.

15.

H.S.

Chang,

Hu,

M.C.

Fu and

S.I.

Marcus, Simulation-Based Algorithms for Markov Decision Processes, Springer Science & Business, Media, 2013.

16.

C.K.

Chui and

Chen, Linear Systems and Optimal Control, Springer Science & Business, Media, 2012.

17.

F.L.

Da Silva,

Glatt and

A.H.R.

Costa MOO-MDP: An object-oriented representation for cooperative multiagent reinforcement learning, IEEE Transactions on Cybernetics (2017).

18.

K.-I.

Funahashi, On the approximate realization of continuous mappings by neural networks, Neural networks 2(3) (1989), 183–192. doi:10.1016/0893-6080(89)90003-8.

19.

Gao,

Dou and

Su, Multi-model switching control of hypersonic vehicle with variable scramjet inlet based on adaptive neural network, in: World Congress on Intelligent Control and Automation, 2016, pp. 1714–1719.

20.

D.T.

Gavel and

Siljak, Decentralized adaptive control: Structural conditions for stability, IEEE Transactions on Automatic Control 34(4) (1989), 413–426. doi:10.1109/9.28016.

21.

Han and

K.S.

Narendra, New concepts in adaptive control using multiple models, IEEE Transactions on Automatic Control 57(1) (2012), 78–89. doi:10.1109/TAC.2011.2152470.

22.

Hooke, De Potentia Restitutiva, or of Spring Explaining the Power of Springing Bodies, John Martyn, London, UK, 1678.

23.

Hua and

S.X.

Ding, Decentralized networked control system design using T–S fuzzy approach, IEEE Transactions on fuzzy systems 20(1) (2012), 9–21. doi:10.1109/TFUZZ.2011.2162735.

24.

C.-S.

Huang,

Wang and

Teo, Solving Hamilton–Jacobi–Bellman equations by a modified method of characteristics, Nonlinear Analysis: Theory, Methods & Applications 40(1) (2000), 279–293. doi:10.1016/S0362-546X(00)85016-6.

25.

P.A.

Ioannou, Decentralized adaptive control of interconnected systems, IEEE Transactions on Automatic Control 31(4) (1986), 291–298. doi:10.1109/TAC.1986.1104282.

26.

K.J.

Keesman, System Identification: An Introduction, Springer-Verlag, 2011, pp. 94–97. doi:10.1007/978-0-85729-522-4.

27.

Kharroubi,

Langren and

Pham, A numerical algorithm for fully nonlinear HJB equations: An approach by control randomization, Monte Carlo Methods and Applications 20(2) (2014), 145–165.

28.

Kleppner and

Kolenkow, An Introduction to Mechanics, Cambridge University Press, 2013.

29.

Lancaster and

Rodman, Algebraic Riccati Equations, Clarendon Press, 1995.

30.

J.W.

Lee and

Jangmin, A multi-agent Q-learning framework for optimizing stock trading systems, in: Proc. International Conference on Database and Expert Systems Applications, 2002, pp. 153–162. doi:10.1007/3-540-46146-9_16.

31.

F.L.

Lewis and

V.L.

Syrmos, Optimal Control, John Wiley & Sons, 1995.

32.

F.L.

Lewis and

Vrabie, Reinforcement learning and adaptive dynamic programming for feedback control, IEEE Circuits and Systems Magazine 9(3) (2009), 32–50. doi:10.1109/MCAS.2009.933854.

33.

F.L.

Lewis,

Vrabie and

K.G.

Vamvoudakis, Reinforcement learning and feedback control: Using natural decision methods to design optimal adaptive controllers, IEEE Control Systems Magazine 2(6) (2012), 76–105.

34.

Liu,

Wang and

Li, Decentralized stabilization for a class of continuous-time nonlinear interconnected systems using online learning optimal control approach, IEEE transactions on neural networks and learning systems 25(2) (2014), 418–428. doi:10.1109/TNNLS.2013.2280013.

35.

Liu,

Gu,

Sheng,

Meng,

Wu and

Chen, Decentralized multi-agent system-based cooperative frequency control for autonomous microgrids with communication constraints, IEEE Transactions on Sustainable Energy 5(2) (2014), 446–456. doi:10.1109/TSTE.2013.2293148.

36.

Mahajan, Optimal decentralized control of coupled subsystems with control sharing, IEEE Transactions on Automatic Control 58(9) (2013), 2377–2382. doi:10.1109/TAC.2013.2251807.

37.

W.T.

Miller,

P.J.

Werbos and

R.S.

Sutton, Neural Networks for Control, MIT Press, 1995.

38.

Munos and

Moore, Variable resolution discretization in optimal control, Machine learning 49(2–3) (2002), 291–323. doi:10.1023/A:1017992615625.

39.

Munos and

A.W.

Moore, Variable resolution discretization for high-accuracy solutions of optimal control problems, in: International Joint Conference on Artificial Intelligence, 1999, p. 256.

40.

Narendra,

Oleng and

Mukhopadhyay, Decentralised adaptive control with partial communication, IEEE Proceedings-Control Theory and Applications 153(5) (2006), 546–555. doi:10.1049/ip-cta:20050284.

41.

K.S.

Narendra and

Balakrishnan, Improving transient response of adaptive control systems using multiple models and switching, IEEE Transactions on Automatic Control 39(9) (1994), 1861–1866. doi:10.1109/9.317113.

42.

K.S.

Narendra and

Mukhopadhyay, To communicate or not to communicate: A decision-theoretic approach to decentralized adaptive control, in: Advances in Computing and Communications, 2010, pp. 6369–6376.

43.

D.T.

Nguyen,

Kumar and

H.C.

Lau, Policy gradient with value function approximation for collective multiagent planning, in: Proc. Neural Information Processing Systems Conference, 2017, pp. 4322–4332.

44.

D.T.

Nguyen,

Yeoh,

H.C.

Lau,

Zilberstein and

Zhang, Decentralized multi-agent reinforcement learning in average-reward dynamic DCOPs, in: Proc. International Foundation for Autonomous Agents and Multiagent Systems, 2014, pp. 1341–1342.

45.

Nguyen and

Mukhopadhyay, Identification and optimal control of large-scale system using selective decentralization, in: Proc. IEEE International Conference on Systems. Man and Cybernetics, 2016.

46.

Nguyen and

Mukhopadhyay, Selectively decentralized Q-learning, in: Proc. IEEE International Conference on Systems, Man, and Cybernetics, 2017, pp. 328–333.

47.

Pillonetto,

Dinuzzo,

Chen,

De Nicolao and

Ljung, Kernel methods in system identification, machine learning and function estimation: A survey, Automatica 50(3) (2014), 657–682. doi:10.1016/j.automatica.2014.01.001.

48.

Ranjbar-Sahraei,

Shabaninia,

Nemati and

S.-D.

Stan, A novel robust decentralized adaptive fuzzy control for swarm formation of multiagent systems, IEEE Transactions on Industrial Electronics 59(8) (2012), 3124–3134. doi:10.1109/TIE.2012.2183831.

49.

G.-C.

Rota, The number of partitions of a set, The American Mathematical Monthly 71(5) (1964), 498–504. doi:10.1080/00029890.1964.11992270.

50.

Russell and

Norvig, Artificial Intelligence a Modern Approach, 3rd edn, Prentice Hall, 2010.

51.

Russell and

Norvig, Artificial Intelligence: A Modern Approach, 3rd edn, Pearson, 2010, pp. 830–841.

52.

G.N.

Saridis and

C.-S.G.

Lee, An approximation theory of optimal control for trainable manipulators, IEEE Transactions on Systems, Man and Cybernetics 9(3) (1979), 152–159. doi:10.1109/TSMC.1979.4310171.

53.

Scharpff,

D.M.

Roijers,

F.A.

Oliehoek,

M.T.

Spaan and

M.M.

de Weerdt, Solving transition-independent multi-agent MDPs with sparse interactions, in: Proc. Thirtieth AAAI Conference on Artificial Intelligence, 2016, pp. 3174–3180.

54.

Shah and

P.A.

Parrilo, H2-optimal decentralized control over posets: A state-space solution for state-feedback, IEEE Transactions on Automatic Control 58(12) (2013), 3084–3096. doi:10.1109/TAC.2013.2281881.

55.

Shi and

S.K.

Singh, Decentralized adaptive controller design for large-scale systems with higher order interconnections, IEEE Transactions on Automatic Control 37(8) (1992), 1106–1118. doi:10.1109/9.151092.

56.

Tampuu,

Matiisen,

Kodelja,

Kuzovkin,

Korjus,

Aru,

Aru and

Vicente, Multiagent cooperation and competition with deep reinforcement learning, in: PloS One, Vol. 12, 2017.

57.

W.L.

Teacy,

Chalkiadakis,

Farinelli,

Rogers,

N.R.

Jennings,

McClean and

Parr, Decentralized Bayesian reinforcement learning for online agent collaboration, in: International Foundation for Autonomous Agents and Multiagent Systems, 2012, pp. 417–424.

58.

Vrancx,

Verbeeck and

Now, Decentralized learning in Markov games, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 38(4) (2008), 976–981. doi:10.1109/TSMCB.2008.920998.

59.

F.-Y.

Wang,

Zhang and

Liu, Adaptive dynamic programming: An introduction, Computational Intelligence Magazine, IEEE 4(2) (2009), 39–47. doi:10.1109/MCI.2009.932261.

60.

C.J.

Watkins and

Dayan Q-learning, Machine learning 8(3–4) (1992), 279–292. doi:10.1007/BF00992698.

61.

Wei,

F.L.

Lewis,

Liu,

Song and

Lin, Discrete-time local value iteration adaptive dynamic programming: Convergence analysis, in: IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2017, pp. 1–17.

62.

Wei,

F.L.

Lewis,

Sun,

Yan and

Song, Discrete-time deterministic Q-learning: A novel convergence analysis, IEEE transactions on cybernetics 47(5) (2017), 1224–1237. doi:10.1109/TCYB.2016.2542923.

63.

Wei,

Liu and

Lin, Value iteration adaptive dynamic programming for optimal control of discrete-time nonlinear systems, IEEE Transactions on cybernetics 46(3) (2016), 840–853. doi:10.1109/TCYB.2015.2492242.

64.

Wei,

Liu and

Lin, Discrete-time local iterative adaptive dynamic programming: Terminations and admissibility analysis, in: IEEE Transactions on Neural Networks and Learning Systems, 2016.

65.

Wei,

Liu,

Lin and

Song, Adaptive dynamic programming for discrete-time zero-sum games, IEEE Transactions on Neural Networks and Learning Systems 99 (2017), 1–13.

66.

Wei,

Liu,

Lin and

Song, Discrete-time optimal control via local policy iteration adaptive dynamic programming, IEEE transactions on cybernetics 47(10) (2017), 3367–3379. doi:10.1109/TCYB.2016.2586082.

67.

Wei,

Song and

Yan, Data-driven zero-sum neuro-optimal control for a class of continuous-time unknown nonlinear systems with disturbance using ADP, IEEE transactions on neural networks and learning systems 27(2) (2016), 444–458. doi:10.1109/TNNLS.2015.2464080.

68.

Yang,

Liu and

Wang, Reinforcement learning for adaptive optimal control of unknown continuous-time nonlinear systems with input constraints, International Journal of Control 87(3) (2014), 553–566. doi:10.1080/00207179.2013.848292.

69.

H.D.

Young,

R.A.

Freedman and

A.L.

Ford, Elastic potential energy, in: University Physics, Pearson Education Inc., 2008, pp. 222–230.

70.

Zhang,

Liu,

Luo and

Wang, Adaptive Dynamic Programming for Control: Algorithms and Stability, Springer, 2013.