Why the ‘selfish’ optimizing agents could solve the decentralized reinforcement learning problems

Abstract

Multidisciplinary Design Optimization (MDO) is a computational approach for optimizing design of a complex system of systems that require knowledge from multiple disciplines. In a former study, we explored and found that the individual discipline feasible (IDF), a type of MDO design technique, performed well in several benchmark test cases of decentralized Reinforcement Learning (RL) problems, in particular, stabilizing an unknown system. However, the earlier study was not able to resolve as to why the overall system of systems, even with strongly coupled systems, could be stabilized when each agent just focused on stabilizing itself. In this work, we make significant extension in resolving this behavior by conducting a theoretical analysis of the MDO solution of RL problems. Through the analysis, we show that with the proper control law, each MDO agent should be able to bring its state closer to the 0-stable point regardless of how the other agents’ states impact the state of the whole system. This is the main reason why the ‘selfish’ MDO-IDF agents are successful in learning to stabilize the overall system. The simulation results, including benchmark test cases, verify our analysis. Therefore, we propose that the MDO would be a promising solution in many other decentralized RL problems.

Keywords

Reinforcement learning multidisciplinary design optimization adaptive control individual discipline feasible

1. Introduction

Decentralization learning, also referred to as distributed or multi-agent learning, has generated multiple promising computational techniques to solve large-scale reinforcement learning (RL) problems. The significant advantages of decentralized RL are extensively discussed in [8,9,18,41,54], in which decentralized RL not only reduces the dimensionality of the problem but also increases the robustness of the learning. However, to achieve high-quality decentralized RL solution, we need to tackle four major challenges. First, for each learning agent, when and how should it interact, and with whom [1]. Second, how should the agents define a learning goal: should the agent only focus on its own or should it incorporate the other agents’ goals in its learning goal [8]? Third, similar to other RL problems, the decentralized learning has an additional difficulty due to the multi-disciplinary and partially-defined nature of the problem [15]. This challenge could be tackled by system identification [30,45,55]. For example, in a mass-spring problem, the agent could use linear model to approximate how the position and velocity of the masses would change when applying an external (control) force given that the springs constants are unknown [35]. Fourth, even when the environment and feedback on learning performance could be closely approximated, finding the closed-form solution for the learning problem is unknown in general cases. Therefore, researchers have been focusing on approximation methods to tackle nonlinear Hamilton-Jacobi-Bellman (HJB) equation problem, which is the mathematical foundation of RL, such as [1,7,17,39].

Multidisciplinary Design Optimization (MDO) [2,3,6,12,28], which has been intensively researched and applied in aerospace and mechanical engineering, is potentially a promising approach to tackle the challenges in decentralized RL. In MDO, the computational agents are well-defined and decomposed according to the domain-knowledge of each discipline in a coupled optimization problem. For example, in aero-elastic optimization, there are two decentralized computational units: the aero dynamic units applies fluid dynamics law to manage the air-pressure on the aircraft wing and the structure unit applies the material law to manage the deflection and shape of the wing [10]. Other examples, especially on automatic vehicles, could be found in [13].

2. State of the art techniques

Most of the recent state-of-the-art techniques in decentralized reinforcement learning could be categorized into model-based and model-free techniques. Examples of model-free techniques, such as in [4,24,32,43], mostly cover Q-learning. Fundamentally, Q-learning stores the expected outcome of executing action u at scenario (also called state) × for each agent [46]. In these examples, each learning agent applies Q-learning algorithm to make its own action in a cooperative learning problem. The disadvantage of Q-learning and other model-free techniques is slow-convergence. For example, for a small tic-tac-toe problem, it takes millions of iterations for the Q-learning agent to learn the optimal action-plan [44]. In the other hand, the model-free techniques learn the underlying mechanisms regulating the changes of scenarios and environments in order to decide the action. For example, in [11,31,40], the learning problems are in MDP format and each agent makes the decision by using policy iteration scheme. In another example, [26] shows how each learning agent can apply its own policy computed by adaptive dynamic programming in a system-stabilizing problem. Overall, in these techniques, each learning agent still makes its own decisions independently. The model-based techniques often converge faster than the model-free techniques. However, the model-based techniques are often more complexed, expects more assumptions and require sustainably amount of domain-knowledge to construct. These may not be available in learning and performing in mostly unknown problems.

In related former work [36], we explored and found that the individual discipline feasible (IDF) design approach in MDO performed well for several toy-example RL problems, compared to the commonly used centralized and recent decentralized techniques, including using Markov Decision Process and Q-learning [37]. In the IDF design, which could be called ‘selfish’ design, each computational agent only aims to optimize its own optimization function and uses the other agents’ information as constrains. In this option, the agents tend to seek for local optimization of their objective functions; while, constraints and information from other agents ensure that the local solutions would eventually allow all agent to reach the objective. The exchanged constrains and information would add more computational complexity for each agent, for example: if one agent discretizes these constrains and information as extra parameters, which are subjected to changes in other agents, in the objective functions, then its solution would need more subcases. The exchanged information among the agents could then be preprocessed or transformed into simpler forms to reduce the complexity of the optimization in each agent. For a brief technical description, we used system identification to approximate the unknown environment (or system) in the RL problem. In the system identification step, each learning agent also identified the impact of other agent’s information on its learning performance, which was the central theme in MDO. From the identified model, each learning agent sets up Markov decision process (MDP) to compute the action/control solution [33]. Although [36] demonstrates that the IDF-MDO is successful in solving the problem of stabilizing unknown systems, it was not clear as to why the whole system, even with the strongly coupled sub-systems, could be stabilized when each agent focused only on stabilizing itself.

In this work, from the prior work [36], we make significant extension in several aspects, especially for the theoretical analysis of the MDO solution of RL problems. The major contributions of this work are:

We extend the scope of the learning problem toward linear systems and nonlinear system with noise. In both the linear and nonlinear cases, the MDO approach succeeds in learning how to stabilize an unknown dynamic system. The MDO also shows better learning performance, compared to the centralized approach, when the system contains noise. In our best knowledge, MDO has not been widely applied in decentralized RL.

We derive the control laws for each agent differently between the linear and the nonlinear. These reflex two different strategies to derive the control laws for a MDO agent, given the inter-communication among the agents. The first strategy is to ‘counter’ the impact of other agents’ state. The second strategy is to discretize the other agents’ state for setting up multi-model controls.

We provide the theoretical analysis showing that with the control laws derived in the second contribution, each MDO agent should be able to bring its state closer to the 0-stable point regardless of how the other agents’ state could impact its state. This is the main reason why the ‘selfish’ MDO-IDF agents are successful in learning to stabilize the entire system.

The structure of this work is as follow. First, we demonstrate the formation of the decentralized RL problem to be solved by MDO. Second, we show how to derive the control law for each agent assuming that the learning problem is completely known. Third, moving closer to RL, we review system identification techniques in approximating unknown system. Forth, combining system identification and the control law, we show how the MDO agents learn to stabilize the entire systems in several experimental and real-world examples with noise.

3. MDO in linear system of learning and control

Similar to many former system control and RL approaches [4,11,24,31,32,40,43], we start our theoretical analysis from the linear system due to its simplicity. In addition, recognizing the interconnection patterns among the learning agents is simple in linear cases because we can present these patterns by the system matrix entries. In this section, we emphasize the fundamental difference between MDO RL agents and the classical (engineering) MDO agent. The classical MDO agent optimizes the intermediate and well-defined criteria, which can be written in closed-form; meanwhile, the MDO RL agents optimize the aggregated long-term criteria, which is generally unknown and difficult to write in closed-form.

3.1. Formulation of the linear system in MDO

In general, a linear system is written as $\begin{matrix} (1) & x (t + 1) = Ax (t) + Bu (t) \end{matrix}$ Where, x is the n-dimensional state vector, u is the m-dimensional control vector, A is a $n \times n$ matrix and B is the $n \times m$ matrix. Literally, A, containing the correlation among the state dimensions, demonstrates how the system state dynamically changes without any control; and B demonstrates how the control vector impacts the state vector. In our problems of interest, suppose that system (1) contains K disjointed agents with dimensionality $n_{1} + n_{2} + \dots + n_{K} = n$ for the state vectors and $m_{1} + m_{2} + \dots + m_{K} = m$ for the control vectors. Let ${i_{1}}, {i_{2}}, \dots, {i_{K}}$ be the set of x-indices and ${j_{1}}, {j_{2}}, \dots, {j_{K}}$ be the set of u-indices for agent $1, 2, \dots, K$ , correspondingly. These indices are pre-determined by multidisciplinary domain knowledge. We further assume each agent is only responsible for controlling its sub-state vector, which means B is a blocked matrix as follow $\begin{matrix} (2) & B_{{i_{k}}, {j_{l}}} = 0 \forall 1 ⩽ k, l ⩽ K, k \neq l \end{matrix}$ Then, (1) could be rewritten as a system of k equations, in which each equation corresponds to each agent $\begin{matrix} (3) & \{\begin{matrix} x_{{i_{1}}} (t + 1) \\ = A_{{i_{1}}, {i_{1}}} x_{{i_{1}}} (t) + B_{{j_{1}}, {j_{1}}} u_{{j_{1}}} (t) \\ + A_{{i_{1}}, {i_{k \neq 1}}} x_{{i_{k \neq 1}}} (t) \\ x_{{i_{2}}} (t + 1) \\ = A_{{i_{2}}, {i_{2}}} x_{{i_{2}}} (t) + B_{{j_{2}}, {j_{2}}} u_{{j_{2}}} (t) \\ + A_{{i_{2}}, {i_{k \neq 2}}} x_{{i_{k \neq 2}}} (t) \\ ⋮ \\ x_{{i_{K}}} (t + 1) \\ = A_{{i_{K}}, {i_{K}}} x_{{i_{K}}} (t) + B_{{j_{K}}, {j_{K}}} u_{{j_{K}}} (t) \\ + A_{{i_{K}}, {i_{k \neq K}}} x_{{i_{k \neq K}}} (t) \end{matrix} \end{matrix}$ Where, ${i_{k \neq 1}}, {i_{k \neq 2}}, \dots, {i_{k \neq K}}$ be the set of x-indices not belonging to agent $1, 2, \dots, K$ , correspondingly. The set of equations in (3) means that an agent’s next state is also depending on the other agents’ states. In addition, each agent is allowed to know the other agents’ states.

The objective is to find the series of $u (t)$ to stabilize (1) for any initial state $x (0)$ $\begin{matrix} (4) & \begin{matrix} x (t) ⟶ 0, u (t) ⟶ 0 \\ when t ⟶ \infty \end{matrix} \end{matrix}$ Or, for every agent $\begin{matrix} (5) & \begin{matrix} x_{{i_{k}}} (t) ⟶ 0, u_{{j_{k}}} (t) ⟶ 0 \\ when t ⟶ \infty \forall k \end{matrix} \end{matrix}$ It means that regardless of the starting point, the system state should be able to reach the desired and stable point. In reality, this point may be non-zero, for example: the desired and stable point of the automatic vehicle is that velocity is 30 mph. However, there always exist a transformation such that this point could be mathematically 0. Therefore, for the simplicity of theory and experiment, we often choose state $x = 0$ as the objective [25]. In the notion of optimal control, let Q and R are positive-definite matrices with size $n \times n$ and $n \times m$ , correspondingly. In addition, similar to (2), Q and R are blocked matrices $\begin{matrix} (6) & \begin{matrix} Q_{{i_{k}}, {i_{l}}} = 0 \\ R_{{j_{k}}, {j_{l}}} = 0 \end{matrix} \forall 1 ⩽ k, l ⩽ K, k \neq l \end{matrix}$ which means that for the following objective: minimizing for any $x (0)$ $\begin{matrix} (7) & J (x (0)) = \sum_{t = 0}^{\infty} (x {(t)}^{T} Q x (t) + u {(t)}^{T} R u (t)) \end{matrix}$ then the solution for (7) is also the solution of each agent’s objective by minimizing $\begin{array}{l} J (x_{{i_{k}}} (0)) \\ = \sum_{t = 0}^{\infty} (x_{{i_{k}}} {(t)}^{T} Q_{{i_{k}}, {i_{k}}} x_{{i_{k}}} (t) \\ (8) & + u_{{j_{k}}} {(t)}^{T} R_{{j_{k}}, {j_{k}}} u_{{j_{k}}} (t)) \forall k \end{array}$ Furthermore, $\begin{matrix} (9) & J (x (0)) = \sum_{t = 0}^{\infty} J (x (k)) \end{matrix}$ Or the optimal value of the whole system is equal to the sum of optimal values in all agents. Here, it is easy to see that the optimization (7)–(9) is an individual discipline feasible MDO [2,28], with constrains (3).

3.2. The admissible (sub-optimal) control for the MDO problem in linear system

In this section, to simplify the notation, for every agent k, we rewrite $A_{{i_{k}}, {i_{k}}} = A_{k}$ , $x_{{i_{k}}} = x_{k}$ , $B_{{j_{k}}, {j_{k}}} = B_{k}$ , $u_{{j_{k}}} = u_{k}$ , $A_{{i_{k}}, {i_{l \neq k}}} = C_{k}$ , $x_{{i_{l \neq k}}} = y_{k}$ , $Q_{{i_{k}}, {i_{k}}} = Q_{k}$ , $R_{{j_{k}}, {j_{k}}} = R_{k}$ . Let solution $P_{{i_{k}}, {i_{k}}}$ be the solution of the Riccati equation [23] for agent k $\begin{matrix} (10) & \begin{matrix} A_{k}^{T} P_{k} A_{k} - P_{k} \\ - A_{k}^{T} P_{k} B_{k} {(B_{k}^{T} P_{k} B_{k})}^{- 1} B_{k}^{T} P_{k} A_{k} \\ + Q_{k} = 0 \end{matrix} \end{matrix}$ Then, we have

Theorem 1.
Let matrix $K_{k}$ be defined as $\begin{matrix} (11) & K_{k} = {(R_{k} + B_{k}^{T} P_{k} B_{k})}^{- 1} B_{k}^{T} P_{k} A_{k} \end{matrix}$ The control $\begin{array}{rcl} u_{k} (t) & = & - K_{k} x_{k} (t) \\ (12) & - {(B_{k}^{T} B_{k})}^{- 1} B_{k}^{T} C_{k} y_{k} (t) \end{array}$ is guaranteed to stabilize the system ( 1 )–( 3 ).

The proof is as follow. Rewriting (3) in this section for any agent k $\begin{matrix} (13) & x_{k} (t + 1) = A_{k} x_{k} (t) + B_{k} u_{k} (t) + C_{k} y_{k} (t) \end{matrix}$ We would show that $x_{k} (t + 1) ⟶ 0$ $\forall k$ for any agent k. Substituting (12) to (13), we have $\begin{array}{l} x_{k} (t + 1) = & A_{k} x_{k} (t) - B_{k} K_{k} x_{k} (t) \\ - {(B_{k}^{T} B_{k})}^{- 1} B_{k}^{T} C_{k} y_{k} (t) \\ + C_{k} y_{k} (t) \\ (14) & = & (A_{k} - B_{k} K_{k}) x_{k} (t) \end{array}$ Because $K_{k}$ is computed from the Riccati equation (10), which is guaranteed to stabilize the system $\begin{matrix} (15) & z_{k} (t + 1) = A_{k} z_{k} (t) + B_{k} u_{k} (t) \end{matrix}$ Then, according to [46], all eigenvalues of matrix $(A_{k} - B_{k} K_{k})$ has to be within the unit circle. It is easy to expand from (14) with λ denotes the most prominent eigenvalue of $\begin{matrix} (16) & x_{k} (t + s) = {(A_{k} - B_{k} K_{k})}^{s} x_{k} (t) ⩽ λ^{s} x_{k} (t) \end{matrix}$ In (16), $x_{k} (t + s)$ is a decreasing sequence and eventually converges to 0 as $s ⟶ \infty$ since $λ < 1$ .
4. MDO in nonlinear system learning and control

While most of the control-engineering systems, built upon physical law, are linear, many real-world complex system are nonlinear by nature [5]. Nonlinear systems are generally much more complex. In addition, we have not been able to find the control solution universal for all nonlinear system. In this section, suppose that the nonlinear system satisfies the necessary conditions so that a control solution exists [37], we extend the MDO theory to nonlinear control problems in general form.

4.1. Formulation of nonlinear-system MDO

In this section, we reuse the MDO-IDF technique presented in [36] for the nonlinear and continuous systems. Briefly, suppose that we have a nonlinear, bounded system that could be decoupled for K agents to corporately control $\begin{array}{l} (17) & \begin{matrix} x (t + 1) \\ = [\begin{matrix} x_{{i_{1}}} (t + 1) \\ x_{{i_{2}}} (t + 1) \\ ⋮ \\ x_{{i_{K}}} (t + 1) \end{matrix}] \\ = f (x (t), u (t)) \\ = [\begin{matrix} f_{k} (x_{{i_{1}}} (t), x_{{i_{k \neq 1}}} (t), u_{{j_{1}}} (t)) \\ f_{k} (x_{{i_{2}}} (t), x_{{i_{k \neq 2}}} (t), u_{{j_{2}}} (t)) \\ ⋮ \\ f_{k} (x_{{i_{K}}} (t), x_{{i_{k \neq K}}} (t), u_{{j_{K}}} (t)) \end{matrix}] \end{matrix} \end{array}$ such that (17) has the 0-equilibrium point. $\begin{matrix} f (x = 0, u = 0) = 0, or \\ (18) & f_{k} (x_{{i_{k}}} = 0, x_{{i_{l \neq k}}} = 0, u_{{j_{k}}} = 0) = 0 \forall k \end{matrix}$ Here, the description, notation and properties of x, u and the agent indices ${i_{1}}, {i_{2}}, \dots, {i_{K}}, {j_{1}}, {j_{2}}, \dots, {j_{K}}$ is the same to the linear system section. In system (17), as showed in the right side, each agent is only responsible for controlling its own sub-state vector, and it may know the other agents’ states. In addition, we assume the following continuity property: for any agent k and any $x_{{i_{l \neq k}}}$ , agent k can reach any state within its state boundary. For the goal as showed in (4)–(5), for any agent k, we can choose a positive-definite function $p_{k} (x_{k})$ allowing us to setup the optimization goal: minimizing $\begin{matrix} (19) & J (x_{k} (0)) = \sum_{t = 0}^{\infty} γ^{t} p_{k} (x_{k} (t)) \end{matrix}$ Theoretically, the choice of $p_{k} (x_{k})$ is important to guarantee the convergence of the multi-MDP solution. It is easy to see that the solution to optimize (19) for every agent k is also the solution for the global optimization problem: minimizing $\begin{array}{l} J (x (0)) = \sum_{t = 0}^{\infty} γ^{t} p (x (t)) \\ (20) & where p_{k} (x_{k} (t)) = \sum_{t = 1}^{K} p_{k} (x_{k} (t)) \end{array}$ This completes the formulation of the MDO-IDF nonlinear system.

4.2. Multiple MDP to control the nonlinear system by MDO-IDF

The key point in [36] is that each agent k has L MDP models from which to apply the policy-iteration algorithm [38] for learning and control. Here, L is the discretization resolution, or the number of discrete instances, into which agent k uniformly discretizes the other agents’ states $x_{{i_{l \neq k}}}$ . This allows writing (17) for each agent k as $\begin{array}{l} x_{{i_{k}}} (t + 1) \\ (21) & = \{\begin{matrix} f_{k, x_{{i_{l \neq k}}} = z_{1}} (x_{{i_{k}}} (t), u_{{j_{k}}} (t)) \\ f_{k, x_{{i_{l \neq k}}} = z_{2}} (x_{{i_{k}}} (t), u_{{j_{k}}} (t)) \\ ⋮ \\ f_{k, x_{{i_{l \neq k}}} = z_{L}} (x_{{i_{k}}} (t), u_{{j_{k}}} (t)) \end{matrix} \end{array}$ where $z_{1}, z_{2}, \dots, z_{L}$ are the discrete forms of $x_{{i_{l \neq k}}}$ . From each equation on the right side of (21), we can convert the equation into MDP as showed in [35,36]. In every iteration, agent k discretizes $x_{{i_{l \neq k}}}$ and looks up the corresponding MDP for learning and control.

4.3. Theoretical analysis of the multiple MDP solutions in MDO-IDF

Similar to the linear system analysis, for any agent k, we rewrite $\begin{matrix} (22) & x_{k} (t + 1) = f_{k} (x_{k} (t), y_{k} (t), u_{k} (t)) \end{matrix}$ As in [35], when we linearize (22) at $x_{k} = p$ , considering $y_{k}$ and $u_{k}$ as constants, by Taylor series expansion $\begin{array}{l} f_{k} (x_{k} (t), y_{k} (t), u_{k} (t)) \\ (23) & \approx f_{k} (p, y_{k} (t), u_{k} (t)) + S (x - p) \end{array}$ where $\begin{matrix} (24) & S = [\begin{matrix} \frac{\partial f_{k_{1}}}{\partial x_{1}} |_{\begin{array}{c} x_{k} = p \\ u_{k}, y_{k} \end{array}} & \frac{\partial f_{k_{1}}}{\partial x_{2}} |_{\begin{array}{c} x_{k} = p \\ u_{k}, y_{k} \end{array}} & \dots & \frac{\partial f_{k_{1}}}{\partial x_{n}} |_{\begin{array}{c} x_{k} = p \\ u_{k}, y_{k} \end{array}} \\ \frac{\partial f_{k_{2}}}{\partial x_{1}} |_{\begin{array}{c} x_{k} = p \\ u_{k}, y_{k} \end{array}} & \frac{\partial f_{k_{2}}}{\partial x_{2}} |_{\begin{array}{c} x_{k} = p \\ u_{k}, y_{k} \end{array}} & \dots & \frac{\partial f_{k_{2}}}{\partial x_{n}} |_{\begin{array}{c} x_{k} = p \\ u_{k}, y_{k} \end{array}} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ \frac{\partial f_{k_{n}}}{\partial x_{1}} |_{\begin{array}{c} x_{k} = p \\ u_{k}, y_{k} \end{array}} & \frac{\partial f_{k_{n}}}{\partial x_{2}} |_{\begin{array}{c} x_{k} = p \\ u_{k}, y_{k} \end{array}} & \dots & \frac{\partial f_{k_{n}}}{\partial x_{n}} |_{\begin{array}{c} x_{k} = p \\ u_{k}, y_{k} \end{array}} \end{matrix}] \end{matrix}$ Then, a sufficient condition to ensure that the MDP solutions in each agent could be closed to the optimal stabilizing solution is: the most prominent eigenvalue of S is within unit circle for every agent k and values $x_{k}$ , $u_{k}$ and $y_{k}$ .

Given the condition above, we can prove that the multi-MDP solution is guarantee to stabilize (17) as follow. As in [35], we know that the MDP would stabilize the separated system eventually. However, it is unclear whether or not agent k’s state moving closer to 0 always helps the other agents to move their states closer to 0. Therefore, in the analysis, we focus on showing that for any agent k, has the ability to move closer to 0 regardless of other agents’ states. Therefore, the entire system would eventually converge. To achieve this, we prove the following theorems.

Theorem 2.
For any agent k, for any $y_{k}$ as constant, there always exists a control law to stabilize the system $\begin{matrix} f_{k} (x_{k} (t), y_{k}, u_{k} (t)) ⟶ 0 as t ⟶ \infty \end{matrix}$ for any starting point $x_{k} (0)$ .

The proof for this theorem relies on the following assumption in equation (17): for any agent k and any $x_{{i_{l \neq k}}}$ , agent k can reach any state within its state boundary. This assumption means for any starting point $x_{k} (0)$ and any $y_{k}$ , there always exist a sequence of control $u_{k} (t)$ allowing $x_{k} (t) ⟶ 0$ . Then, at $x_{k} (t) = 0$ , for stabilization, $u_{k} (t)$ should be the solution of equation $f_{k} (0, y_{k}, u_{k} (t)) = 0$ . This solution should exist because of the continuity in (17). Because of this assumption, when we consider $u_{k}$ as the variable in the function $f_{k} (0, y_{k}, u_{k} (t))$ , there would exist $u_{k}$ to reach any point in the closed hypersphere of radius δ centered by 0. Then, by the intermediate value theorem [42], the solution for $f_{k} (0, y_{k}, u_{k} (t)) = 0$ must exist.
Theorem 3.
For any agent k, there exist a function $p_{k} (x_{k})$ such that if k discretizes its state and action as in [ 11 , 35 ], then any MDP stabilizing k toward 0 would satisfy $\begin{matrix} p_{k} (x_{k}) = 0 \Leftrightarrow x_{k} = 0 \end{matrix}$ and $\begin{matrix} ‖ p_{k} (x_{k} (t + 1)) ‖ ⩽ ‖ p_{k} (x_{k} (t)) ‖ \end{matrix}$ and if there exist $u_{k} (t)$ such that $‖ f_{k} (x_{k}, y_{k}, u_{k} (t)) ‖ ⩽ ‖ p_{k} (x_{k} (t)) ‖$ , then $\begin{matrix} ‖ p_{k} (x_{k} (t + 1)) ‖ < ‖ p_{k} (x_{k} (t)) ‖ \end{matrix}$

To prove this theorem, we briefly the discretization process in [34] as follow. Suppose that each dimension of $x_{k}$ , $u_{k}$ is symmetrically bounded by $[- χ, χ]$ and $[- μ, μ]$ , where $χ, μ > 0$ are known boundaries for $x_{k}$ and $u_{k}$ , correspondingly. Let $M_{k}$ , usually $M_{k} > L$ , be the number of intervals in each dimension of $x_{k}$ and $u_{k}$ for which we uniformly divide the dimension into small grids. Therefore, the entire state space is divided into $M_{k}^{n_{k}}$ small hyper cubes with edge $θ_{x} = 2 χ / M_{k}$ . The control space is divided into $M_{k}^{m_{k}}$ small hyper cubes with edge $θ_{u} = 2 μ / M_{k}$ . All points inside a hyper cube are discretely represented by the center of the hyper cube. Points on the border between two hyper cubes are represented by the center of the ‘left’ hypercube. Mathematically, the discretization process is described by the following formulas $\begin{array}{l} (25) & \begin{matrix} x_{k} [i] ⟶ θ_{x} + χ / M_{k} \\ \forall i \in [1, n_{k}] and \\ x_{k} [i] \in [θ_{x}, θ_{x} + 2 χ / M_{k})] \end{matrix} \\ (26) & \begin{matrix} u_{k} [i] ⟶ θ_{u} + μ / M_{k} \\ \forall i \in [1, m_{k}] and \\ u_{k} [i] \in [θ_{u}, θ_{u} + 2 μ / M_{k})] \end{matrix} \end{array}$ where $θ_{x} \in {- χ, - χ + 2 χ / M_{k}, - χ + 4 χ / M_{k}, \dots, χ - 2 χ / M_{k}}$ and $θ_{u} \in {- μ, - μ + 2 μ / M_{k}, - μ + 4 μ / M_{k}, \dots, μ - 2 μ / M_{k}}$ which are the ‘left’ boundaries in the hyper cubes. We denote $x_{dis}$ and $u_{dis}$ as the discrete space and control vector of $x_{k}$ and $u_{k}$ , correspondingly.

The discretization in (25) and (26) allows partitioning the state space into different layers as follow: – The 0th layer contains only the hypercube with the $0$ -equilibrium point. – Recursively, the hth layer ( $h > 0$ ) contains all hyper-cubes outer-neighboring the $h - 1$ th layer. Figure 1 demonstrates the layer defined above in two-dimensional state space. In addition, by the continuity assumption and the choice of $M_{k}$ to discretize the action state, for every state of layer $h > 0$ , there must exist discrete action transiting the next state toward the inner layers.

Fig. 1.
Demonstration of state-layers for discretization in two dimensions.

From the discretization in (25) and (26) and the layer definition above, we setup the penalty function as follow $\begin{matrix} (27) & p_{k} (x_{k}) = - β_{h} ‖ x_{k} ‖ \end{matrix}$ in which h denotes the layer where x belongs to. $β_{k}$ are positive constants chosen as follow $\begin{matrix} (28) & \{\begin{matrix} β_{0} = 0 \\ β_{1} = 1 \\ β_{h} = \frac{β_{h - 1} \sqrt{n_{k}}}{1 - γ} \forall h > 1 \end{matrix} \end{matrix}$

It is easy to see that $p_{k} (x_{k})$ defined in (27) and (28) would satisfy the conditions in Theorem 3. Denote δ as the size of the hypercube presented in (25), (26) It is easy to see that at the hth layer, $‖ x_{dis} ‖$ is bounded by $\begin{matrix} (29) & ‖ x_{dis} ‖ = \sqrt{\sum_{i = 1}^{n_{k}} {(x_{dis})}^{2}} < h δ \sqrt{n_{k}} \end{matrix}$ From the definition of $p_{k} (x_{k})$ in (27) and (27), for every state at hth layer ( $h > 0$ ), we have $\begin{array}{l} (30) & \begin{matrix} \sum_{t = 0}^{\infty} γ^{t} β_{h} ‖ x_{dis} (t) ‖ \\ < \sum_{t = 0}^{\infty} γ^{t} β_{h} h δ \sqrt{n_{k}} \\ = \frac{β_{h} h δ \sqrt{n_{k}}}{1 - γ} = β_{h + 1} h δ \end{matrix} \end{array}$ In (30), the left side is the upper bound of $J (x_{k})$ by executing action such that the state does not transit toward the outer layer. Here, the upper bound is calculated in the worst-case scenario when the state simply stay in the kth layer. In addition, the right side is the lower bound of $J (x_{k})$ by executing action such that the state transit toward the outer layers. Therefore, for any state, the optimal policy should avoid actions, which could transit toward the outer layers, which has larger $p_{k}$ . This guarantees $‖ p_{k} (x_{k} (t + 1)) ‖ ⩽ ‖ p_{k} (x_{k} (t)) ‖$ . Furthermore, from any state $x_{k}$ , if there exist an action $u_{k}$ transiting the next state toward the inner layers, then the optimal policy should allow transiting the state toward the inner layers, which means $‖ p_{k} (x_{k} (t + 1)) ‖ < ‖ p_{k} (x_{k} (t)) ‖$ .

Theorems 2 and 3 guarantee that every MDO-IDF agent k could drive its own system toward the equilibrium point $x_{k} (t + 1) = x_{k} (t) = 0$ regardless of the other agent’ states. However, the control law at the 0-stable point is the solution of $f_{k} (0, y_{k}, u_{k}) = 0$ , which does not always have $u_{k} = 0$ . However, because (17) has the equilibrium point $f (0, 0) = 0$ , it is easy to prove that.
Theorem 4.
The multiple-MDP approach, in which the MDPs are setup as in [ 11 , 35 ], would stabilize every MDO-IDF agent to the 0-stable point $[x_{k}, u_{k}] = 0$ when $N_{k} ⟶ \infty$ , $M_{k} ⟶ \infty$ and $L ⟶ \infty$ (which implies higher discretizing resolution).

The proof for $N_{k} ⟶ \infty$ , $M_{k} ⟶ \infty$ is essentially in [35]. The proof for $L ⟶ \infty$ is simple. First, because every agent state would move toward the inner layers, or inner hypercubes as soon as possible (Theorem 3), $y_{k}$ would be closer to 0. For any agent k, let $y_{dis}$ denotes the discretization of $y_{k}$ . Because of the continuity property in (17), the solution of $f_{k} (0, y_{k}, u_{k}) = 0$ , which could be in format $\begin{matrix} (31) & u_{k} = q (y_{k}), q (0) = 0 \end{matrix}$ has to be continuous. By the Taylor series expansion, for every $y_{k}$ in the discretizing hypercube containing $y_{k} = 0$ $\begin{array}{l} ‖ u_{k} - 0 ‖ \\ = ‖ q (y_{k}) - q (0) ‖ \\ (32) & ⩽ \frac{\partial q}{\partial y_{k}} |_{\begin{array}{c} 0 \end{array}} ‖ y_{k} - 0 ‖ + O ({(y_{k} - 0)}^{2}) \end{array}$ As $L ⟶ \infty$ , the size of the hypercube for $y_{k}$ approaches 0, or $‖ y_{k} - 0 ‖ ⟶ 0$ . Also, because q is continuous, the determinant of Jacobian matrix $\frac{\partial q}{\partial y_{k}} |_{\begin{array}{c} 0 \end{array}}$ is finite. Therefore, $‖ u_{k} ‖ ⟶ 0$ .
5. MDO for reinforcement learning

In RL, the MDO faces another problem of unknown environment. In this work, without changing the characteristics of the RL problem, we assume that A in equation (1) is unknown, so is f in equation (17). Here, we assume that B in equation (1) is known to make the problem simpler while satisfying that the system is unknown. Therefore, for each agent, $A_{k}$ and $C_{k}$ (in equations (10)–(14)) are unknown in the linear system, so is fk in the nonlinear system. To tackle the unknown environment, for each MDO agent, we apply system identification [21,27] to approximate $A_{k}$ , $C_{k}$ and $f_{k}$ by ${\hat{A}}_{k}$ , ${\hat{C}}_{k}$ , and ${\hat{f}}_{k}$ as we did in [35]. Then, ${\hat{A}}_{k}$ , ${\hat{C}}_{k}$ , and ${\hat{f}}_{k}$ would be used in computing the control as showed in (11)–(12) for the linear system and in MDP construction for the nonlinear system.

Related to system identification, we reuse the ‘window size’ Ω concept in [35]. Briefly, window size Ω decides how frequently we call the identification while applying control techniques. Here, the choice of Ω and system identification algorithm should minimize the identification error $\begin{matrix} (33) & e (t) = ‖ x (t) - \hat{x} (t) ‖ \end{matrix}$ where $\begin{array}{l} \hat{x_{k}} (t) = & \hat{A_{k}} x_{k} (t - 1) + B_{k} u_{k} (t - 1) \\ (34) & + \hat{C_{k}} y_{k} (t - 1) \end{array}$ in the linear system, and $\begin{matrix} (35) & \hat{x_{k}} (t) = {\hat{f}}_{k} (x_{k} (t - 1), y_{k} (t - 1), u_{k} (t - 1)) \end{matrix}$ in the nonlinear system. For the linear system, which is computationally inexpensive, the identification algorithm estimate $A_{k}$ , $C_{k}$ partially by setting $Ω = 1$ $\begin{array}{l} (36) & \begin{matrix} A_{k} (t) = & A_{k} (t - 1) \\ - α \frac{(x (t) - \hat{x} (t)) x {(t - 1)}^{T}}{1 + x {(t - 1)}^{T} x (t - 1)} \end{matrix} \\ (37) & C_{k} (t) = C_{k} (t - 1) + α e (t) y {(t - 1)}^{T} 1 \end{array}$ in which α is the learning rate and $1$ is the column vector with dimension of $y k$ . The theoretical analysis to derive (36) and (37) could be found in [24,31] using the gradient descent approach. For the nonlinear system, we apply the same neural network [14] technique implemented in Matlab [29] as we already showed in [4], for the MDO-IDF design.

6. Simulation

In this section, we test the MDO performance in many existing problems in adaptive control and compare the MDO performance with other state-of-the-art approaches. Our simulative experiments include several benchmark problems helping explanation of how the MDO agents interacts and cooperates and real-world problems for comparison.

6.1. MDO agents stabilize the linear system

6.1.1. Toy example: When the MDO agents would ‘work together’ or ‘work separately’

Fig. 2.

Two agents reach the stable point together in linear system.

The first simulation result explores one interesting characteristic of the linear system: agent k would behave more similar to a separated system as the other agents are closer to their stable points. To be more precise, in (12) and (13), $x_{k} (t + 1) = A_{k} x_{k} (t) + B_{k} u_{k} (t) + C_{k} y_{k} (t) ⟶ A_{k} x_{k} (t) + B_{k}$ and $u_{k} (t) = - K_{k} x_{k} (t) - {(B_{k}^{T} B_{k})}^{- 1} B_{k}^{T} C_{k} y_{k} (t) ⟶ - K_{k} x_{k} (t)$ as $y_{k} ⟶ 0$ . Therefore, we demonstrate two scenarios in a simple two-agent system: when the agents appear to stable their systems together; and when one agent converges its state to 0 much faster and leaving the other agent behaving as a separated system. Here, we use the following example of a strongly coupled and autonomously unstable system $\begin{array}{rcl} [\begin{matrix} x_{1} (t + 1) \\ x_{2} (t + 1) \end{matrix}] & = & [\begin{matrix} 1 & 1.5 \\ 1.5 & 1 \end{matrix}] [\begin{matrix} x_{1} (t) \\ x_{2} (t) \end{matrix}] \\ (38) & + [\begin{matrix} u_{1} (t) \\ u_{2} (t) \end{matrix}] \end{array}$ Here, agent 1 is responsible for stabilizing $[x_{1}, u_{1}]$ and agent 2 is responsible for stabilizing $[x_{2}, u_{2}]$ . In Fig. 2, the starting point is $x_{1} (0) = x_{2} (0) = 2$ , the learning rate is $α = 0.05$ , and the optimizing target is to minimize $\sum_{t = 0}^{\infty} (x_{{i_{k}}} {(t)}^{T} x_{{i_{k}}} (t) + u_{{j_{k}}} {(t)}^{T} u_{{j_{k}}} (t))$ (8). In addition, identification, we set the initial $A_{k} (0)$ as identity matrix and $C_{k} (0)$ as the zero matrix. In this figure, both of these agents follow each other closely. They both reach the stable point after about 20 iterations. In Fig. 3, the starting point is $x_{1} (0) = 6$ , $x_{2} (0) = - 1$ . Here, agent 2 state reaches its stable point much earlier (at 10th iteration). After that, agent 1 reaches its stable point straight-forwardly, which is a pattern from a separated system. In addition, we can see that although agent 2 stabilizes its state faster, its control does not follow up with 0 because it needs to ‘cancel out’ the impact $C_{2} y_{2}$ – which means agent 1 has not reach 0 yet.

Fig. 3.

One agent reaches the stable point earlier and leave the other agent behaving as in a separated system.

6.1.2. A real-world problem: Automated vehicle mass-spring system

Fig. 4.

The automated vehicle mass-spring systems. Upper: at the default positions; lower: not at the default position.

Fig. 5.

Learning and control performance in mass-spring system. (a) MDO agents; (b) comparison among MDO, selective decentralization (SD) and adaptive dynamic programing (ADP). 100 iteration = 1 second.

In Figs 4 and 5, we show the performance of the MDO-IDF agents in learning to control the mass-spring system, which is a real-world and more complex than the simulation above. Here, we assume three automated vehicles travelling together (i.e. a large truck needs to carry several cars). They are connected to each other by springs to avoid collision in rare scenarios (i.e. one vehicle control unit suddenly loses power). The vehicles do not know the other vehicles’ parameters and the spring parameters. By default, three vehicles travel with the same velocity and the distances among them should be kept constantly. However, there are numerous reasons driving the vehicles from the default positions (i.e. unexpected road friction) on travel. Therefore, each vehicle (annotated as a mass m) has a specific control unit u to help itself returning to the default position.

From Fig. 4 example, when the automated vehicles are off from the default position, the following acting force on each vehicle are: – Vehicle $m_{1}$ has: force $\vec{F_{12, 1}}$ from spring $k_{12}$ pushing to the right, and individual control $\vec{u_{1}}$ pushing to the right toward the default position $p_{1}$ . – Vehicle $m_{2}$ has: force $\vec{F_{12, 2}}$ from spring $k_{12}$ pushing to the left, force $\vec{F_{23, 2}}$ from spring $k_{23}$ pushing to the right, and individual control $\vec{u_{2}}$ pushing to the right toward the default position $p_{2}$ . – Vehicle $m_{3}$ has: force $\vec{F_{23, 3}}$ from spring $k_{23}$ pushing to the left, and individual control $\vec{u_{3}}$ pushing to the left toward default position $p_{3}$ . Writing the second Newton’s law vector-equations [22] for these vehicles, without losing the generality, we have $\begin{matrix} (39) & \{\begin{matrix} m_{1} \vec{a_{1}} = \vec{F_{1, 1}} + \vec{u_{1}} \\ m_{2} \vec{a_{2}} = \vec{F_{12, 2}} + \vec{F_{23, 2}} + \vec{u_{2}} \\ m_{2} \vec{a_{2}} = \vec{F_{3, 3}} + \vec{u_{3}} \end{matrix} \end{matrix}$

Where $\vec{a_{1}}$ , $\vec{a_{2}}$ and $\vec{a_{3}}$ stand for the accelerations of $m_{1}$ , $m_{2}$ and $m_{3}$ , correspondingly. Let $(x_{1}, v_{1})$ , $(x_{2}, v_{2})$ and $(x_{3}, v_{3})$ denote their displacement positions and velocities. Applying Hooke’s law for elastic spring [16] and linearizing (39) with small time interval $Δ t$ , we have the system $\begin{array}{l} x (t + 1) \\ = [\begin{matrix} 1 & Δ t & 0 & 0 & 0 & 0 \\ - \frac{Δ t}{m_{1}} k_{12} & 1 & k_{12} \frac{Δ t}{m_{1}} & 0 & 0 & 0 \\ 0 & 0 & 1 & Δ t & 0 & 0 \\ \frac{- k_{12} Δ t}{m_{2}} & 0 & \frac{(k_{12} + k_{23}) Δ t}{m_{2}} & 1 & \frac{- k_{23} Δ t}{m_{2}} & 0 \\ 0 & 0 & 0 & 0 & 1 & Δ t \\ 0 & 0 & \frac{k_{23} Δ t}{m_{3}} & 0 & - \frac{k_{23} Δ t}{m_{3}} & 1 \end{matrix}] \\ (40) & \times x (t) + [\begin{matrix} 0 & 0 & 0 \\ \frac{Δ t}{m_{1}} & 0 & 0 \\ 0 & 0 & 0 \\ 0 & \frac{Δ t}{m_{2}} & 0 \\ 0 & 0 & 0 \\ 0 & 0 & \frac{Δ t}{m_{2}} \end{matrix}] u (t) \end{array}$

Where $\begin{matrix} x = [\begin{matrix} x_{1} \\ v_{1} \\ x_{2} \\ v_{2} \\ x_{3} \\ v_{3} \end{matrix}] and u = [\begin{matrix} u_{1} \\ u_{2} \\ u_{3} \end{matrix}] . \end{matrix}$ In this experiment, we set the vehicles’ masses as $m_{1} = m_{2} = m_{3} = 1$ (kg). The spring elastic are $k_{12} = k_{23} = 0.5$ (kg/m²). For other system parameter, $Δ t = 0.01$ (s). Also, in (5), (6), the stabilizing optimization includes $p (x) = x^{T} x$ and $q (u) = u^{T} u$ , which implies the vehicles should return to the default positions as soon as possible, maintain the same velocities with the ‘reasonable trade off’ with control effort. In this case, we have $\begin{array}{l} x (t + 1) \\ = [\begin{matrix} 1 & 0.01 & 0 & 0 & 0 & 0 \\ - 0.01 & 1 & 0.01 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0.01 & 0 & 0 \\ - 0.01 & 0 & 0.015 & 1 & - 0.005 & 0 \\ 0 & 0 & 0 & 0 & 1 & 0.001 \\ 0 & 0 & 0.005 & 0 & - 0.005 & 1 \end{matrix}] x (t) \\ (41) & + [\begin{matrix} 0 & 0 & 0 \\ 0.01 & 0 & 0 \\ 0 & 0 & 0 \\ 0 & 0.01 & 0 \\ 0 & 0 & 0 \\ 0 & 0 & 0.01 \end{matrix}] u (t) \end{array}$

Here, the dimensionalities of x and u are $N = 6$ and $M = 3$ . Three agents jointly stabilize the system. They are assigned for $[x_{1}, x_{2}, u_{1}]$ , $[x_{3}, x_{4}, u_{2}]$ and $[x_{5}, x_{6}, u_{3}]$ , correspondingly. The starting point is $x (0) = (- 0.5, 0, - 0.3, 0, 0.2, 0)$ and the learning rate $α = 0.001$ .

In Fig. 5, we see that these agents would eventually stabilize the system, after some periods of oscillation due to learning. We also implement the adaptive dynamic programming (ADP) [19], which is among the most well-developed techniques in RL and adaptive control recently and the selective decentralization (SD) technique [35], which was showed to outperform ADP in several examples. Here, the SD stabilizes the system state slightly faster than the MDO, but at the cost of scarifying the control cost. The ADP [19] is not able to stabilize the system. Investigating the issue with ADP, we found that the system (40), (41) is not Hurwitz; therefore, it is difficult for the ADP to initialize the suitable parameters for system identification [19,20].

Fig. 6.

Upper: state and the control of the agents in system (42). Lower: comparison among MDP, Selective Decentralization (SD) and ADP.

Fig. 7.

Upper: state and the control of the agents in system (43). Lower: comparison among MDP, Selective Decentralization (SD) and ADP.

6.2. MDO agents stabilize the nonlinear system

Unlike the linear system, in the nonlinear system, one agent does not necessary behave similar to a separated system when the other agents reach the stable points. Therefore, in this section, we demonstrate how the MDO agents stabilize the entire systems in two toy examples. In the first example (Fig. 6), we have

$\begin{array}{l} x (t + 1) \\ (42) & = [\begin{matrix} sin (0.6 x_{1} (t) + 0.2 x_{2} (t) + 0.2 x_{3} (t) + u_{1} (t)) \\ sin (0.2 x_{1} (t) + 0.6 x_{2} (t) + 0.2 x_{3} (t) + u_{2} (t)) \\ sin (0.2 x_{1} (t) + 0.2 x_{2} (t) + 0.6 x_{3} (t) + u_{3} (t)) \end{matrix}] \end{array}$

This example, three MDO agents, who are responsible for $[x_{1}, u_{1}]$ , $[x_{2}, u_{2}]$ and $[x_{3}, u_{3}]$ . It is easy to see that for the first agent, when $x_{2}, x_{3} ⟶ 0$ , its system approaches $sin (0.6 x_{1} (t) + u_{1} (t))$ , or it behaves closer to a separated system. In the second example (Fig. 6), we have

$\begin{array}{l} x (t + 1) \\ (43) & = [\begin{matrix} sin (0.4 x_{1} (t) + 0.3 x_{2} (t) + 0.3 x_{3} (t) + u_{1} (t)) \\ sin (0.3 x_{1} (t) + 0.4 x_{2} (t) + 0.3 x_{3} (t) + u_{2} (t)) \\ sin (x_{3} (t)) - 1 + cos (0.3 x_{2} (t) + 0.3 x_{1} (t)) + u_{3} (t) \end{matrix}] \end{array}$

Here, we also have three MDO agents as in (39). It is easy to see that the third agent, whose ‘separated’ system is $sin (x_{3} (t)) - 1 + u_{3} (t)$ , does not have its separated stable point at $[x_{3} = 0, u_{3} = 0]$ . When $x_{2}, x_{3} ⟶ 0$ , its system approaches $sin (x_{3} (t)) + u_{3} (t)$ , which is no longer the separated form.

For both (40) and (41), we setup the MDO-IDF agents and discrete-MDP solution as showed in [4]. Briefly, each agent uniformly divides its state $(x_{k})$ and control $(u_{k})$ bounded range, which is known to be between −0.2 and 0.2, into $M_{k} = N_{k} = 7$ regions. Each agent k also divides each dimension of $x_{l \neq k}$ into $L_{k} = 3$ regions. Therefore, each agent compute $3^{2} = 9$ MDPs of size $7 \times 7 \times 7$ for learning and control. The starting point is $x_{1} = 0.2$ , $x_{2} = 0.16$ , $x_{3} = - 0.2$ for both (40) and (41). For identification, each agent has neural networks of 50 hidden layers. We use the window size of $Ω = 50$ . In each training round, the training data set is reused at most 1000 times (epoch) [26] to improve identification. The maximum number of iterations t is 5000.

Overall, in both examples, the selectively decentralized (SD), the MDO and the ADP agents could stabilize the systems. The SD and MDO drive the systems toward the zero-equilibrium point faster than the ADP techniques. Both the MDO and the SD behave similarly at the first few tens of iterations. Here, the oscillation implies that the ‘learning’ period, when the agents need to identify the system and make error frequently. The ADP has smoother learning curve; which implies ‘less aggressive’ learning but slower converging speed.

6.3. MDO in stabilizing noisy system

In this simulation section, we want to answer the following questions. First, to what extend the MDO could improve the system identification, compared to the centralized approach, given increasing level of noise? Second, to what extend the selective decentralization could stabilize the system faster than the centralized approach could, given increasing level of noise? To simplify the analysis, this section reuse (39) with added noise element

$\begin{array}{l} x (t + 1) \\ = r (t) \\ (44) & \times [\begin{matrix} sin (0.6 x_{1} (t) + 0.2 x_{2} (t) + 0.2 x_{3} (t) + u_{1} (t)) \\ sin (0.2 x_{1} (t) + 0.6 x_{2} (t) + 0.2 x_{3} (t) + u_{2} (t)) \\ sin (0.2 x_{1} (t) + 0.2 x_{2} (t) + 0.6 x_{3} (t) + u_{3} (t)) \end{matrix}] \end{array}$

In which r is a random and unknown noise vector of with expected value of 0. We assume that r are under a multivariate normal distribution. Thus, the level of noise depends on the standard deviation of r.

Compared to the centralized approach, the MDO shows less learning performance when the noise level is low. However, when the noise level increases, MDO performance starts approaching and eventually outperforms the centralized performance, as showed in Fig. 9 and Fig. 10. Figures 7 and 8 show typical examples of the MDO performance in low-noise and high-noise scenarios. By low-noise, the standard deviation of r is 0.005. By high-noise, the standard deviation of r is 0.1. Interestingly, in these figures, both MDO and the centralized approach could converge x to relatively similar levels. This phenomena also appears when the noise level increases. Therefore, we conclude that the performance of MDO is mostly decided by the action vector u.

Fig. 8.

State and control trajectory of the MDO and centralized system in small-noise scenarios.

Fig. 9.

State and control trajectory of the MDO and centralized system in large-noise scenarios.

Fig. 10.

The converging learning performance of the MDO and the centralized system when the noise standard deviation increases.

7. Discussions

In this work, we show that the MDO agents have two strategies to handle how the other agents’ state could impact their own state. The first strategy is to ‘cancel out’ the impact, especially when the agents could setup its own system in format $x_{k} (t + 1) = F_{k} (x_{k} (t)) + G_{k} (u_{k} (t)) + H_{k} (y_{k} (t))$ , as showed in the linear system. In this case, the control include two components: one to handle the ‘separated form’ $x_{k} (t + 1) = F_{k} (x_{k} (t)) + G_{k} (u_{k} (t))$ and one to handle $G_{k} (U (x_{k} (t))) + H_{k} (y_{k} (t)) = 0$ by finding function U. The control law for the linear system is derived using this strategy: $U = {(B_{k}^{T} B_{k})}^{- 1} B_{k}^{T} C_{k}$ . The second strategy is to write the agent’s system into a set of multiple independent agent-systems, in which each agent-system corresponds to a discretized value of $y_{k}$ . This strategy is applied when the overall system in general nonlinear format, when we cannot reformat the system as showed in the first strategy.

This work shows that choosing the right objective function for optimization is critical to ensure that the MDO solution would stabilize the system. As in (28)–(31), once we can choose the objective function such that ‘moving toward the stable and equilibrium point is better than moving away from it’, we show that the local MDO agents would eventually bring their sub-state toward the local equilibrium points. In addition, in (9) and (20), the global optimal value is equal to the sum of the local optimal value because we choose the objective functions as the quadratic. The sufficient condition is that the global optimal value is better if the local optimal value of one MDO agent is better, supposing that the other agents’ optimal values remain the same. Here, it is easy to see that the quadratic function is just a narrower condition.

Our experimental results show that there is a trade-off between the learning smoothness and converging speed among these techniques. Here, both SD and MDO are ‘aggressive’ learner. These techniques have oscillating learning curves and the agents’ state/action trajectories do not strictly decrease at the first few tens of iterations. In contrast adaptive dynamic programming shows smooth learning curves, and the state/action trajectories almost strictly decrease. However, MDO and SD stabilize the system significantly faster than ADP.

Given that the learning agents substantially learn the system such that we can skip the system identification phase, the ADP technique would require less computational resources since the ADP requires neural networks with size (mostly determined by input and output layer) growing linearly to the system size. However, SD and MDO require storing the Markov transitional matrices, whose size grow exponentially with the system size. Therefore, in practice, we may use MDO and SD for off-line training to take the advantage of less-sample requirements. The MDO and SD results could be reused in ADP training to improve performance.

Because we have no restriction on the format of the nonlinear system, it is difficult to find a closed-form solution, even for sub-optimal control. Therefore, we applied the discretized MDP approach as the control algorithm for nonlinear systems. There are three disadvantages of applying MDP. First, this method requires several ad-hoc parameters in order to discretize the state and action space. Second, this method is memory consuming: the memory needed for storing the MDP grows exponentially with the system dimensionality. Third, the MDP solution assume several tight conditions to ensure the convergence. Therefore, in narrower system formats, such as systems in feedback linearization form, other control techniques such as Adaptive Dynamic Programming [5,23,25,33,36–38] would be more suitable.

Another limitation in this work is that the theoretical analysis assumes that system identification would help the MDO agent precisely estimate its state dynamic. However, this assumption is not always true, given that the performance of system identification depends significantly on several other parameters, such as the learning rate [42] and initialization of the estimators [34]. Also, it is possible that the combination of these parameters may completely misidentify the system, leading to uncontrollable cases. Therefore, in order to apply MDO, the implementer needs to carefully examine the performance of system identification prior to learning and control.

8. Conclusions

Compared to our prior work in applying MDO in decentralized RL [36], this work provides more comprehensive analysis on setting up the MDO and the convergence of the MDO agents in learning and control problems. In this work, suppose that each MDO agent precisely learns its own system, the agent guarantees to bring its state closer toward the stable point regardless of other agents’ states. Therefore, the entire system would move closer toward and eventually reach the stable point $x = 0$ , $u = 0$ . In addition to the stability, we also address the performance of the MDO system in noisy environment, and show that the MDO is superior to the centralized approach in this condition. Referring to the four challenges in decentralized RL, we could see that the MDO tackles these challenges as follow. First, there is no restriction on when and to whom an agent should communicate. Second, the MDO agent should define a local and limited goal. Third, each agent could use system identification to tackle the unknown nature of the RL problem.

Our simulative examples show that the MDO may have the similar, if not better, performance in unknown system stabilization and learning than several state-of-the-art approaches, including selective decentralization (SD) [33] and adaptive dynamic programming (ADP) [19,47–53]. However, due to the lack of theoretical analysis on converging speed in SD and ADP, we are not able to address the performance gained by MDO. In linear system, which is easier for setting up real-world problems, we found that the Hurwitz-dependency is one reason explaining the MDO superior performance to ADP. When the system is non-Hurwitz, ADP will face much more difficulties in initialization, which further leads to poor performance. For nonlinear systems, since they are significantly more complexed, we are unable to incorporate more real-world nonlinear problem in this work. This would be done in some other future works, with intensive support from domain-experts, such as in hydroengineering.

Footnotes

Acknowledgements

This work was funded by the INFEWS program (award ID 2017-67003-26057), which is supported via an interagency collaboration between United States Department of Agriculture and National Science Foundation. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the funding agencies.

References

Abu-Khalaf and

F.L.

Lewis, Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach, Automatica 41(5) (2005), 779–791. doi:10.1016/j.automatica.2004.11.034.

N.M.

Alexandrov and

M.Y.

Hussaini, Multidisciplinary Design Optimization: State of the Art, SIAM, 1997.

N.M.

Alexandrov and

R.M.

Lewis, Analytical and computational aspects of collaborative optimization for multidisciplinary design, AIAA Journal 40(2) (2002), 301–309. doi:10.2514/2.1646.

Arslan and

Yuksel, Decentralized Q-learning for stochastic teams and games, IEEE Transactions on Automatic Control 62(4) (2016), 1545–1558. doi:10.1109/TAC.2016.2598476.

D.P.

Atherton and

D.P.

Atherton, Nonlinear Control Engineering, Van Nostrand Reinhold, New York, 1982.

R.J.

Balling and

Sobieszczanski-Sobieski, Optimization of coupled systems-a critical overview of approaches, AIAA Journal 34(1) (1996), 6–17. doi:10.2514/3.13015.

R.W.

Beard,

G.N.

Saridis and

J.T.

Wen, Galerkin approximations of the generalized Hamilton–Jacobi–Bellman equation, Automatica 33(12) (1997), 2159–2177. doi:10.1016/S0005-1098(97)00128-3.

Busoniu,

Babuska and

De Schutter, A comprehensive survey of multiagent reinforcement learning, IEEE Transactions on Systems, Man, And Cybernetics-Part C: Applications and Reviews 38(2) (2008). doi:10.1109/TSMCC.2007.913919.

Buşoniu,

Babuška and

De Schutter, Multi-Agent Reinforcement Learning: An Overview: Innovations in Multi-Agent Systems and Applications-1, Springer, 2010, pp. 183–221.

10.

E.J.

Cramer,

Dennis,

John,

P.D.

Frank,

R.M.

Lewis and

G.R.

Shubin, Problem formulation for multidisciplinary optimization, SIAM Journal on Optimization 4(4) (1994), 754–776. doi:10.1137/0804044.

11.

F.L.

Da Silva,

Glatt and

A.H.R.

Costa, MOO-MDP: An object-oriented representation for cooperative multiagent reinforcement learning, IEEE Transactions on Cybernetics 99 (2017), 1–13.

12.

Deb, Current trends in evolutionary multi-objective optimization, International Journal for Simulation and Multidisciplinary Design Optimization 1(1) (2007), 1–8. doi:10.1051/ijsmdo:2007001.

13.

Giesing and

J.-F.

Barthelemy, A summary of industry MDO applications and needs, in: 7th AIAA/USAF/NASA/ISSMO Symposium on Multidisciplinary Analysis and Optimization, 1998, p. 4737.

14.

Hecht-Nielsen, Theory of the Backpropagation Neural Network: Neural Networks for Perception, Elsevier, 1992, pp. 65–93.

15.

G.E.

Hinton and

T.J.

Sejnowski, Unsupervised Learning: Foundations of Neural Computation, MIT press, 1999.

16.

Hooke, De Potentia Restitutiva, or of Spring Explaining the Power of Springing Bodies, Vol. 1678, John Martyn, London, UK, p. 23.

17.

C.-S.

Huang,

Wang and

Teo, Solving Hamilton–Jacobi–Bellman equations by a modified method of characteristics, Nonlinear Analysis: Theory, Methods & Applications 40(1) (2000), 279–293. doi:10.1016/S0362-546X(00)85016-6.

18.

P.A.

Ioannou, Decentralized adaptive control of interconnected systems, IEEE Transactions on Automatic Control 31(4) (1986), 291–298. doi:10.1109/TAC.1986.1104282.

19.

Jiang and

Jiang, Robust Adaptive Dynamic Programming, John Wiley & Son, Inc., 2017.

20.

Jiang and

Jiang, Off-policy learning for a turbocharged diesel engine, from, http://yu-jiang.github.io/radpbookdemos/Ch2Ex2/.

21.

K.J.

Keesman, System Identification: An Introduction, Springer-Verlag, 2011.

22.

Kleppner and

Kolenkow, An Introduction to Mechanics, Cambridge University Press, 2013.

23.

Lancaster and

Rodman, Algebraic Riccati Equations, Clarendon press, 1995.

24.

J.W.

Lee and

Jangmin, A multi-agent Q-learning framework for optimizing stock trading systems, in: International Conference on Database and Expert Systems Applications, Berlin, Germany, 2002, pp. 153–162. doi:10.1007/3-540-46146-9_16.

25.

Liberzon, Calculus of Variations and Optimal Control Theory: A Concise Introduction, Princeton University Press, 2012.

26.

Liu,

Wang and

Li, Decentralized stabilization for a class of continuous-time nonlinear interconnected systems using online learning optimal control approach, IEEE transactions on neural networks and learning systems 25(2) (2014), 418–428. doi:10.1109/TNNLS.2013.2280013.

27.

Lyzell, Initialization Methods for System Identification, Linköping University Electronic Press, 2009.

28.

J.R.

Martins and

A.B.

Lambe, Multidisciplinary design optimization: A survey of architectures, AIAA Journal 51(9) (2013), 2049–2075. doi:10.2514/1.J051895.

29.

I.N.C.

Mathwork and Train , Train Neural Network, retrieved from https://www.mathworks.com/help/nnet/ref/train.html on Dec 15, 2018.

30.

K.S.

Narendra and

Parthasarathy, Identification and control of dynamical systems using neural networks, IEEE Transactions on Neural Networks 1(1) (1990), 4–27. doi:10.1109/72.80202.

31.

D.T.

Nguyen,

Kumar and

H.C.

Lau, Policy gradient with value function approximation for collective multiagent planning, Advances in Neural Information Processing Systems (2017), 4322–4332.

32.

D.T.

Nguyen,

Yeoh,

H.C.

Lau,

Zilberstein and

Zhang, Decentralized multi-agent reinforcement learning in average-reward dynamic DCOPs, in: Twenty-Eighth AAAI Conference on Artificial Intelligence, 2014, pp. 1341–1342.

33.

Nguyen and

Mukhopadhyay, Identification and optimal control of large-scale system using selective decentralization, in: Proc. IEEE International Conference on Systems. Man and Cybernetics, Budapest, 2016, pp. 503–508.

34.

Nguyen and

Mukhopadhyay, Selectively decentralized Q-learning, in: Proc. IEEE International Conference on Systems, Man, and Cybernetics, Bannf, Canada, 2017, pp. 328–333.

35.

Nguyen and

Mukhopadhyay, Two-phase selective decentralization to improve reinforcement learning systems with MDP, AI Communications 31(4), 319–337. doi:10.3233/AIC-180766.

36.

Nguyen and

Mukhopadhyay, Multidisciplinary Optimization in Decentralized Reinforcement Learning, 16th IEEE International Conference on Machine Learning and Applications (ICMLA), Cancun, Mexico 2017, pp. 779–784.

37.

T.M.

Nguyen, Selectively Decentralized Reinforcement Learning, 2018.

38.

Russell and

Norvig, Artificial Intelligence a Modern Approach, 3rd edn, Prentice Hall, 2010.

39.

G.N.

Saridis and

C.-S.G.

Lee, An approximation theory of optimal control for trainable manipulators, Systems, IEEE Transactions on Man and Cybernetics 9(3) (1979), 152–159. doi:10.1109/TSMC.1979.4310171.

40.

Scharpff,

D.M.

Roijers,

F.A.

Oliehoek,

M.T.

Spaan and

M.M.

de Weerdt, Solving Transition-Independent Multi-Agent MDPs with Sparse Interactions, Thirtieth AAAI Conference on Artificial Intelligence (2016), 3174–3180.

41.

Shi and

S.K.

Singh, Decentralized adaptive controller design for large-scale systems with higher order interconnections, IEEE Transactions on Automatic Control 37(8) (1992), 1106–1118. doi:10.1109/9.151092.

42.

Stewart, in: Calculus, 6th edn, Thomson Learning, 2013.

43.

Tampuu,

Matiisen,

Kodelja,

Kuzovkin,

Korjus,

Aru,

Aru and

Vicente, Multiagent cooperation and competition with deep reinforcement learning, PLoS one 12(4), e0172395. doi:10.1371/journal.pone.0172395.

44.

Van De Steeg,

M.M.

Drugan and

Wiering, Temporal difference learning for the game tic-tac-toe 3d: Applying structure to neural networks, IEEE Symposium Series on Computational Intelligence (2015), 564–570.

45.

J.-S.

Wang and

Y.-P.

Chen, A fully automated recurrent neural network for unknown dynamic system identification and control, IEEE Transactions on Circuits and Systems I: Regular Papers 53(6) (2006), 1363–1372.

46.

C.J.

Watkins and

Dayan, Q-learning, Machine Learning 8(3–4) (1992), 279–292. doi:10.1007/BF00992698.

47.

Wei,

F.L.

Lewis,

Liu,

Song and

Lin, Discrete-time local value iteration adaptive dynamic programming: Convergence analysis, IEEE Transactions on Systems, Man, and Cybernetics: Systems (2017).

48.

Wei,

F.L.

Lewis,

Sun,

Yan and

Song, Discrete-time deterministic Q-learning: A novel convergence analysis, IEEE transactions on cybernetics 47(5) (2017), 1224–1237. doi:10.1109/TCYB.2016.2542923.

49.

Wei,

Liu and

Lin, Value iteration adaptive dynamic programming for optimal control of discrete-time nonlinear systems, IEEE Transactions on cybernetics 46(3) (2016), 840–853. doi:10.1109/TCYB.2015.2492242.

50.

Wei,

Liu and

Lin, Discrete-time local iterative adaptive dynamic programming: Terminations and admissibility analysis, IEEE Transactions on Neural Networks and Learning Systems (2016).

51.

Wei,

Liu,

Lin and

Song, Adaptive dynamic programming for discrete-time zero-sum games, IEEE Transactions on Neural Networks and Learning Systems (2017).

52.

Wei,

Liu,

Lin and

Song, Discrete-time optimal control via local policy iteration adaptive dynamic programming, IEEE transactions on cybernetics (2017).

53.

Wei,

Song and

Yan, Data-driven zero-sum neuro-optimal control for a class of continuous-time unknown nonlinear systems with disturbance using ADP, IEEE transactions on neural networks and learning systems 27(2) (2016), 444–458. doi:10.1109/TNNLS.2015.2464080.

54.

Weiss, Multiagent Systems: A Modern Approach to Distributed Artificial Intelligence, MIT press, 1999.

55.

Xiao-Qun and

Jun-An, Parameter identification and backstepping control of uncertain Lü system, Chaos, Solitons & Fractals 18(4) (2003), 721–729. doi:10.1016/S0960-0779(02)00659-8.