Abstract
This paper proposes a stochastic optimal controller for networked control systems (NCS) with unknown dynamics and medium access constraints. The medium access constraint of NCS is modelled as a Markov Decision Process (MDP) that switches modes depending the channel access to the actuators. We then show that using the MDP assumption, the NCS with medium access constraint can be modelled as a Markovian jump linear system. Then a stochastic optimal controller is proposed that minimizes the quadratic cost function using Q-learning algorithm. The resulting control algorithm simultaneously optimizes the quadratic cost function and also allocates the network bandwidth judiciously by designing a scheduler. Two compensation strategies transmit zero and zero-order hold for control inputs that fail to get an access to channel are studied. The proposed controller and scheduler are illustrated using experiments on networks and simulations on an industrial four-tank system. The advantage of the proposed approach is that the optimal controller and scheduler can be designed forward-in-time for NCS with unknown dynamics. This is a departure from traditional dynamic programming based approaches that assume complete knowledge of the NCS dynamics and network constraints beforehand to solve the optimal controller problem backward-in-time.
Keywords
Introduction
Typically in industries many control applications exchange information over a shared communication channel such as a field bus [1]. Such industrial control loops are prototypical examples of networked control systems (NCS) widely studied in control literature [2, 3, 4]. Sharing of finite network bandwidth gives raise to medium access constraint issues that denote insufficient channels to accommodate controller and sensor requests at any given time instant [5]. The network access constraints can not only degrade the NCS performance, but at times can also be potential enough to destabilize the system [6]. In such a situation, the controller design problem should include a scheduling and compensation policy for control inputs that did not gain access to the channel.
In literature, three types of scheduling policies have been studied for dealing with medium access constraint, they are: static [7, 8, 9], dynamic [10, 11, 12] and hybrid [13, 14]. In static scheduling policies the channel access is decided off-line, thus leading to a simple implementation. Adaptation of static protocols in industries has not been significant and is also limited by the time-varying behaviour of network access constraint. Dynamic protocols on other hand use real-time feedback for meeting the system performance. The advantage of dynamic scheduling is the performance robustness to time-varying medium access constraint. However, dynamic scheduling is computationally intensive and requires continuous monitoring of communication channels. Hybrid scheduling combines the simplicity of static protocols with the performance of dynamic protocol. The authors in [13] showed that hybrid scheduling is a promising approach to design optimal controllers. In spite of these developments, combining control performance and scheduling in one framework still remains a challenge.
To our best knowledge, Lincoln and Bernhardsson [15] first studied combining optimal control with scheduling, and proposed a static scheduling protocol. The authors used a dynamic programming approach wherein a pruning strategy of the search tree was used to avoid the combinatorial explosion problem. In [16] the authors proposed stability conditions for NCS with channel access constraints considering prior knowledge of communication sequence. The investigation showed that the control performance is strongly coupled with the scheduling policy and used restricted assumptions to prove the robustness. The conclusions were reaffirmed in the investigation in [17], where even for a given medium access policy, the authors proved that realizing optimal controllers is hard for random access protocols. The authors in [18] stated that co-designing optimal controller and scheduler is very complicated or even unsolvable.
To overcome the shortcomings with deterministic approaches stochastic protocols were proposed. Guo and Wang [14] studied random access protocols that allowed only a subset of actuators to have medium access and receive the control input. At the next instant, a different set of actuators were provided the access. The switching between two sets of the actuators was defined using a Markov process. More recently, the investigation in [6] studied the problem of combining packet dropouts with medium access constraint and provided a Markov random group access protocol.
The problems of stochastic optimal controller and scheduler design using stochastic approaches has also been studied in [19, 9]. The authors proposed an LQG controller for NCS subjected to medium access constraints. The authors proved that with a periodic communication sequence, the detectability and observability of the NCS is preserved. The optimal performance of NCS with network bandwidth and coding constraints was studied in [20]. The optimal performance was obtained using spectral factorization and partial fraction technique. [34] proposes a robust control approach, which confirm the effectiveness of the design in the presence of constraints and parameter uncertainities. A controller based on a generalization of the Lyapunov function, then a Linear-Quadratic Regulator (LQR) are applied with a prescribed degree of stability. The process of learning in an unknown environment with Genitic Algorithm is studied in [39] and specific interest to decision makers is Metaheuristics as it can find good solutions for complex problems in reasonable amount of time [38]. They propose Instance-specific paramter tuning and discuss automated approach does not require explicit knowledge of the metaheuristic used. [36] discuss the methods to solve the response time variability problem and show how to design specific hybrid metaheuristics.
There are two assumptions in all the above approaches that render them unsuitable for industries, they are: (i) information on network imperfections are known before hand and (ii) availability of high fidelity models. Usually information on network imperfections are not available beforehand, and in practice obtaining high fidelity models in industries is cumbersome and costly. On the other hand, industrial controllers have access to variety of data that can be aggregated in a centralized fashion using supervisory control and data acquisition (SCADA) systems. Therefore, industries requires methods that solve the optimization problem forward-in-time and data driven approach for optimization.
In literature, [35] discuss the applicability of Reinforcement Learning (RL) towards multiple access design in order to reduce energy consumption and to achieve low latency in WSNs. Although this maximizes the long-term expected return value of the agent, Heuristic Dynamic Programming (HDP) schemes solve the optimal control problems forward-in-time using either value or policy iterations. Lewis et al. [21] proposed Q-learning policy iteration method to solve the optimal strategies for linear discrete-time system without involving process dynamics. The role of HDP for designing optimal controllers for NCS has been investigated only recently. Lewis et al. [22] proposed a Q-function based stochastic optimal and sub-optimal controller for networked control systems subjected to random communication delays and packet losses. Jegannathan and Xu [23] used a neuro dynamic programming approach for stochastic optimal controller design for uncertain non-linear system for NCS. Stochastic optimal controller for NCS subjected to packet losses and delays was proposed in [24]. These investigations illustrated the role of HDP in designing optimal controllers forward-in-time for NCS without involving system dynamics. To our best knowledge, the role of HDP for co-designing controller and scheduler in the context of NCS has not been investigated. In particular, the use of experimental data and action dependent heuristic dynamic programming has not been fully explored with the exception of [23].
This investigation proposes a new Q-learning based stochastic optimal controller for NCS with medium access constraint. The main idea is to use Q-learning for building model free stochastic optimal controller that works forward-in-time. To design the controller, first the medium access constraint is modelled as a Markov Decision Process (MDP) wherein the future states depend on the current states and actions taken during the time step. Here the actions are the control inputs of the NCS that are obtained by knowing the operating ranges of the actuators. The investigation next shows that the NCS with channel access constraint modelled as a MDP is a Markovian jump linear system (MJLS) that switches states depending on the actions in that state. Then Q-learning uses the MJLS model to design the stochastic optimal controller. The states of the Q-learning are the scheduling variables that denote the actuators that get access to the control signal. It should be pointed here that, although the proposed approach leads to a suboptimal controller design, it provides a reasonable performance having proper choices of actions that reflect the operating conditions. To obtain these actions, experiments on the network and simulations on process are used. The experiments model the channel access constraint as a Markov chain and this can be combined with the actions (control inputs) to generate the MDP model for the medium access constraint. The MDP model is used in our controller design.
An interesting extension of the proposed controller, is in industrial applications having periodic communication sequences. This is most common scenario in industries, wherein the scheduling cycle are triggered by clocks. With the periodicity assumption, it is shown that the controller design can be simplified to a finite horizon optimal controller. It is possible then to include the time-step as state along with the channel access variable in the Q-learning. Simulations showed that such an implementation had significant computation benefits. Another interesting new scheme of ensemble learning is proposed by authors of [37], however, they use fuzzy logic and new optimization techniques which must be further analyzed for practical application in industries.
The investigation is organized into five sections including the introduction. Section 2 presents the problem formulation and background required of the analysis. The optimal controller design algorithm and stability properties are described in Section 3. Section 4 presents the results and the conclusions are drawn on obtained results in Section 6.
Notations: The following notations are used throughout the paper.
Background
The basic NCS structure considered in the paper is shown in Fig. 1, wherein the control information is transmitted over a shared communication channel. The sharing of network bandwidth with other applications limits the bandwidth available to the NCS and this gives raise to medium access constraints. As a result, only a few actuator gain access to control signals at any given time-instant. To judiciously use the network bandwidth and also achieve optimal performance in medium access constrained NCS, the controller and scheduler need to be co-designed. Further, the compensation scheme for actuators failing to gain access to the network also needs to be investigated.
Basic networked control systems configuration with medium access constraints.
The NCS is assumed to be a discrete-time linear time-invariant system
where
The following assumption is used to denote the medium access constraint and the working modes of sensor and actuators similar to the investigations in [25, 23]:
To simplify our analysis, we further assume that the states are directly measured by the sensors and transmitted to the controller. While, the control signals are transmitted as packets to the actuators via the
The medium access status of the control input at any time instant
The control input to the NCS at time instant
where
where
Combining the NCS dynamics in Eq. (1), the control input
From the definition of
The problem considered in the investigation, is the design of the optimal controller that minimizes the stochastic cost function
where
Usually, in industries prior knowledge of process dynamics and network constraints are not available. Therefore, in industries, communication access constrained NCS requires optimal controllers that work forward-in-time and with unknown process dynamics.
In this section, stochastic optimal control of NCS with medium access constraints and unknown dynamics is proposed using idea of Q-learning [26]. The design of optimal controller requires the knowledge of information transmitted by the scheduler for the actuators that did not gain access to the network. In literature two approaches have been widely studied, they are: (i) zero transmission and (ii) zero-order-hold [27, 28].
The compensator for the zero transmission is given by Eq. (3). Then, combining the system dynamics Eq. (1), the medium access constrained NCS control input in Eq. (3) and the zero transmit scheduler, we obtain
The compensator for zero-order-hold strategy is given by
where
The dynamics of medium access constrained NCS with zero order hold compensator is given by
Using Eq. (9), an augmented state vector consisting of
where the augmented system matrices are given by
It is important to note that, when the system dynamics in Eqs (3) and (9) are known, the Stochastic Ricatti Equation (SRE) can be used to design the controller and scheduler backward-in-time. The stochastic cost function for a given time instant, assuming zero transmission strategy can be represented as
where
One can see that, the solution of SRE still requires the prior knowledge of
As stated earlier, obtaining process models in industries is cumbersome. Obtaining an analytical solution to the optimal controller problem is infeasible. Consequently, to design controller and scheduler for NCS with unknown dynamics, forward-in-time, we propose to use a Q-learning based approach. Q-learning is a form of reinforcement learning [30] primarily used in agent based systems, where the agent does not have any model of the environment. The agent has information only on the states, numerical reward functions and possible actions at each of these states.
In Q-learning, each of the state-action pair is assigned an estimated value called the Q-value. Rewards generated on reaching the states is used to update the Q-value estimate of that state. As the rewards are stochastic, the states needs to be visited many times to update the Q-values. The optimal action dependent value function
where the rewards are computed as
In this investigation, the scheduling variable
where the step size,
Since, the Q-updates are stochastic, the procedure is repeated for many iterations. Then a greedy action selection can be used to select the most optimal control actions and network schedule. The resulting controller will provide the stochastic optimal controller and scheduler that optimize the NCS performance. However, since the Q-learning algorithm works online, considerable time is spent on learning. To reduce the learning time, this investigation uses the SOFTMAX action selection [30]. To implement the SOFTMAX action selection, a Boltzmann probability distribution is used and is given by
The use of SOFTMAX in Eq. (16) speeds the convergence of Q-learning based optimal controller.
Using Assumption 2 and the Remark 3, the convergence of the Q-learning based optimal controller design is guaranteed. The stochastic optimal controller approaches the optimal value, as each number of states is executed infinite number of times.
A special case of scheduling sequence studied in literature is the periodic communication. Here, we show that the periodic communication assumption can greatly reduce the computational complexity with Q-learning based optimal controller design. Furthermore, the periodic communication is a good choice for the design of Q-learning finite horizon optimal controller.
The periodic communication assumption has been studied in [19, 9].
The assumption above is useful as it preserves many useful properties such as controllability and detectability of medium access constrained NCS. Furthermore, it models many industrial controllers that work based on schedule cycle triggered by clocks.
Using the periodic assumption, the optimal controller design problem in Eq. (6) can be reformulated as a finite horizon optimal control design problem with the control horizon equal to the periodicity of the communication sequence,
The Q-learning based data-driven optimal controller and scheduler co-design algorithm proposed for the infinite horizon optimal controller can be simplified in the case of periodic communication sequence, by taking the time-steps and scheduling variable
Figure 2 shows the block diagram of the stochastic optimal controller proposed in this investigation. The proposed approach is model free and requires only the reward update from the system. Sensor measurements and the known actions can be used to determine the rewards. Once the rewards are obtained the Q-value is updated using Eq. (15).
Block diagram of Q-learning based stochastic optimal controller.
Although, the proposed Q-learning based optimal controller leads to a sub-optimal design. However, the design problem is simplified as no process knowledge is assumed. The proposed method uses only feedback information on reward to design the optimal controller
To illustrate the effectiveness of the proposed optimal controller, experiments on an industrial network and simulations on four tank process are used. A detailed review of the four-tank process is presented in [33]. The four tank process used in this study is shown in Fig. 3. The four tank process is non-linear, the model is linearized and then discretized with a sampling time of
where
Description of the quadruple process.
The four tank process is a medium access constrained NCS as during any given control period, only one of the pumpsand valves can receive the control signal. The problem of regulation is to achieve mass balance in the four tanks. In other words, the liquid mass flow should be regulated to a steady state value in the presence of disturbances.
Modelling the network as a MDP requires experiments to determine the transition probabilities of
State trajectories of four tank process with static LQR.
The Markov chain thus generated gives the probability for the pumps/valves to get access to the control signals. Combining the Markov chain model with the possible actions, the MDP model required for designing the stochastic controller can be obtained. For example, provided pump 2 operates for a given period and the level of tank 1 reached
First, Fig. 4 indicates that the stochastic optimal control of medium access constrained NCS with known system dynamics and prior information on channel access as a Markov chain, and having initial condition of
State trajectories of four tank process with Q-learning based stochastic optimal controller with zero transmit compensation.
Communication sequence with (a) transmit zero and (b) zero-order-hold compensation for the four tank problem.
Second, Fig. 5 shows the state trajectories of the four tank process with medium access constraints and controlled using the Q-learning based stochastic optimal controller with zero transmit compensation strategy. One can the observe that the proposed controller, scheduler, and compensator strategy make the states converge to zero in steady-state condition. The corresponding communication schedule is shown in Fig. 6a. This results shows that the proposed Q-learning based controller with zero transmit strategy not only regulates the system state to zero optimally, but also schedules the network bandwidth effectively.
State trajectories of four tank process with Q-learning based stochastic optimal controller with zero-order-hold compensation.
Finally, Fig. 7 shows state trajectories of the four tank process with medium access constrained four tank process controlled using Q-learning based stochastic optimal controller with zero-order-hold compensation strategy. The controller regulates the states to zero in steady-state conditions. The scheduling sequence of the control input is shown in Fig. 6b. This results illustrates the performance of Q-learning based optimal controller to regulate the system state and schedule network bandwidth efficiently.
Reward function of the optimal controller.
The plot of the reward function versus time for 500 iterations is shown in Fig. 8 for zero-order-hold strategy. The plot shows that the reward function is monotonic and attains a steady state value equal to zero as the states approach zero. This result illustrates the convergence of the Q-learning algorithm.
The Q-learning of the controller is a continuous and never ending process in the presence of dynamic disturbances. However, in the ideal case of known disturbances occuring at predetermined times, the learning can be stopped when the controller provides the required control performance to the plant. However, the compensation scheme cannot be derived from the proposed example. The proposed method has the advantage of computationally simple, when the communication is periodic, at the cost of yeilding a sub-optimal controller. The convergence of the proposed algorithm is not derived here, however, interested readers can refer to [21, 22] for proof of convergence.
Conclusions
This paper presented a stochastic optimal controller for NCS with medium access constraints and unknown dynamics using Q-learning. NCS with medium access constraints was modelled as a Markovian jump linear system with the channel constraints modelled as a Markov decision process. Two compensation strategies were studied for the controllers, transmit zero and zero-order-hold.
Then the stochastic optimal controller using Q-learning was proposed. The resulting approach co-designed an optimal controller, compensator, and sche- duler. The MDP assumption of the channel dynamics guaranteed the convergence of the Q-learning algorithm.
The investigation also showed that, when the communication sequence is periodic, realization of finite horizon optimal controller is computationally simple. Finally, the benefits of the proposed controller was illustrated on a four tank process. Our results showed that the Q-learning based stochastic controller optimized the NCS performance in the presence of medium access constraints.
Implementation of the proposed Q-learning based optimal controller on real-scale industrial setting with NCS facing medium access constraint, other compensation mechanism for controllers failing get access to control signals, and studying the stability of the proposed controller are future directions of this investigation. Robustness and Convinience are also important parameters to be studied.
