Stochastic optimal controller design for medium access constrained networked control systems with unknown dynamics

Abstract

This paper proposes a stochastic optimal controller for networked control systems (NCS) with unknown dynamics and medium access constraints. The medium access constraint of NCS is modelled as a Markov Decision Process (MDP) that switches modes depending the channel access to the actuators. We then show that using the MDP assumption, the NCS with medium access constraint can be modelled as a Markovian jump linear system. Then a stochastic optimal controller is proposed that minimizes the quadratic cost function using Q-learning algorithm. The resulting control algorithm simultaneously optimizes the quadratic cost function and also allocates the network bandwidth judiciously by designing a scheduler. Two compensation strategies transmit zero and zero-order hold for control inputs that fail to get an access to channel are studied. The proposed controller and scheduler are illustrated using experiments on networks and simulations on an industrial four-tank system. The advantage of the proposed approach is that the optimal controller and scheduler can be designed forward-in-time for NCS with unknown dynamics. This is a departure from traditional dynamic programming based approaches that assume complete knowledge of the NCS dynamics and network constraints beforehand to solve the optimal controller problem backward-in-time.

Keywords

Networked control systems (NCSs)stochastic optimal controller q-learning medium access constraints Markov Decision Process (MDP)

1. Introduction

Typically in industries many control applications exchange information over a shared communication channel such as a field bus [1]. Such industrial control loops are prototypical examples of networked control systems (NCS) widely studied in control literature [2, 3, 4]. Sharing of finite network bandwidth gives raise to medium access constraint issues that denote insufficient channels to accommodate controller and sensor requests at any given time instant [5]. The network access constraints can not only degrade the NCS performance, but at times can also be potential enough to destabilize the system [6]. In such a situation, the controller design problem should include a scheduling and compensation policy for control inputs that did not gain access to the channel.

In literature, three types of scheduling policies have been studied for dealing with medium access constraint, they are: static [7, 8, 9], dynamic [10, 11, 12] and hybrid [13, 14]. In static scheduling policies the channel access is decided off-line, thus leading to a simple implementation. Adaptation of static protocols in industries has not been significant and is also limited by the time-varying behaviour of network access constraint. Dynamic protocols on other hand use real-time feedback for meeting the system performance. The advantage of dynamic scheduling is the performance robustness to time-varying medium access constraint. However, dynamic scheduling is computationally intensive and requires continuous monitoring of communication channels. Hybrid scheduling combines the simplicity of static protocols with the performance of dynamic protocol. The authors in [13] showed that hybrid scheduling is a promising approach to design optimal controllers. In spite of these developments, combining control performance and scheduling in one framework still remains a challenge.

To our best knowledge, Lincoln and Bernhardsson [15] first studied combining optimal control with scheduling, and proposed a static scheduling protocol. The authors used a dynamic programming approach wherein a pruning strategy of the search tree was used to avoid the combinatorial explosion problem. In [16] the authors proposed stability conditions for NCS with channel access constraints considering prior knowledge of communication sequence. The investigation showed that the control performance is strongly coupled with the scheduling policy and used restricted assumptions to prove the robustness. The conclusions were reaffirmed in the investigation in [17], where even for a given medium access policy, the authors proved that realizing optimal controllers is hard for random access protocols. The authors in [18] stated that co-designing optimal controller and scheduler is very complicated or even unsolvable.

To overcome the shortcomings with deterministic approaches stochastic protocols were proposed. Guo and Wang [14] studied random access protocols that allowed only a subset of actuators to have medium access and receive the control input. At the next instant, a different set of actuators were provided the access. The switching between two sets of the actuators was defined using a Markov process. More recently, the investigation in [6] studied the problem of combining packet dropouts with medium access constraint and provided a Markov random group access protocol.

The problems of stochastic optimal controller and scheduler design using stochastic approaches has also been studied in [19, 9]. The authors proposed an LQG controller for NCS subjected to medium access constraints. The authors proved that with a periodic communication sequence, the detectability and observability of the NCS is preserved. The optimal performance of NCS with network bandwidth and coding constraints was studied in [20]. The optimal performance was obtained using spectral factorization and partial fraction technique. [34] proposes a robust control approach, which confirm the effectiveness of the design in the presence of constraints and parameter uncertainities. A controller based on a generalization of the Lyapunov function, then a Linear-Quadratic Regulator (LQR) are applied with a prescribed degree of stability. The process of learning in an unknown environment with Genitic Algorithm is studied in [39] and specific interest to decision makers is Metaheuristics as it can find good solutions for complex problems in reasonable amount of time [38]. They propose Instance-specific paramter tuning and discuss automated approach does not require explicit knowledge of the metaheuristic used. [36] discuss the methods to solve the response time variability problem and show how to design specific hybrid metaheuristics.

There are two assumptions in all the above approaches that render them unsuitable for industries, they are: (i) information on network imperfections are known before hand and (ii) availability of high fidelity models. Usually information on network imperfections are not available beforehand, and in practice obtaining high fidelity models in industries is cumbersome and costly. On the other hand, industrial controllers have access to variety of data that can be aggregated in a centralized fashion using supervisory control and data acquisition (SCADA) systems. Therefore, industries requires methods that solve the optimization problem forward-in-time and data driven approach for optimization.

In literature, [35] discuss the applicability of Reinforcement Learning (RL) towards multiple access design in order to reduce energy consumption and to achieve low latency in WSNs. Although this maximizes the long-term expected return value of the agent, Heuristic Dynamic Programming (HDP) schemes solve the optimal control problems forward-in-time using either value or policy iterations. Lewis et al. [21] proposed Q-learning policy iteration method to solve the optimal strategies for linear discrete-time system without involving process dynamics. The role of HDP for designing optimal controllers for NCS has been investigated only recently. Lewis et al. [22] proposed a Q-function based stochastic optimal and sub-optimal controller for networked control systems subjected to random communication delays and packet losses. Jegannathan and Xu [23] used a neuro dynamic programming approach for stochastic optimal controller design for uncertain non-linear system for NCS. Stochastic optimal controller for NCS subjected to packet losses and delays was proposed in [24]. These investigations illustrated the role of HDP in designing optimal controllers forward-in-time for NCS without involving system dynamics. To our best knowledge, the role of HDP for co-designing controller and scheduler in the context of NCS has not been investigated. In particular, the use of experimental data and action dependent heuristic dynamic programming has not been fully explored with the exception of [23].

This investigation proposes a new Q-learning based stochastic optimal controller for NCS with medium access constraint. The main idea is to use Q-learning for building model free stochastic optimal controller that works forward-in-time. To design the controller, first the medium access constraint is modelled as a Markov Decision Process (MDP) wherein the future states depend on the current states and actions taken during the time step. Here the actions are the control inputs of the NCS that are obtained by knowing the operating ranges of the actuators. The investigation next shows that the NCS with channel access constraint modelled as a MDP is a Markovian jump linear system (MJLS) that switches states depending on the actions in that state. Then Q-learning uses the MJLS model to design the stochastic optimal controller. The states of the Q-learning are the scheduling variables that denote the actuators that get access to the control signal. It should be pointed here that, although the proposed approach leads to a suboptimal controller design, it provides a reasonable performance having proper choices of actions that reflect the operating conditions. To obtain these actions, experiments on the network and simulations on process are used. The experiments model the channel access constraint as a Markov chain and this can be combined with the actions (control inputs) to generate the MDP model for the medium access constraint. The MDP model is used in our controller design.

An interesting extension of the proposed controller, is in industrial applications having periodic communication sequences. This is most common scenario in industries, wherein the scheduling cycle are triggered by clocks. With the periodicity assumption, it is shown that the controller design can be simplified to a finite horizon optimal controller. It is possible then to include the time-step as state along with the channel access variable in the Q-learning. Simulations showed that such an implementation had significant computation benefits. Another interesting new scheme of ensemble learning is proposed by authors of [37], however, they use fuzzy logic and new optimization techniques which must be further analyzed for practical application in industries.

The investigation is organized into five sections including the introduction. Section 2 presents the problem formulation and background required of the analysis. The optimal controller design algorithm and stability properties are described in Section 3. Section 4 presents the results and the conclusions are drawn on obtained results in Section 6.

Notations: The following notations are used throughout the paper. $P>0∼{}∼{}(\geqslant∼{}0)$ denotes real symmetric positive definite (semidefinite) matrix. A matrix consists of diagonal elements is given by $diag\{\ldots\}$ . $E(.)$ stands for the expectation operator.

2. Background

The basic NCS structure considered in the paper is shown in Fig. 1, wherein the control information is transmitted over a shared communication channel. The sharing of network bandwidth with other applications limits the bandwidth available to the NCS and this gives raise to medium access constraints. As a result, only a few actuator gain access to control signals at any given time-instant. To judiciously use the network bandwidth and also achieve optimal performance in medium access constrained NCS, the controller and scheduler need to be co-designed. Further, the compensation scheme for actuators failing to gain access to the network also needs to be investigated.

Figure 1.

Basic networked control systems configuration with medium access constraints.

The NCS is assumed to be a discrete-time linear time-invariant system

$\displaystyle x(k+1)=Ax(k)+Bu(k)+Ev(k)$ (1) $\displaystyle y(k)=Cx(k)$

where $x=[x_{1},\ldots,x_{n}]^{T}\in\mathbb{R}^{n}$ is the state vector,

$\displaystyle E(v(k)^{T}v(k))=V(k)=0\text{∼{}∼{}otherwise}$ (2)

The following assumption is used to denote the medium access constraint and the working modes of sensor and actuators similar to the investigations in [25, 23]:

Assumption 1. (a) Sensor is time-driven while the controller and actuator are event driven. (b) the communication network is subjected to medium access constraint and only $p$ among the $m$ control inputs can be transmitted during any time-epoch as shown in Fig. 1.

To simplify our analysis, we further assume that the states are directly measured by the sensors and transmitted to the controller. While, the control signals are transmitted as packets to the actuators via the $p$ communication channels.

The medium access status of the control input at any time instant $k$ is decoded using a binary-valued function $\sigma_{i}(k)$ $i=\{1,2,\ldots m\}$ as in [9]. A value of $\sigma_{i}(k)=1$ indicates that the actuator $i$ receives the control input packet during the time-instant $k$ ; the packet is not transmitted, otherwise. Then at any instant $k$ , the medium access status of the actuators is represented by the $m-to-p$ communication sequence $\sigma(k)=[\sigma_{1}(k),\ldots,\sigma_{m}(k)]^{T}$ . This communication sequence is generated during each time period by the scheduler.

The control input to the NCS at time instant $k$ is given by

$\bar{u}(k)=M_{\sigma(k)}∼{}u(k)∼{}∼{}∼{}∼{}∼{}k=0,1,\ldots$ (3)

where $M_{\sigma(k)}=diag(\sigma(k))$ is communication matrix that decodes the scheduler.

Assumption 2. The matrix $M_{\sigma(k)}$ can be modeled as a Markov Decision Process (MDP) as follows:

$\displaystyle Pr\{\sigma_{l}(k+1)=j|\sigma_{l}(k)=i,\bar{u}(k)=a\}=P_{ij}^{a}$ $\displaystyle\sum_{j}{P_{ij}^{a}}=1$ (4)

where $P_{ij}^{a}$ denotes the transition probability from state $i$ to $j$ under taking the action $a$ . The action $a$ in our case are the control inputs. These actions are obtained by studying the operating range of the control signals and knowledge of the process operations.

Combining the NCS dynamics in Eq. (1), the control input $\bar{u}$ and Assumption 2, we can represent the system dynamics as a Markov jump system with multiple modes that depend on the channel conditions and control actions $\bar{u}(k)$ .

$x(k+1)∼{}=∼{}A∼{}x(k)+B∼{}\bar{u}(k)+G∼{}v(k)$ (5)

From the definition of $\bar{u}(k)$ and Assumption 2, $B∼{}\bar{u}(k)$ is also a MDP and the NCS with medium access constraint is a Markov jump linear system.

The problem considered in the investigation, is the design of the optimal controller that minimizes the stochastic cost function

$J_{k}=\underset{\sigma,v(k)}{\operatorname{E}}\sum_{j=k}^{\infty}(x_{j}^{T}∼{}% Q∼{}x_{j}+\bar{u}_{j}^{T}∼{}R∼{}\bar{u}_{j})$ (6)

where $Q$ and $R$ are symmetric positive semi-definite and symmetric positive definite matrices respectively. Optimizing the cost function in Eq. (6) requires complete knowledge of process dynamics and network constraints before hand. Then the Stochastic Riccati Equation can be solved backward-in-time. Design of stochastic controllers with the above Assumption can be found in [20].

Usually, in industries prior knowledge of process dynamics and network constraints are not available. Therefore, in industries, communication access constrained NCS requires optimal controllers that work forward-in-time and with unknown process dynamics.

3. Optimal controller and scheduler design

In this section, stochastic optimal control of NCS with medium access constraints and unknown dynamics is proposed using idea of Q-learning [26]. The design of optimal controller requires the knowledge of information transmitted by the scheduler for the actuators that did not gain access to the network. In literature two approaches have been widely studied, they are: (i) zero transmission and (ii) zero-order-hold [27, 28].

The compensator for the zero transmission is given by Eq. (3). Then, combining the system dynamics Eq. (1), the medium access constrained NCS control input in Eq. (3) and the zero transmit scheduler, we obtain

$x(k+1)=Ax(k)+BM_{\sigma(k)}u(k)+Gv(k)$ (7)

Remark 1. Design of stochastic optimal controller with zero transmission scheduler in Eq. (3) for known NCS dynamics with medium access constraint has been studied in [19, 9].

The compensator for zero-order-hold strategy is given by

$\bar{u}(k)=(M_{\sigma(k)}u(k)+M_{\overline{\sigma}}(k)u(k-1))$ (8)

where $M_{\overline{\sigma}}(k)=I-M_{\sigma(k)}$ models the signals that are not scheduled to receive the control signals from the controller.

The dynamics of medium access constrained NCS with zero order hold compensator is given by

$\displaystyle x(k+1)=Ax(k)+BM_{\sigma(k)}u(k)$ $\displaystyle\quad∼{}+BM_{\overline{\sigma}(k)}u(k-1)$ (9)

Using Eq. (9), an augmented state vector consisting of $z(k)=[x(k)u(k-1)]^{T}\in\Re^{n+(m-p)}$ , is defined such that

$z(k+1)=\tilde{A}z(k)+\tilde{B}M_{\sigma(k)}u(k)+Gv(k)$ (10)

where the augmented system matrices are given by

$\tilde{A}=\begin{bmatrix}A&BM_{\overline{\sigma}(k)}\\ 0&M_{\overline{\sigma}(k)}\\ \end{bmatrix}$

$\tilde{B}=\begin{bmatrix}BM_{\sigma}(k)\\ M_{\sigma(k)}\\ \end{bmatrix}$

Remark 2. To simplify our analysis, this investigation presents the optimal control approach for zero transmission policy. Although, the results obtained can be generalized to the zero-order-hold strategy as well.

3.1 Optimal control

It is important to note that, when the system dynamics in Eqs (3) and (9) are known, the Stochastic Ricatti Equation (SRE) can be used to design the controller and scheduler backward-in-time. The stochastic cost function for a given time instant, assuming zero transmission strategy can be represented as

$J(k)=\underset{\sigma,v}{\operatorname{E}}(x(k)^{T}P(k)x(k)+\bar{u}(k)^{T}R% \bar{u}(k))$ (11)

where $P(k)∼{}\geqslant∼{}0$ is the solution of the SRE described in [29]. Assuming model dynamics to be known, the stochastic optimal controller with zero transmission compensation is given by

$\displaystyle K(k)=-(R+\underset{\sigma,v}{\operatorname{E}}(BM_{\sigma(k)}P_{% k+1}BM_{\sigma(k)})^{-1}$ $\displaystyle\quad∼{}\underset{\sigma,v}{\operatorname{E}}(BM_{\sigma(k)}P_{k+% 1}A)$ (12)

One can see that, the solution of SRE still requires the prior knowledge of $\sigma(k)$ for computing the controller gains using backward iteration algorithm. The scheduling sequence for backward iteration algorithm can cause combinatorial explosion problem making the controller design intractable.

3.2 Q-learning based stochastic optimal controller

As stated earlier, obtaining process models in industries is cumbersome. Obtaining an analytical solution to the optimal controller problem is infeasible. Consequently, to design controller and scheduler for NCS with unknown dynamics, forward-in-time, we propose to use a Q-learning based approach. Q-learning is a form of reinforcement learning [30] primarily used in agent based systems, where the agent does not have any model of the environment. The agent has information only on the states, numerical reward functions and possible actions at each of these states.

In Q-learning, each of the state-action pair is assigned an estimated value called the Q-value. Rewards generated on reaching the states is used to update the Q-value estimate of that state. As the rewards are stochastic, the states needs to be visited many times to update the Q-values. The optimal action dependent value function $Q(x_{k},u_{k},\sigma_{k})$ is equal to stochastic cost function $J(k)$ . Therefore, the Q-function for the medium access constrained NCS is defined in terms of expected value using Bellman equation [31] as

$\displaystyle Q(x(k),u(k),\sigma(k))=\underset{\sigma,v}{\operatorname{E}}(x(k% ),u(k),\sigma(k))$ $\displaystyle\quad∼{}+J(k+1)$ (13)

where the rewards are computed as

$\displaystyle r(x(k),u(k),\sigma(k))$ (14) $\displaystyle=x(k)^{T}Qx(k)+(u(k)\sigma(k))^{T}R(u(k)\sigma(k))$

3.2.1 Infinite horizon controller

In this investigation, the scheduling variable $\sigma(k)$ is taken as the states, whereas the control inputs are the actions. The control inputs that form the actions are selected based on the maximum and minimum actuation values of the actuators as well as knowledge of the process. The rewards are generated at each state using the quadratic cost objective function and the Q-value for each of these state for the corresponding action is updated. The Q-value update is given by

$\displaystyle Q(s,a)(k)=Q(s,a)(k)+\alpha(k))[r(k)$ $\displaystyle\quad∼{}+\delta\underset{a^{\prime}}{\operatorname{max}}Q(s^{% \prime},a)(k)-Q(s,a)(k))$ (15)

where the step size, $\alpha$ is given by $0<\alpha\leqslant 1$ is constant, and $\delta>0$ represents the discounted reward, $s, a$ and $s^{\prime},a^{\prime}$ the future state and action. It should be noted here that, in this investigation to minimize the quadratic cost function of the optimal controller, the negative of the Q-value is maximized to obtain the rewards.

Since, the Q-updates are stochastic, the procedure is repeated for many iterations. Then a greedy action selection can be used to select the most optimal control actions and network schedule. The resulting controller will provide the stochastic optimal controller and scheduler that optimize the NCS performance. However, since the Q-learning algorithm works online, considerable time is spent on learning. To reduce the learning time, this investigation uses the SOFTMAX action selection [30]. To implement the SOFTMAX action selection, a Boltzmann probability distribution is used and is given by

$P(a|s)=\frac{e^{\frac{Q(s,a)}{k}}}{\sum_{j}e^{\frac{Q(s,a_{j})}{k}}}$ (16)

The use of SOFTMAX in Eq. (16) speeds the convergence of Q-learning based optimal controller.

Remark 3. It should be noted that the convergence of the Q-learning algorithm to the optimal policy is guarantied in a MDP system, if each action is executed in each state an infinite number of times and the learning rate is decayed appropriately [32].

Using Assumption 2 and the Remark 3, the convergence of the Q-learning based optimal controller design is guaranteed. The stochastic optimal controller approaches the optimal value, as each number of states is executed infinite number of times.

3.3 Periodic communication sequence and finite horizon optimal controller

A special case of scheduling sequence studied in literature is the periodic communication. Here, we show that the periodic communication assumption can greatly reduce the computational complexity with Q-learning based optimal controller design. Furthermore, the periodic communication is a good choice for the design of Q-learning finite horizon optimal controller.

The periodic communication assumption has been studied in [19, 9].

Assumption 3. The controller scheduling sequence is assumed to be periodic with periodicity $\mathcal{N}$ .

The assumption above is useful as it preserves many useful properties such as controllability and detectability of medium access constrained NCS. Furthermore, it models many industrial controllers that work based on schedule cycle triggered by clocks.

Using the periodic assumption, the optimal controller design problem in Eq. (6) can be reformulated as a finite horizon optimal control design problem with the control horizon equal to the periodicity of the communication sequence, $\mathcal{N}$ . The SRE equation for the finite horizon optimal controller problem with periodicity of the communication sequence equal to $\mathcal{N}$ is given by

$\displaystyle J_{k}=\underset{\sigma}{\operatorname{E}}\sum_{j=k}^{\mathcal{N}% -1}(x(j)^{T}Qx(j)+u(j)^{T}Ru(j))$ $\displaystyle\quad∼{}+\!x(\mathcal{N})^{T}Q(\mathcal{N})x(\mathcal{N}){+}u(% \mathcal{N})^{T}Q(\mathcal{N})x(\mathcal{N})$ (17)

The Q-learning based data-driven optimal controller and scheduler co-design algorithm proposed for the infinite horizon optimal controller can be simplified in the case of periodic communication sequence, by taking the time-steps and scheduling variable $\sigma(k)$ as the states of the Q-learning algorithm and possible control inputs as actions.

Figure 2 shows the block diagram of the stochastic optimal controller proposed in this investigation. The proposed approach is model free and requires only the reward update from the system. Sensor measurements and the known actions can be used to determine the rewards. Once the rewards are obtained the Q-value is updated using Eq. (15).

Figure 2.

Block diagram of Q-learning based stochastic optimal controller.

Although, the proposed Q-learning based optimal controller leads to a sub-optimal design. However, the design problem is simplified as no process knowledge is assumed. The proposed method uses only feedback information on reward to design the optimal controller

4. Results

To illustrate the effectiveness of the proposed optimal controller, experiments on an industrial network and simulations on four tank process are used. A detailed review of the four-tank process is presented in [33]. The four tank process used in this study is shown in Fig. 3. The four tank process is non-linear, the model is linearized and then discretized with a sampling time of $20∼{}s$ . The discrete-time model of the quadruple process is

where $x$ denotes the mass of the liquid in the four tanks, $F_{1},F_{2}$ the flows in tank 1 and 2, respectively. The flows to the tank are obtained using two pumps and inlet valves. The two pumps and the valves together model the control signals. The sensors measure the height of the liquid in the four tank.

Figure 3.

Description of the quadruple process.

The four tank process is a medium access constrained NCS as during any given control period, only one of the pumpsand valves can receive the control signal. The problem of regulation is to achieve mass balance in the four tanks. In other words, the liquid mass flow should be regulated to a steady state value in the presence of disturbances.

$\displaystyle x(k+1)=\begin{bmatrix}0.8839&0&0.1823&0\\ 0&0.8725&0&0.1957\\ 0&0&0.8058&0\\ 0&0&0&0.7901\end{bmatrix}\!\!x(k)$ $\displaystyle\quad∼{}+\begin{bmatrix}4.2335&0.5791\\ 0.5729&3.7392\\ 0&5.3965\\ 4.9002&0\\ \end{bmatrix}\begin{bmatrix}F_{1}(k)\\ F_{2}(k)\\ \end{bmatrix}$ (18)

Modelling the network as a MDP requires experiments to determine the transition probabilities of $\sigma(k)$ . To obtain the transition probabilities experiments are conducted on the distributed control systems with a MODBUS over TCP/IP network. The transmitted control packets are recorded using Wireshark. To obtain the Markov chain, the network loading of other applications connected to the network is deliberately increased during the experiments. Then the corresponding probabilities is recorded. The experiments results in a Markov chain model for the medium access constraint. The transition probabilities obtained from experiments are given in Eq. (19).

Figure 4.

State trajectories of four tank process with static LQR.

The Markov chain thus generated gives the probability for the pumps/valves to get access to the control signals. Combining the Markov chain model with the possible actions, the MDP model required for designing the stochastic controller can be obtained. For example, provided pump 2 operates for a given period and the level of tank 1 reached $h_{1}$ , the probability that pump 1 being turned ON or low flow rate into tank 1 is highly probable. Such information when embedded with the Markov chain model can be used to obtain the MDP required for Q-learning based optimal stochastic controller.

$\Pi=\begin{bmatrix}0.3367&0.6633\\ 0.6815&0.3185\\ \end{bmatrix}$ (19)

First, Fig. 4 indicates that the stochastic optimal control of medium access constrained NCS with known system dynamics and prior information on channel access as a Markov chain, and having initial condition of $x_{0}=[500∼{}500∼{}500∼{}500]$ . The optimal controller is obtained by solving the SRE backward-in-time. The state trajectory shows that the controller can make the states to converge to zero, while ensuring stable NCS operation.

Figure 5.

State trajectories of four tank process with Q-learning based stochastic optimal controller with zero transmit compensation.

Figure 6.

Communication sequence with (a) transmit zero and (b) zero-order-hold compensation for the four tank problem.

Second, Fig. 5 shows the state trajectories of the four tank process with medium access constraints and controlled using the Q-learning based stochastic optimal controller with zero transmit compensation strategy. One can the observe that the proposed controller, scheduler, and compensator strategy make the states converge to zero in steady-state condition. The corresponding communication schedule is shown in Fig. 6a. This results shows that the proposed Q-learning based controller with zero transmit strategy not only regulates the system state to zero optimally, but also schedules the network bandwidth effectively.

Figure 7.

State trajectories of four tank process with Q-learning based stochastic optimal controller with zero-order-hold compensation.

Finally, Fig. 7 shows state trajectories of the four tank process with medium access constrained four tank process controlled using Q-learning based stochastic optimal controller with zero-order-hold compensation strategy. The controller regulates the states to zero in steady-state conditions. The scheduling sequence of the control input is shown in Fig. 6b. This results illustrates the performance of Q-learning based optimal controller to regulate the system state and schedule network bandwidth efficiently.

Figure 8.

Reward function of the optimal controller.

The plot of the reward function versus time for 500 iterations is shown in Fig. 8 for zero-order-hold strategy. The plot shows that the reward function is monotonic and attains a steady state value equal to zero as the states approach zero. This result illustrates the convergence of the Q-learning algorithm.

Remark 4. The optimal performance with zero transmit and zero-order-hold compensation schemes were not much different with the proposed controller. Therefore, conclusions on the best compensation scheme cannot be derived at least from the proposed example.

5. Discussion

The Q-learning of the controller is a continuous and never ending process in the presence of dynamic disturbances. However, in the ideal case of known disturbances occuring at predetermined times, the learning can be stopped when the controller provides the required control performance to the plant. However, the compensation scheme cannot be derived from the proposed example. The proposed method has the advantage of computationally simple, when the communication is periodic, at the cost of yeilding a sub-optimal controller. The convergence of the proposed algorithm is not derived here, however, interested readers can refer to [21, 22] for proof of convergence.

6. Conclusions

This paper presented a stochastic optimal controller for NCS with medium access constraints and unknown dynamics using Q-learning. NCS with medium access constraints was modelled as a Markovian jump linear system with the channel constraints modelled as a Markov decision process. Two compensation strategies were studied for the controllers, transmit zero and zero-order-hold.

Then the stochastic optimal controller using Q-learning was proposed. The resulting approach co-designed an optimal controller, compensator, and sche- duler. The MDP assumption of the channel dynamics guaranteed the convergence of the Q-learning algorithm.

The investigation also showed that, when the communication sequence is periodic, realization of finite horizon optimal controller is computationally simple. Finally, the benefits of the proposed controller was illustrated on a four tank process. Our results showed that the Q-learning based stochastic controller optimized the NCS performance in the presence of medium access constraints.

Implementation of the proposed Q-learning based optimal controller on real-scale industrial setting with NCS facing medium access constraint, other compensation mechanism for controllers failing get access to control signals, and studying the stability of the proposed controller are future directions of this investigation. Robustness and Convinience are also important parameters to be studied.

References

Hespanha

Naghshtabrizi

. A survey of recent results in networked control systems. Proceedings-IEEE 2007; 95(1): 138.

Srinivasanv

Vallabhan

Ramaswamy

Kotta

. Adaptive lqr controller for networked control systems subjected to random communication delays. In American Control Conference (ACC), IEEE 2013; 783-787.

Yang

. Networked control system: A brief survey. IEEE Proceedings Control Theory and Applications 2006; 153(4): 403-412.

Zhao

Liu

Rees

. Integrated predictive control and scheduling co-design for networked control systems. IET Control Theory & Applications 2008; 2(1): 7-15.

Ionete

Cela

. Structural properties and stabilization of ncs with medium access constraints. In Decision and Control 45th IEEE Conference on, IEEE 2006; 1141-1146.

Zhu

Guo

Yang

Wang

. Networked optimal control with random medium access protocol and packet dropouts. Mathematical Problems in Engineering 2015.

Branicky

Phillips

Zhang

. Scheduling and feedback co-design for networked control systems. In Decision and Control, 2002, Proceedings of the 41st IEEE Conference on 2002 Dec; 2: 1211-1217.

Brockett

. Stabilization of motor networks. In Decision and Control, Proceedings of the 34th IEEE Conference on 1995 Dec; 2: 1484-1488.

Zhang

Hristu-Varsakelis

. Communication and control co-design for networked control systems. Automatica 2006; 42(6): 953-958.

10.

Guo

Jin

. A switching system approach to actuator assignment with limited channels. Int J Robust Nonlinear Control 2010; 20: 1407-1426. doi: 10.1002/rnc.1522.

11.

Guo

. A switching system approach to sensor and actuator assignment for stabilisation via limited multi-packet transmitting channels. International Journal of Control 2011; 84(1): 78-93.

12.

Guo

. Communications and control co-design: A combined dynamic-static scheduling approach. Science China Information Sciences 2012; 55(11): 2495-2507.

13.

Gaid

MEMB

Cela

Hamam

. Optimal integrated control and scheduling of networked control systems with communication constraints: Application to a car suspension system. Control Systems Technology, IEEE Transactions on 2006 July; 14(4): 776-787.

14.

Wang

Guo

. Control with a random access protocol and packet dropouts. International Journal of Systems Science, (ahead-of-print) 2015; 1-9.

15.

Lincoln

Bernhardsson

. LQR optimization of linear system switching. Automatic Control, IEEE Transactions on 2002 Oct; 47(10): 1701-1705.

16.

Xie

. Optimal control of networked systems with limited communication: A combined heuristic and convex optimization approach. In Decision and Control, Proceedings, 42nd IEEE Conference on, IEEE 2003; 2: 1194-1199.

17.

Tobagi

. Multiaccess protocols in packet communication systems. Communications, IEEE Transactions on 1980; Apr; 28(4): 468-488.

18.

Matveev

Savkin

. The problem of LQG optimal control via a limited capacity communication channel. Systems & Control Letters 2004; 53(1): 51-64.

19.

Zhang

Hristu-Varsakelis

. LQG control under limited communication. In Decision and Control, 2005 and 2005 European Control Conference, CDC-ECC’05, 44th IEEE Conference on, IEEE 2005; 185-190.

20.

Zhan

Guan

Zhang

Yuan

. Optimal tracking performance of mimo control systems with communication constraints and a code scheme. International Journal of Systems Science 2015; 46(3): 464-473.

21.

Al-Tamimi

Lewis

Abu-Khalaf

. Model-free qlearning designs for linear discrete-time zero-sum games with application to h-infinity control. Automatica 2007; 43(3): 473-481.

22.

Jagannathan

Lewis

. Stochastic optimal control of unknown linear networked control system in the presence of random delays and packet losses. Automatica 2012; 48(6): 1017-1030.

23.

Jagannathan

. Stochastic optimal controller design for uncertain nonlinear networked control system via neuro dynamic programming. Neural Networks and Learning Systems, IEEE Transactions on 2013; 24(3): 471-484.

24.

Jagannathan

Lewis

. Stochastic optimal design for unknown linear discrete-time system zero-sum games in input-output form under communication constraints. Asian Journal of Control 2014; 16(5): 1263-1276.

25.

Liou

Ray

. A stochastic regulator for integrated communication and control systems: Part I – formulation of control law. Journal of Dynamic Systems, Measurement, and Control 1991; 113(4): 604-611.

26.

Christopher John Cornish Hellaby Watkins. Learning from delayed rewards, PhD Thesis, University of Cambridge England, 1989.

27.

Hristu-Varsakelis

. On the period of communication policies for networked control systems, and the question of zero-order holding. In Decision and Control, 2007 46th IEEE Conference on, IEEE 2007; 38-43.

28.

Schenato

. To zero or to hold control inputs with lossy links. Automatic Control, IEEE Transactions on 2009; 54(5): 1093-1099.

29.

Wonham

. On a matrix riccati equation of stochastic control. SIAM Journal on Control 1968; 6(4): 681-697.

30.

Sutton

Barto

. Reinforcement learning: An introduction, adaptive computation and machine learning series, The MIT Press, Cambridge, Massachusetts, 1997.

31.

Lewis

Syrmos

. Optimal control, John Wiley & Sons, 1995.

32.

Tsitsiklis

. Asynchronous stochastic approximation and qlearning. Machine Learning 1994; 16(3): 185-202.

33.

Subathra

Seshadhri

Radhakrishnan

. A comparative study of neuro fuzzy and recurrent neuro fuzzy model-based controllers for real-time industrial processes. Systems Science & Control Engineering 2015; 3(1): 412-426.

34.

Kouamana

Velosa

. Robust control and synchronization of chaotic systems with actuator constraints. Handbook of Research on Artificial Intelligence Techniques and Algorithms. Ed. Pandian Vasant. Hershey: IGI Global, 2015; 1-43.

35.

Fatemeh

Maihami

. Distributed learning algorithm applications to the scheduling of wireless sensor networks. Handbook of Research on Novel Soft Computing Intelligent Algorithms: Theory and Practical Applications. Ed. Pandian M. Vasant. Hershey: IGI Global, 2014; 860-91.

36.

Alberto

Corominas

Pastor

. Pure and hybrid metaheuristics for the response time variability problem. Meta-Heuristics Optimization Algorithms in Engineering, Business, Economics, and Finance. Ed. Pandian M. Vasant. Hershey: IGI Global, 2013; 275-311.

37.

Liu

JNK

Wang

. Fuzzy integral-based kernel regression ensemble and its application. Handbook of Research on Artificial Intelligence Techniques and Algorithms. Ed. Pandian Vasant. Hershey: IGI Global, 2015; 378-410.

38.

Jana

Beullens

Wang

. Instance-specific parameter tuning for meta-heuristics. Meta-Heuristics Optimization Algorithms in Engineering, Business, Economics, and Finance. Ed. Pandian M. Vasant. Hershey: IGI Global, 2013; 136-70.

39.

Singh

. Evaluation of genetic algorithm as learning system in rigid space interpretation. Handbook of Research on Novel Soft Computing Intelligent Algorithms: Theory and Practical Applications. Ed. Pandian M. Vasant. Hershey: IGI Global, 2014; 475-510.