Multi-agent cooperative optimal scheduling strategy of integrated energy system in urban area under extreme events

Abstract

Because the global climate change intensifies as well as the natural disasters frequently occur, extreme events have caused serious impacts on the energy system in urban areas, and at the same time, they have brought great challenges to the supply and scheduling of urban energy systems. Therefore, in order to better integrate and manage various energy resources in urban areas, a Deep Q-Leaning Network-Quasi Upper Confidence Bound model is innovatively constructed using deep reinforcement learning technology to learn the state and behavior mapping relationship of energy system. Use deep learning to fit complex nonlinear models to optimize the entire energy system. Compare and verify the experiment with the real energy system. The improved Deep reinforcement learning algorithm is compared with Q-learning model, PDWoLF PHC algorithm model, Quasi Upper Confidence Bound algorithm model and deep Q-Leaning Network algorithm model. The results show that the research algorithm has the smallest instantaneous error value and absolute value of frequency deviation for area control, and the average value of the research algorithm in the absolute value of the frequency deviation is reduced by 45%–73% compared to other algorithms; over time, the unit output power of the research algorithm is able to flexibly track the stochastic square wave loads. Therefore, the proposed system strategies can provide feasible solutions to meet the challenges of extreme events and promote the sustainable development and safe operation of urban energy systems.

Keywords

Integrated energy systems multi-intelligence deep reinforcement learning sampling mechanisms extreme events

1. Introduction

The production and life of human beings cannot be separated from the organic integration of comprehensive energy systems, however, the energy systems in urban areas face severe challenges under extreme events due to both natural and anthropogenic factors [1]. These extreme events, such as fires, earthquakes, and floods, not only have a great impact on the lives of urban residents, but also pose new needs for urban energy systems’(UES) supply and dispatch [2]. Under extreme events, UES scheduling faces several challenges. First, extreme events may lead to damage or failure of energy facilities, thus affecting the supply reliability of the energy system [3]. Second, the drastic changes in energy demand triggered by extreme events require rapid adjustment of the energy system within a short period of time [4]. Moreover, the uncertainty and complexity caused by extreme events also bring great difficulties to the scheduling decisions of energy systems. Deep Q-Leaning Network (DQN) is a deep reinforcement learning algorithm (DRLA) that is capable of making intelligent decisions in uncertain environments with strong exploration capabilities [5]. In order to cope with these challenges, the research tries to construct a system strategy that could enhance the robustness and adaptability of the energy system based on deep learning reinforcement algorithms, incorporating Quasi-Upper Confidence Bound (Q-UCB), and the research aims to propose a collaborative and optimal scheduling strategy for multi-intelligence of integrated energy systems (IES) in urban areas under extreme events. The innovation of this research is to combine the advantages of the two algorithms and propose corresponding coping strategies and methods for the uncertainty and complexity brought by extreme events, which is expected to provide theoretical guidance and practical reference for the scheduling decisions of urban regional energy systems under extreme events, and to improve the level of intelligence and optimization effect of the energy system. The study is separated into four, the first is for analyzing the research status of the improved DRLA. The second part describes the co-optimization function of the improved DRLA and the process of model construction. The third part compares and analyzes the improved DRLA model. The last part is a summary of the whole paper. DQN-UCB algorithm is designed to solve the optimal scheduling problem of urban regional integrated energy system, improve energy efficiency, system robustness and reliability. During the training process, the agent inputs into the DQN network according to the current state, and outputs the corresponding behavior selection probability. In DQN-UCB algorithm, the introduction of UCB algorithm can help agents better explore the state space and improve the convergence speed and stability of the algorithm.

2. Related works

DQN, a reinforcement learning algorithm based on value functions, is mainly used for decision making and action selection. The multi-objective optimization problem in textile manufacturing is becoming more and more challenging, and the traditional methods cannot handle high-dimensional decision space. For this reason, He et al. introduced DQN in the multi-intelligence system instead. The experiment demonstrates that the presented system of the study can achieve the optimal solution of textile ozone oxidation process and its performance is better than the traditional methods [6]. In order to obtain a marketing investment strategy to increase the visibility of its corresponding brand in marketing scenarios, the Vargas-Perez team developed a deep reinforcement learning agent on the ground of a double deep Q-network algorithm, and the relevant outcomes showed that the decision support system facilitates the optimization of an online dynamic learning environment [7]. Facing the difficult control of the window of the hydraulic support, Yang et al. presented a deep reinforcement learning method for regulating the action of the window based on the three-dimensional simulation platform, and after three-dimensional simulation experiments, the results verified the validity of the method, i.e., it improves the efficiency of the top coal mining and achieves better economic benefits [8]. In order to better control the vehicle’s intelligence to learn errors based on its actions and interaction with the environment, Quek et al. used a deep Q network based on fractional and pixel inputs to achieve agent learning in the vehicle, which was experimentally confirmed to enable the self-driving car to learn maneuvering operations and gradually gain the ability to successfully navigate and avoid obstacles [9]. In order to optimize the old network intrusion detection methods and improve their detection rate, Yang’s team came out with a deep learning based special detection model, in which the encrypted network malicious traffic is automatically feature extracted, and it is experimentally proved that the model is able to differentiate between the normal and abnormal encrypted network traffic, with an accuracy rate of 99.95% [10].

Multi-intelligent body cooperative and optimal scheduling can achieve the optimal performance of the overall system through the collaboration and optimal decision-making among multiple intelligences. To realize consistent tracking of a multi-intelligent body system with actuator saturation, Chu et al. constructed a series of nested ellipsoid invariant sets related to the consistency error, and experimentally showed that the scheduling gain parameter can improve the convergence speed of consistent tracking [11]. In order to set a reasonable discharge price for the grid and EVs, Zhang et al. presented a negotiation strategy for the participation of EVs in optimal dispatch in the case of multiple intelligences, and the outcomes showcase that the presented negotiation model proves to be effective in balancing the interests of the grid and EVs as well as peak shifting in the numerical arithmetic cases [12]. Traditional manual programmable logic controller systems face the problem of load imbalance and irrational bin allocation in the industrial load sector. For this reason, Chen et al. proposed various optimization models with multi-intelligent body systems, which were proven to be intuitive for upward and hierarchical bin allocation [13]. In order to efficiently determine the optimal active and reactive power of dispatchable energy sources, Elgamal’s team proposed a new multi-intelligent body control system for energy management of microgrids, and the outcomes showcased that the model could dispatch agents on the DG and ESS buses to handle the optimal economic operation [14]. In order to realize the application in closed-loop scheduling of demand response for energy systems, Campos et al. incorporated the hybrid action can into the SAC framework, and the results showed that the algorithm can quickly avoid violating constraints and continuously improve towards the optimal solution [15].

In summary, it demonstrates that the DQN algorithm does well in solving issues with high-dimensional state space as well as large action space, and it reached excellent outcomes in some complex tasks, while the multi-intelligence body cooperative optimal scheduling can realize the cooperative scheduling and efficient use of different resources, and enhance the efficiency. To address the coordination and control issues of urban regional integrated energy systems under extreme events, this paper studies and improves the DQN algorithm to construct a more suitable allocation and scheduling urban regional integrated energy system for energy management and optimization. The innovation of the research model lies in the use of multi-agent design, which enables collaborative optimization among various agents. This collaborative optimization method can better handle various constraints and goals in the comprehensive energy system of urban areas, and improve the overall energy utilization efficiency.

3. Co-optimization model construction based on DQN in integrated energy system

The study improves on the basis of DQN, introduces the algorithm of Q-UCB to further improve the system structure, and finally constructs the DQN-UCB algorithm, which is designed to improve the scheduling of the UES and realize energy’s effective management.

3.1 Construction of collaborative optimization model of integrated energy system based on improved DQN algorithm

DQN, as one of the DRLA, is able to deal with high-dimensional spatial states in complex environments, and DQN can improve the efficiency of the integrated system by learning and optimizing the decision-making of the intelligentsia to achieve the rational allocation and utilization of the integrated energy resources [16]. It can also achieve collaborative learning among the intelligences to improve the decision-making strategy of the IES by jointly optimizing the decisions to adapt to the energy demand and changes in the urban area under extreme events. $Q$ The computational representation of the network is shown in Eq. (1).

$\displaystyle Q(s,a\left|w\right.)\approx Q^{\pi}(s,a)$ (1)

In Eq. (1), $w$ is the parameter; $s$ serves as the state space; $a$ serves as the intelligent body action; $Q$ is the state-action value function, also known as the $Q$ function. $Q$ The network processes high-dimensional data like images to realize the learning of the intelligent body, and the DQN deals with the deviation in the time difference method by setting the target network, so as to complete the updating of the parameters of the $Q$ network, and the relevant updating loss function of the parameters of the $Q$ network is shown in Eq. (2).

$\displaystyle l(w)=E_{s,a,r,s_{1}}\left[{(r+\gamma\mathop{\max}\limits_{a^{1}}% Q(s_{1},a_{1}\left|{\overline{w}}\right.)-Q(s_{1},a_{1}\left|w\right.))^{2}}\right]$ (2)

In Eq. (2), $r$ is the reward value obtained by the intelligent body interacting with the circumstance; $\gamma$ is the discount factor; $s_{1}$ is the next state of $s$ ; $a_{1}$ is the action of the intelligence in the next $s_{1}$ state; $(s_{1},a_{1}\left|{\overline{w}}\right.)$ is the target $Q$ network; and the structure is consistent with that of the $Q$ network, so as to obtain the $Q$ value of the next action. After obtaining the gradient by derivation of $l(w)$ , then the parameters can be updated by gradient descent method, the mechanism of DQN to realize action mapping is shown in Fig. 1.

Figure 1.

Mapping from images to actions by using DQN.

Figure 1 shows the schematic diagram of image to action mapping using DQN, from which it can be learned that DQN realizes image mapping in 4 steps, DQN takes inputs of original image, uses convolutional layer to fully connect the neural network, and finally the output is the Q-value of the action. The DQN obtains the reward value and realizes the state updating through the experience playback mechanism for the interaction in the intelligent body and the circumstance, followed by the memory unit in the form as sample storage, the update network parameters of DQN are shown in Eq. (3).

$\displaystyle L_{k}(Q_{k})=E_{(s,a,r,s_{{}^{1}})\sim U(D)}(Q_{\overline{k}}-Q_% {k}(s,a;\theta_{k}))^{2}$ (3)

In Eq. (3), $Q_{k}(s,a;\theta_{k})$ serves as the output of current value function; $\theta_{k}$ is the parameter of current network $Q_{k}$ ; $Q_{\overline{k}}$ is the output of target value network (TVN). The relevant output expression is showcased in Eq. (4).

$\displaystyle Q_{\overline{k}}=r+\gamma\mathop{\max}\limits_{a^{1}}Q(s_{1},a_{% 1}\left|{\theta_{\overline{k}}}\right.)$ (4)

In Eq. (4), $r$ is the reward value obtained by the intelligent body interacting with the circumstance; $\gamma$ is the discount factor; $\left|{\theta_{\overline{k}}}\right.$ is the parameter of the target $Q_{\overline{k}}$ at the $k$ iteration. If the parameter $\theta$ of the current value network is copied to the TVN $\theta_{\overline{k}}$ , the expression of the TVN is shown in Eq. (5).

$\displaystyle\theta_{\overline{k}}=\theta_{k+l}$ (5)

In Eq. (5), $l$ serves as the quantity of iterations; $\theta_{k+l}$ serves as the current network parameter after $l$ . This delayed parameter updating method helps to reduce the correlation in the current network $Q$ value and the target $Q$ value, and therefore can effectively improve the stability of DQN. However, UES are prone to energy supply risks and energy scheduling difficulties under extreme events, and traditional deep Q-networks cannot cope with this complex situation [17]. Under extreme events, the operating environment of integrated energy systems in urban areas may change dramatically. Therefore, the introduction of Q-UCB can help agents better adapt to these changes, and the purpose of introducing Q-UCB is to enhance the robustness of the system because it encourages agents to explore new states and actions. At the same time, Q-UCB algorithm can help agents to find a balance between exploration and utilization, so as to achieve collaborative optimization of the whole system. Through the cooperation between multiple agents, we can better cope with the impact of extreme events and improve the stability and reliability of the system. In urban district energy systems, extreme events may occur due to the uncertainty of weather, load and other factors, and Q-UCB can help intelligences to choose appropriate strategies to cope with these situations.

Figure 2.

UCB Algorithm network structure.

Figure 2 shows the algorithm structure of Q-UCB, the computational complexity of Q-UCB algorithm is relatively high, especially when the number of intelligences is high, but at the same time it may increase the computational burden of the system and require higher computational resources [18]. In which the Q-value function is updated as shown in Eq. (6).

$\displaystyle Q^{\pi}(s_{t},a_{t})\leftarrow Q^{\pi}(s_{t},a_{t})+a(r_{t+1}+% \gamma Q^{\pi}(s_{t+1},a_{t+1})-Q^{\pi}(s_{t},a_{t}))$ (6)

In Eq. (6) $a_{t}$ is the learning rate. The advantage of the UCB algorithm is that it can synthesize the historical reward information of the gambling machine and find a balance between exploration and exploitation, and make action selection by the average reward value and the size of the action indicator. The expression of the reward value is shown in Eq. (7).

$\displaystyle X_{k}=\overline{x_{k}}+\sqrt{\frac{2\ln n}{n_{k}}}$ (7)

In Eq. (7), $\overline{x_{k}}$ is the average reward value of the first $k$ action; $\sqrt{\frac{2\ln n}{n_{k}}}$ is the confidence level on the current action; $n_{k}$ is the number of times it has been selected; and $n$ is the sum of the number of experiments for all actions at the current position. However, the UCB algorithm makes strong assumptions about the reward distribution of the gambling machine and may not be able to adapt effectively when the reward distribution changes. For this reason, Q-UCB is incorporated into DQN to construct the DQN-UCB algorithm.

3.2 Design of a synergistic optimization model for integrated energy systems in urban areas

The DQN, which incorporates the Q learning deep neural network of the confidence upper bound algorithm, realizes the decision making of the intelligent body by learning the value function, and by combining these two algorithms, the DQN-UCB algorithm can explore the unknown domain in the process of continuous learning and synthesize the existing knowledge in decision making [19]. The confidence upper bound UCB algorithm can make full use of the historical information and as the value of the value function of each action, so that the optimal action decision can be derived, and when there is an error in the estimated Q-value and the true Q-value, then the error is limited by using Hoffding’s inequality, whose expression is shown in Eq. (8).

$\displaystyle Q_{k+1}(s_{k},a_{k})-Q(s_{k},a_{k})\leqslant b_{r}$ (8)

In Eq. (8) $b_{r}$ is the confidence reward and its expression is shown in Eq. (9).

$\displaystyle b_{r}=c\sqrt{\frac{\log(\left|S\right|\left|A\right|k/q)}{\tau}}$ (9)

In Eq. (9), $c$ is an absolute constant greater than 0; $S$ serves as the set of all possible states of the external circumstance; $A$ serves as the set of possible actions generated by the intelligent body; $k$ serves as the total of iterations so far; $\tau$ serves as the quantity of times the intelligent body has accessed the state and action pairs; and $q$ is the confidence factor. The final algorithmic framework of DQN-UCB is shown in Fig. 3.

Figure 3.

The algorithm framework of DQN-UCB is shown in the figure.

Figure 3 shows the algorithmic framework of DQN-UCB, specifically, the DQN-UCB algorithm consists of the following steps, firstly, initializing the deep neural network for approximating the value function. The second is to initialize the UCB parameters, which are used to balance the degree of exploration and exploitation. Selecting an action based on the current state. It is possible to combine the utilization of existing knowledge and exploration of the unknown domain by using the trained deep neural network and UCB parameters. Third, the selected action is executed and feedback from the environment is observed. The relevant parameters are updated based on the feedback for enhancing the decision. Fourth, on the ground of the updated network parameters, the action is selected again, executed and feedback is observed. This process is repeated until a preset stopping condition is reached. Through continuous iterative learning and optimization, the DQN-UCB algorithm can gradually improve the decision-making ability of the intelligences and eventually realize the multi-intelligence collaborative optimal scheduling of the energy system. Where the expression of the loss function input by the UCB calculator is shown in Eq. (10).

$\displaystyle L=(Q_{\overline{k}}+b_{r}-Q_{k}(s,a;\theta_{k}))^{2}$ (10)

In Eq. (10) $b_{r}$ is the confidence reward; $Q_{k}(s,a;\theta_{k})$ is the output of the current value function. The expression parameters are updated by back propagation of gradient of neural network as shown in Eq. (11).

$\displaystyle\theta_{k+1}=\theta_{k}+a_{\theta_{k}}L_{k}(\theta_{k})$ (11)

In Eq. (11) $a$ serves as the learning rate. As the network parameters are updated in real time, the update formula for the target network parameters is shown in Eq. (12).

$\displaystyle Y_{k}^{\textit{DQN--UCB}}=\left\{{{\begin{array}[]{l}{rk,(k+1)}% \hfill\\ {r+\gamma\mathop{\max}\limits_{a^{1}}Q(s_{1},a_{1}\left|{\theta_{\overline{k}}% }\right.)}\hfill\\ \end{array}}}\right.$ (12)

In Eq. (12), $r$ is the reward value obtained by the intelligent body interacting with the circumstance; $\gamma$ is the discount factor; $\left|{\theta_{\overline{k}}}\right.$ serves as the parameter of the target $Q_{\overline{k}}$ at the $k$ iteration. For the sampling mechanism in DQN, the priority sampling method linked to USB is proposed, where samples are extracted and normalized after each execution of the action, and this operation can be used to make the optimal action by preferentially selecting the excellent samples, and the probability updating formula for each sample selection is shown in Eq. (13).

$\displaystyle p_{k,i}=p_{k,i}+c\left(r_{k,i}+\sqrt{\frac{\log(\left|S\right|% \left|A\right|k/q)}{\tau_{i}}}\right)$ (13)

In Eq. (13), $c$ is an absolute constant greater than 0; $S$ serves as the set of all possible states of the external circumstance; $A$ serves as the set of possible actions produced by the intelligent body. The expression of $p_{k,i}$ can be shown in Eq. (14).

$\displaystyle p_{k,i}=\frac{p_{k,i}}{\sum\limits_{j=1}^{N}{p_{k,j}}}$ (14)

In Eq. (14), $p_{i}$ is the probability of selecting the first $i$ sample; $N$ serves as the total of samples. The power system is an important component of the integrated energy system and a core carrier for achieving energy conversion, transmission, distribution, consumption, and other links. The integrated energy system is based on the power system and achieves complementary and mutually beneficial effects among various heterogeneous energy subsystems through collaborative planning of multiple energy sources. A new integrated energy system that meets diversified energy demand, improves energy utilization efficiency, and promotes sustainable energy development. In order to avoid the interruption or damage of the urban regional energy system, which may cause serious impacts on the city’s energy services such as power supply, heating, cooling, etc., and meanwhile, for enhancing the resilience of the urban regional IES, the multi-regional cooperative distribution system of Automatic Generation Control (AGC) for the power supply system is integrated into the research system. The framework model of AGC is shown in Fig. 4.

Figure 4.

Distributed multi region AGC architecture.

Figure 4 shows the framework diagram of the multi-distributed AGC system, which is a system used to dispatch and control the generating units in the power system, and its main function is to enable the power generation to be adjusted in accordance with the changes in the load demand to maintain the balance of the power system [20]. Meanwhile, the AGC system can monitor the demand and supply of energy in real time, and through intelligent scheduling algorithms, it can allocate energy to different regions and users for realizing the energy’s efficient use. After processing the optimal regional control error instantaneous value magnitude, the reward function is shown in Eq. (15).

$\displaystyle R(k)=-\eta\left|{f(k)}\right|-(1-\eta)\left[{\textit{ACE(k)}^{2}% }\right]/1000$ (15)

In Eq. (15), ACE serves as the instantaneous error value of area control; $\left|{f(k)}\right|$ serves as the absolute value (ABA) of frequency deviation; $\eta$ is the weighting coefficient of $\left|{f(k)}\right|$ ; and $1-\eta$ is the weighting coefficient of ACE(k). The final DQN-UCB strategy for the fused multi-distributed AGC system is shown in Fig. 5.

Figure 5.

DQN-UCB system model flowchart.

Figure 5 shows the flowchart of DQN-UCB algorithm, the action network weight values and target action network value weights are initialized and parameters are set, then the action is executed to get the new state and the reward function value is calculated. After selecting the random number if it satisfies, then the sample is taken and updated, otherwise it returns to the previous step. Then as well as UCB to derive the confidence level with the Q value of all actions. Then the sample is normalized by the probability of being selected and the weights $\theta$ are updated to get the total power and output. The final study utilized the DQN-UCB algorithm model for collaborative optimization of urban comprehensive energy systems. For testing the control performance of DQN-UCB in extreme events for urban multi-area integrated energy synergy, simulation experiments are conducted for different algorithms under random square wave perturbation, and the index outcomes are showcased in Table 1.

Table 1

Statistical table of simulation experiment indicators for different algorithms

Region	Index	Q		PDWoLF PHC ( $\lambda$ )		QUCB		DQN		DQN-UCB
Zone 1	$\|\Delta$ f $\|$ /Hz	0.	02	0.	02	0.	02	0.	01	0.	01
	$\|$ ACE $\|$ /MW	56.	89	47.	59	34.	64	25.	12	12.	04
	CPS1/%	194.	82	195.	36	197.	18	198.	29	199.	62
Zone 2	$\|\Delta$ f $\|$ /Hz	0.	02	0.	02	0.	08	0.	01	0.	01
	$\|$ ACE $\|$ /MW	58.	10	48.	34	35.	16	22.	39	10.	73
	CPS1/%	195.	23	196.	22	197.	52	198.	38	199.	51
Zone 3	$\|\Delta$ f $\|$ /Hz	0.	02	0.	02	0.	02	0.	01	0.	01
	$\|$ ACE $\|$ /MW	47.	44	43.	88	29.	47	14.	66	8.	20
	CPS1/%	195.	47	196.	74	198.	20	199.	04	199.	91

As can be seen from Table 1, in region 1 DQN-UCB has the smallest value of area control instantaneous error and ABA of frequency deviation, compared to the other algorithms in the ABA of frequency deviation in the average value of 45%–73%; the value of area control instantaneous error is reduced by 52%–78%, and the value of CPS1 is the highest, compared to the other algorithms is higher by about 1.4–4.8 MW. in region 2, DQN-UCB has 47.38 MW, 37.62 MW, 24.44 MW, and 11.66 MW lower values of instantaneous error in area control than the Q algorithm, the PDWoLF PHC algorithm, the QUCB algorithm, and the DQN algorithm, respectively; where the frequency deviation of less than 0.01 Hz is the lowest value. The studied DQN-UCB algorithm in region 3 has the same minimum values of instantaneous error and ABA of frequency deviation for regional control. It can be concluded that the studied DQN-UCB algorithm is stable and highly controllable and adaptable in coping with irregular surge and dips in the power system.

4. Performance evaluation of DQN-UCM for integrated energy Co-optimization of urban systems

For verifying the stability and coordination capability of the DQN-UCM facing extreme event disturbances, the performance of the research algorithm model is analyzed in comparison with other algorithms to assess the effectiveness of the DQN-UCM in energy systems.

4.1 DQN-UCM algorithm parameter design and performance evaluation

In the occurrence of extreme events, the urban area IES needs to face the problems of energy supply interruption, equipment damage and failure, system recovery and reconstruction, etc. State-action data set, environmental parameter data set, regional control data set, continuous step load disturbance data set, energy utilization and power grid income data set were selected for experimental analysis. First of all, the controller output of DQN-UCB is specifically studied; The outcomes are illustrated in Fig. 6.

Figure 6.

Controller output curve of DQN-UCB.

Figure 6 showcases the schematic diagram of the controller output curve of DQN-UCB. Figure 6 demonstrates that the output power (OP) of the unit of DQN-UCB can flexibly track the random square wave load over time, and its random square wave load floats up and down from 0 to 2100 MA, and at 2.5 h, it can be 150 MA below the random square wave load of the controller curve of DQN-UCB; at 3.5 h the DQN-UCB controller curve is about 250 MA above the random square wave load; thus, it can be seen that the DQN-UCB algorithm is almost completely adapted to the random square wave load for the rest of the time. UCB controller curve is higher than the random square wave load by about 250 MA, and the rest of the time period the DQN-UCB controller curve fits the random square wave load almost completely, which shows that the adaptability and stability of the DQN-UCB algorithm can satisfy the actual demand. Meanwhile, for testing the convergence of the DQN-UCB algorithm. The DQN-UCB with Q algorithm, PDWoLF PHC algorithm, QUCB algorithm, as well as DQN algorithm are trained under the same simulation training set, and the training results are showcased in Fig. 7.

Figure 7.

Comparison of DQN-UCB learning effectiveness and convergence values.

Figure 7(a) shows the controller output curve of the DQN-UCB algorithm. Figure 7(a) demonstrates that the OP of the DQN-UCB controller can quickly track the load perturbation curves, and its OP ranges from $-$ 800 MW to 800 MW. Figure 7(b) showcases the CPSI curves of the DQN-UCB algorithm in the region of A and region of B. Figure 7(b) indicates that in the region of A the DQN-UCB is tuned to quickly converge to a power of about 200 MW at about 2100 S. In region B DQN-UCB reaches a stable converged power at about 2900 S. The CPSI curves in region A and B are showcased in Figure 7(b). Figure 7(c) showcases the frequency variation of the DQN-UCB algorithm, in region A the DQN-UCB algorithm reaches a stabilized value at around 3200 S with a power of around 50 HZ, and in region B the DQN-UCB algorithm reaches a stabilized value at 4100 S with a stabilized value of around 50.01 HZ. Figure 7(d) showcases the convergence plot of the average reward value of the five algorithms, and Fig. 7(d) indicates that the DQN-UCB algorithm converges faster than the other four algorithms, has better training effect, and converges to the optimal solution at about 100 S, which is faster than the Q algorithm, PDWoLF PHC algorithm, QUCB algorithm, and DQN algorithm by 2200 S, 2500 S, 2100 S, respectively, 3400 S. In summary, it demonstrates that the DQN-UCB algorithm possesses more excellent learning effect and convergence effect, and has superior dynamic performance. Next, the dynamic control performance of DQN-UCB algorithm, and QUCB algorithm, and DQN algorithm in complex environment is studied. The analyzed outcomes are showcased in Fig. 8.

Figure 8.

Algorithm controller output curve.

Figure 8(a) shows the controller output curve of the DQN algorithm, and Fig. 8(a) demonstrates that the DQN algorithm could achieve the equalization coordination consistent with the disturbance change gait, but the size of the coverage varies with time. And at 40–50 S, the coverage is larger, and the power range is 0.95–1.1 HZ. Figure 8(b) shows the controller output curve of the QUCB algorithm, and from Fig. 8(b), it can be seen that the QUCB algorithm has the largest coverage with time, and the magnitude ranges are larger and the power ranges are 0.85–1.15 HZ at 15–38 S. Figure 8(c) shows the DQN-UCB algorithm The controller output curve of the DQN-UCB algorithm is showcased in Fig. 8(c). It showcases that the DQN-UCB algorithm has the smallest coverage compared with the other two algorithms at the same moment, and the amplitude range is about 0.03–0.06 HZ and 0.05–0.09 HZ less than that of the DQN algorithm and the QUCB algorithm power ranges, respectively, at 20–30 S. From this, we can see that the DQN-UCB algorithm has the smallest power deviation, and it is able to realize the better dynamic control effect.

4.2 Test results of DQN-UCM co-optimization model

For testing the stability of the DQN-UCB algorithm in response to extreme events, the instantaneous error value of area control, frequency deviation, and CPS1 indicators are selected and compared and analyzed with the Q algorithm model, the PDWoLF PHC algorithm model, the QUCB algorithm model, and the DQN algorithm model, and the outcomes are indicated in Fig. 9.

Figure 9.

System performance comparison under random load disturbance.

Figure 9(a) shows the comparison of instantaneous error values for area control, where the error value of DQN-UCB is 1.02 MW, which is lower than Q algorithm, PDWoLF PHC algorithm, QUCB algorithm, and DQN algorithm by 1.41 MW, 1.28 MW, 1.32 MW, and 1.16 MW, respectively. Figure 9(b) shows the comparison of CPS1 values, where DQN-UCB has the highest CPS1 value of 198.0% which is 1.0%, 0.5%, 0.6%, and 0.25% higher than Q algorithm, PDWoLF PHC algorithm, QUCB algorithm, and DQN algorithm, respectively. Figure 9(c) shows the comparison of frequency deviation values, in which DQN-UCB has the lowest CPS1 value of 0.0042 HZ, which is lower than Q algorithm, PDWoLF PHC algorithm, QUCB algorithm, and DQN algorithm by 0.0020 HZ, 0.0015 HZ, 0.0018 HZ, and 0.0014 HZ, respectively. It can be seen that in the face of the extreme events, the DQN-UCB is more effective for the integrated energy coordination is stronger. Then the test comparison analysis is carried out using continuous step load perturbation, and the analysis results are shown in Fig. 10.

Figure 10.

Controller output curve under continuous step load disturbance.

Figure 10(a) shows the controller output curve under continuous step load perturbation of QUCB algorithm, and Fig. 10(a) demonstrates that the average value of frequency deviation value of QUCB algorithm is 0.0013. Figure 10(b) shows the controller output curve under continuous step load perturbation of DQN algorithm, and from Fig. 10(b), the average value of frequency deviation value of DQN algorithm is 0.0011. Figure 10(c) shows the controller output curve under continuous step load perturbation of DQN-UCB algorithm. 10(c) shows the controller output curve of DQN-UCB algorithm under continuous step load perturbation, and Fig. 10(c) demonstrates that the average value of frequency deviation of DQN-UCB algorithm is 0.00054, compared with the first two QUCB algorithms and DQN algorithm algorithm algorithms, there is a 94.60% enhancement in DQN-UCB algorithm. offset and has strong dynamic control performance. Finally, to verify the advantages of the urban comprehensive energy optimization system based on research algorithms in terms of energy utilization efficiency and grid benefits, a comparative analysis will be conducted between the system based on DQN-UCB algorithm, PPO-UCB algorithm, and LinUCB algorithm. The final experimental results are shown in Fig. 11.

Figure 11.

Comparison of energy utilization efficiency and economic benefits.

Figure 11(a) shows the economic benefits of systems based on different algorithms. It can be seen from Fig. 11(a) that the economic benefits of the DQN-UCB algorithm exceed 60000 yuan after 3 months, and the economic benefits of the system based on the research algorithm continue to grow over time. Figure 11(b) shows the comparison of energy utilization rates of urban comprehensive energy systems based on different algorithms. It can be seen from Fig. 11(b) that the energy utilization rate of the system based on the DQN-UCB algorithm exceeds 0.8 after six months, and reaches 0.9 or above after 21 months. Compared to the optimization system studied, the energy utilization efficiency of the system based on PPO-UCB algorithm and LinUCB algorithm has always been below 0.8, and with the increase of time, the growth of energy utilization efficiency is not satisfactory. In summary, it can be learned that DQN-UCB algorithm has a higher power than the other algorithms in the face of extreme events, better control, more stability, and faster and more flexible coordination efficiency.

5. Conclusion

Extreme events are often unpredictable and pose a serious threat to urban energy supplies. Improving the ability of energy system to cope with emergencies and ensuring the stability of energy supply is an important guarantee for the stable development of social economy. Therefore, based on the deep reinforcement learning algorithm, this study attempts to integrate the confidence upper bound UCB algorithm that can make full use of historical information and make optimal action decisions, and finally build an optimization system for urban integrated energy cooperative scheduling. The outcomes show that the DQN-UCB has the smallest area control instantaneous error value and ABA of frequency deviation, compared with other algorithms in the ABA of frequency deviation in the average value of 45%–73%; at the same time, the OP of the DQN-UCB controller could rapidly follow the load perturbation curves, and its OP ranges $-$ 800 MW to 800 MW; in the comparison of the value of the frequency deviation, in which the DQN- UCB has the lowest CPS1 value of 0.0042 HZ, and the average value of frequency deviation value of DQN-UCB algorithm is 0.00054, which is 94.60% higher than the first two QUCB algorithms and DQN algorithm; in the comparison of controller output curves under the white noise load perturbation, the tracking effect of the DQN-UCB algorithm on the load change is more effective than the first two algorithms. In the comparison of the controller output curve under white noise load disturbance, the DQN-UCB algorithm tracks the load changes more smoothly than the previous two algorithms, and has a higher power generation efficiency, which is 100 MW and 80 MW higher than the QUCB algorithm and DQN algorithm. It demonstrates that the DQN-UCB algorithm is able to provide a reliable scheme for solving the optimization of the synergy of the integrated energy sources of the urban area under the extreme events. The shortcoming of the research is that the system studied in this study is suitable for the integrated energy system with the power system as the core, and for the more complex multi-regional integrated energy system, multiple factors and practical applications need to be further considered.

Footnotes

Funding

The work was financially supported by Science and Technology Projects from State Grid Corporation of China, (Research and demonstration of multi-agent cooperation and interaction technology for urban regional integrated energy system supporting grid toughness improvement, No.: 5400-202317577A-3-2-ZN).

References

Wang

Huo

Yan

Cui

. Leveraging heat accumulation of district heating network to improve performances of integrated energy system under source-load uncertainties. Energy. 2022; 252(2): 234-246.

Mahbod

Rafiee

. Trend analysis of extreme precipitation events across Iran using percentile indices. Int J Climatol. 2021; 41(2): 952-969.

Kolukula

Murty

PLN

. Improving cyclone wind fields using deep convolutional neural networks and their application in extreme events. Prog Oceanogr. 2022; 202: 102763.

Taggart

. Evaluation of point forecasts for extreme events using consistent scoring functions. Q J Roy Meteor Soc. 2022; 148(742): 306-320.

Zhou

Wang

Yan

. Data-driven discoveries of Bäcklund transformations and soliton evolution equations via deep neural network learning schemes. Phys Lett A. 2022; 450: 128373.

Thomassey

Zeng

. Multi-objective optimization of the textile manufacturing process using deep-Q-network based multi-agent reinforcement learning. J Manuf Syst. 2022; 62: 939-949.

Vargas-Perez

Mesejo

Chica

Cordon

. Deep reinforcement learning in agent-based simulations for optimal media planning. Inform Fusion. 2023; 91: 644-664.

Yang

Yuan

. Deep Q-network for optimal decision for top-coal caving. Energies. 2020; 13(7): 1618.

Quek

Koh

Tso

Woo

. Deep Q-network implementation for simulated autonomous vehicle control. IEET Intell Transp Syst. 2021; 15(7): 875-885.

10.

Yang

Liang

Wen

Gao

. A deep-learning- and reinforcement-learning-based system for encrypted network malicious traffic detection. Electron Lett. 2021; 57(9): 363-365.

11.

Chu

Chen

Wei

Zhang

. Gain scheduling consensus of multi-agent systems subject to actuator saturation. Int J Control. 2020; 93(4): 771-782.

12.

Zhang

Ding

Tan

Xie

. Negotiation strategy of discharging price between power grid and electric vehicles considering multi-agent. IET Gener Transm Dis. 2020; 14(5): 833-844.

13.

Chen

Wang

Jin

Zhang

Ouyang

. Hierarchical-fuzzy allocation and multi-parameter adjustment prediction for industrial loading optimization. Connect Sci. 2022; 34(1): 687-708.

14.

Elgamal

Korovkin

Elmitwally

Chen

. Robust multi-agent system for efficient online energy management and security enforcement in a grid- connected microgrid with hybrid resources. IEET Gener Transm Dis. 2020; 14(9): 1726-1738.

15.

Campos

El-Farra

Palazoglu

. Soft actor-critic deep reinforcement learning with hybrid mixed-integer actions for demand responsive scheduling of energy systems. Ind Eng Chem Res. 2022; 61(24): 8443-8461.

16.

Maan

Chaba

. Deep Q-network based fog node offloading strategy for 5G vehicular adhoc network. Ad Hoc Netw. 2021; 120: 102565.

17.

Giri

Majumder

. Deep Q-learning based optimal resource allocation method for energy harvested cognitive radio networks. Phys Commun-Amst. 2022; 53: 101-116.

18.

Fang

Luo

Zhao

Jiang

Liu

. ST-SIGMA: Spatio-temporal semantics and interaction graph aggregation for multi-agent perception and trajectory forecasting. CAAI T Intell Techno. 2022; 7(4): 744-757.

19.

Zheng

Sun

Chen

Sun

Tao

Sun

. Deep Q-network based real-time active disturbance rejection controller parameter tuning for multi-area interconnected power systems. Neurocomputing. 2021; 460: 360-373.

20.

Shahrabadi

. Noise figure and input intercept point’s errors in an AGC-less microwatt ultrawideband system (limiter and low noise amplifier). IET Commun. 2021; 15(20): 2597-2614.