Abstract
Because the global climate change intensifies as well as the natural disasters frequently occur, extreme events have caused serious impacts on the energy system in urban areas, and at the same time, they have brought great challenges to the supply and scheduling of urban energy systems. Therefore, in order to better integrate and manage various energy resources in urban areas, a Deep Q-Leaning Network-Quasi Upper Confidence Bound model is innovatively constructed using deep reinforcement learning technology to learn the state and behavior mapping relationship of energy system. Use deep learning to fit complex nonlinear models to optimize the entire energy system. Compare and verify the experiment with the real energy system. The improved Deep reinforcement learning algorithm is compared with Q-learning model, PDWoLF PHC algorithm model, Quasi Upper Confidence Bound algorithm model and deep Q-Leaning Network algorithm model. The results show that the research algorithm has the smallest instantaneous error value and absolute value of frequency deviation for area control, and the average value of the research algorithm in the absolute value of the frequency deviation is reduced by 45%–73% compared to other algorithms; over time, the unit output power of the research algorithm is able to flexibly track the stochastic square wave loads. Therefore, the proposed system strategies can provide feasible solutions to meet the challenges of extreme events and promote the sustainable development and safe operation of urban energy systems.
Keywords
Introduction
The production and life of human beings cannot be separated from the organic integration of comprehensive energy systems, however, the energy systems in urban areas face severe challenges under extreme events due to both natural and anthropogenic factors [1]. These extreme events, such as fires, earthquakes, and floods, not only have a great impact on the lives of urban residents, but also pose new needs for urban energy systems’(UES) supply and dispatch [2]. Under extreme events, UES scheduling faces several challenges. First, extreme events may lead to damage or failure of energy facilities, thus affecting the supply reliability of the energy system [3]. Second, the drastic changes in energy demand triggered by extreme events require rapid adjustment of the energy system within a short period of time [4]. Moreover, the uncertainty and complexity caused by extreme events also bring great difficulties to the scheduling decisions of energy systems. Deep Q-Leaning Network (DQN) is a deep reinforcement learning algorithm (DRLA) that is capable of making intelligent decisions in uncertain environments with strong exploration capabilities [5]. In order to cope with these challenges, the research tries to construct a system strategy that could enhance the robustness and adaptability of the energy system based on deep learning reinforcement algorithms, incorporating Quasi-Upper Confidence Bound (Q-UCB), and the research aims to propose a collaborative and optimal scheduling strategy for multi-intelligence of integrated energy systems (IES) in urban areas under extreme events. The innovation of this research is to combine the advantages of the two algorithms and propose corresponding coping strategies and methods for the uncertainty and complexity brought by extreme events, which is expected to provide theoretical guidance and practical reference for the scheduling decisions of urban regional energy systems under extreme events, and to improve the level of intelligence and optimization effect of the energy system. The study is separated into four, the first is for analyzing the research status of the improved DRLA. The second part describes the co-optimization function of the improved DRLA and the process of model construction. The third part compares and analyzes the improved DRLA model. The last part is a summary of the whole paper. DQN-UCB algorithm is designed to solve the optimal scheduling problem of urban regional integrated energy system, improve energy efficiency, system robustness and reliability. During the training process, the agent inputs into the DQN network according to the current state, and outputs the corresponding behavior selection probability. In DQN-UCB algorithm, the introduction of UCB algorithm can help agents better explore the state space and improve the convergence speed and stability of the algorithm.
Related works
DQN, a reinforcement learning algorithm based on value functions, is mainly used for decision making and action selection. The multi-objective optimization problem in textile manufacturing is becoming more and more challenging, and the traditional methods cannot handle high-dimensional decision space. For this reason, He et al. introduced DQN in the multi-intelligence system instead. The experiment demonstrates that the presented system of the study can achieve the optimal solution of textile ozone oxidation process and its performance is better than the traditional methods [6]. In order to obtain a marketing investment strategy to increase the visibility of its corresponding brand in marketing scenarios, the Vargas-Perez team developed a deep reinforcement learning agent on the ground of a double deep Q-network algorithm, and the relevant outcomes showed that the decision support system facilitates the optimization of an online dynamic learning environment [7]. Facing the difficult control of the window of the hydraulic support, Yang et al. presented a deep reinforcement learning method for regulating the action of the window based on the three-dimensional simulation platform, and after three-dimensional simulation experiments, the results verified the validity of the method, i.e., it improves the efficiency of the top coal mining and achieves better economic benefits [8]. In order to better control the vehicle’s intelligence to learn errors based on its actions and interaction with the environment, Quek et al. used a deep Q network based on fractional and pixel inputs to achieve agent learning in the vehicle, which was experimentally confirmed to enable the self-driving car to learn maneuvering operations and gradually gain the ability to successfully navigate and avoid obstacles [9]. In order to optimize the old network intrusion detection methods and improve their detection rate, Yang’s team came out with a deep learning based special detection model, in which the encrypted network malicious traffic is automatically feature extracted, and it is experimentally proved that the model is able to differentiate between the normal and abnormal encrypted network traffic, with an accuracy rate of 99.95% [10].
Multi-intelligent body cooperative and optimal scheduling can achieve the optimal performance of the overall system through the collaboration and optimal decision-making among multiple intelligences. To realize consistent tracking of a multi-intelligent body system with actuator saturation, Chu et al. constructed a series of nested ellipsoid invariant sets related to the consistency error, and experimentally showed that the scheduling gain parameter can improve the convergence speed of consistent tracking [11]. In order to set a reasonable discharge price for the grid and EVs, Zhang et al. presented a negotiation strategy for the participation of EVs in optimal dispatch in the case of multiple intelligences, and the outcomes showcase that the presented negotiation model proves to be effective in balancing the interests of the grid and EVs as well as peak shifting in the numerical arithmetic cases [12]. Traditional manual programmable logic controller systems face the problem of load imbalance and irrational bin allocation in the industrial load sector. For this reason, Chen et al. proposed various optimization models with multi-intelligent body systems, which were proven to be intuitive for upward and hierarchical bin allocation [13]. In order to efficiently determine the optimal active and reactive power of dispatchable energy sources, Elgamal’s team proposed a new multi-intelligent body control system for energy management of microgrids, and the outcomes showcased that the model could dispatch agents on the DG and ESS buses to handle the optimal economic operation [14]. In order to realize the application in closed-loop scheduling of demand response for energy systems, Campos et al. incorporated the hybrid action can into the SAC framework, and the results showed that the algorithm can quickly avoid violating constraints and continuously improve towards the optimal solution [15].
In summary, it demonstrates that the DQN algorithm does well in solving issues with high-dimensional state space as well as large action space, and it reached excellent outcomes in some complex tasks, while the multi-intelligence body cooperative optimal scheduling can realize the cooperative scheduling and efficient use of different resources, and enhance the efficiency. To address the coordination and control issues of urban regional integrated energy systems under extreme events, this paper studies and improves the DQN algorithm to construct a more suitable allocation and scheduling urban regional integrated energy system for energy management and optimization. The innovation of the research model lies in the use of multi-agent design, which enables collaborative optimization among various agents. This collaborative optimization method can better handle various constraints and goals in the comprehensive energy system of urban areas, and improve the overall energy utilization efficiency.
Co-optimization model construction based on DQN in integrated energy system
The study improves on the basis of DQN, introduces the algorithm of Q-UCB to further improve the system structure, and finally constructs the DQN-UCB algorithm, which is designed to improve the scheduling of the UES and realize energy’s effective management.
Construction of collaborative optimization model of integrated energy system based on improved DQN algorithm
DQN, as one of the DRLA, is able to deal with high-dimensional spatial states in complex environments, and DQN can improve the efficiency of the integrated system by learning and optimizing the decision-making of the intelligentsia to achieve the rational allocation and utilization of the integrated energy resources [16]. It can also achieve collaborative learning among the intelligences to improve the decision-making strategy of the IES by jointly optimizing the decisions to adapt to the energy demand and changes in the urban area under extreme events.
In Eq. (1),
In Eq. (2),
Mapping from images to actions by using DQN.
Figure 1 shows the schematic diagram of image to action mapping using DQN, from which it can be learned that DQN realizes image mapping in 4 steps, DQN takes inputs of original image, uses convolutional layer to fully connect the neural network, and finally the output is the Q-value of the action. The DQN obtains the reward value and realizes the state updating through the experience playback mechanism for the interaction in the intelligent body and the circumstance, followed by the memory unit in the form as sample storage, the update network parameters of DQN are shown in Eq. (3).
In Eq. (3),
In Eq. (4),
In Eq. (5),
UCB Algorithm network structure.
Figure 2 shows the algorithm structure of Q-UCB, the computational complexity of Q-UCB algorithm is relatively high, especially when the number of intelligences is high, but at the same time it may increase the computational burden of the system and require higher computational resources [18]. In which the Q-value function is updated as shown in Eq. (6).
In Eq. (6)
In Eq. (7),
The DQN, which incorporates the Q learning deep neural network of the confidence upper bound algorithm, realizes the decision making of the intelligent body by learning the value function, and by combining these two algorithms, the DQN-UCB algorithm can explore the unknown domain in the process of continuous learning and synthesize the existing knowledge in decision making [19]. The confidence upper bound UCB algorithm can make full use of the historical information and as the value of the value function of each action, so that the optimal action decision can be derived, and when there is an error in the estimated Q-value and the true Q-value, then the error is limited by using Hoffding’s inequality, whose expression is shown in Eq. (8).
In Eq. (8)
In Eq. (9),
The algorithm framework of DQN-UCB is shown in the figure.
Figure 3 shows the algorithmic framework of DQN-UCB, specifically, the DQN-UCB algorithm consists of the following steps, firstly, initializing the deep neural network for approximating the value function. The second is to initialize the UCB parameters, which are used to balance the degree of exploration and exploitation. Selecting an action based on the current state. It is possible to combine the utilization of existing knowledge and exploration of the unknown domain by using the trained deep neural network and UCB parameters. Third, the selected action is executed and feedback from the environment is observed. The relevant parameters are updated based on the feedback for enhancing the decision. Fourth, on the ground of the updated network parameters, the action is selected again, executed and feedback is observed. This process is repeated until a preset stopping condition is reached. Through continuous iterative learning and optimization, the DQN-UCB algorithm can gradually improve the decision-making ability of the intelligences and eventually realize the multi-intelligence collaborative optimal scheduling of the energy system. Where the expression of the loss function input by the UCB calculator is shown in Eq. (10).
In Eq. (10)
In Eq. (11)
In Eq. (12),
In Eq. (13),
In Eq. (14),
Distributed multi region AGC architecture.
Figure 4 shows the framework diagram of the multi-distributed AGC system, which is a system used to dispatch and control the generating units in the power system, and its main function is to enable the power generation to be adjusted in accordance with the changes in the load demand to maintain the balance of the power system [20]. Meanwhile, the AGC system can monitor the demand and supply of energy in real time, and through intelligent scheduling algorithms, it can allocate energy to different regions and users for realizing the energy’s efficient use. After processing the optimal regional control error instantaneous value magnitude, the reward function is shown in Eq. (15).
In Eq. (15), ACE serves as the instantaneous error value of area control;
DQN-UCB system model flowchart.
Figure 5 shows the flowchart of DQN-UCB algorithm, the action network weight values and target action network value weights are initialized and parameters are set, then the action is executed to get the new state and the reward function value is calculated. After selecting the random number if it satisfies, then the sample is taken and updated, otherwise it returns to the previous step. Then as well as UCB to derive the confidence level with the Q value of all actions. Then the sample is normalized by the probability of being selected and the weights
Statistical table of simulation experiment indicators for different algorithms
As can be seen from Table 1, in region 1 DQN-UCB has the smallest value of area control instantaneous error and ABA of frequency deviation, compared to the other algorithms in the ABA of frequency deviation in the average value of 45%–73%; the value of area control instantaneous error is reduced by 52%–78%, and the value of CPS1 is the highest, compared to the other algorithms is higher by about 1.4–4.8 MW. in region 2, DQN-UCB has 47.38 MW, 37.62 MW, 24.44 MW, and 11.66 MW lower values of instantaneous error in area control than the Q algorithm, the PDWoLF PHC algorithm, the QUCB algorithm, and the DQN algorithm, respectively; where the frequency deviation of less than 0.01 Hz is the lowest value. The studied DQN-UCB algorithm in region 3 has the same minimum values of instantaneous error and ABA of frequency deviation for regional control. It can be concluded that the studied DQN-UCB algorithm is stable and highly controllable and adaptable in coping with irregular surge and dips in the power system.
For verifying the stability and coordination capability of the DQN-UCM facing extreme event disturbances, the performance of the research algorithm model is analyzed in comparison with other algorithms to assess the effectiveness of the DQN-UCM in energy systems.
DQN-UCM algorithm parameter design and performance evaluation
In the occurrence of extreme events, the urban area IES needs to face the problems of energy supply interruption, equipment damage and failure, system recovery and reconstruction, etc. State-action data set, environmental parameter data set, regional control data set, continuous step load disturbance data set, energy utilization and power grid income data set were selected for experimental analysis. First of all, the controller output of DQN-UCB is specifically studied; The outcomes are illustrated in Fig. 6.
Controller output curve of DQN-UCB.
Figure 6 showcases the schematic diagram of the controller output curve of DQN-UCB. Figure 6 demonstrates that the output power (OP) of the unit of DQN-UCB can flexibly track the random square wave load over time, and its random square wave load floats up and down from 0 to 2100 MA, and at 2.5 h, it can be 150 MA below the random square wave load of the controller curve of DQN-UCB; at 3.5 h the DQN-UCB controller curve is about 250 MA above the random square wave load; thus, it can be seen that the DQN-UCB algorithm is almost completely adapted to the random square wave load for the rest of the time. UCB controller curve is higher than the random square wave load by about 250 MA, and the rest of the time period the DQN-UCB controller curve fits the random square wave load almost completely, which shows that the adaptability and stability of the DQN-UCB algorithm can satisfy the actual demand. Meanwhile, for testing the convergence of the DQN-UCB algorithm. The DQN-UCB with Q algorithm, PDWoLF PHC algorithm, QUCB algorithm, as well as DQN algorithm are trained under the same simulation training set, and the training results are showcased in Fig. 7.
Comparison of DQN-UCB learning effectiveness and convergence values.
Figure 7(a) shows the controller output curve of the DQN-UCB algorithm. Figure 7(a) demonstrates that the OP of the DQN-UCB controller can quickly track the load perturbation curves, and its OP ranges from
Algorithm controller output curve.
Figure 8(a) shows the controller output curve of the DQN algorithm, and Fig. 8(a) demonstrates that the DQN algorithm could achieve the equalization coordination consistent with the disturbance change gait, but the size of the coverage varies with time. And at 40–50 S, the coverage is larger, and the power range is 0.95–1.1 HZ. Figure 8(b) shows the controller output curve of the QUCB algorithm, and from Fig. 8(b), it can be seen that the QUCB algorithm has the largest coverage with time, and the magnitude ranges are larger and the power ranges are 0.85–1.15 HZ at 15–38 S. Figure 8(c) shows the DQN-UCB algorithm The controller output curve of the DQN-UCB algorithm is showcased in Fig. 8(c). It showcases that the DQN-UCB algorithm has the smallest coverage compared with the other two algorithms at the same moment, and the amplitude range is about 0.03–0.06 HZ and 0.05–0.09 HZ less than that of the DQN algorithm and the QUCB algorithm power ranges, respectively, at 20–30 S. From this, we can see that the DQN-UCB algorithm has the smallest power deviation, and it is able to realize the better dynamic control effect.
For testing the stability of the DQN-UCB algorithm in response to extreme events, the instantaneous error value of area control, frequency deviation, and CPS1 indicators are selected and compared and analyzed with the Q algorithm model, the PDWoLF PHC algorithm model, the QUCB algorithm model, and the DQN algorithm model, and the outcomes are indicated in Fig. 9.
System performance comparison under random load disturbance.
Figure 9(a) shows the comparison of instantaneous error values for area control, where the error value of DQN-UCB is 1.02 MW, which is lower than Q algorithm, PDWoLF PHC algorithm, QUCB algorithm, and DQN algorithm by 1.41 MW, 1.28 MW, 1.32 MW, and 1.16 MW, respectively. Figure 9(b) shows the comparison of CPS1 values, where DQN-UCB has the highest CPS1 value of 198.0% which is 1.0%, 0.5%, 0.6%, and 0.25% higher than Q algorithm, PDWoLF PHC algorithm, QUCB algorithm, and DQN algorithm, respectively. Figure 9(c) shows the comparison of frequency deviation values, in which DQN-UCB has the lowest CPS1 value of 0.0042 HZ, which is lower than Q algorithm, PDWoLF PHC algorithm, QUCB algorithm, and DQN algorithm by 0.0020 HZ, 0.0015 HZ, 0.0018 HZ, and 0.0014 HZ, respectively. It can be seen that in the face of the extreme events, the DQN-UCB is more effective for the integrated energy coordination is stronger. Then the test comparison analysis is carried out using continuous step load perturbation, and the analysis results are shown in Fig. 10.
Controller output curve under continuous step load disturbance.
Figure 10(a) shows the controller output curve under continuous step load perturbation of QUCB algorithm, and Fig. 10(a) demonstrates that the average value of frequency deviation value of QUCB algorithm is 0.0013. Figure 10(b) shows the controller output curve under continuous step load perturbation of DQN algorithm, and from Fig. 10(b), the average value of frequency deviation value of DQN algorithm is 0.0011. Figure 10(c) shows the controller output curve under continuous step load perturbation of DQN-UCB algorithm. 10(c) shows the controller output curve of DQN-UCB algorithm under continuous step load perturbation, and Fig. 10(c) demonstrates that the average value of frequency deviation of DQN-UCB algorithm is 0.00054, compared with the first two QUCB algorithms and DQN algorithm algorithm algorithms, there is a 94.60% enhancement in DQN-UCB algorithm. offset and has strong dynamic control performance. Finally, to verify the advantages of the urban comprehensive energy optimization system based on research algorithms in terms of energy utilization efficiency and grid benefits, a comparative analysis will be conducted between the system based on DQN-UCB algorithm, PPO-UCB algorithm, and LinUCB algorithm. The final experimental results are shown in Fig. 11.
Comparison of energy utilization efficiency and economic benefits.
Figure 11(a) shows the economic benefits of systems based on different algorithms. It can be seen from Fig. 11(a) that the economic benefits of the DQN-UCB algorithm exceed 60000 yuan after 3 months, and the economic benefits of the system based on the research algorithm continue to grow over time. Figure 11(b) shows the comparison of energy utilization rates of urban comprehensive energy systems based on different algorithms. It can be seen from Fig. 11(b) that the energy utilization rate of the system based on the DQN-UCB algorithm exceeds 0.8 after six months, and reaches 0.9 or above after 21 months. Compared to the optimization system studied, the energy utilization efficiency of the system based on PPO-UCB algorithm and LinUCB algorithm has always been below 0.8, and with the increase of time, the growth of energy utilization efficiency is not satisfactory. In summary, it can be learned that DQN-UCB algorithm has a higher power than the other algorithms in the face of extreme events, better control, more stability, and faster and more flexible coordination efficiency.
Extreme events are often unpredictable and pose a serious threat to urban energy supplies. Improving the ability of energy system to cope with emergencies and ensuring the stability of energy supply is an important guarantee for the stable development of social economy. Therefore, based on the deep reinforcement learning algorithm, this study attempts to integrate the confidence upper bound UCB algorithm that can make full use of historical information and make optimal action decisions, and finally build an optimization system for urban integrated energy cooperative scheduling. The outcomes show that the DQN-UCB has the smallest area control instantaneous error value and ABA of frequency deviation, compared with other algorithms in the ABA of frequency deviation in the average value of 45%–73%; at the same time, the OP of the DQN-UCB controller could rapidly follow the load perturbation curves, and its OP ranges
Footnotes
Funding
The work was financially supported by Science and Technology Projects from State Grid Corporation of China, (Research and demonstration of multi-agent cooperation and interaction technology for urban regional integrated energy system supporting grid toughness improvement, No.: 5400-202317577A-3-2-ZN).
