Anti-interference technology for wireless communication based on greedy reinforcement learning algorithm

Abstract

As the most valuable area in the telecommunications industry, wireless communication has shown great potential for development in the 21st century. With the massive popularity of smartphones and 5G technology, how to create a high-quality wireless communication transmission network gradually becomes a key problem that needs to be broken through urgently at present. The study proposes a corresponding greedy reinforcement learning algorithm based on the establishment of an interference-resistant wireless communication model, which performs direct retention of high-value actions as a way to avoid extensive network computation. The results show that the algorithm achieves a fitness value of 99.1 and converges to 99.9 at about 19 iterations in the handwritten digital image set. It indicates the algorithm has a fast convergence speed in incorporating the dual network structure and empirical recovery, which can effectively enhance the learning efficiency of the anti-interference of wireless communication system and provide a new reference method for the development of anti-interference technology of wireless communication.

Keywords

deep learning reinforcement learning greedy algorithm wireless communication anti-interference

Introduction

As the boost of communication technology, wireless communication (WC) is being used more and more in various fields and the communication space is facing a more complex electromagnetic environment. A stable and secure environment in the frequency band of WC is always the goal sought after by technical experts, as various environmental interferences or malicious man-made attacks are always constantly affecting normal communication.¹ The result is that more and more research is focused on anti-jamming technologies for WC. Among the current anti-jamming techniques, the overall can be divided into three types: time domain, frequency domain and air domain.² Among them, the time domain anti-jamming techniques include instantaneous communication, etc., which is mainly based on the sequential change of the signal in the time attribute for anti-jamming in the time domain. Instantaneous communication techniques reduce the chance of a communication signal being detected by reducing the duration of the signal in the channel in order to raise the threshold for interference in communication transmissions.^3,4 With the diffusion of this technology, the probability of communication messages being jammed is considerably lower and therefore the applications are widespread. Although a number of anti-jamming methods have emerged from the time, frequency and air domains, they still struggle to overcome a number of inherent problems. Firstly, current methods usually fail to obtain an optimal policy and tend to fall into a locally optimal policy. Secondly, due to the poor computing power of communication devices, the algorithms used today are far beyond the tolerance threshold of the devices and cannot be scaled up on a large scale, and too much computation increases the communication latency and the communication quality tends to degrade.⁵ To add insult to injury, deep learning-based channel analysis techniques are beginning to be used to attack WC, and general anti-jamming techniques have become difficult to combat, with more efficient anti-jamming algorithms needing to be proposed. Recently, the application of reinforcement learning in decision problems has provided new ideas for updating anti-jamming techniques in the field of communications. Therefore, the study constructs an anti-jamming model and designs greedy reinforcement learning algorithms in order to further provide the effectiveness of the application of anti-jamming techniques in communications. The main content of the study is divided into four parts. The second part is a literature review on WC techniques and reinforcement learning. In the third section, the design process of the anti-interference WC system model and the greedy reinforcement learning algorithm is presented. In the fourth section, the greedy reinforcement learning algorithm and the practical tests on communication interference immunity are conducted to verify the method. The final concludes a summary and suggestions for future.

Related works

To more comprehensively deal with attacks from malicious interferers in WC, numerous studies and experiments have been carried out by scholars at home and abroad. Gandhi et al. proposed a network clustering and fuzzy reinforcement learning based scheme for maximizing network lifetime. The results showed that the method has superior function compared to traditional research methods.⁶ Ranjan et al. used the interference index (II) as the interference minimization key to minimize the interference between secondary nodes in order to maximize the system capacity. For validating the outcomes, an efficient greedy algorithm is incorporated and the final result confirms that it can provide a 60% gain in CR network capacity with the introduction of II.⁷ Liu et al. proposed a method based on generalized complementary coding scrambling multiple access (and named it GCCSMA) to improve the operating efficiency of radio spectrum technology. This coding method combines complementary codes and multiple-input multiple-output technology to overcome the problem of code duplication during use. The results verified that this method has certain effectiveness.⁸ Wang et al. proposed a distributed RA algorithm in view of coordinated Q learning to address the interference to the channel during RA and improve the communication quality. The outcomes demonstrate that the method has superior convergence performance, while the network transmission capability is better than existing methods; the overhead of RA can be effectively reduced by coordinated Q learning.⁹ Scholars such as Zhao et al. proposed a new technology based on the TACS architecture to achieve safer and more efficient train autonomous detour systems. And introduce 5G technology into this framework to make up for the shortcomings of existing communication technology. Comparing this algorithm with traditional methods, the convergence speed of this algorithm is significantly faster.¹⁰

With the rapid development of reinforcement learning technology, many scholars began to apply it directly to the powder of network communication and RA. Yan et al. proposed a deep learning based communication security anti-interference decision method to better enhance the communication anti-interference ability in the process of smart city development. The process integrates the analysis of interference strength and channel gain aspect functions. The results showed that the algorithm has a network capacity of 960 bits when the number of links is 300 and also has a high reliability of autonomous decision making.¹¹ Liang et al. proposed a transfer learning model based on convolutional neural network to reconstruct the compressed signal to cope with the situation where the current deep learning algorithm cannot adapt to large sample data. The experiment selected ultra-wideband radar echo signals and the Modified National Institute of Standards and Technology handwriting dataset to compare the performance of the constructed model. The results showed that the model has superior performance under different noise levels.¹² Mu et al. propose a machine learning-based allocation scheme for intelligent spectrum partitioning to better analyze the interference immunity of wireless body area networks (WBANs). The results show that the research method is able to adapt more quickly to the rapidly changing WBAN in the topology, while the system has excellent anti-interference performance and system stability.¹³ Lu et al. proposed a reinforcement learning framework based on drone-assisted anti-jamming cells to improve the performance of cellular systems against smart jammers. UAVs are combined with deep reinforcement learning and transfer learning to improve the intelligent interference capabilities of cellular systems. Simulation shows that this algorithm can significantly reduce the bit error rate and save drone consumption.¹⁴

In summary, with the rapid development of communication technology, wireless networks are facing increasingly severe interference problems, which include not only interference caused by natural factors, but also security threats caused by malicious attacks. In order to improve the anti-interference ability of wireless communication systems, researchers have proposed a variety of algorithms and methods. At present, reinforcement learning algorithms have been widely used in wireless communication networks. Most of the research focuses on the allocation of communication resources and network attacks. Few studies have integrated greedy algorithms into reinforcement learning to improve reinforcement learning, thereby better solving the problem of wireless communication anti-interference. In addition, the greedy algorithm performs well in scenarios such as resource allocation and power control due to its simple and efficient characteristics. Through the local optimal selection strategy, the greedy algorithm can quickly find the approximate optimal solution, which is suitable for real-time systems that require fast response. In view of this, the study proposes a wireless communication anti-interference technology based on greedy reinforcement learning algorithm. First, an anti-interference wireless communication model based on deep learning is constructed, and then the communication anti-interference model is optimized in combination with the greedy learning algorithm to achieve efficient operation of anti-interference wireless communication data tasks, and it is expected that it can effectively improve communication performance.

Anti-interference techniques for WC incorporating greedy-reinforcement learning algorithms

For improving the anti-interference capability of WC, the research is based on reinforcement learning algorithm and combined with greedy algorithm to form a new effective anti-interference technique for WC in view of greedy-reinforcement learning algorithm, and this section is divided into two parts to explain.

Deep learning-based model for interference-resistant WC systems

In this WC model, the sender sends information to the receiver over the transmission channel under the influence of multiple interfering attackers. At the moment $k$ , the sender (e.g. a wireless transmission device, etc.) will start transmitting data to the receiver with a transmission power of $P_{S} (k)$ .

At the same time, there will be $L$ interference attackers sending meaningless interference signals with $P_{J}^{l} (k) \in {P_{J}^{1} (k), P_{J}^{2} (k), \dots, P_{J}^{L} (k)}$ power to interfere with the transmission frequency band. The corresponding jamming power is available at $P_{J}^{l} (k) l$ in different power levels. In the study to model the communication system, it is assumed that each jamming attacker will only interfere with one channel.¹⁵ When the moment $k$ is reached, the sender can choose one of the optional communication bands $N$ to send, and use $x^{(k)}$ to represent it. The jammers can choose their jamming attack frequency bands, which is represented in the system as ${y_{1}^{(k)}, y_{2}^{(k)}, \dots, y_{L}^{(k)}}$ . And $h_{s}, h_{l}$ indicates the channel power gain of the receiver corresponding to the sender and the interference attacker.¹⁶ The specific anti-interference WC system model is shown in Figure 1.

Figure 1.

Model of a jam-resistant WC system.

As can be seen in Figure 1, for defending against jamming attacks, it is essential for the sender to select an unblocked secure channel $x^{(k)}$ with a suitable transmit power $P_{S} (k)$ . The reason for choosing a variable transmit power is that the variable transmit power model always has better communication efficiency than the constant transmit power model, subject to the same average power constraint. When the time is $k$ , the receiver starts receiving the signal and sends it for the sender via the feedback channel.¹⁷ The equation for the calculation of SINR throughout the process is given in equation (1).

S I N R (k) = (P_{S} (k) h_{s}) / (β + \sum_{l = 1}^{L} P_{J}^{l} (k) h_{l} f (x^{(k)} = y_{l}^{(k)}))

(1)

In equation (1), $β$ represents the receiver noise; $P_{J}^{l} (k)$ represents the interference power selected by the $l$ attacker; and $f (ξ)$ represents an indicator function which is 1 when $ξ$ holds and 0 otherwise. Where the channel is blocked by the interferer at $k$ , it is essential for the sender to re-transmit the signal, which represents the additional energy consumption and is denoted as $C_{m}$ . When the maximum value of the interference power is $P_{J}^{L} (k)$ , the channel is considered to be completely blocked. And to choose a valid coefficient in energy saving of the communication equipment and communication performance, the utility of the communication system is experimentally defined and is expressed in equation (2).

u_{s}^{(k)} = S I N R (k) - C_{m} f (P_{J}^{l} (k) = P_{J}^{L} (k)) f (x^{(k)} = y_{l}^{(k)}) - \frac{C_{s} P_{S} (k)}{P_{S}^{\max}}

(2)

In equation (2), $P_{S}^{\max}$ indicates the maximum transmission power; $R$ indicates the number of transmitters transmitting power; $I$ indicates the number of interferers jamming power; $L$ represents the interference attacker; $N$ indicates the number of communication bands; $x^{(k)}$ indicates the communication band selected by the sender; $y_{l}^{(k)}$ indicates the attack band selected by the $l$ th attacker; $h_{s}$ indicates the CPG of the sender; $h_{1}$ indicates the CPG of the $l$ th interfering attacker; $C_{s}$ indicates the transmission loss per unit; $C_{m}$ denotes the retransmission unit transmission loss; and $u_{s}^{(k)}$ denotes the utility of the communication system at the moment. In order to systematically optimize the anti-interference model for WC, the study introduces the concept of deep learning to improve it. The implementation process of the reinforcement learning algorithm is shown in Figure 2.

Figure 2.

Specific implementation process of the reinforcement learning algorithm.

In the reinforcement learning model, the intelligence constantly interacts with its surroundings, where at each time point it receives feedback from the environment indicating its current state, which is used to update its strategy. The sum of the rewards of this process is expressed in equation (3).

H_{t} = \sum_{k = 0}^{\infty} {γ_{1}}^{k} r_{t + k^{'}}

(3)

In equation (3), $H_{t}$ is the sum of the rewards; $γ_{1}$ is the decay factor; $r_{t}$ is the reward received by the intelligence from the environment; and $k^{'}$ is the coefficient. When the time node is $t$ , the state of the environment $S$ is recorded as $S_{t}$ and the next state of the environment $S_{t + 1}$ is obtained based on the rewards and actions. The goal of the intelligence in the reinforcement learning process is to maximize the sum of the rewards of each state $H_{t}$ ; when a reinforcement learning problem has Markovian properties, the model can be defined as a Markovian decision process.¹⁸ This process is often represented using a five-tuple $(S, A, P, R, γ)$ , where $S$ serves as the environmental states; $A$ serves as the set of actions chosen by the intelligence; $P$ represents the transfer probabilities; and $H_{t}$ represents the sum of the payoffs. In the presence of a system model, dynamic programming is often used to solve problems related to reinforcement learning. That is, a strategy evaluation algorithm is used for calculating the value function of a strategy, while the optimal strategy is learned using iterative value updates or iterative strategy updates. The relevant function is a prediction describing the sum of the returns for the cumulative future, as expressed in equation (4).

v_{π} (S_{t}) = E (H_{t} | s_{t} = S)

(4)

In equation (4), $v_{π} (S_{t})$ denotes the value function. In reinforcement learning, the largest task of an intelligence is actually to find the optimal strategy and use it to maximize the reward. The optimal policy is denoted as $π$ and can be obtained by comparing different policies. This is expressed in equation (5) by choosing either the largest value function or the largest state value function.

{\begin{cases} v_{*} (s) = \max v_{π} (s) \\ q_{*} (s, a) = \max q_{π} (s, a) \end{cases}

(5)

In equation (5), $a$ represents a specific action. The fusion of all the above formulas leads to the final reinforcement learning objective of solving the optimal strategy, which can be calculated as shown in equation (6).

π_{*} (a | s) = {\begin{cases} 1, i f a = argmax q_{*} (s, a) \\ 0, e l s e \end{cases}

(6)

Combined with the above analysis, the resulting deep reinforcement learning structure is shown in Figure 3.

Figure 3.

Deep reinforcement learning architecture.

Design of anti-interference model for WC in view of greedy-reinforcement learning algorithm

In WC jamming engineering, the processing of large amounts of data is a highly comprehensive and systematically fraught large scale project for researchers. In order to better achieve efficient operation of jam-resistant WC data tasks, the research proposes to combine greedy algorithms with reinforcement learning algorithms to form a $(τ - ε)$ -greedy reinforcement learning algorithm. The specific algorithm architecture is shown in Figure 4.

Figure 4.

Specific architecture of the greedy reinforcement learning algorithm.

As can be seen in Figure 4, module A is used to save actions and states, and to input the resulting parameters into the network, ultimately determining whether to save the previous signal with value or to calculate the highest value action according to the network. Module B reduces the coupling of the network data through a dual network structure, calculates the $Q$ -value, and updates the network parameters; the priority experience replay module improves the utilization of the samples according to the Sum-tree structure. Also in the reinforcement learning part of the module, the Q function determines the value of $Q$ , as expressed in equation (7).

Q (s^{(k)}, a^{(k)}) = E [u_{s}^{(k)} + γ \max Q (s^{(k + 1)}, a^{'}) s^{(k)}, a^{(k)}]

(7)

In equation (7), $s^{(k + 1)}$ represents the next state corresponding to the sender’s action $a^{(k)}$ at state $s^{(k)}$ ; $A$ represents the set of actions that can be chosen at state $s^{(k + 1)}$ ; $γ$ represents the discount factor, which serves as the uncertainty of the future reward received by the sender; and $a^{'}$ represents the specific action chosen at the moment. In the traditional $ε$ greedy algorithm, the sender is more likely to choose the action with the largest value of q, while selecting an action at random with $ε / | A |$ probability ( $| A |$ indicates the number of elements in the action space). The value of $ε$ decreases during the learning process, giving the sender sufficient opportunity for exploring with the initial learning phase. To avoid falling into a local optimal solution when the algorithm is running, the experiment introduces a probability parameter $τ$ to represent the probability $a^{(k - 1)}$ of executing the previous action at time $k$ , and does not calculate the value of $Q$ .The obtained greedy calculation for $(τ, ε)$ is given in equation (8).

π (a^{(k)} | s^{(k)}) = {\begin{cases} a^{(k - 1)}, w i t h p r o b a b i l i t y τ \\ a_{r}, w i t h p r o b a b i l i t y ε \\ \underset{a^{'} \in A}{argmax} Q (s^{(k)}, a^{'}), w i t h p r o b a b i l i t y (1 - τ - ε) \end{cases}

(8)

In equation (8), $π$ denotes the action strategy; $a_{r}$ denotes the random action. In determining valuable behavior, the study requires a threshold around the average utility of the first $T$ moments, which is denoted as $u_{s}^{- (k - 1)}$ and expressed in equation (9).

u_{s}^{- (k - 1)} = \frac{1}{T} \sum_{i = 1}^{T} {u_{s}}^{(k - i)}

(9)

In equation (9), the value of the action $a^{(k)}$ is measured using the difference between $u_{s}^{(k)}$ and $u_{s}^{- (k - 1)}$ . The greed theorem shows that the sender will directly adopt the action $a^{(k)}$ at the next moment gap $k + 1$ with probability parameters $τ$ . In this case, to effectively speed up learning, the sender will keep the action that it gets that contributes more to the system. That is, the higher the value of $u_{s}^{(k - 1)} - u_{s}^{- (k - 1)}$ , the higher the probability that the previous action $a^{(k)}$ will be directly adopted. For this reason, a Gaussian-like function is used for calculating the value of $τ$ , as shown in equation (10).

τ = {\begin{cases} 1 - \frac{1}{\sqrt{2 π} σ_{1}} \exp (\frac{{(u_{s}^{(k)} - u_{s}^{- (k - 1)})}^{2}}{2 σ_{1}^{2}}), u_{s}^{(k)} > u_{s}^{- (k - 1)} \\ \frac{1}{\sqrt{2 π} σ_{2}} \exp (\frac{{(u_{s}^{(k)} - u_{s}^{- (k - 1)})}^{2}}{2 σ_{2}^{2}}), o t h e r w i s e \end{cases}

(10)

In equation (10), $σ_{1}, σ_{2}$ is the effective parameter to control the step size of $τ$ . Based on the Gaussian principle, the larger the value of the two parameters, the smoother the variation of $τ$ , and vice versa. Reinforcement learning is used throughout the system to select the optimal action, thus ensuring convergence of the algorithm and effectively demonstrating the convergence of the $ε -$ greedy algorithm. A Convolutional Neural Network (CNN) is used for updating the $Q$ -value. The corresponding structure is shown in Figure 5.

Figure 5.

Structure of CNN model.

In view of the output of the CNN, the sender can choose the optimal transmit power and channel.¹⁹ In the research method, two convolutional neural networks with the same structure are constructed, which are served as $Q_{1}, Q_{2}$ . The action with the largest $Q$ value in the $Q_{1}$ network can be obtained by calculating $a_{\max}^{(k)}$ . The detailed calculation is shown in equation (11).

{\begin{cases} a_{\max}^{(k)} = argmax Q_{1} (φ^{(k + 1)}, a^{'}; θ_{1}^{(k)}) \\ Q_{t \arg e t}^{(k)} = u_{s}^{(k)} + γ Q_{2} (φ^{(k + 1)}, a_{\max}^{(k)}; θ_{2}^{(k)}) \end{cases}

(11)

In equation (11), $θ_{1}^{(k)}, θ_{2}^{(k)}$ represents the corresponding network parameters of at the time of $Q_{1}, Q_{2} k$ ; $φ^{(k + 1)}$ represents the next state; and $a_{\max}^{(k)}$ represents the corresponding action of the maximum $Q$ value that the $Q_{1}$ network is trained to have when the state is $φ^{(k + 1)}$ . Next, in the priority-based experience recall module, $e^{(k)} = [φ^{(k)}, φ^{(k + 1)}, a^{(k)}, u_{s}^{(k)}]$ is used to represent the experience samples at the moment of $k$ , while storing all the samples in the Sum-tree. The $i$ th sample of the $M$ experience samples is represented using the ordinal number $i$ . The probability of the obtained $i$ th sample is calculated in equation (12).

P_{i} = \frac{q_{i}}{\sum_{j = 1}^{k} q_{j}}

(12)

Also when evaluating the priority of the empirical samples, the value of the Temporal Difference (TD) needs to be calculated, which is shown in equation (13).

ψ_{i} = Q_{t \arg e t}^{(i)} - Q (φ^{(i)}, a^{(i)})

(13)

In equation (13), $ψ_{i}$ denotes the TD error. Then the network parameters of the stochastic gradient descent method were updated in the way that the obtained loss function was calculated by using the stochastic gradient descent method as the basis, see equation (14).

L (θ_{1}^{(k)}) = \frac{1}{M} \sum_{i = 1}^{M} ω_{i} ψ_{i}^{2}

(14)

In equation (14), $ω_{i}$ denotes the importance of the sampling weights, which will be expressed as equation (15).

ω_{i} = \frac{{(M \cdot P_{i})}^{- λ}}{\max_{1 \leq j \leq k} ω_{j}}

(15)

In equation (15), $λ$ represents a variable used to control the amount of significant sampling and may have a value of 0 or 1. When $λ = 0, 1$ represents no versus full significant sampling, respectively. After updating the network parameters of $Q_{1}$ , the TD error will be calculated again according to the process, and the priority of all empirical samples is updated again according to $q_{j} = | ψ_{j} | θ_{1}^{(k)}$ . Unlike the $Q_{1}$ network, the parameters of the $Q_{2}$ network $θ_{2}^{(k)}$ are not updated immediately, but are replaced by the network parameters of $Q_{1} f$ as the time frequency. The resulting greedy PDDQN (Prioritized Double Deep Q Network) algorithm flow based on $(τ, ε)$ is shown in Figure 6.

Figure 6.

Basic flow of the PDDQN.

In Figure 6, the PDDQN model uses a greedy algorithm to solve the resource allocation problem in wireless communication and control the power. It also uses a deep learning algorithm to dynamically adjust the communication parameters and can adapt to the ever-changing communication environment, optimizing the communication performance through continuous learning. In the wireless communication anti-interference problem, the greedy reinforcement learning algorithm can be used to adjust the communication parameters to resist external interference and quickly respond to strategy implementation.

Performance test and application effect of anti-interference system for WC

In order to better analyze the functionality of the system constructed in this study, the performance of the WC anti-jamming system and the actual simulation results of this technology are analyzed in this section.

Performance testing of anti-interference systems for WC

Once the design optimization of the WC anti-interference system has been completed, the performance of the research model in actual operation needs to be compared to test its validity. For verifying the applying, the research method to the WC system, the experimental environment and basic parameters were first set and selected. The basic environment for the experiments is shown in Table 1.

Table 1.

The experimental basic environmental parameters.

Parameter variables	Parameter selection
The overall implementation platform of the system	Simulink
Operating system	Windows 10
Operating environment	MATLAB
System PC side memory	12 G
CPU dominant frequency	2.62 Hz
GPU	RTX-2070
Central processing unit	i7-8700
Data storage	MySQL data bank
Data regression analysis platform	SPSS 26.0

To ensure that the overall experiments are conducted in a reasonable manner, the relevant parameters in the performance simulation experiments are set as follows: $P_{S} = [1, 5, 10] W$ , $P_{J} = [0, 4, 8, 10] W$ , $h_{s} = h_{j} = 0.5$ , $β = 1$ , $C_{m} = 1$ , $C_{s} = 2$ , $γ = 0.6$ , $T = 5$ , $σ_{1} = 0.8$ , $σ_{2} = 85$ , $W = 32$ , $M = 10$ , and $f = 10$ . In this study, Improved BP Neural Network (IBP), Improved Convolutional Neural Networks (ICNN), and the method in the literature (20) were selected to compare the performance with the method in this study (PDDQN). The experimental parameters were set identically for all algorithms except for special experimental parameters during the experiments. The ImageNet dataset and the MNIST dataset were selected as the basis for the analysis of the function of the different models. The convergence of the models for the different algorithms is shown in Figure 7.

Figure 7.

Comparison of convergence of different algorithms. (a) ImageNet dataset, (b) MNIST dataset.

Figure 7(a) shows a comparison of the change in fitness on the ImageNet dataset. With the number of runs, the fitness values of all four algorithms show a fluctuating trend of zigzag changes and never appear to be stable values; while when the number of iterations is 120, the research method has the largest fitness value of 94.8, which also indicates a faster convergence and better merit finding ability. Figure 7(b) shows a test of convergence on the MNIST dataset. When the number of iterations is around 19, the fitness value of the research method reaches 99.1 and converges infinitely to 99.9; while it is at 140 iterations that the remaining three algorithms slowly start to show equilibrium fitness, but the values are smaller than the research method. All of these results indicate that the research method has a better fitness effect and is able to reach a convergence state more quickly and start to find the optimal parameters. The four algorithms were then compared and the accuracy of the runs in the two datasets is shown in Figure 8.

Figure 8.

Accuracy of the four algorithms running on different datasets. (a) ImageNet dataset, (b) MNIST dataset.

In Figure 8(a), the highest accuracy rates for the four methods PDDQN, literature (20), ICNN, and IBP are 92.1%, 82.3%, 74.2%, and 65.3%, respectively when the run time is 0.05 s. The maximum accuracy of the studied methods starts to be reached when the run time reaches 0.035 s. Figure 8(b) illustrates that the accuracy of the IBP method is still low in the dataset and increases as the run time increases, but not by much in fact, with the highest value being around 70.0%. The running accuracy of the study method shows a cliff-like increase and has a maximum accuracy of 96.2% at around 0.032 s. The above results indicate that the research method has a higher running accuracy under the same experimental conditions and also has the performance of reaching steady state more quickly. A comparison of the time taken by the different algorithms to reach steady state on the two datasets is then presented in Figure 9.

Figure 9.

Time taken by different models to reach steady state. (a) ImageNet dataset, (b) MNIST dataset.

In Figure 9(a), when the number of iterations is 91, the running times of PDDQN, literature (20), ICNN, and IBP start to stabilize, with 0.0768 s, 0.0791 s, 0.0801 s, and 0.0823 s, respectively. The time taken by the four methods to reach a stable state is 0.0501 s, 0.0571 s, 0.0556 s, and 0.0601 s, respectively, and the time taken by the four methods to reach a stable state is significantly smaller than that of the other algorithms, which is to a certain extent faster than the operational efficiency of WC anti-interference.

Analysis of simulation results of anti-interference experiments

Based on the above analysis of the performance of the model, the study then proceeded to conduct an experimental simulation of the interference immunity and to compare it with different algorithms. Firstly, the S/N performance of the variable and constant transmit power models were compared in practice with the same average power $P$ and with the operating conditions $N = 32, L = 2$ and $N = 32, L = 8$ , respectively, as shown in Figure 10.

Figure 10.

Comparison of different reinforcement learning algorithms. (a) Comparison of SINR values of different algorithms when N = 32 and L = 2, (b) comparison of SINR values of different algorithms when N = 32 and L = 8.

In Figure 10(a), parameter N is equal to 32, parameter L is equal to 2, and the SINR of the research method reaches 4.53 at time node 200, while the SINR of ICNN and literature (20) are only 3.41 and 3.82, respectively. In terms of convergence speed, the research method reaches convergence at time node 200, while the other three methods start converging after time node 1000. In Figure 10(b), parameter N is equal to 32 and parameter L is equal to 8; it can be observed that the performance of all methods has been reduced. The study method still has a maximum SINR value at time node 200, reaching 400, while the corresponding values for the literature (20), ICNN, and IBP are 3.41, 3.23, and 2.54, respectively. This indicates that the convergence rate of the study method starts to decrease when the number of interfering attackers increases, but is still faster than the rest of the algorithms. The incorporation of the dual network structure and empirical recovery in the research method can effectively improve the performance and learning efficiency of the WC system against interference. Finally, the research method is compared with a reinforcement learning algorithm based on $ε -$ greedy while keeping the same environment variables, as shown in Figure 11.

Figure 11.

Comparison of SINR for different greedy reinforcement learning algorithms. (a) Comparison of SINR of different algorithms when N = 32 and L = 2, (b) comparison of SINR of different algorithms when N = 32 and L = 8.

The selectable communication bands and interfering attackers in Figure 11(a) are 32 and 2, respectively, that is, $N = 32, L = 2$ . It can be found that when the time node is 200, the SINR value of the reinforcement learning algorithm increases rapidly from 2.61 to 3.32 after incorporating the greedy algorithm of $(τ, ε) -$ , while there is a large improvement in the convergence speed. In Figure 11(b), $N = 32, L = 8$ , a partial increase in SINR values can be found for both algorithms. When the time node is 200, the research method and the comparison method have SINR values of 2.52 and 2.91, respectively, but the improvement in convergence speed is not very significant. This indicates that the lower chance of finding the optimal communication strategy leads to the fact that whether or not the previous action is retained does not help much in terms of overall communication performance, and the convergence time mostly exceeds 1000 time nodes.

Conclusion

The development of WC technologies such as 5G has placed higher demands on communication security. The study constructs an anti-interference model for WC in this context and uses it to design a greedy action retention algorithm for improving the convergence speed of the algorithm and the utilization of communication computing resources. The results show that the algorithm has a maximum fitness value of 94.8 in the ImageNet dataset, with a high convergence speed and superiority seeking capability. In the MNIST dataset, the highest value of the IBP method is at 70.0%, and the proposed method shows a cliff-like increase in operational accuracy and has a higher accuracy with a maximum accuracy of 96.2% near 0.032 s. Moreover, when the number of iterations is around 89, the running time of the three methods PDDQN, ICNN and IBP starts to reach a stable state, with 0.0501 s, 0.0556 s, and 0.0601 s, respectively. The time taken by the designed methods to reach a stable state is significantly smaller than that of other algorithms, which to a certain extent can be faster in the running efficiency of WC anti-interference. The SINR values of ICNN and IBP at the 200 time node are 3.23 and 2.54 for the number of channels and jammers of 32 and 8, respectively, while the proposed method has a maximum value of 400. With the selectable communication bands and jamming attackers of 32 and 2, respectively, the reinforcement learning algorithm incorporating the greedy algorithm improves the SINR value from 2.61 to 3.32 in a very short period of time and greatly accelerates the convergence rate, which is a significant enhancement to the anti-interference capability of WC when the time node is 200.

However, the study did not consider the consumption of computing resources by deep learning, and the future needs to be combined with wireless-driven edge computing to enhance the application capability of low-power networks.

Statements and declarations

Footnotes

Conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

References

Jiao

Sun

Fang

, et al. An overview of wireless communication technology using deep learning. China Commun 2021; 18(12): 1–36.

Cao

Wei

Pei

. A channel allocation method based on dual-population differential evolution in wireless sensor networks. Int J Sens Netw 2021; 36(1): 50–58.

Wang

Zhang

, et al. A survey on deploying mobile deep learning applications: a systemic and technical perspective. Digit Commun Netw 2022; 8(1): 1–17.

Xia

Fattah

SMM

Babar

. A survey on UAV-enabled edge computing: resource management perspective. ACM Comput Surv 2023; 56(3): 1–36.

Luo

Wang

Xia

, et al. Path planning for UAV communication networks: related technologies, solutions, and opportunities. ACM Comput Surv 2023; 55(9): 1–37.

Gandhi

Vikas

Ratnam

, et al. Grid clustering and fuzzy reinforcement‐learning based energy‐efficient data aggregation scheme for distributed WSN. IET Commun 2020; 14(16): 2840–2848.

Ranjan

Agrawal

Joshi

. Interference mitigation and capacity enhancement of cognitive radio networks using modified greedy algorithm/channel assignment and power allocation techniques. IET Commun 2020; 14(9): 1502–1509.

Liu

Huang

Chang

, et al. Generalized complementary coded scrambling multiple access for MIMO communications. IEEE T Veh Technol 2021; 70(12): 13047–13061.

Wang

Qian

. Self-adaptive resource allocation in underwater acoustic interference channel: a reinforcement learning approach. IEEE Internet Things 2019; 7(4): 2816–2827.

10.

Zhao

Liu

Yang

, et al. Future 5G-oriented system for urban rail transit: opportunities and challenges. China Commun 2021; 18(2): 1–12.

11.

Yan

Wang

, et al. A communication security anti-interference decision model using deep learning in intelligent industrial IoT environment. Soft Comput 2022; 26(16): 7993–8002.

12.

Liang

Zhao

. A transfer learning approach for compressed sensing in 6G-IoT. IEEE Internet Things 2021; 8(20): 15276–15283.

13.

Wei

, et al. Spectrum allocation scheme for intelligent partition based on machine learning for inter-WBAN interference. IEEE Wirel Commun 2020; 27(5): 32–37.

14.

Xiao

Dai

, et al. UAV-aided cellular communications with deep reinforcement learning against jamming. IEEE Wirel Commun 2020; 27(4): 48–53.

15.

Ramezanpour

Mosavi

. Two-stage beamforming for rejecting interferences using deep neural networks. IEEE Syst J 2020; 15(3): 4439–4447.

16.

Shi

Niu

, et al. Efficient jamming identification in wireless communication: using small sample data driven naive bayes classifier. IEEE Wirel Commun Le 2021; 10(7): 1375–1379.

17.

Ohtsuki

. Machine learning in 6G wireless communications. IEICE T Commun 2023; 106(2): 75–83.

18.

Sun

Zhao

, et al. End-to-end learning of secure wireless communications: confidential transmission and authentication. IEEE Wirel Commun 2020; 27(5): 88–95.

19.

Cao

Liu

. Deep AI enabled ubiquitous wireless sensing: a survey. ACM Comput Surv 2021; 54(2): 1–35.

20.

Waqas

Halim

, et al. The role of artificial intelligence and machine learning in wireless networks security: principle, practice and challenges. Artif Intell Rev 2022; 55(7): 5215–5261.