Abstract
With the increasing emphasis on network security, monitoring and identifying abnormal network traffic has become a research hotspot in the field of network security. In order to efficiently identify abnormal network traffic, this study proposes the use of linear discriminant analysis to process the data features of network traffic. The processing method is to separate different categories of features through mapping. Then, the principal component analysis method is used for feature normalization and a new feature matrix with low fit is constructed. The features in this matrix not only have independence, but also retain the differential characteristics of the samples. Finally, by introducing support vector machines to classify the feature matrix, the anomaly recognition of the samples is completed. The experimental results show that the feature extraction algorithm proposed in this study can achieve a maximum feature separation of 95%, and not only can complete classification within 39 ms as quickly as possible, but also has an error rate of only 0.5%. The classifier can achieve a maximum recognition rate of 98.6% for different types of abnormal traffic, and the highest accuracy rate can reach 99.1%. In summary, the improved classifier proposed in the study has excellent performance and can be used in network traffic recognition applications.
Keywords
Introduction
In the rapidly developing society of the Internet era, the scale and complexity of network traffic are also increasing. However, there are some abnormal behaviors in these traffic, such as malicious attacks, data leaks and so on. Therefore, network traffic identification is of great significance for network security and user privacy protection. 1 The traditional network traffic recognition algorithm usually adopts the way of feature extraction and classifier training, but this way has some shortcomings. First of all, the traditional feature extraction needs a lot of time and manpower cost; moreover, it is easily affected by data bias, resulting in unclear feature separation and lower feature recognition accuracy. 2 Secondly, the traditional classifier training tends to ignore the issue of data redundancy, resulting in recognition errors. 3 Therefore, a new type of network traffic monitoring technology, namely anomaly detection, has emerged in this field. This detection method starts from the perspective of network traffic feature training and classifies abnormal traffic by analyzing the differences between different types of traffic. However, there are numerous machine algorithms that can be applied in anomaly network traffic detection, so selecting appropriate algorithms for anomaly detection can achieve twice the result with half the effort. Linear Discriminant Analysis (LDA) and Principal Component Analysis (PCA) are both supervised learning algorithms in machine learning algorithms, often used for feature extraction and task classification. Among them, the LDA algorithm can classify data by category, while the PCA algorithm can ignore sample categories for differentiation. Abnormal traffic data not only has a large amount of data, but also is mixed with normal traffic data, making it difficult to distinguish. Therefore, the processing of abnormal traffic data can first use LDA algorithm for category differentiation, and then use PCA algorithm for deep feature processing. In addition to feature processing, the recognition of abnormal data also requires classification and recognition of the processed features. Support Vector Machines (SVM) is a classic classifier with excellent high-dimensional data processing capabilities. Therefore, the study adopted an improved SVM algorithm to recognize and classify abnormal data features. To realize the above content, the research is split into five sections. The first section is a brief introduction to the research direction and innovation points of the full text. The second section is a review of the existing network anomaly recognition algorithms. In the third section, an improved LDA-PCA-SVM classification algorithm is designed according to the characteristics of abnormal network traffic. The fourth section is the performance analysis of the classification algorithm. The fifth part is a summary of the research results and prospects.
Related works
Network anomaly traffic identification is a technique to secure the network by detecting anomalous traffic. It can be achieved by using techniques such as machine learning, deep learning, data analytics and firewalls. There are various algorithms involved in the development of the current technique on network anomalous traffic identification. Wang et al. 4 developed a stochastic matrix theory method for anomaly monitoring and localization of data in distribution networks. Any feeder in the distribution network forms a corresponding data matrix, and the research method can represent the data anomaly state based on linear eigenvalues and also localize the anomalous data based on eigenvectors. The experimental results show that the method has strong robustness. For the problem of network intrusion, Xiao et al. 5 proposed a feature similarity fuzzy clustering anomaly detection algorithm for anomalous data detection, the algorithm can be calculated by the data of the web page weight data on the similarity of user data, and the algorithm can be based on the information returned by the web page window to locate the anomalous data. The experimental results show that the algorithm has high accuracy. Song and Kim extended the scope of network anomaly monitoring to vehicle network systems and developed a machine learning anomaly monitoring method. The method uses the pseudo-normal data of noise to detect anomalies, and completes the pseudo-normal flow training through the generator. Algorithm could effectively carry out anomaly monitoring of vehicle network and was superior to other monitoring algorithms. 6 Niu et al. 7 optimized machine learning and proposed a heuristic statistical testing method. A heuristic statistical testing method which combined statistics and machine learning methods to count the length of network traffic sequences and perform operations to detect the degree of anomalies in the data. The experimental results show that the proposed method of the study has a high accuracy of traffic identification. Lyamin et al. 8 upgraded the security of the network data under wireless communication and the upgrade method is based on DCC algorithm, which is mainly used for malicious blocking of the network traffic during the communication process. The study combines the DCC algorithm with hybrid interference detector to statistically analyze the wireless network traffic. The experimental results show that the method has high efficiency in detecting anomalous data.
LDA algorithm is a method used in statistics, pattern recognition and machine learning, which is mainly used in separating feature combinations. Mendes et al. 9 used LDA algorithm for squamous intraepithelial lesion detection in medicine. Firstly, mass spectrometry was used to detect plasma to differentiate between lesion and normal plasma. Then the plasma detected by mass spectrometry was analyzed and predicted in depth using the LDA algorithm. The experimental results show that the detection method combining LDA and mass spectrometry has a detection accuracy of 77% for plasma. The prediction accuracy was 80%. Liu et al. 10 developed a carbon nanotube sensor array for LDA-SVM model training. The array can be used to recognize the dead or alive state of mammalian cells. The experimental results show that the accuracy of cell recognition after training with LDA-SVM model is 87.5%–93.8%. Fierro et al. 11 used LDA method to detect the water vapor mass mixing ratio in high altitude lightning. In the LDA algorithm, the total water vapor mass that exists outside the location of the observed lightning is removed by an equal amount to reduce the impact on the detection results. The experimental results show that the algorithm has high accuracy in detecting the water vapor mass mixing ratio of lightning. Pontes et al. 12 combined the LDA algorithm with the ACO algorithm and proposed a meta-heuristic that mimics the cooperative behavior of ants. The meta-heuristic mimics the cooperative behavior of ants increases the probability of discarding non-informative variables and can be used to construct a more concise model. Experimental results show that the optimized LDA-ACO algorithm can be used for model testing at a faster speed.
To sum up, abnormal network traffic behavior is a kind of harmful behavior caused by illegal attacks. Although many researchers have developed different recognition techniques combining multiple algorithms, there are few fusion recognition algorithms combining LDA, PCA and SVM. Since LDA and PCA algorithms have good feature extraction and separation performance, LDA-PCA algorithm is used in this study to process the features of abnormal network traffic data, and SVM method is used to build a feature classifier to complete the feature recognition of abnormal network traffic.
Research on network abnormal traffic identification algorithm based on LDA-PCA-SVM
Network abnormal traffic monitoring is currently a new research hotspot in the field of networks. Therefore, selecting appropriate algorithms for network abnormal traffic monitoring has important value. This study proposes to use LDA algorithm and PCA algorithm to process the features of network data. Firstly, LDA algorithm is used for data mapping dimensionality reduction, then PCA algorithm is used for normalization, and finally SVM algorithm is used for classification and recognition.
Feature classification of abnormal network traffic behavior
Network abnormal traffic behavior is one of the most common types of network security abnormal behavior. The attack object of abnormal network traffic is network traffic. A clear definition of network traffic is the set of IP data generated by a specific network observation point during a specific time period. These data sets represent the sum of the traffic information of the network device in a specific period of time. 13 Because network traffic can effectively feedback the running state of the network and the device, network traffic is often used as the carrier of abnormal behavior attack.
Figure 1 shows the network traffic data composition and monitoring diagram. Among them, Figure 1 is the composition diagram of network traffic data. The components of network traffic data fields have different meanings, including input and output IP addresses, input and output ports, data protocols, number of fields, data packets, data flows, and data flow duration. The data flow field contains complete traffic information and can accurately describe all network traffic behaviors in a specific period of time. By classifying and extracting and analyzing different features of network behaviors, the security performance of network traffic behaviors in this period can be obtained.
14
Figure 1(b) shows the network traffic monitoring diagram. Firstly, the network traffic is collected by different data collectors and aggregators. Then in the summary and aggregate flow database, the network data flow analyzer is used to analyze the network traffic data to realize the monitoring of data behavior. Schematic diagram of network traffic data composition and monitoring. (a) Schematic diagram of the composition of network traffic data. (b) Network traffic monitoring chart.
Abnormal network traffic refers to the traffic data in a specific period of time that is inconsistent with the characteristics of normal data and causes suspicion. Abnormal network traffic is an operation behavior that sends abnormal traffic data. When abnormal data is detected in network traffic, it indicates that the information of the device has been illegally leaked. Currently, there are three common abnormal behaviors of network traffic, which are operation context exception, environment exception and access exception. Context exception refers to abnormal behavior during traffic monitoring. An access exception refers to an access exception caused by unexpected events. Environmental anomalies are the most complex, and the causes of such anomalies are difficult to be directly analyzed from the data set of network traffic, and need to be analyzed and screened through other permissions.
Figure 2 shows the abnormal behavior of the network traffic environment. The cause of abnormal network traffic on the victim host is that hackers attack multiple puppet machines by manipulating one puppet machine. Then the virus data in the puppet machine is transmitted to the victim host, causing the victim host network exception. With the increasing attention of Internet users to network security, many algorithms have been developed to monitor abnormal behavior of network traffic. The common principle of abnormal traffic behavior identification is to classify and extract the features of network traffic data, and establish the optimal classification order by constructing a multi-layer classification model.
15
Schematic diagram of abnormal behavior in network traffic environment.
Figure 3 shows the schematic diagram of abnormal behavior identification of network traffic. Before monitoring abnormal network traffic, data must be preprocessed. The data set is split into training set and test set, and then the feature classification algorithm is applied to extract the data features in the data set, and then the feature type is identified by the detection model. Although this method can effectively monitor and identify abnormal behaviors of network traffic, there are still cases of classification errors due to unbalanced data classification. In fact, different technicians need to consider different directions in anomaly monitoring and identification, resulting in different data collected and different feature classification models built. Therefore, the classification accuracy of each feature category cannot be improved at the same time. To sum up, to handle the issue of low accuracy of abnormal traffic data classification, LDA method was proposed in this study to identify data characteristics. The feature set is normalized by PCA method. Finally, SVM algorithm is used to design the feature classifier to complete the feature recognition. Framework diagram for identifying abnormal behavior in network traffic.
Research on abnormal behavior recognition algorithm of network traffic based on improved LDA-PCA-SVM
In the field of feature extraction, LDA algorithm and PCA algorithm are the most commonly used representative methods. Among them, LDA algorithm, as a subspace learning algorithm, can input data features into subspace while preserving the integrity of features to the greatest extent. The operation process of this algorithm has certain supervision.
16
Abnormal traffic data has the characteristic of complex categories, so a supervised feature classification algorithm is needed to process it. However, although a single LDA algorithm can distinguish abnormal data from normal data, there is still a problem of incomplete processing of abnormal data. Therefore, the study also introduced the PCA algorithm to process the classified abnormal data. The PCA algorithm can map traffic data with high-dimensional features to a low-dimensional space for visualization processing. Suppose there is a set
In equation (1),
Figure 4 is a schematic diagram of the mapping of different types of samples, representing the projection and mapping direction diagrams of different mapping directions. The mapping direction on the right allows for better sample separation without overlap. This shows that excellent mapping direction can improve the separation degree between the two types of samples, so it is necessary to calculate the interclass dispersion matrix Mapping diagrams for different types of samples.
In equation (2),
In equation (3),
In equation (4),
In equation (5),
In equation (6),
In equation (7),
To efficiently calculate the information contained in the matrix in time, a new matrix
Then, the linear operation of the feature matrix is carried out through the variance threshold
In equation (10),
Figure 5 shows the flow chart of feature processing under LDA-PCA algorithm. The feature separation algorithm proposed in this study combines the LDA algorithm and PCA algorithm to separate the data features in the sample set. First, the category of the target sample set is entered, and the inter-class dispersion matrix Feature processing flowchart under LDA-PCA algorithm.
SVM classification algorithm is popular in sample recognition and classification. Network abnormal traffic identification is a process of classifying and identifying the characteristics of abnormal data, so this algorithm is suitable for network abnormal traffic identification. The operation process of this algorithm is to find the maximum classification distance in a specific sample set and set it as the distance between two types of samples for solving.
19
In sample set
In equation (11),
Figure 6 shows the sample classification diagram under SVM algorithm. In Figure 6, the classification effect between the two kinds of samples mainly depends on the plane interval constructed by the intermediate linear equation. The larger the interval, the more obvious the classification effect.20,21 In order to obtain the optimal sample classification distance, a relaxation factor Schematic diagram of sample classification under SVM algorithm.
In equation (12),
Performance analysis of network abnormal traffic identification algorithm based on LDA-PCA-SVM
To improve the monitoring effect of large-scale network abnormal traffic data, LDA algorithm and PCA algorithm are proposed in this study to process data features. After separating them effectively, the classification of data features is completed by constructing SVM separator. To test the recognition application performance of the above algorithms, this paper intends to complete the performance analysis of the proposed algorithms in a typical dataset WIDE.
Performance analysis of feature extraction algorithm based on LDA-PCA
To explore the recognition performance of the proposed algorithm, the typical data set WIDE will be used as the experimental set. The WIDE dataset has a reliable distribution of test sets and training sets, reducing the redundancy of the dataset. In this experiment, a single LDA algorithm, a single PCA algorithm and a Scale-invariant feature transform (SIFT) algorithm were used for comparative analysis. The above three kinds of comparison are widely used because of their good feature processing effect.
Experimental environment table.
Figure 7 shows the comparison diagram of algorithm feature separation degree under different feature dimensions. Among them, Figure 7(a) and (b) are the comparison graphs of feature separation degree under the training set and the test set, respectively. The overall separation degree of the training set is superior to the test set, which verifies the validity of the data set. The separation degree of the four algorithms increases when the feature dimension increases. LDA-PCA algorithm proposed in this paper has the highest degree of feature separation because it takes into account the problem of data overfitting and introduces parameters of inter-class dispersion and intra-class dispersion during operation, thus improving the effect of feature data separation. In the training set, the LDA-PCA algorithm can achieve a high separation degree of 0.95 when the feature dimension is 7, so the algorithm has excellent feature separation effect. Comparison of algorithm feature separation under different feature dimensions. (a) Algorithmic feature separation for different dimensions under training set. (b) Algorithmic feature separation for different dimensions under test set.
Figure 8 illustrates the effect of feature mapping under various algorithms. Among them, Figure 8(a)–(d) are the feature mapping renderings of SIFT algorithm, PCA algorithm, LDA-PCA algorithm and LDA algorithm respectively. The feature mapping effect of SIFT algorithm is the worst, and different types of feature points are not completely separated. Followed by PCA algorithm, followed by LDA algorithm, the best mapping effect is LDA-PCA algorithm. The mapping effect of LDA-PCA algorithm can clearly separate two types of feature points, with clear boundaries and low coincidence degree. The reason for the good mapping effect is that the selection of inter-class and intra-class dispersion parameters helps to control the mapping direction of the feature sample set. Excellent mapping direction can reduce the mapping coincidence rate of feature points between different types. Therefore, LDA-PCA algorithm has good mapping effect and can clearly separate different types of feature points. Feature mapping renderings under different algorithms. (a) Feature mapping diagram under SIFT algorithm. (b) Feature mapping diagram under PCA algorithm. (c) Feature mapping diagram under LDA-PCA algorithm. (d) Feature mapping diagram under LDA algorithm.
The algorithms used in Figure 9(a)–(d) are respectively SIFT algorithm, PCA algorithm, LDA-PCA algorithm and LDA algorithm. Among the four algorithms, LDA-PCA has the shortest response time and the lowest error rate in feature processing. The worst performance is SIFT algorithm. The shortest response time required by LDA-PCA algorithm is 39 ms, and the error rate can reach 0.5% at the lowest. Much better than SIFT algorithm’s 112 ms response time and 5.7% error rate. The response time reflects the efficiency of the algorithm. The error rate reflects the reliability of the results of the algorithm. Therefore, the proposed LDA-PCA algorithm has high operational efficiency and reliability. Response time and error rate of feature processing under different algorithms. (a) Feature processing time and error rate under SIFT algorithm. (b) Feature processing time and error rate under PCA algorithm. (c) Feature processing time and error rate under LDA-PCA algorithm. (d) Feature processing time and error rate under LDA-PCA algorithm.
Based on the above results, it can be found that the LDA-PCA algorithm proposed in the study has good performance. At present, there are still many new feature processing algorithms developed in the field of anomaly data detection. Among them, the Late Potentiogram (LPM) algorithm, pseudo 3D encoder, and 3D Multi level memory mechanism Convolutional Neural Network (3D MM CNN) and Variational Abnormal Behavior Detection (VABD) algorithms have been widely used in the field of anomaly recognition and have achieved excellent application results. Therefore, this study used LPM, 3D MM-CNN, and VABD as comparative algorithms for performance analysis.
Accuracy of feature classification under different algorithms.
Feature classification and recognition performance analysis based on LDA-PCA-SVM classifier
The proposed LDA-PCA-SVM classifier is intended to be used for classification and recognition of different types of features. At present, Convolutional Neural Networks (CNN) are also commonly used algorithms in the field of feature classification and recognition. Recurrent Neural Network (RNN) and K-Nearest Neighbor (KNN) algorithm. The above three algorithms have considerable classification performance, so the above three algorithms are used as comparison algorithms to analyze the performance of LDA-PCA-SVM classifier.
Table of characteristics and types of network abnormal traffic data.
Figure 10 shows the recognition rate of multiple types of abnormal data under different algorithms. Among them, the identification data types in Figure 10(a)–(d) are Probe, U2R, R2L and Dos, respectively. In the four different data types, the LDA-PCA-SVM algorithm has the highest recognition rate of data features. Secondly, KNN algorithm and RNN algorithm; finally, the CNN algorithm. The highest recognition rates of LDA-PCA-SVM algorithm in Probe, U2R, R2L and Dos are 95.2%, 98.6%, 92.1% and 97.4%, respectively. Because the LDA-PCA-SVM algorithm can perform deep processing on data, it can not only reduce the dimensionality of features, reduce operational errors, but also maintain data diversity. Therefore, better recognition rates can ultimately be achieved. In summary, LDA-PCA-SVM algorithm has a good feature classification and recognition ability. Recognition rate of multiple types of abnormal data under different algorithms. (a) Recognition rate of probe data features using different algorithms. (b) The recognition rate of U2R type features using different algorithms. (c) The recognition rate of R2L type features using different algorithms. (d) The recognition rate of Dos type features using different algorithms.
Figure 11 shows the recognition accuracy of multiple types of abnormal data under different algorithms. The data types of Figure 11(a)–(d) are Probe, U2R, R2L and Dos, respectively. Among the four data types, LDS-PCA-SVM algorithm has the highest recognition accuracy. In Probe, U2R, R2L and Dos four types, the highest accuracy can reach 96.7%, 98.1%, 99.1% and 96.8% respectively. The reason why LDS-PCA-SVM algorithm can achieve high accuracy is that LDS-PCA algorithm can separate features more accurately when extracting features from data. The SVM classifier proposed in this study optimizes the linear classification equation and obtains the best classification recognition distance. Therefore, the proposed LDS-PCA-SVM has a high recognition accuracy. Accuracy of identifying multiple types of abnormal data under different algorithms.
Conclusion
Network abnormal traffic, as a common network security hazard behavior, has attracted people’s attention, and its abnormal data identification has become a research hotspot. To effectively classify and identify network abnormal traffic, the LDA method is proposed to separate the characteristics of network traffic. In this study, by introducing inter-class dispersion and intra-class dispersion, the best feature mapping direction is selected to complete the effective feature separation. To improve the effect of classification and recognition, PCA method is used to normalize the features. Finally, the relaxation factor is introduced to optimize the SVM to enhance the classification effect and complete the identification of abnormal traffic data. The separation efficiency of LDA-PCA method for abnormal features increases when the feature dimension increases. When the feature dimension is 7, the maximum separation degree of the algorithm can reach 95%, and it has a good feature mapping effect. The highest recognition rates of Probe, U2R, R2L and Dos were 95.2%, 98.6%, 92.1% and 97.4%, respectively, by the improved LDA-PCA-SVM classifier. The accuracy rate reached 96.7%, 98.1%, 99.1% and 96.8%, indicating that the improved LDA-PCA-SVM classifier proposed performs well. In summary, the proposed LDA-PCA-SVM has considerable development prospects in the field of large-scale network abnormal traffic recognition. However, due to the fact that the recognition process requires the operation of three algorithms: LDA, PCA, and SVM, the computational efficiency has certain disadvantages. In the future, its operational efficiency can be optimized and improved.
Footnotes
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Liaoning Provincial Department of Education’s Basic Research Project for Universities [Grant No. JYTMS20230711] and Liaoning Province Science and Technology Plan Joint Program (Fund) Project [Grant No. 2023JH2/101700009].
