Abstract
Aiming at the problems of excessive dependence on manual work, low detection accuracy and poor real-time performance of current probe flow anomaly detection in power system network security detection, a detection method for calculating information entropy of probe flow and random forest classification is proposed. Firstly, the network probe stream data are captured and aggregated in real-time to extract network stream metadata. Secondly, by calculating Pearson correlation coefficient and maximum mutual information coefficient, feature selection of network metadata is carried out. Finally, the information entropy and stochastic forest algorithm are combined to detect the anomaly of probe traffic based on the selected key feature groups, and the probe traffic is accurately classified by multiple incremental learning. The results show that the proposed method can quickly locate the abnormal position of probe traffic and analyze the abnormal points, which greatly reduces the workload of application platform for power system security monitoring, and has high detection accuracy. It effectively improves the reliability and early warning ability of power system network security.
Introduction
With the rapid development of information technology, the amount of information among industries has shown an explosive growth trend. China’s electric power industry also entered the era of big data. New technologies such as big data technology, artificial intelligence, the Internet of Things, and cloud computing have been widely used in the power industry due to their own advantages [1]. However, the operation of power information system must rely on the high-speed development and open network platform. In the application process, it is always faced with network security, hardware and software failures and human destruction and other security issues. Because the power information system network is open, network attackers can use network vulnerabilities to destroy and steal useful information through improper means [2].
In order to control the security risks faced by the power network in time, a large number of security monitoring and defense equipment have been deployed by the power enterprises. These devices mainly include: firewalls, forward and backward isolation devices, servers, switches, vertical encryption devices, anti-virus systems, intrusion monitoring systems and network security situation assessment systems [3].
In the face of the characteristics of high-speed power network links and large amount of data, real-time anomaly detection should minimize the processing and access to data. At present, most of the network traffic detection algorithms are based on flow. For example, in [4], an improved local anomaly factor anomaly detection method KTLAD is proposed, which is based on density detection, calculates the separation degree between each flow packet and the nearby flow packet, and improves the timeliness of flow anomaly detection. In [5–9], Machine learning methods are used to detect anomalies and feature extraction. Yu et al. [6] firstly proposed a network flow prediction method based on MLP-NN, their main contribution is to improve the accuracy of traditional AR methods. Shi et al. [9] proposed an intelligent network flow identifying method, which is based on the neural network algorithm, GHSOM.
In addition, there are many detection tools and methods such as NetFlow, TCPDump, Openflow, Wireshark and IDSs, which are used to detect and collect network traffic. DDoS, worms, C & S and other attacks are detected and analyzed by these tools. However, these flow technologies and methods are based on packet sampling, filtering algorithms or over dependence on expert knowledge base. It may limit the detection range of network attacks, affect the correct extraction of original IP network characteristics and the accurate detection of abnormal network attacks.
In order to overcome the problem of insufficient information and low detection accuracy of network traffic analysis relying on a single traffic software or intrusion detection tool, power enterprises deploy flow probes to collect traffic in the network, provincial and local units based on proximity principle, mirror the original IP flow of network equipment link to flow probes, and then flow probes pass the parsed flow metadata through the network. The traffic probe then sends the parsed traffic metadata to the monitoring platform through network transmission for analysis and processing.
At present, power enterprises are facing the following problems when they use flow probes to detect abnormal network traffic: In the process of deployment and implementation of traffic probes, it is necessary to mirror multiple ports of routers (or switches). There may be some problems such as mismatching, mismatching and ineffective configuration, which lead to incomplete traffic mirroring and affect threat analysis. During the operation of the monitoring platform, multiple probes deployed throughout the network are distributed in various units. The configuration changes, mirror line anomalies and optical module anomalies of each unit may lead to incomplete flow mirroring, extreme situations, network security policy changes or network channel anomalies, which may lead to all elements of a probe. No data can be sent. There are so many probes in the whole network that it is impossible to ensure the integrity of the mirror configuration of the whole network traffic by means of routine manual inspection. If the probe contains a large amount of abnormal attack information flow, the security administrator can not directly judge and deal with it manually.
In the above serious situation, the network anomaly information flow detection based on probe traffic urgently needs an automatic detection model, which can regularly detect the integrity of each probe traffic mirror, to ensure that the system can normally analyze threats and security alarms.
This paper aims to propose a new generation method, which uses the random forest learning method to realize the autonomous classification of power network probe flow. Through the entropy calculation and correlation analysis of network characteristics, the fast anomaly detection of power network probe large-scale flow is realized, and the problems of low detection rate and poor generalization ability of the model are solved.
The innovations of this paper are as follows: This paper proposes an effective solution based on network flow clustering self-learning and multi-angle feature information fusion and classification. This method focuses on solving the problem that the current power network probe traffic mirror is easy to be missing and incomplete, and can only be found by artificial delay. In order to solve the problem of unified collection of distributed traffic probe data, this paper establishes a unified metadata collection standard for multi-source heterogeneous data. The proposed method has been relatively comprehensively tested in the simulated probe stream, which can detect different anomalies of the probe stream in time, and compared with other methods, it has high detection accuracy and stable performance.
The rest of the paper is structured as follows. In Section 1, we describe the current research situation. In Section 2, we analyze the characteristics of network flow and propose a feature aggregation method. In Section 3, we use random forest to detect the network flow anomaly. In Section 4, the advantages of the proposed method are verified by experiments. Finally, the paper summarizes the work of the paper and the direction of further efforts in the future.
Related works
At present, the data carried by power system security detection mainly depends on the network traffic detected by various security detection devices (such as network traffic probe, intrusion detection system, etc.). In order to ensure the reliable operation of power system services, build a credible network environment, and reduce the harm of various abnormal events to power communication networks and their services, the detection of network traffic anomalies becomes more important.
Network security data acquisition system usually cannot guarantee stable and reliable operation for a long time. Occasionally, incomplete data, loss and malicious content will appear in the collected records. This may be due to various reasons, such as instrument failure, power failure, communication line failure or attack [10].
With the power system entering the network information age, the power company is faced with a huge number of dynamic and real-time data every day. It is totally impossible to judge the integrity and reliability of the network acquisition data only by relying on human resources and experience. Network data flow is a typical time series. It has the characteristics of dynamic, high-dimensional and infinite. Data streams are changing all the time. Traditional processing methods have many disadvantages. Therefore, anomaly detection technology of network data flow should combine the characteristics of network data flow, so as to achieve better anomaly detection of convective data [11].
There are two main methods for anomaly detection of network traffic: active detection and passive detection. Active detection refers to sending data packets to the network actively and obtaining network traffic information through the feedback of data packets. Passive detection refers to the use of detection tools to collect traffic information in a fixed location [12].
The probe image flow with IPFIX used in power system is one of the effective methods of passive detection. It mainly collects important network metadata, which describes some information such as network communication protocols, bandwidth utilization, and node failure conditions, etc.
In [13], Li et al. pointed out: “In order to support IPv6 traffic monitoring in a large-scale network, flow-level traffic measurement methods with IPFIX will be more useful; In IPFIX, the flexible template structure that can carry flow information elements supports various traffic monitoring applications”. In order to overcome the problems of large amount of data transmission in the network of IoT nodes and the inability of real-time computing, Petr Matoušek et al. propose a new IoT monitoring model based on extended IPFIX records. The model employs a passive monitoring probe that observes IoT traffic and collects metadata from IoT protocols [14].
The main technical methods for passive detection include threshold detection method [15, 19], statistical detection method [16, 23], data mining method [17, 25] and feature extraction method [21, 24]. In [15], aiming at the problem that traditional detection methods based on information entropy mostly adopt fixed thresholds and cannot adapt to the changing network environment dynamically, an improved burst traffic detection method based on information entropy is proposed. This method can dynamically adjust the threshold value according to the entropy value of normal historical traffic, and detect the burst traffic caused by DDoS and Flash Crowd attacks through experiments. A traffic anomaly detection model based on statistics is presented in [16]. The model uses variance model for statistics, sets the confidence interval of the relevant parameters in the measurable set, and then detects the network traffic in real-time. The value falling into the confidence interval is normal traffic, and the value outside the confidence interval is abnormal traffic.
In [20], aiming at the problems of clustering-based network anomaly detection methods such as inaccurate selection of initial clustering centers and ineffective identification of aspherical clusters, an improved K-means clustering algorithm is proposed to distinguish abnormal and normal network traffic behaviors. The authors use the distance ratio to judge the clustering effect and improve the accuracy of the clustering results. In [21], aiming at the problems of poor accuracy, unstable power supply and easy traffic congestion in current power information system, a network data flow monitoring method combining non-linear feature decomposition and adaptive spectrum detection is proposed. This method reconstructs and spectral analysis network traffic data and extracts traffic data. The spectrum feature is separated blindly by matched filter detector, and the traffic is monitored effectively.
At present, there are many researchers who analyze network attack anomaly based on data stream processing generated by synchronous phasor measurement unit (PMU) [26–28]. In [26], a classification method is proposed to extract event features from PMU packet data stream accumulated by PMU buffer using CNN based data filter. Authors in [28] propose a distributed PDMA detection framework for PMU data based on emerging deep learning technologies, and point out that more and more researchers use deep learning methods to detect PDMAs. However, this work requires all measurement time series in the power system, whose performance may deteriorate the system as the observed power expands.
It is known from the previous that although many researchers at home and abroad have studied network flow anomaly detection, the current research methods cannot be simply extended to complex power information systems. This is because the power network data flow has the characteristics of large amount of data, continuous time, unknown faults and high possibility of network attack.
In order to further improve the performance of random forest method based on Entropy in anomaly detection and make it meet the real-time detection of network flow anomalies in power information system, this paper firstly proposes a feature selection method based on Pearson correlation coefficient and maximum mutual information coefficient to improve the accuracy and accuracy of feature selection; secondly, it adopts the method of feature selection based on Pearson correlation coefficient and maximum mutual information coefficient. SMOTE algorithm is used to improve the sample balance. Finally, random forest method is used to detect and classify network flow anomalies in power information system.
Network flow aggregation method
Network flow definition
The network flow data comes from the original data collected by the flow probe deployed in the system. The flow probe data will be aggregated into the platform of the power system security monitoring and management center, and the metadata information of the data flow can be obtained by filtering, analyzing and mining. These metadata are in a custom standard format and are mainly used for communication between flow probes and power system security monitoring platforms.
Where, F (IP) denotes IP network flow, SIP denotes source IP address, Sport denotes source port, DIP denotes destination IP address, Dport denotes destination port, Protocol denotes protocol field. Because network data flow is generally bidirectional, these field attributes can be interchanged with each other.
In the practical engineering practice of this paper, 112 information elements are selected, and the specific format is shown in Table 1.
Coding format of information element
Metadata is formatted as a string in the following table. Each information element occupies a fixed position in the string. The strings are separated by ∧ and the last string ends with ∧. Definition of information element that does not exist in Metadata: If the corresponding information element does not exist in a Metadata, the location does not need to fill in any content, that is to say, the two ∧ are adjacent. If the extracted information element itself has a ∧ sign, it needs to be escaped with an escape string% % %.
For example, DNS metadata string data is shown in Fig. 1. The metadata is explained in detail as shown in Fig. 2.

DNS metadata string data.

Detailed description of DNS sample metadata field.
In order to realize anomaly detection and classification based on network probe data flow, referring to the packet aggregation feature extraction algorithm in reference [29], and extracting reasonable important feature values of anomaly traffic detection from the original probe information metadata collected, a novel feature extraction algorithm for power system probe traffic data aggregation is proposed. As shown in algorithm 1.
Flow eigenvector analysis
After extracting the five-tuple of network data package to get the aggregated stream, it is easy to get the basic attributes of a stream: source and destination IP address, port number and the included network protocol. These features can only roughly locate a network connection behavior, and cannot directly reflect whether the network behavior is abnormal. Therefore, further fine-grained feature extraction of network flow is needed.
According to the time correlation, data statistical correlation and protocol contained in network flow, 23 statistical eigenvectors are extracted to analyze the anomaly of probe flow. The specific contents are shown in Table 2.
Statistical characteristic set of network flow
Statistical characteristic set of network flow
The quality of feature selection directly affects the result of network traffic anomaly determination. Although the traffic feature set in Table 2 can be obtained by statistics of traffic. However, not all network flows can get all the feature sets in Table 2, and it is not appropriate to use the feature sets in Table 2 for different anomaly classifications, because different anomalies often present different feature changes.
In order to better distinguish the correlation between different features, Pearson correlation coefficient [30] is used to calculate the correlation between features.
Pearson correlation coefficient between two variables is defined as the quotient of covariance and standard deviation between two variables.
Because μ
X
= E (X),
And the interval of value of ρX,Y is [–1, 1], –1 means complete negative correlation, and 1 means complete positive correlation.
Assuming three features A, B, and C, the Pearson correlation coefficient matrix T between features can be calculated according to formula (5).
It is not difficult to see from the matrix T that the correlation values between the features corresponding to the upper and lower triangles of the matrix are coincident, and the diagonal autocorrelation value is meaningless. Therefore, the matrix T is further simplified as:
In formula (6), if the value of ρB,A is between –1 and 1, both features are selected, otherwise, there is a positive and negative correlation between the two A, B features, and only one feature can be selected.
For the features selected by Pearson correlation coefficient method, the maximum mutual information coefficient (MIC) is used to calculate the strength between features and classification labels, and further feature selection is carried out.
In formula (7), a and b are the number of partitioned lattices in the direction of eigenvector X and Y. B is a custom variable. Generally, the value of B is 0.6 power of the data quantity. I (X, Y) represents the mutual information between X and Y and the calculation formula is as follow:
In the formula, P (X, Y) denotes the joint probability density function of X and Y, and P (X), P (Y) denotes the edge probability density function of X and Y, respectively. I (X, Y) value is [0, 1], and the greater I (X, Y) value is, the stronger the correlation intensity of the feature is, that is, the stronger the gain brought by classification. If I (X, Y) threshold of the selected feature is set as ɛ, then I (X, Y)) > ɛ, the feature is selected.
The mic (X, Y) between each feature X and the type Y is calculated by formula (7) or (8), and the feature with mic greater than the threshold is selected. Table 3 shows the calculated results.
Correlation strength between features and classification categories
After extracting and analyzing the behavior characteristics of different anomalous network flows, it is necessary to use powerful machine learning algorithm to train and learn the training sets of known categories after feature extraction and labeling. In this paper, random forest algorithm is selected as the classification learning algorithm of abnormal flow in power probe network. Random forest refers to a classifier that uses multiple trees to train and predict samples. The classifier was first proposed by Leo Breiman and Adele Cutler [31].
Because this algorithm has the characteristics of high accuracy, good robustness and easy to use, especially when dealing with the data with high feature dimension, it has high recognition accuracy, strong generalization ability of model, independent training trees and fast training speed. Table 2 lists more than 20 network features, which belong to high-dimensional feature vectors. Random forest algorithm can better adapt to the high-dimensional feature vectors of network flows and achieve accurate and efficient classification of probe flows.
Algorithmic flow
In order to accurately identify the anomaly of probe network flow in power information system, according to the data aggregation method of network flow proposed in the previous chapter, a reasonable process of anomaly identification of probe network flow is designed, as shown in Fig. 3.

Basic flow of anomaly flow identification.
According to Fig. 3, the basic flow framework for anomaly flow identification is divided into two parts:
(1) Construction of classification model
Firstly, through the construction of experimental environment, different abnormal network flows are run in simulation or real isolated environment. Abnormal network flows are captured by network grabbing tools such as Wireshark, Tcpdump or real power system network probes. After processing, the abnormal network traffic data of identified categories are obtained as training. Data sets. The training data set is added to the generation recognition network flow, and then the statistical features are extracted according to the feature extraction method. After feature extraction, the corresponding feature vectors are obtained for each detected aggregate flow.
(2) Machine learning classification
In order to classify feature vectors accurately and quickly, a series of reprocessing of feature vectors are needed, including one-hot coding of some features, feature selection by combining Pearson correlation coefficient and maximum mutual information coefficient, and SMOTE algorithm is used to solve the imbalance problem of probe traffic data and Z-Score unified coding of features.
The random forest algorithm is used to train the classification model. Through multiple iterations of training, the training parameters are optimized locally to achieve good results. Then through the ten-fold cross-validation method commonly used in machine learning, the model is tested to determine whether the model achieves the desired effect. The input training data set is divided into ten folds, 90% of which is used as the training data to build the model, and the other fold data is used as the test data to verify the model, and the accuracy, recall rate and false positive rate of the output model are tested.
The trained classification model can be used for test verification in the real network environment of the power system. The real-time network traffic data is captured by the traffic probe, and the feature vector is used as the input of the classification model through data stream aggregation and feature extraction operations. The model learns and analyzes the classification of the output probe network stream to determine whether there is an abnormality.
Experimental environment
In order to verify the power system probe network flow anomaly detection method proposed in this paper, it is necessary to select appropriate network flow data. Currently, although there are some public network security test data sets, such as DAPRA1998, NSL-KDD, UNSW_NB15, and ISCX-Slow-2016, these benchmark data mainly make the network attack abnormality, and the power system probe network flow abnormality There are certain differences, so this article does not directly use these internationally renowned data sets to simulate the performance of the proposed algorithm.
This paper sufficiently investigates various situations of abnormal network flow of probes in power system, and finds that there are several main reasons for abnormal network flow of probes: (1) serious loss of mirror data of network flow; (2) abnormal data flow of external network flow; (3) abrupt large-scale malicious attack data flow in network.
Based on anomaly detection for these problems, this paper designs a network simulation test data stream, which first collects the normal data stream of power system, then randomly injects large-scale attack data stream into the normal data stream or creates anomaly data stream in the normal data stream (For example, delete some normal data streams, change the flag bit of network connection, increase UDP timeout invalid messages, etc.).
The experimental environment is tested in Windows 10 home version 64-bit operating system, Anaconda3 (64-bit), Python 3.7, 8.0GB of memory, Intel (R) Core i3-4005U CPU @ 1.7 GHz. Two kinds of malicious attack network flow and two kinds of anomalies of increasing and losing network are tested in the experiment. The experimental data sets are saved in CSV file format. Details are shown in Table 4.
Training data set
Training data set
For the detection classification model, we need to use different evaluation index values to evaluate the performance of the model. In this paper, the binary obfuscation matrix of classifier is selected to evaluate the detection performance of the proposed model in power system probe flow anomaly. Specific evaluation indicators include: True Positive: The number of abnormal network flows classified as abnormal by the model, expressed by TP. False Positive: The number of non-abnormal network flows classified as abnormal by the model, expressed in FP. True Negative: The number of non-abnormal network flows classified as non-abnormal by the model, expressed as TN. False Negative: The number of abnormal network flows classified as non-abnormal by the model, expressed by FN.
The index values of Accuracy, Precision, Recall, F_Score, TPR and FPR are calculated respectively. The calculation formula is shown in formula (9)–(14).
Through feature extraction, encoding and machine learning classification detection of test data sets, good results are obtained.
First, the experiment obtained information such as the duration of the network flow, the distribution of IP addresses, forward packet average arrival time interval, the average time interval of forward packets, forward packet arrival time interval standard deviation, forward packet average byte number, backward packet average byte number, the ratio of the number of forward and backward packets, and the ratio of the number of forward and backward packet bytes.
According to the features selected in Table (2), Fig. 4 lists all the statistical eigenvalue information contained in the attack synthesis flow.

All statistical eigenvalues contained in the attack stream.
The statistical distribution of these information is shown in Figs. 5–9.

Network Flow Duration.

Main IP address map.

Average time interval and standard deviation of time interval of forward packet.

Average number of bytes of forward and backward data packets.

Ratio of forward and backward packet numbers and bytes.
From Fig. 5, it can be seen that the duration of the existing abnormal flow is generally long, while the duration of the normal business flow is relatively stable. Most of the duration of the abnormal network flow exceeds 2000 milliseconds, and the duration of the normal business flow is concentrated below 200 milliseconds.
In Fig. 6, it can be clearly seen that there are two abnormal network IP addresses in the detected network, and the data of these two IP addresses has reached 3216 and 2790 times in a short time.
From the statistics of average time interval and standard deviation of forward packets in Fig. 7, it can be seen that the anomalous network flow is very obvious, and the standard deviation of time interval and interval is very large, while the two numerical deviations of normal traffic flow are very small.
From the results of Fig. 8, it can be seen that the average bytes of forward and backward packets in anomalous network flows are relatively large, generally reaching more than 200 bytes, some even exceeding 1000 bytes, while the number of bytes of normal traffic network flows is basically less than 200 bytes.
As can be seen from Fig. 9, since the experiment is to insert two abnormal botnet flows into the normal traffic flow, and the current deadly malicious code mainly uses pull-and-pull mode to request commands, the abnormal malicious botnet traffic will inevitably appear as compared with the upstream and downstream network traffic in terms of data volume characteristics. Normal traffic has different behavior characteristics and has certain modularity. In Fig. 8, it is found that most anomalous network flows are in the range of 0-4000 bytes, and relatively concentrated in the range of 3000-4000.
Secondly, in order to further verify the correctness of the model algorithm, this paper extracts some training data to verify the model. The proposed information entropy and random forest detection results are shown in Fig. 10. It can be seen that the recognition and interpretation of the classification of power probe network traffic in the prediction results can basically reach 97%. There are 6373 normal background traffic flows in the obfuscation matrix, 67 erroneously identified as Neris anomalous network traffic, 28 rbot anomalous network traffic, and 3399 anomalous network traffic in Neris. Only 165 traffic were misidentified as normal background traffic. This fully demonstrates that the network flow features extracted in this paper can identify and predict the abnormal network flow of power system well after model training.

Model training results of some test sets.
Thirdly, in the same case, for the same data set of power system probe anomaly network flow, other classification models in scikit-learn machine learning library are used to train and predict. It is found that other algorithms have low classification accuracy, high false alarm rate and error rate, which are not as good as the methods proposed in this paper. This experiment compares the SVM, Naive Bayes and Random Trees algorithms, and the indicators are shown in Figs. 11 and 12.

Comparison of Classification Algorithms under the Same Data Set.

Classification values of different models for Neris and rbot abnormal network flows.
As can be seen from Fig. 11, the Accuracy value of the proposed method can reach 99.6%, while that of the SVM method is 89.4%, and that of the NaiveBayes method is 49.6%. The comparison of the other parameters (TPR, FPR, F_Score, Precision and Recall) is the same as that of the Accuracy value, and the performance of the other three methods is lower than that of the proposed method.
As can be seen from Fig. 12, the accuracy of the proposed method can reach 99.9% and 99.6% for the classification detection of Neris and rbot for two kinds of anomalous network flows. Other methods are lower than the methods mentioned in this paper, such as SVM and NaiveBayes, which have low accuracy. Obviously, they cannot be directly used to detect and predict abnormal network flow images in power system.
Finally, in order to be able to compare and analyze our proposed method with other methods(Random Trees [32], SVM [33] and Naive Bayes [34]), we selected the ISOT dataset for test verification. ISOT dataset [35] is a botnet dataset consisting of malicious DNS traffic and a benign dataset consisting of DNS traffic.
In this experiment, the subtree parameter (n_estimators) value of the random forest algorithm is set to 100, and the rest of the parameters use the model default values. The maximum depth (max_depth) value of the decision tree algorithm is set to 3, and the minimum number of samples for a single leaf node (min_samples_leaf) value is 5. SVM algorithm chooses linear kernel and Naive Bayes algorithm chooses GussianNB kernel as training function.
The experiment first selected 12 features for model training based on the mutual information detection results. These 12 features are: ‘TN_FP’, ‘TN_BP’, ‘RN_FBP’, ‘B_MIN_FP’, ‘B_AVG_FP’, ‘B_MAX_FP’, ’ B_MIN_BP ’,’ B_AVG_BP ’,’ B_STA_BP ’,’ T_MIN_FP ’,’ T_AVG_FP ’,’ D_NF ’. In order to solve the imbalance of these features, the SMOTE algorithm is used to expand the sample size. Then, the PCA algorithm is used to reduce the feature dimension of the training and test data, and 5-7 important features are selected for model training to improve the model detection time.
The results of experimental comparison are shown in Table 5. From the test results of the experimental data, the performance of our proposed method is stable, and the values of several detection indicators exceed 98%; However, the detection effect of other methods is also relatively acceptable, but the four indicators cannot maintain relative balance, and some indicators have low values, such as Random Trees method for ISO dataset, the recall rate is only 85.88%, The SVM method takes the longest time, reaching more than 3 seconds, and Naive Bayes method has the lowest F-Score value.
The results of experimental comparison
In the experiments of the SVM algorithm, due to the need to continuously find the hyperplane that can optimally distinguish the data until the calculation results converge, the process consumes a lot of time due to the complexity of the traffic data. In addition, because decision trees and random forests are non-linear supervised classification models, there is no data inseparability. Random forest uses a voting mechanism of multiple decision trees to improve the decision tree, and solves the shortcoming of easy overfitting of the decision tree. From the experimental results, it is fully proved that the overall classification effect of the random forest is better than the decision tree. However, since random forest requires a certain number of sample sets to train each tree during the training process, the time consumption is more than the decision tree classification.
At present, when some electric power companies in China use traffic probes to detect abnormal network traffic, they have problems such as loss of mirrored traffic and abnormalities that cannot be detected in time. Aiming at these outstanding problems, this paper proposes a novel traffic anomaly classification method based on random forest and information entropy.
This method starts with the analysis of several reasons for the abnormality of the probe traffic, and adopts multiple methods such as dimensionality reduction and normalization to achieve the problems of extraction, standardization, and missing completion of traffic features in the distributed probe network.
Second, by time-divisionally calculating the entropy of traffic feature information and Pearson and the maximum mutual information coefficient (MIC) value, the association between network traffic features can be found quickly and accurately, and the accuracy of model training is improved.
Finally, in order to verify the effectiveness and superiority of the proposed method, a simulation test flow of power network probes is designed experimentally and the differences between the proposed method and other methods under different indicators are compared through multiple experiments.
The results show that the detection performance of the method proposed in this paper is generally better than other methods. The extracted characteristic indicators have outstanding performance and accurate correlation analysis, which can basically meet the power flow probe abnormal detection and prediction, and find a new method and new idea to solve the power system probe flow abnormal detection.
In the future, we will work hard to further optimize the method proposed in this paper, improve the model training effect, accuracy and generalization, and expand the project to the following areas: Train and detect the method in different network flows to determine the generalization ability of the method. Continue to search for new methods for extracting and calculating correlation indicators of network flow characteristics under different network attack situations, and further discover abnormal attack behaviors such as multi-step network attacks, hidden and large-scale distributed denial attacks. Explore the parallel transformation of the proposed method to achieve network traffic anomaly detection in large-scale, heterogeneous, and cloud environments. Apply the proposed method to the engineering practice, perform real-time detection and analysis on massive power network flows, and find the lack of training and detection of the model in a complex power network environment, in order to find optimization approaches and further improve the method’s availability and performance.
Footnotes
Acknowledgments
The authors are grateful to the anonymous reviewers for their detailed and accurate comments on the amendments to this paper. This work is supported in part by the Natural Science Foundation of China (Nos. U196620027, 51777015), and also supported by the project of “Practical Innovation and Enhancement of Entrepreneurial Ability (No. SJCX201970)” for Professional Degree Postgraduates of Changsha University of Science & Technology; Open fund project of Hunan Provincial Key Laboratory of Processing of Big Data on Transportation (No. A1605); the key scientific and technological project of “Research and Application of Key Technologies for Network Security Situational Awareness of Electric Power Monitoring System (No. ZDKJXM20170002)” of China Southern Power Grid Corporation.
