Abstract
In order to improve the accuracy and recall rate of the clustering mining process of large-scale network abnormal data and shorten the time of clustering mining, in this study, a large-scale network anomaly data clustering mining method based on selective collaborative learning is proposed. Through cooperative training and selective ensemble learning, a machine learning anomaly detection model and a strong classifier for large-scale network data are designed, and the correlation variable analysis method is used to obtain the dissimilarity measure of data. The network anomaly data is processed by fuzzy fusion, and the nearest neighbor algorithm is used to realize the clustering mining of large scale network anomaly data. The data clustering mining accuracy of this method reaches 98.16%, the time of data clustering mining is only 2.5 s, and the recall rate of data clustering mining is up to 98.38%, indicating that this method can improve the effect of large-scale network anomaly data clustering mining.
Keywords
Introduction
Anomaly data mining focuses on data objects whose behavior is inconsistent with that of most data objects or different from normal behavior. Usually, this kind of data objects will be ignored as noise in other functions of data mining, but in some cases, it contains extremely important information [1]. The task of anomaly data mining is to find out the data objects that are hidden behind most of the data objects whose behavior is inconsistent with that of the general data objects and find out the valuable information behind them [2, 3]. Abnormal data mining usually includes three processes: definition of abnormal data, selection of appropriate algorithms, and analysis of the significance of abnormal data [4].
Relevant scholars have studied this and made some progress. Liu et al. [5] proposed an optical fiber network anomaly analysis algorithm based on Bayesian partition data mining. Bayesian is used to quantitatively complete the feature classification of data samples, and the a priori probability is modified through maximization analysis; the mining characteristic parameters and probability coefficients are set according to different types of abnormal information; according to Bayesian partition, data mining is carried out on the sample data. This method has high recognition accuracy, but the convergence speed is prolonged and the recognition efficiency is poor. Kang et al. [6] proposed a Naive Bayes-based regional anomaly data mining method, reconstructed the Euclidean acquisition measurement strategy between data vectors by using the illusion space, and calculated the deviation ratio of each data. Classify the abnormal data and obtain the corresponding set and trigger. According to the probabilistic transient calculation of data nodes, the effective mining of regional abnormal data is realized This method has higher accuracy of anomaly data mining, but the mining efficiency is poor. Chen et al. [7] proposed a power grid abnormal data identification method based on improved generative countermeasure network. Wgan alternating training generator and discriminator are used to learn the distribution characteristics of power generation statistical data and generate samples. Enhance the original abnormal samples, and the expansion proportion of abnormal samples is determined; outlier data identification is realized by using isolated forest algorithm on the expanded balanced data set. This method has high data mining recall, but is limited by long mining time and low mining efficiency.
In order to improve the accuracy and recall of large scale network abnormal data clustering mining process, shorten the time of clustering mining, this study proposes a clustering mining method of large-scale network abnormal data based on selective collaborative learning, designs the machine learning anomaly detection model through collaborative training and selective integrated learning, designs the strong classifier of large-scale network data through collaborative training algorithm, and gives the collaborative training process of large-scale network data. This operation can classify the collected network abnormal data more accurately, which is the basis for improving the accuracy of clustering mining of abnormal data. The correlation variable analysis method is adopted to obtain the dissimilarity measurement of data, the joint association rule analysis method is used for fuzzy fusion processing of abnormal network data, the hybrid weighted block matching of large-scale network abnormal data is realized. Thus, the recall rate of data clustering mining can be improved. The clustering mining of large-scale network abnormal data is realized by nearest neighbor algorithm, effectively improve the clustering mining effect of large-scale network abnormal data. In this way, the efficiency of clustering mining for large-scale network abnormal data can be improved and the time of clustering mining can be reduced.
Large scale network anomaly data mining
Meaning of abnormal data
Exception data is defined as: Abnormal data is a data object that is inconsistent with most data objects, significantly deviates from other data objects, and is not satisfied with the general mode or behavior of the data. They are usually noisy data in the clustering process and do not belong to any cluster or small patterns in the cluster.
Large scale network anomaly data mining process
Data mining technology has a perfect process. It mines unknown and meaningful information that can be used to make decisions or enrich knowledge from a given data set. Before data mining on data sets, researchers must first decide the mining steps and the methods to achieve the established objectives in each step, and make a clear plan for the process of data mining to ensure its orderly implementation [8, 9]. Generally speaking, the data analysis of data mining in specific applications consists of three stages:
The first stage: Data set preparation phase
Describe the problems to be solved and obtain the data set. Including: data cleaning, data integration, data conversion and data protocol. The task of data cleaning is to supplement the attributes of missing data objects, remove noise and delete duplicate values by analyzing the characteristics and distribution of data objects [10]. The task of data integration is to centralize the data obtained from a variety of heterogeneous data sources physically and logically, and store it in a unified data memory. Data transformation is simply to smooth and generalize the centralized data objects and transform them into more convenient data sets for analysis. Data protocol is the compression process of data sets. The compressed data sets have almost the same mining effect as the original data sets, and can effectively improve the speed of data analysis due to the small scale of data [11, 12, 13].
The second stage: Data mining stage
In this stage, various algorithms and tools are used to mine the deep hidden information in the data set. The specific data mining process is to determine the mining task, select the mining technology, determine the data mining method and analyze the data set. Firstly, the task of data mining is determined according to the specific needs, so as to accurately select the mining method and ensure the efficient and smooth completion of the subsequent work. Then select the appropriate data mining technology to complete the selection of mining process. Next, determine the data mining method. Each method has its suitable data mining task, so the most efficient algorithm should be selected from a series of data mining algorithms [14]. The selection process of mining algorithms should be carried out according to the characteristics of specific data sets, and the system should be used according to the goals and requirements of analysts [15]. Finally, the selected algorithm is used for iterative search, and hidden, useful and novel patterns are mined to complete the analysis of the dataset.
The third stage: Knowledge expression and analysis stage
Use the knowledge and patterns mined in the previous stage to explain life events and predict possible future events. Decision makers can make decisions according to the mined patterns, including two ways: one is pattern evaluation, domain experts evaluate the accuracy and accuracy of the results, delete irrelevant and redundant patterns, and store them in the pattern library or present them to users; The second is knowledge representation, which expresses the model and knowledge in a form that is easy to understand, so that people can understand the application well. In the concrete data mining process, the above process is iterated repeatedly.
Clustering mining method of large-scale network abnormal data
Collaborative training and selective integrated learning
Traditional semi supervised cooperative training requires that the data set of training base classifier has two fully redundant view conditions. During cooperative training, one classifier will mark the data randomly selected from the data set; The other classifier will learn the marked data of the former [16]. Thus, a strong classifier can be obtained based on the learning data set. Cooperative training algorithm will be limited by the difference of classifiers in the process of training and learning. To this end, this study proposes a time series anomaly detection algorithm RFCL based on cooperative learning combined with selective ensemble learning, and combines it with the multi-scale method of wavelet transform to assist the cooperative learning algorithm to select unlabeled sequences for classifier training and learning, used for anomaly detection of time series patterns. Semi supervised cooperative training method belongs to the research category of machine learning [17]. The machine learning model is shown in Fig. 1.
Machine learning model.
Machine learning is divided into supervised learning, semi supervised learning and unsupervised learning [18]. The relationship between the three is shown in Fig. 2.
Relationship between learning style and data set.
The data set of training base classifier of CO training algorithm has two sufficient redundant view conditions: one is that the two data sets are independent of each other in the processing of the same problem; Second, both data sets can explain this problem. During collaborative training, a classifier will mark the data randomly selected from the data set; The other classifier learns the marked data of the former. Thus, a learning data set-based classifier can be obtained. This assumption can also be expressed as follows: let
View independence hypothesis in collaborative training.
It can also be explained that let the classification function
Bipartite diagram of labeled samples and unlabeled samples.
The value of sample space
Collaborative training process.
In Fig. 5, there will be no labeled training set, which is divided into two by other points to be able to find the marked and unmarked training sets, respectively, using the two ways, and through two classifiers to classify it, and then marker training data into two classifiers, make a training set and no marking of the training set data classifier learning mark again. The unlabeled training set after classification processing is transferred to the unlabeled training set, and the iterative cycle is carried out.
The purpose of the collaborative training algorithm is to train the base classifier using a large number of unlabeled data and a small amount of labeled data, and obtain strong classifier. When collaborative training, the number of labeled training data sets increases with time. The process of collaborative training algorithm can be expressed as follows:
Input:
Marked data set
Algorithm:
The unlabeled data set
The classifier
The classifier
The classifier
The classifier
Add these newly marked data to the marked data set
Randomly select
Based on the similarity analysis, the numerical attribute features, classification attribute features and data fusion processing of abnormal network data are extracted, and the optimization mining algorithm of abnormal network data is improved. Under mixed attributes, the weighted filtering function formula of abnormal network data mining is:
Where,
When the data satisfies the wide stationary condition, the fuzzy centroid coefficient
The fuzzy fusion processing of abnormal network data is conducted, and the classification fuzzy set of abnormal network data is constructed based on the fuzzy centroid dissimilarity measurement method, which is as follows:
The fuzzy fusion weighting coefficient of abnormal network data meets the following requirements:
A modulation parameter is introduced into the adjacent target data mining node nodei, and the allocation rules
Where
In the anomaly detection method, the clustering algorithm distinguishes and classifies the data objects according to certain requirements and rules. When clustering, the similarity between the data objects is used as the criterion of cluster division. It is an unsupervised classification method. The classical nearest neighbor clustering algorithm is adopted for detection of abnormal behavior, and the data is clustered into k-class clusters by simple iteration.
The nearest neighbor algorithm is the easy to implement classical clustering algorithm, and has strong flexibility and adaptability. The accuracy of clustering algorithm directly affects the effect of network anomaly detection. This paper uses entropy method to weight each attribute of data object, objectively determine the attribute weight, and improve the accuracy of clustering. The clustering of normal and abnormal behavior is realized, and then the cluster pattern recognition method based on statistics realizes the discrimination of cluster pattern, and completes the monitoring and defense of large-scale intrusion behavior.
For nearest neighbor clustering algorithm, suppose there is a data object set
Algorithm description:
arbitrarily select a data object calculate the distance assuming that there are cluster centers calculate the distance from each data object to the center of each cluster in turn, and divide it into the nearest cluster by comparison until all
When clustering algorithm is applied to intrusion detection, it often clusters the objects containing normal data and abnormal data, and then judges the number of data objects contained in each cluster and the given threshold. If the former one is less than the latter, the cluster is an abnormal cluster, all data objects in the cluster are abnormal data objects. This method is based on the assumption that the number of normal data in the data set is far greater than the number of abnormal data. However, usually abnormal intrusion data such as DOS and probe will produce a large amount of intrusion data, but the normal type will only produce a small amount of data, therefore, the above anomaly detection method will cause difficulty in anomaly detection in this case. The basic idea is: firstly, the data set containing only normal data is used as the training set, and the above algorithm for automatically determining the number of clusters is used for clustering, then the anomaly judgment is made for the data to be mined, the nearest distance and the corresponding nearest class are found by calculating the distance between the data to be mined and each cluster center, and the nearest distance and the threshold are compared. To judge whether the data is normal or abnormal. The threshold range is jointly determined according to the cluster radius of the nearest class and the standard deviation of the cluster radius. A better coefficient
Firstly, the data set containing only normal data is trained by the improved algorithm. It is assumed that the clustering centers of the trained data are
Abnormal data clustering mining model
Clustering based anomaly mining is usually divided into two processes: training stage and mining stage. Through the above anomaly mining method, the training stage of anomaly mining is to cluster the normal data through the algorithm of automatically determining the number of clusters, so as to obtain the optimal cluster center and number of clusters. The anomaly mining stage is to judge the abnormal data of the test set according to the clustering center obtained in the training stage and the above abnormal data judgment method. The clustering mining model of abnormal data based on clustering is shown in Fig. 6.
Abnormal data clustering mining model.
Experimental scheme
To validate the proposed algorithm for automatically determining the cluster number in abnormal data mining, the simulation experiment uses the KDD Cup 1999 data set for network intrusion mining. The data set is a network connection data set collected by Lincoln Laboratory on a simulated U.S. Air Force LAN. Considering the data set is too large, people usually use “KDDCUP. Data_10_percent” in the data set to carry out the experiment of intrusion mining. Each connection of the data set is defined as a packet sequence from start to end in a time period, and in this time period, the data is transmitted from the source IP to the destination IP in the form of protocol under predefined conditions. Each network connection of the data set is divided into two types: normal and abnormal. The data contains a total of 494021 records, including 97278 normal records, and the rest are abnormal intrusion data. As is shown in Table 1.
Categories and names of intrusion attacks
Categories and names of intrusion attacks
Accuracy of clustering mining large-scale network anomaly data
The proposed method, the methods of Reference [5], Reference [6], and Reference [7] are compared in the clustering mining accuracy of large-scale network abnormal data c. As is shown in Table 2.
Accuracy of clustering mining of large-scale network abnormal data
Accuracy of clustering mining of large-scale network abnormal data
When the number of iterations is 50, the accuracy of large-scale network anomaly data clustering mining of Reference [5] method is 56.72%, the accuracy of large-scale network anomaly data clustering mining of Reference [6] method is 59.17%, the accuracy of large-scale network anomaly data clustering mining of Reference [7] method is 57.32%, and the accuracy of large-scale network anomaly data clustering mining of the proposed method is 99.42%; When the number of iterations is 300, the accuracy of large-scale network anomaly data clustering mining of Reference [5] method is 70.13%, the accuracy of large-scale network anomaly data clustering mining of Reference [6] method is 68.22%, the accuracy of large-scale network anomaly data clustering mining of Reference [7] method is 59.36%, and the accuracy of large-scale network anomaly data clustering mining of the proposed method is 99.86%, which is the highest.
The proposed method, the methods of Reference [5], Reference [6], and Reference [7] are compared in clustering mining time of large-scale network anomaly data as well. As is shown in Table 3.
Cluster mining time of large-scale network anomaly data
Cluster mining time of large-scale network anomaly data
When the number of iterations is 100, the time of large-scale network anomaly data clustering mining of Reference [5] method is 96.32 s, the time of large-scale network anomaly data clustering mining of Reference [6] method is 79.2 s, the time of large-scale network anomaly data clustering mining of Reference [7] method is 68.2 s, and the time of large-scale network anomaly data clustering mining of this method is 0.5 s; When the number of iterations is 300, the time for large-scale network anomaly data clustering mining of Reference [5] method is 98.5 s, that of Reference [6] method is 175.8 s, that of Reference [7] method is 137.9 s, and that of the proposed method is 2.5 s, which is the shortest.
The proposed method, Reference [5], Reference [6], Reference [7] and this method are compared in the recall of large-scale network abnormal data clustering mining. As is shown in Table 4.
Recall of large-scale network anomaly data clustering mining
Recall of large-scale network anomaly data clustering mining
When the number of iterations is 150, the recall rate of large-scale network anomaly data clustering mining of Reference [5] method is 62.18%, the recall rate of large-scale network anomaly data clustering mining of Reference [6] method is 70.25%, the recall rate of large-scale network anomaly data clustering mining of Reference [7] method is 73.21%, and the recall rate of large-scale network anomaly data clustering mining of the proposed method is 97.45%; When the number of iterations is 300, the recall rate of large-scale network anomaly data clustering mining of Reference [5] method is 52.26%, the recall rate of large-scale network anomaly data clustering mining of Reference [6] method is 69.32%, the recall rate of large-scale network anomaly data clustering mining of Reference [7] method is 68.21%, and the recall rate of large-scale network anomaly data clustering mining of the proposed method is 99.56%, which is the highest.
Abnormal data is usually a data object that is inconsistent with most data, but there is value behind the abnormal data. In order to improve the accuracy and recall rate of the clustering mining process of large-scale network abnormal data and shorten the time of clustering mining, in this study, a clustering mining method of large-scale network abnormal data based on selective collaborative learning is proposed. The machine learning anomaly detection model is designed through collaborative training and selective integrated learning, the sample space is described by bipartite graph, the strong classifier of large-scale network data is designed through collaborative training algorithm, the collaborative training process of large-scale network data is given, and the joint association rule analysis method is used for fuzzy fusion processing of abnormal network data. The fuzzy fusion weighting coefficient of abnormal network data is obtained, the hybrid weighted block matching of large-scale network abnormal data is realized, and clustering mining of large-scale network abnormal data is achieved through the nearest neighbor algorithm. The effectiveness of the proposed method is verified by experiments, and the experimental results are as follows:
When the number of iterations is 300, the accuracy of large-scale network anomaly data clustering mining of the proposed is 99.86%, the accuracy of the three comparison methods are 70.13%, 68.22% and 59.36%, respectively, indicating that the accuracy of the proposed method is much higher than that of other methods. When the number of iterations is 300, the clustering time of large-scale network anomaly data mining of the proposed method is 2.5 s, the clustering time of the three comparison methods is 98.5 s, 175.8 s and 137.9 s, respectively, indicating that the data clustering mining time of the proposed method is significantly shorter than that of other methods, and has higher efficiency of data clustering mining. When the number of iterations is 300, the recall rate of large-scale network anomaly data clustering mining of the proposed method is 99.56%, the recall rates of the three comparison methods are 52.26%, 69.32% and 68.21%, respectively, indicating that the recall rate of the proposed method is much higher than that of other methods.
Comprehensive the above results, this method can effectively improve the accuracy of clustering mining for large-scale network abnormal data and recall rate, and also can shorten the time of the clustering mining, improve the efficiency of data clustering mining, show the research content of this article to clustering of large-scale network abnormal data mining application to provide the reference. However, there are still some limitations in the research results of this paper. For example, this paper only studies the clustering mining process of large-scale abnormal data, and does not consider the case of small-scale data. In the future, we will focus on this and expand the research conclusions of this paper to improve the method of this paper.
