Abstract
At present, network abnormal data detection algorithm has low efficiency and accuracy, and the false negative rate is very high. Therefore, the location accuracy of abnormal data is not ideal. An intelligent detection method of network abnormal data based on space-time nearest neighbor and likelihood ratio test was proposed. The time interval adjustment algorithm based on the change smoothness judgement strategy and the adaptive data change rule was used to adaptively adjust data acquisition time interval according to network performance parameters and achieve network data acquisition. The grid partition was used to convert source data points into appropriate granularity to complete the data preprocessing. Based on the maximum a posteriori probability, we selected the measured values of data to be detected at several moments as the time nearest neighbor points. The abnormal degree of data was quantified. Meanwhile, the likelihood ratio test was used to determine whether the data was abnormal. The abnormal alarm information was aggregated. All alarm information was arranged according to the size. The two alarm times with maximum difference value are used as the boundary, and the multi-point dislocation combined abnormal location method was used to locate the detection result. Experiment results show that the average detection time of proposed algorithm is 0.21 s. The average false negative rate is 2.8%. The accuracy of abnormal data detection and the positioning accuracy are high. The proposed algorithm can detect network abnormal data efficiently, which lays a foundation for the development of this field.
Introduction
With the rapid development of various mobile Internet applications, the Internet has become an indispensable platform for lives. However, with the increase of network traffic, various abnormal traffic in network traffic also comes which seriously affects the quality of network communication and the safe of user host [1]. Especially for the network in dynamic data environment, the traffic anomaly of backbone links can be directly used as a basis for detecting network failures. Therefore, the detection of abnormal data has great practical significance. In recent years, some excellent results have been studied by researchers.
The reference [2] proposes a network abnormal data detection model based on selection of intrusion features. The correlation dimension method is used to extract and mine of data feature in network channel data and optimize the information feature of correlation dimension, so as to realize the identification and classification of intrusion information. Combined with fuzzy C means clustering algorithm, the effective mining and detection of network abnormal data is achieved. Experimental results show that this detection model can improve the detection efficiency of network abnormal data and intrusion information, but the detection accuracy is low. The reference [3] proposes a method for detecting abnormal data in optical fiber network based on improved genetic algorithm. By adjusting the crossover probability and mutation probability, this method improves genetic algorithm to avoid the local optimal solution. Based on the analysis of genetic algorithm, the problem of abnormal data detection in optical fiber network is converted to the problem of obtaining the best solution. The improved genetic algorithm is used to realize the detection of abnormal data in optical fiber network. Experimental results show that the proposed method is simple and accurate. But the data is not preprocessed before the detection, leading to low detection efficiency. The reference [4] proposes a network abnormal data detection method based on migration technology and D-S evidence theory. This method generalizes the traditional network abnormal detection technology, which has high detection rate for unknown network anomaly. But the location accuracy of abnormal data is low. The reference [5] proposes a network abnormal detection model based on feature selection of information gain. The model has short time consumption, but the false negative rate is high.
For the problems of current algorithms and methods, an intelligent detection algorithm for network anomaly data based on space-time nearest neighbor and likelihood ratio test is proposed. The specific structure is:
The time interval between the adaptive data change law and the changing smoothness judgment strategy is used to adjust algorithm and collect network data in dynamic data environment, which lays the foundation for subsequent detection of abnormal data.
The grid partition is used to transform source data into appropriate granularity to achieve data preprocessing. Thus, the efficiency of abnormal data detection is improved.
Space-time nearest neighbor and likelihood ratio test are used for the intelligent detection of network anomaly data, so as to improve the accuracy of abnormal data detection.
The high precision location of abnormal data link is realized based on the multi-point dislocation combined anomaly location method.
The feasibility of intelligent detection algorithm for network anomaly data based on space-time nearest neighbor and likelihood ratio test is verified by simulation experiments.
Summarize the whole article and look forward to the next research.
Material and methods
Collection of network data in dynamic data environment
In order to improve the detection accuracy of abnormal data, the time interval of data collection should be adjusted according to the data change rule. In simple terms, the time interval of data collection can be adjusted automatically according to the data change. When data change is relatively gentle, the interval time of data collection should be increased automatically to save bandwidth [6, 7]. When the data changes violently, the interval time of data collection is reduced so that the performance of managed network can be more accurately analyzed.
In conclusion, the problem of data acquisition in abnormal data detection can be transformed into the problem of the smoothness of data change and the increase and decrease of interval time of data acquisition. Based on this requirement, the data acquisition algorithm based on adaptive data change rule is divided into following parts: the judgement strategy of smoothness of data change and the time interval adjustment algorithm of adaptive data change rule.
In this algorithm, the way of judging smoothness of data is to evaluate the current data smoothness through the evaluation of change of data collected in a period. Because the change of network data is random, it is difficult to find a function or a curve to simulate or implement the graphical representation of network data. Therefore, we mainly investigate the data fluctuation and carry out a quantitative evaluation of fluctuation. According to the quantized value of fluctuation, a method for judging data change smoothness is obtained. The detailed process is as follows:
Supposing that the sampling value of performance data D at moment t is D i , where, i = 1, 2, 3⋯. |D i - Di-1| is called the change value of data at moment t. A period (m > 0) is selected, and the unit is the second, which is called the interval of datachange.
If there are k sampling points in the time interval [t - m, t], namely D
j
, j = 1, 2, 3, ⋯, k, obviously D
k
= D
i
,
According to formula (1), only using the
Another period (n ≥ m > 0) is selected, and its unit is second, which is called as the analysis interval of data change smoothing. Assuming that there are p sampling points in the time interval [t - n, t], it is recorded as D
j
, j = 1, 2, 3, ⋯, p. Obviously D
p
= D
i
, p ≥ k.
flatDegree (m, n) is defined as the smoothness of data change of sampling point D
i
in the data change interval m and the analysis interval n of smoothness of data change at the moment t, that is to say, the judgement strategy of data change smoothness, n > m > 0. The value of flatDegree (m, n) is obtained by introducing formula (1) and formula (2) into formula (3):
The concrete meaning of smoothness flatDegree (m, n) of data change: when the data change smoothness is relatively large, the data change acutely. Otherwise, when the data change smoothness is relatively small, the data change is relativelygently.
According to the calculation and analysis above, data change smoothness is a decimal which is greater than 0 and less than 1. In the algorithm of data collection based on the adaptive data change rule of data, a critical threshold of data change smoothing can be set based on the precision requirement. After the first sampling, the system calculates data change smoothness in the sampling point. If this value is bigger than the threshold value set in advance, the current interval is too large and the data changes acutely. We should reduce the data collection time interval and improve the sampling precision, the accuracy of abnormal data detection [8, 9]. If the value is less than the critical threshold, the current data change is relatively gently, and the sampling time interval should be increased.
The time interval adjustment algorithm of data collection based on adaptive data change law refers to the algorithm that the system automatically adjusts the time interval of data collection according to this result [10, 11] when the data collection system judges the change trend of network performance parameters in the current moment according to the strategy given by the above process. When the data changes gently, the time interval of data collection is increased. When the data changes acutely, the time interval of data collection is reduced.
In order to get the final acquisition results, the following parameters are set:
The benchmark time interval t1 of data collection: it is the minimum time unit in this algorithm. The increase or decrease of time interval of data collection in the algorithm must be an integer multiple of this time unit. The initial time interval t2 of data collection: when the system runs, the initial time interval of data collection is the integer multiple of the benchmark time interval of data collection. The maximum interval t3 of data collection: when the algorithm automatically adjusts the time interval of data collection, it must not be larger than this time interval of data collection, and it is an integer multiple of the benchmark time interval. Generally, the quantization interval m of data change should be larger than this value. The minimum interval t4 of data collection: when the algorithm automatically adjusts the time interval of data collection, it must not be less than this time interval of data collection, and it must be the integer multiple of the benchmark time interval [12].
According to the above setting of parameters, the results of network data acquisition in dynamic data environment are obtained:
Where, f (x) denotes the data collection sequence. flatDegree (m, n) can adjust the data collection and automatically choose corresponding time interval of data acquisition based on the value of flatDegree (m, n). When flatDegree (m, n) is in different time intervals, different time intervals of data acquisition at are selected correspondingly. When flatDegree (m, n) is relatively large, a small time-interval is selected. Conversely, when flatDegree (m, n) is relatively small, a big time-interval is selected.
In conclusion, according to the evaluation of data changes in a period, a strategy for judging the smoothness of data change is established. The time interval of data collection is adjusted according to the change trend of network performance parameters at the current moment. When the data changes gently, the time interval of data collection is increased. When the data changes acutely, the time interval of data collection is reduced.
The data obtained from section 2.1 is preprocessed. The main purpose is to transform source data into appropriate granularity and then input it to the detection algorithm, so as to enhance the detection efficiency of abnormal data. The process is as follows:
If A ={ A1, A2, ⋯ A
d
} denotes a set of bounded attributes, and U denotes a d-dimensional data space, the formula is:
Supposing that V = (v1, v2, ⋯, v d ) denotes a d-dimensional data in U, where, the value v i is taken from A i . Through dividing each attribute dimension into N intervals, the data space is divided into a non-intersecting super cuboid.
A grid C is a super cuboid obtained by take an interval respectively in each dimension, C can be expressed as:
Where, c i denotes that the symbol dimension is an effective value. C obtained from formula (6) contains a lot of super cuboids that contains uniform or non-uniform data points. Based on data stream processing, a S function is used to convert attribute value to the interval (0, 1), then divide the conversion results into N parts averagely. The main idea is to divide less on the regional interval with large data density and to divide more on the regional interval with sparse data, which is shown in Fig. 1.

Function conversion.
Based on above contents, the process that the S function is used to convert the source data into the appropriate granularity can be expressed by formula (7):
Where, e denotes the average value of historical data. g denotes the standard deviation of historical data.
In data preprocessing of data, the method based on grid partition is used to divide data space. The grid partition is to divide every dimension of data space, so as to divide the whole data space into a limited number of super cuboid. The S function is used to divide even or uneven data points in super cuboid into the appropriate granularity, which greatly reduces the computational complexity and storage complexity of system and improves the efficiency of abnormal data detection.
Based on above contents, an intelligent detection of network abnormal data based on space-time nearest neighbor and likelihood ratio test is proposed to detect abnormal data. The process is as follows:
Select a space-time nearest neighbor
If the data to be detected is
The number l of time nearest neighbor points and the number k of space nearest neighbor points are determined. For the selection of time nearest neighbor node
For the selection of k node nearest neighbor nodes of node space
Let us suppose that the network includes at least historical measurement data with l length in a dynamic data environment. At the (t - i)-th time, the absolute value of difference of measurement data between the node
Where,
Let us suppose that the network works normally before the moment t. The nodes corresponding to the smallest k elements in vector
Where,
According to Bayes formula, there are:
According to formula (11), we can obtain:
The above process is iterated. Meanwhile, the posterior probabilities that node is the k nearest neighbor point of node
In the process (1) of selecting space-time nearest neighbor points, based on the maximum posteriori probability and the stationary hypothesis of measured data, we select several nodes which are closest to the measured data of nodes to be tested in normal working time as the space-time neighbor points. When the data of node to be detected is abnormal, the measured data of this node goes wrong, then the difference between the node and its space-time neighbor points on measured data will change significantly [13, 14].
According to (1), we can see that the smaller the absolute value of difference between measured data, the greater the similarity. The weighted summation is carried out for the difference value between the data
Where, An,Δt denotes the weighting coefficient of measured difference between the data
Formula (14) and formula (15) are introduce into the formula (13), we can obtain the abnormal degree value
(3) Likelihood ratio test
For the node data
The method of likelihood ratio test is used for the judgment based on formula (17).
Where,
Where,
According to (24), the decision threshold when the false alarm rate is 1 - α can be determined by combining abnormal parameters and historical measurement data. Therefore, the result D (V ; t) of whole dynamic data network abnormal data detection can be expressed as:
Where, ξ denotes the detection factor, which can reduce the false negative rate of algorithm and improve the accuracy of abnormal data detection. V denotes a set of network data.
In conclusion, the historical data in abnormal data detection is used to select space-time nearest neighbor points of data to be detected. The abnormal degree of data to be detected is quantified by the abnormality between the data to be detected and the nearest neighbor data. Based on the value of abnormal degree, the hypothesis test model is built. The likelihood ratio test is used to judge whether the data to be detected is abnormal.
According to the result in section 2.3, the abnormality of data can be detected accurately. But we need to locate the abnormality in order to judge the accurate link and location of abnormal data. In order to improve the accuracy of abnormal data location, multi-point dislocation combined anomaly location method is used to locate detection results.
If there are b detection links, we can collect all alarm information W′1, W′2, ⋯ W′
b
in data processing end over a period when a data processing terminal gives an alarm. After this aggregation, all the alarm information is arranged in a certain way. Supposing that they are arranged with the alarming frequency, we can obtain
According to (26), it is possible to judge the specific location of abnormal data as long as we find a demarcation point and divide the alarming times into two sets. The difference value D
i
between adjacent alarm information is calculated.
The two alarm information of maximum difference Dmax is taken the demarcation point, and all the alarm information is divided into set X and set Y. Any element in set X is larger than any element in set Y. As a result, the alarm information in set X shows that the attack intensity is relatively large. Therefore, the corresponding data processing terminal does not process links. It is the normal link and does not have abnormal data. The alarm information in set Y shows that the attack intensity is relatively small, and the corresponding data processing end does not process links, which is the location of abnormal data. In conclusion, the judgment function f (X|Y) of abnormal data position is obtained:
The accurate position of abnormal data can be obtained by formula (28).
In abnormal data location, the abnormal alarm information is aggregated, and the difference value between adjacent values is calculated. The two alarm times of maximum difference value are used as the boundary, and all the alarm information is divided into two sets. We can determine that the untreated link corresponding to the data processing end without alarm is abnormal, and other links are normal. The untreated links in data processing end corresponding to the set of large elements are normal, and the untreated links in data processing end corresponding to the set of small elements are abnormal. Thus, the abnormal data can be locatedaccurately.
In order to prove the overall effect of proposed algorithm, the experimental platform was built on the matlab2017. The simulation was carried out on the component random data set. A network consisting of 210 sensors was designed. The sensor data could be divided into three categories based on the neighbor relationship of sensors, and each category included 70 sensors. Figure 2 was a part of experimental environment.

Experimental environment.
Experimental indicators:
Detection efficiency of abnormal data
Accuracy of abnormal data detection
False negative rate of abnormal data detection
Location precision of abnormal data
Experimental results:
To analyze Fig. 3, the time-consuming curve of optical fiber network anomaly detection based on improved genetic algorithm fluctuated greatly. The average detection time was 0.55s. The time-consuming curve of network anomaly data based on space- time nearest neighbor and likelihood ratio test fluctuated steadily, and the average detection time was 0.21s. This showed that the algorithm had high efficiency in detecting abnormal data. With the increase of data to be detected, the time consumption of this algorithm did not show an increasing trend, and the overall operation effect was good. In the data preprocessing of this algorithm, the grid partition was used to divide data space. The S function was used to divide even or uneven data points in the super cuboid into the appropriate granularity, which improved the detection efficiency.

Comparison of detection efficiency with different abnormal data detection methods.
Figure 4(a) was an experimental model of detection accuracy for abnormal data, which was a random waveform under the network operation. The black dots in the graph denoted abnormal data points, and the white dots were the disturbed data. Thus, the accuracy of proposed algorithm was proved.

Comparison of detection accuracy with different abnormal data detection methods.
According to Fig. 4, it is known that the detection accuracy of network anomaly data based on intrusion feature selection was poor in the network running waveform. Although all the abnormal data could be detected, the interfered data was also detected as abnormal data, which showed that this method could not achieve abnormal data detection with high accuracy. The intelligent detection of network anomaly data based on space-time nearest neighbor and likelihood ratio test could detect abnormal data accurately, which is not affected by disturbed data. Compared with the method for detecting network anomaly data based on the intrusion feature selection, it was more reliable.
The false negative rate was the missing rate, which was an important index to prove the abnormal data detection method. From the analysis of Fig. 5, the average false negative rate of network abnormal detection method based on the information gain feature selection was 16.8%. The average false negative rate of Intelligent detection of network abnormal data based on space-time nearest neighbor and likelihood ratio test was 2.8%. From the data, the intelligent detection algorithm of network abnormal data based on space-time nearest neighbor and likelihood ratio test was more scientific andpractical.

Comparison of false negative rates with different abnormal data detection methods.
In Fig. 6, the curve of location accuracy of network anomaly data detection method based on migratory technology and D-S evidence theory was relatively stable. With the increase of detection data, the location accuracy was not influenced, and the location accuracy was not more than 80%. The location accuracy of network abnormal data intelligent detection algorithm based on space-time nearest neighbor and likelihood ratio test was also not influenced with the increase of detection data. The highest location accuracy could reach 99%. Through the above comparison, obviously, the location accuracy of network abnormal data intelligent detection algorithm based on space-time nearest neighbor and likelihood ratio test was better than that of current algorithms.

Comparison of location accuracy in different anomaly data detection methods.
With the increase of network data traffic, massive data emerges gradually in the work and life. Therefore, the detection of abnormal data is very important. For these problems in current algorithms and methods, an intelligent detection algorithm for network abnormal data based on space-time nearest neighbor and likelihood ratio test is proposed. Based on grid partition, the source data is transformed into the appropriate granularity, and then it is inputted in the detection algorithm to solve the problem of low efficiency of abnormal data detection in the current algorithms and methods. According to the judgement strategy of data smoothness and the time interval of change regulation of adaptive data, the algorithm is adjusted to improve the accuracy of data sampling and the accuracy of abnormal data detection. To set detection factor can reduce the false negative rate of algorithm and improve the accuracy of abnormal data detection. The location accuracy of abnormal data is improved by determining the location of abnormal data position. In next step, different anomaly data detection algorithms are proposed for different types of attack to further improve the detection accuracy.
Footnotes
Acknowledgments
Construction of Innovative Talents Platform in Intelligent commerce based on IDPO Process Theory, transformation of education research project in Jinan University(No. JG2018006);
Research and Large-scale Application of Education Cloud Platform based on Big Data and Gaming Learning, Guangdong Province Science Technology Plan Foundation (No.2016B010124008).
Construction of “Internet+” service architecture for University Student based on IDPO Process Theory, Collaborative Innovation and Platform Environment Project (No. 2016B040401003).
