Abstract
Now, the classifier network anomaly traffic detection method is generally considered to be a general method of good detection effect and high detection precision. In order to detect abnormal network traffic more efficiently, so as to ensure the security of Internet users, a network traffic anomaly detection algorithm based on Mahout Classifier is studied. Aiming at the problem of time correlation and the influence of abnormal samples on accuracy of detection statistics, and the first detection point is eliminated and the applicable detection point is added. The BP neural network algorithm and Bayesian network model are used to predict the abnormal probability of the anomaly node, the detection precision is optimized, and the exception points of the training set are reorganized. In view of the anomalies detected by anomaly detection models, an emergency response method is proposed, which not only detects anomalies, but also handles anomalies.
Introduction
After a long time of statistics and verification, in a large number of network anomaly traffic detection methods, the classifier network anomaly traffic detection method is generally considered to be a general method of good detection effect and high detection precision [1]. The following issues are still generally involved in the current environment: First, the foundation of anomaly detection system requires a lot of data entry training. The workload of training samples is very large, and the process of data collection and processing is complex. Computers that analyze large amounts of data can’t be alone [2]. Secondly, with the change of the network environment and the external environment, the original classifier cannot adapt to the network environment, reduce the accuracy of dynamic classification, and optimize the classifier to adapt to the environment [25–27]. But each weight training cost is bigger and the maintenance cost is higher. Third, even after finding abnormal network traffic [28–30], it is difficult to deal with the corresponding abnormal response even if there are many problems [3]. Therefore, network traffic anomaly detection algorithm based on Mahout classifier is urgent and necessary.
In this paper, a network traffic anomaly detection algorithm based on Mahout classifier is studied. Aiming at the problem of time correlation of detection statistics and the influence of abnormal samples on the precision, BP neural network algorithm and Bayesian network model are used to predict the probability of abnormal nodes, optimize the detection accuracy and reorganize the abnormal points added to the training set.
Mahout classifier based network anomaly traffic detection method can not only improve the adaptability and accuracy of training set, but also reduce material consumption and resource consumption. Compared with traditional methods, the detection method is more intelligent and innovative. There are five parts in this article. The first part mainly introduces the research background and significance; the second part mainly analyzes the current research situation; the third part introduces the research methods and data sources; the fourth part of the system is tested. The error is detected according to the anomaly detected by the anomaly detection model, and an emergency response method is proposed. Anomalies not only can be detected, but also be able to be handled. The fifth part summarizes the full text.
Related work
Because abnormal network detection is of great significance and is closely related to people’s basic life and property, researchers at home and abroad have focused on this direction for in depth research [4]. For abnormal traffic detection, many researchers have made great contributions to the study of abnormal flow using different methods from different angles [5]. In the abnormal traffic detection of network protocols, researchers focus on the particularity of network protocols in abnormal traffic and apply them to different analysis methods. In the network abnormal traffic detection of machine learning, scholars generally believe that machine learning needs to go through the analysis and statistics of the original data to form the basic mode of autonomous statistical analysis, which is applicable to the abnormal attack detection of non-authentication judgment under various conditions and environments [6]. Scholars believe that machine learning is a good method of detecting abnormal traffic for unknown data. In the network abnormal traffic detection of classifiers, scholars believe that the classification method of classifiers plays a crucial role in the classification effect, and selects the classification method that supports both multi-dimensional data statistics and extended [7]. In summary, current scholars have already studied the detection of abnormal traffic on the network, but still need to go further.
LMS and BP algorithm
Minimum mean square (LMS) algorithm based on anomaly detection in network
There are many kinds of network anomaly traffic detection methods in the current environment, and there is no obvious distinction between them. Traffic anomaly detection can be divided into two steps: one is the collection part of data source; the other is the detection part of network abnormal traffic [8]. According to different data collection methods, the network anomaly traffic detection methods can be classified as: data acquisition based on data flow, traffic acquisition of network protocols such as IP, TCP and so on and data source acquisition by packet capture [9]. According to the location of abnormal monitoring points, the network anomaly traffic detection methods can be divided into two ways: single chain end node anomaly detection and large scale distributed anomaly node detection. In this paper, data source preprocessing analysis is transferred to large data processing platform [10]. With the help of distributed platform handling fast speed, high accuracy and good reliability, the current mainstream network anomaly traffic detection algorithm is studied, and the support vector selection method for classifier classification is studied. In addition to the bottleneck part of the training data set, a comprehensive training method of balance training times and training effect and accuracy is proposed, and the probability analysis method of Bayesian network is proposed to determine whether the exception is abnormal if the special detection points are added to the training set. When the exception is detected, the adaptive exception handling method is adapted to adjust the processing method, processing order, processing parameter, and boundary condition automatically when processing and analyzing the process of processing and analysis, so as to ensure that it and the data need to be processed [11]. The best distribution can be obtained by matching the distribution characteristics and structural features. The least mean square (LMS) algorithm is very typical in many adaptive filtering algorithms. Generally, the general gradient estimation algorithm will result in some negative effects on the filtering encounter, therefore, the least mean square algorithm uses a special gradient estimation to prevent the impact of adverse factors. LMS algorithm, which is widely used in practice, has obvious advantages, convenient computation and simple implementation [12]. For adaptive linear combiner system, the error surface function of LMS algorithm is different from other algorithms, and can be expressed as
Then, the update formula of the least mean square LMS algorithm based on the steepest descent method is:
Among them, the input vector of the N moment can be expressed as X (n + 1) = [x0 (n) x1 (n) ⋯ xL-1 (n)]
T
: the weight coefficient vector of the N moment can be expressed as W (n) = [w0 (n) w1 (n) ⋯ wL-1 (n)]
T
, which represents the order of the filter. μ represents the step factor, and the stability and convergence of the control algorithm are its main functions. In order to converge the process of adaptive LMS algorithm, the algorithm needs to meet certain convergence conditions. If the input signal is a stationary signal and independent of the weight coefficient vector, the equation of equation (3) should also be satisfied. Simplification:
Among them, λmax represents the largest eigenvalue of the input auto correlation matrix R. The principle of LMS algorithm is a closed loop adaptive system, which can be considered as a low pass filter. And in the calculation process, the calculation is simple and easy to operate, so the computation is also highly efficient. But the LMS algorithm also has some limitations, because the weight vector adjustment of the algorithm is determined by the estimation of the gradient rather than the real value, so the adaptive process itself has the interference noise. Of course, with the increasing number of iterations, the large noise components doped in gradient estimation will gradually decay. However, if the step size of LMS algorithm belongs to a fixed value, there must be a certain degree of contradiction between its convergence speed, tracking ability and steady state imbalance [13].
Mahout is an open source software based on Hadoop. Developers integrate some mainstream classic operation algorithms into software, which makes it easier for users to invoke the operation algorithm more conveniently without recompilation, which is an integrated tool for effective operation of algorithms [14]. With the help of Hadoop distributed platform, Mahout extends its algorithm effectively to cloud computing. That is, Mahout is an algorithm library based on Hadoop distribution. At the same time, Mahout transforms the classical algorithm into Map/Reduce mode, which significantly improves the data volume and processing performance of the algorithm [15]. BP algorithm is a supervised gradient descent algorithm, whose implementation steps are: first, the initial weights and thresholds are assigned the random numbers between (–1, 1), and the random numbers such as uniform distribution can be used to ensure that the network is not saturated by large weighting. Then a pair of data (Xk, Tk) is selected from the training sample data, and the input vector is added to the input layer (m = 0), so that all input nodes i have:
In the form, the super mark k refers to the sample number. The signal is then propagated through the network. The relation formula is used:
The output of each node j in each layer from the first level is calculated until the output of each node j of the output layer is calculated. Then each node error value of the output layer is calculated (on the Sigmoid function):
This error is derived from the difference between the actual output and the desired target value. Then the error value of each node in the front layer is calculated.
This is obtained by layer by layer back propagation error, in which p = m,m-1,m-2,...,1, until every node’s error in each layer is counted. Using the weighted correction formula:
All connection weights are corrected. In general, η= 1∼01.0 is called the training rate coefficient. Then the second step is returned, and the next input sample is repeated until the result converges to a certain accuracy range. The algorithm presented in this paper has three distinct features: parallel processing mode, sensitivity in information expansion and the structure of algebraic and logical operations. One is the parallel processing mode [16].
In the detection of abnormal traffic, the training method for training set is processed after a fixed time interval. The experimental environment settings are shown in Figs. 1 and 2. The test data window is moved parallel, and the old training data is removed in the data detection set to add new training data [17]. At the same time and in the same location, the data traffic in the network is roughly the same, and the amount of data is basically not fluctuant. In the normal state, retraining data sets will cause great material and resource consumption [18]. Aiming at this data set training algorithm, in this paper, a new method of balancing adaptability and accuracy is sought, which is the training method of abnormal real time response. The data set is automatically retrained only when abnormal points are found. This can reduce the material and resource consumption while improving the accuracy of data detection. A training method is for abnormal real time response [19]. In order to test the network abnormal traffic, this algorithm is used to simulate the Hash value of the text in the following 7 conditions. The result of the Hash result is expressed as the Table 1 with the number of 16. From the simulation results, it can be seen that each bit change of the initial value and the subtle change of the key will bring great changes to the Hash value, with high initial value and key sensitivity, which shows that the one way Hash performance of the algorithm is very good [20].

Experimental environment setting interface.

Experimental environment settings.
16 decimal representation of simulation results
Then the statistical analysis of chaos and diffusion properties is carried out. The Hash function should try to make the corresponding message clear and unrelated to the corresponding Hash value. For the binary representation of the results, there are only 1 or 0 possibilities for each bit, so the diffusion effect of the ideal Hash should be a slight change in the initial value that will result in a 50% probability of change in each bit of the result [21, 22]. The following test experiments are as follows: at each test, a plain text is selected in the plain text space to find its Hash value. Then changing the plain text 1 bits is worth another Hash result, comparing the two Hash results to find the number of bits changed Bi [23, 24]. The test is executed 2048 times, and the corresponding change shows that the maximum change bit number is 84 and the minimum is 45. This reflects the better diffusion effect of the algorithm. After N=256,512,1024,2048 identical tests, the bit number of the average Hash value changes based on the algorithm in plain text 1bit is obtained, which are the values of
Statistics of number of changed bit
Tables 1 and 2 are carefully observed. It can be seen that the average change bit number is B = 63.97 and the average change probability is p = 49.98%, which are very close to ideal 64 bit and 50%. ΔB and ΔP mark the stability of Hash’s chaos and dispersion, and the smaller they become, the more stable they are. The Δ calculated in this paper is very small, and the algorithm’s ability to confuse and distribute plain text is strong and stable. Based on the above algorithm, the following algorithms will be tested in this article. A Butter worth filter with a order of 2 and a normalized cutoff frequency of 0.4 Hz is used to simulate the unknown system. The input signal x (n) belongs to Gauss white noise whose mean is zero and variance is 1. The channel noise v (n) is not related to the input signal x (n), which is the same as the Gauss white noise whose mean is zero and the standard variance is 0.1. The number of sampling points is selected for 1000 points. When the sample is 500 points, the system will change, the normalized cutoff frequency will be 0.25 Hz, and then 200 independent simulations are done in turn. Finally, the statistical average will be calculated.
As shown in Fig. 3, the steady state imbalance is proportional to the forgetting factor. In order to observe the stable performance of the algorithm, the algorithm invariants parameters are selected, and the SNR is l0 dB. Then, the steady state forgetting factor (lambda) of the two algorithms is compared to reflect the algorithm’s steady state misalignment. The solid line e1 represents the improved algorithm, and the dashed line e2 represents the algorithm before changing. In the range of 0 < λ < < l, the smaller the forgetting factor is, the stronger the tracking ability of the system is and the more stable the system is. Therefore, the advantage of the algorithm in steady state imbalance is obvious. The abnormality detection of network traffic under different dimensions is shown in Fig. 5.

Stable performance of the improved algorithm before and after the signal to noise ratio is 10 dB.
To further intuitively reflect the absorb ability of the two coatings, the 2-D equal height absorption characteristics and three dimensional wave absorbing effects are respectively described. It can be seen from Fig. 4 that the high absorption intensity distribution of the design I is obviously better than that of the design scheme 2. The reflection loss of design 2 reflected below -IOdB range (The scope of the contour of the arrow in the figure) is obviously wider than that of the design scheme 1 (Especially in the 2∼3 GHz range). These are consistent with the analysis results in the above table. The reflection loss peaks of the two design schemes can reach –55.82 dB and –60.49 dB respectively. In addition, it can be seen from the diagram that the absorption effect of the two schemes on the TM mode is obviously better than that of the TE mode. In order to explain this phenomenon, the design scheme 1 is taken as an example. Figure 4 shows the variation of the reflection loss with the incident angle at different polarization modes at a specific frequency. It can be seen from the graph that under the TE mode, the reflection loss increases monotonously with the increase of angle regardless of 8 GHz or 12 GHz electromagnetic waves, and is close to 0 at 89 degrees. In the TM mode, the reflection loss decreases monotonously to a peak with the increase of angle, then increases slowly until close to 0. The peak position is known as Brewster angle, which exists only in the TM mode, and has an energy trap effect on electromagnetic waves.

Relationship between the reflection loss and the angle of incidence under different polarization waves.

Network traffic anomaly detection under different dimensions.
From Fig. 6, it can be seen that the tracking algorithm proposed in this paper is very fast. When the traditional algorithm is in the optimal state, at the beginning, the algorithm e1 achieves convergence before the traditional algorithm e2. After a sudden change in the system, the improved algorithm converges after a very small number of iterations, which is obviously faster than the traditional algorithm, and the strong tracking ability can be displayed. It can be seen that the improved RLS algorithm has a good initial convergence rate. The improved RLS algorithm can also accurately detect harmonic components. The collected data file is checked by Wire shark, and the corresponding parameters of the network traffic in the data set can be directly viewed. For the time of network packets, some network traffic can be intercepted for a period of time, and the data of different data sets are merged into new data sets, and experimental tests are carried out. In Fig. 6, the number of IP traffic packets before and after the attack is changed, with the IP packet as the longitudinal axis and the IP address as the horizontal axis, that is, the corresponding relationship between the destination IP address and the number of IP packets is established. Significant changes can be seen in the distribution of data, and the change of such distribution is the basis for judging the occurrence of anomalies. This method can prove the effectiveness of IP packet entropy anomaly traffic detection method. For worm attack, it is necessary to be based on the characteristics of his own existence, that is, the process of worm scanning transmission needs to scan the source port number of the data so that the invasion will be successful. This method can also detect anomalies, so the distribution and change of the source port number of the data can also be used as an anomaly detection method.

Learning curves of different algorithms when the signal to noise ratio is lOdB.
In recent years, the network environment is becoming more and more complex and network attacks are more and more hidden. People not only need to strengthen their own awareness of protection, but also need network security personnel to maintain and supervise network traffic, detect abnormal network traffic, so as to ensure network users’ Internet security. Therefore, the research of network anomaly traffic detection technology is of great significance. In this paper, a network traffic anomaly detection algorithm based on Mahout classifier is studied. In view of the difference between classifier data quantity, training place and actual application place, the original training set detection effect can not detect new data well. In addition, the current training needs a large amount of resources and so on. In this paper, an anomaly real time response training method is adopted. Only when the exception is added, the training set is retrained, the detection window is moved parallel to the exception point, and the first detection point is eliminated and the applicable detection point is added. Experimental results show that this method can not only improve the adaptability and accuracy of training set, but also reduce material consumption and resource consumption.
