Abstract
For the unsupervised learning based clustering algorithm, the intrusion detection rate is low, and the training sample based on supervised learning clustering algorithm is insufficient. A semi-supervised kernel fuzzy C-means clustering algorithm based on artificial fish swarm optimization (AFSA-KFCM) is proposed. Firstly, the kernel function is used to change the distance function in the traditional semi-supervised fuzzy C-means clustering algorithm to define a new objective function, thus improving the probabilistic constraints of the fuzzy C-means algorithm. Then, the artificial fish swarm algorithm with strong global optimization ability is used to improve the KFCM sensitivity to the initial cluster center and easy to fall into the local extremum, thus improving the convergence speed and improving the classification effect. The test results in the Wine and IRIS public datasets show that the AFSA-KFCM clustering algorithm is superior to the traditional algorithm in clustering accuracy and time efficiency. At the same time, the experimental results in KDDCUP99 experimental data show that the algorithm can obtain the ideal detection rate and false detection rate in intrusion detection.
Keywords
Introduction
With the rapid development of the Internet, the various industries in the society are completely inseparable from the Internet. People are increasingly dependent on it, making the security of computer networks even more important. However, while the Internet has brought us convenience, it has also exposed its vulnerability, because security vulnerabilities are ubiquitous in network protocols. At the same time, the network attack behavior is becoming more and more complex, and it is necessary to perform multiple levels of protection on the network to be effective. At present, traditional security measures such as data encryption, firewall and intrusion detection cannot effectively solve multiple security problems [1–4]. Traditional firewalls can only statically restrict access, lack of mobility and real-time responsiveness.
In response to the above problems, the concept of Intrusion Detection System (IDS) [3] was proposed. It is connected in series on the network. When a network attack occurs, IDS will automatically detect the intrusion attack and judge and alarm the intrusion behavior [5,6, 5,6]. In recent years, the application of artificial intelligence and machine learning technology in IDS has become a hot spot, effectively achieving the high detection rate and low false positive rate of IDS [7]. For example, Wang Yucheng proposed an abnormal intrusion detection algorithm against Web attacks based on K-Nearest Neighbor (KNN) clustering algorithm [8]. The KNN clustering analysis algorithm in machine learning is used to cluster the data to detect whether there is abnormal data. Zhang Xia [9] proposed a network intrusion detection method based on machine learning algorithm. At present, the mainstream of machine learning is divided into supervised learning, unsupervised learning, semi-supervised and intensive learning [10]. Zhao Jianhua et al. [11] applied supervised SOM neural networks in intrusion detection. Lin Dongmao et al. [12] proposed a network intrusion detection algorithm based on unsupervised immune optimization layering. The data is learned in the immune network, the data is compressed by a small-scale network, the identification characteristics of the data are enhanced, and the network is analyzed by a hierarchical clustering method. However, both supervised learning and unsupervised learning have major disadvantages: 1) Supervised learning is not very good for new types of attacks, which is unrealistic for dynamic networks; 2) Unsupervised learning is in the field of intrusion detection. There are also related applications, but if the mixed attack type is mistaken as a normal type during its learning, this will result in a high false positive rate in the later learning process. Therefore, this paper chooses fuzzy C-means (FCM) clustering in semi-supervised clustering as the main means to achieve network intrusion detection.
However, the FCM clustering algorithm is an iterative descent algorithm with good local search ability and not easy to converge to global optimality. As a biomimetic technique that artificially imitates the living habits of animal groups, the group intelligent optimization algorithm has a great application prospect in data mining. The group intelligent optimization algorithm has strong global optimization ability. Therefore, in order to improve the problems of FCM clustering algorithm, the group intelligent optimization algorithm is introduced into fuzzy clustering. For example, Nguyen D D et al. [13] introduced the genetic algorithm into the FCM clustering algorithm to obtain the GA-FCM algorithm, and used the genetic algorithm to optimize the initial clustering center of FCM. On this basis, DingY et al. [14] proposed a hybrid GA-KFCM method based on the kernel functions FCM and GA. The artificial fish swarm algorithm is a new global optimization algorithm proposed by Li Xiaolei et al. in 2002 on the basis of intelligent behavior research of animal groups [15,16, 15,16]. As an intelligent optimization algorithm that imitates fish foraging, the artificial fish swarm algorithm can make the artificial fish jump out of the local optimal solution in the process of optimization, and achieve the global optimization result.
Therefore, a semi-supervised kernel fuzzy C-means (AFSA-KFCM) clustering algorithm based on artificial fish swarm optimization is proposed and tested on the KDD Cup 99 network intrusion detection database. Firstly, the Gaussian kernel function is used to change the Euclidean distance in the FCM algorithm to define the objective function, and the probabilistic constraints in the FCM algorithm are changed, so that the robustness of the algorithm to noise and isolated points is enhanced. Then, the artificial fish swarm algorithm with strong global optimization ability is used to improve the KFCM sensitivity to the initial cluster center and easy to fall into the local extremum, thus improving the convergence speed and improving the classification effect. The test results verify the applicability and advancement of the proposed algorithm.
Semi-supervised clustering learning
The main function of intrusion detection is to divide the data packets in the network into normal and abnormal classes. Therefore, its essence is a problem of pattern classification. The goal of machine learning based on cluster analysis is to divide the data to divide the data with higher similarity into the same group and divide the data with less similarity or similarity into the same group.
Principle of cluster analysis
Through the introduction of the appeal, it can be seen that the essence of cluster analysis is a multi-dimensional sample classification problem. In order to classify samples, it is necessary to determine the degree of similarity between data samples, and no parameters such as attributes, quantities, etc. are required. The determination of this degree of similarity is to introduce the concept of “distance". Let X ={ x1, x2, . . . , x
p
} and Y ={ y1, y2, . . . , y
p
} be two sample data with dimension p, then the absolute distance is calculated as shown in formula (1) [17].
The similarity coefficient between the i-th sample and the j-th sample can be obtained from formula (2).
The correlation coefficient between the i-th indicator and the j-th index can be obtained from formula (3).
Where
The clustering algorithm based on unsupervised learning has low intrusion detection rate, while the clustering algorithm based on supervised learning has insufficient training samples. Therefore, semi-supervised learning is introduced into the intrusion detection algorithm, and the fuzzy clustering analysis method is adopted. In the data model of fuzzy clustering analysis, the definition of the original data matrix is as follows:
In order to divide the features of all the objects m to group the data sets, the expression defining the data objects is as shown in formula (5).
Where, x i = { xi1, xi2, . . . , x im } = 1, 2, . . . , n.
Let the tag data and the unknown tag data be represented in the form of a matrix:
The objective function of semi-supervised fuzzy clustering is defined as:
Where n u is the number of unknown tag data samples.
The fuzzy clustering analysis algorithm can be roughly divided into four types [18, 19]: 1) pedigree clustering method, 2) clustering method based on equivalence relation, 3) graph theory clustering method and 4) objective function based clustering method. Because the objective cluster-based fuzzy clustering method uses traditional nonlinear programming principles to deal with data mining problems, it is simple to implement and has attracted the attention of most research institutions and scientists. The FCM clustering algorithm belongs to the objective function-based method and exhibits excellent performance in the face of a large amount of data. Therefore, the research object of this paper is the FCM clustering algorithm.
The objective function of the traditional FCM clustering algorithm uses the Euclidean distance as the measure, which results in its poor classification ability and cannot effectively deal with the nonlinear sample. Therefore, Hu D et al. [20] proposed to introduce the function kernel function satisfying the Mercer condition into cluster analysis, and achieve more accurate clustering effect through nonlinear mapping. The principle is to use a nonlinear transformation to map the sample set into the high-dimensional space F, then the objective function J (U, V) of the fuzzy kernel function clustering is:
Where, the integer c (2 ≤ c ≤ n) is the number to be classified, the set X = {x1, x2, . . . , x
n
} is the set of samples, n is the number of cluster samples, V = {v1, v2, . . . , v
c
} is the cluster center of the sample set, Φ (·) is the nonlinear transform function, Φ (x
k
) is the sample in the feature space F In the mapping, Φ (v
i
) is the mapping of the cluster center in the feature space F,
It can be seen from the above analysis that the FCM algorithm is only suitable for processing spherical clusters and has poor adaptability to other data types. In order to solve this problem, people introduce a kernel function in such an algorithm, and propose a KFCM algorithm based on kernel function. Therefore, using the Gaussian kernel function to change the Euclidean distance in the FCM algorithm to define the objective function, that is, by replacing the original Euclidean distance ∥x
k
- v
i
∥ with ∥Φ (x
k
) - Φ (v
i
)∥, the probabilistic constraints in the FCM algorithm are changed, so that it is robust to noise and isolated points
Taking the minimum value of the objective function J (U, V) as the clustering criterion, the partial derivative of u
ik
and Φ (v
i
) is zero, as follows:
Among them, the kernel function selects the Gaussian kernel function that satisfies the Mercer condition, namely:
Where σ is the scale parameter. Therefore, equation (9) can be written as:
Substituting the formula (13) into the formula (10) and the formula (11), we can get the iterative way of U, V:
Equation (14) is the expression of the membership matrix. According to the kernel function K (x
i
, x
j
) =〈 Φ (x
i
) , Φ (x
j
) 〉, combined with the formula (15), we can deduce:
Where K (x k , x j ) can be calculated by formula (12).
If ∥U new - U old ∥ < ɛ, the iteration is terminated, and the required cluster center v i and membership matrix u ik are obtained, otherwise, the formula (7) is continued to update the membership matrix u ik . ɛ > 0 is the convergence precision.
Artificial Fish Swarm Algorithm
The artificial fish swarm algorithm [16] (AFSA) is used as an optimization algorithm to introduce animal owners. The so-called animal locality is an entity that simulates animals or autonomous robots and can generate adaptive intelligent behavior in complex environments. The AFSA uses a bottom-up design pattern with faster convergence. The basic idea of AFSA is: Suppose that in a fish pond, fish is looking for food by looking for it in the fish pond or following other fish. AFSA uses this model to simulate fish. Foraging behavior, cluster behavior and rear-end behavior in order to achieve the purpose of excellence. Assume that AF is a virtual entity of real fish, called artificial fish, and its visual concept diagram is shown in Fig. 1.

Artificial fish visual concept map.
In Fig. 1, X represents the current state of AF, visual is its field of view, and Xv is the position of the viewpoint at a certain moment. If the state of this position is better than the current state, AF moves to Xnext; otherwise AF finds other positions. The main parameters in the algorithm are: visual, step, congestion factor λ, try_number. The four basic behaviors of artificial fish schools include: 1) foraging behavior, 2) cluster behavior, 3) rear-end behavior, and 4) random behavior.
In the proposed AFSA-KFCM algorithm, encoding is performed using an encoding format based on the initial cluster center, that is, n initial cluster centers constitute the position of each artificial fish, and s is the dimension of the sample vector. The position of the artificial fish is a n × s-dimensional variable, so the position code of the artificial fish is as follows:
Where c ij , j = 1, 2, …, s represents the cluster center represented by each individual in the population.
The fitness function is used to evaluate the pros and cons of individuals in the population and to guide the search. In the KFCM algorithm, the objective function f of an individual in a population is derived from equation (16). According to the idea of FCM, the smaller the value of the objective function is, the more obvious the clustering effect is, that is, the better the classification effect. Thus, the fitness function of an artificial fish group can be defined by means of this objective function:
As mentioned above, the KFCM clustering algorithm is solved using the objective function optimization to the minimum value. However, such a method tends to cause the final solution to fall into a local minimum point and the convergence speed is slow. As a swarm intelligence algorithm, artificial fish schools have strong global search characteristics. Therefore, this paper proposes an effective combination of AFSA and KFCM clustering algorithm to improve clustering effect and optimization precision.
The specific implementation steps of the AFSA-KFCM clustering algorithm are as follows:
Step 1: Initialization: set the parameters in the KFCM, the fuzzy index m, the convergence precision, the number of clusters c; initialize the fish population size N, and randomly initialize the membership matrix;
Step 2: Calculate the cluster center of each type according to formula (6), as the code of the initial artificial fish, repeat N times and initialize N fish. Calculate the food concentration value of each type of artificial fish, and record the artificial fish with the largest food concentration value on the bulletin board. Initialize the parameters of the artificial fish.
Step 3: Simulate the behaviors of foraging, clustering, and rear-end according to the AFSA algorithm. According to formula (18), the fitness function values obtained after simulating the execution of each behavior of the artificial fish are calculated and compared, and the behavior with the largest value is selected and the behavior is actually performed.
Step 4: Perform an optimal individual retention strategy for individual behavior of artificial fish;
Step 5: Update the bulletin board record with the artificial fish with the highest food concentration value.
Step 6: Repeat steps 3) ∼4) until the fishnet algorithm is reached at the maximum number of iterations, and the optimal value is recorded in the output bulletin board.
Step 7: After the above steps are completed, the global optimal solution obtained by the AFSA algorithm is used as the initial clustering center, and the KFCM algorithm is called.
Simulation experiment results
Experimental environment configuration and data set
The operating environment of all the experiments in this paper is MATLAB R2015b under Window7 system, Intel(R) Core(TM) i5-3210M CPU @ 2.50 GHz, and 4GB of memory. In the experimental process of this paper, set the maximum number of iterations = 200, the number of clusters = 3, the fuzzy index m = 2, σ = 0.8. The number of artificial fish = 60, visual = 6, step = 0.4, ɛ = 0.001, ɛ = 0.8. The four algorithms were repeated 50 times and the average of each indicator was calculated.
Clustering effect analysis
In order to verify the effectiveness and advancement of the AFSA-KFCM clustering algorithm, simulation experiments were performed using Iris standard data and Wine datasets from the UCI machine learning database (http://archive.ics.uci.edu/ml/datasets/). The UCI database is a database for machine learning proposed by the University of California Irvine, which currently has 335 datasets. A comparative analysis of FCM [12], KFCM [20], GA-KFCM [14] and AFSA-KFCM was performed on the Iris dataset and the Wine dataset. The parameters of the two sets of experimental data sets are shown in Table 1. The clustering accuracy and running time of the four algorithms on different data sets are shown in Tables 2 and 3, respectively.
It can be seen from Tables 2 and 3 that the clustering effect of the FCM algorithm and the KFCM algorithm is relatively poor due to the greater sensitivity to the initial value. The GA-FCM algorithm utilizes the advantages of the group intelligence optimization algorithm to improve the local search ability, optimize the clustering effect, and improve the accuracy. The AFSA-KFCM algorithm is significantly higher than the FCM, KFCM, and GA-FCM algorithms in clustering accuracy. Due to the addition of the kernel function and the optimization of the AFSA algorithm, the time consumption of the AFSA-KFCM algorithm is slightly higher than that of the FCM and KFCM algorithms, but the time to achieve clustering is reduced compared with the GA-FCM algorithm.
All parameters of the 2 experimental datasets
All parameters of the 2 experimental datasets
Experimental results on the Iris dataset
Experimental results on the Wine dataset
The intrusion detection experiment used the KDDCup 99 dataset [21], which collected 9 weeks of raw network connection data, including a total of 4 attack types (DOS, Probing, R2 L, and U2 R). The performance of the intrusion detection algorithm mainly uses the detection rate (DR). Through this indicator, the detection effect of the algorithm can be well measured.
The intrusion detection rates of the four algorithms for the four attack types are shown in Fig. 2(a), Fig. 2(b), Fig. 2(c), and Fig. 2(d), respectively. It can be seen that the AFSA-KFCM clustering algorithm proposed in this paper has the highest detection rate results for DoS, U2R and Probe, especially DoS. This is because the kernel function is introduced into the FCM clustering analysis, and the nonlinear mapping is used to improve the adaptability to multiple data types. At the same time, the problem of KFCM falling into local minimum is solved by AFSA.

Intrusion Detection Rate for 4 Attack Types.
This paper proposes an AFSA-KFCM clustering algorithm and its application in network intrusion detection. The kernel function is used to change the distance function in the traditional semi-supervised FCM clustering algorithm to define a new objective function. AFSA is used to improve the KFCM’s sensitivity to the initial clustering center and easy to fall into local extremum, thus improving the convergence speed and improving the classification effect. The test results in the UCI public data set verify the accuracy and efficiency of the AFSA-KFCM clustering algorithm. At the same time, a higher detection rate is obtained in the KDDCUP99 intrusion detection data set.
