Abstract
The semi-supervised learning (SSL) problems are often solved by graph based algorithms, semi-definite programmings etc. These methods always require high space complexities, and thus are not efficient for network intrusion detection systems. Apart from the space complexity challenge, a network intrusion detection system should be able to handle the distribution drifting of data flow as well. A common solution for this concept drift problem is by SSL. In this paper, an incremental SSL training framework is proposed to combine the low space complexity advantage of topology learning and SSL for network intrusion detection. First, the unsupervised self-organizing incremental neural network is extended to process labeled and unlabeled information incrementally. Second, a kernel function is constructed from the training results of the previous step. Finally, a kernel machine is trained with the constructed kernel function. The proposed method reduces the space complexity of SSL to the magnitude similar to supervised learning. The experiments are carried out on the NSL-KDD datasets, and the results show that the proposed method outperforms the mainstream methods such as Transductive Support Vector Machine and Label Propagation.
Keywords
Introduction
The network intrusion detection problem has been a prevailing research topic in the last decade. The intrusion detection task is special in at least three ways. First, the data volume in the network is large and is still growing with the expansion of the Internet. Second, the distribution of the data is constantly changing, thus requiring better generalization abilities from the machine learning algorithms used in intrusion detection. Third, the classification problem in network intrusion detection is nonlinear. In general, it brings several challenges in machine learning including nonlinear separation, incremental learning and semi-supervised learning altogether.
Based on the availability of labeled data, intrusion detection is divided into four major solutions, namely the unsupervised, supervised, semi-supervised and one-class classification based intrusion detections. Unsupervised detection methods attempt to discover undesired patterns without any labeled information. It is usually implemented with unsupervised clustering methods. Supervised detection employs traditional supervised classification methods to learn models on the labeled dataset. Semi-supervised detection is based largely on semi-supervised learning (SSL), which employs information from both labeled and unlabeled data to train classifiers. In one class classification based intrusion detection, the labeled data is available without any intrusion samples. There are also hybrid implementations that combine supervised learning and one-class learning [19, 32]. But an IDS that combine SSL and other machine learning setting is not fully explored.
While supervised learning based IDSes are often limited by the quality of training datasets, one class classification based intrusion detection are widely used to discover unknown intrusions. One such example is implemented by one class support vector machines (OCSVM) [21]. The drawback of OCSVM is that normal data is assumed to be distributed within a spherical area, which might not always be true. Besides, it is incapable of incorporating information from unlabeled data. Therefore its performance depends largely on a sufficient training dataset. Other methods utilize the one class modeling principle such as [14] havn’t seriously considered the insufficiency of the normal samples.
Unsupervised and semi-supervised detection are widely used in IDSes. The unsupervised detection takes the clustering assumption [2], where data points in one cluster are assumed to belong to the same class. But this is potentially false and may cause over fitting problems. The SSL methods are devised so that by including unlabeled data in training to increase the classification accuracy. One of the drawbacks of SSL based detection is that it is only feasible when the data is distributed according to some assumptions. For example, the widely used Transductive Support Vector Machine (TSVM) [12] assumes that samples belonged to different classes are separated by a low density area in a high dimensional space. Graph based SSL such as [24] has the problem of low space and time efficiency. In fuzzy set based intrusion detections [5] and entropy based methods [10], the previous drawbacks are remedied. Their drawback is that they are not online thus not being able to adapt to concept drifts.
The online learning methods can learn without storing data, and can be divided into different categories by the solutions of non-linear separation. In the well known Support Vector Machine [6], the kernel functions are employed. Another solution is lazy learning, i.e. to represent data with samples in the same space to the input. Competitive learning neural networks are common implementations of online lazy learning. There are two kinds of classic competitive learning neural networks, namely the Self-Organizing Map (SOM) [20] and Growing Neural Gas (GNG) [7]. The main drawback of SOM is its compromised performance to learn complex topology without a proper predefined graph. GNG can learn arbitrary topology structure, however the weakness is its endless growing of neuron population, which make it less suitable for large data streams. In the Self-Organizing Incremental Neural Network (SOINN) [8] and enhanced SOINN (ESOINN) [9], arbitrary topology structure can be learned without suffering from endless growing of neurons.
There are a number of researches which integrate SSL and online learning for intrusion detection [22], most of which are based on SOM [11, 31], where SOM is used as a clustering method. Then clustering centers are labeled for constructing a nearest neighbor classifier. The weak points of SOM based SSL are that labeling of clustering centers is after the online training and potentially offline. Moreover, the limitation of SOM for learning complex topology structure is inherited. Semi-supervised GNG [23] is proposed for strictly online construction of nearest neighbor classifiers. The drawback of GNG based SSL is that the endless growing neurons problem of GNG is not remedied. Online SSL based on ESOINN is proposed in [17, 26], in which the sample count of learning result is reduced. However, the drawback of this method is it works only on convex shapes.
In this paper, we propose a semi-supervised incremental method that combines incremental learning, nonlinearity modeling and SSL. It is able to consolidate information from labeled and unlabeled data to increase classification accuracy, and able to update learned model on new data to increase efficiency. Our key contributions are listed as below. A knowledge reuse framework for ESOINN is proposed in the form of kernel function construction. An incremental semi-supervised kernel function construction method based on ESOINN is proposed to enhance the generalization abilities of SVMs. Combining the proposed framework and SVM, the algorithm performs SSL with the same space efficiency to a supervised SVM. A extension of ESOINN to processing large volumes of labeled data and unlabeled data called mixture SOINN (MSOINN) is proposed. Moreover, MSOINN can process the labeled and unlabeled data both incrementally.
Experiments are carried out on the NSL-KDD [27] dataset. Comparison results on the NSL-KDD dataset show that the proposed method has better generalization ability than OCSVM, Label Propagation and TSVM etc.
The rest of the paper is organized as follows. Section 2 gives an introduction to the embedding problem and self-organizing incremental neural networks. In Section 3 the framework of our proposed method is given. Section 4 is the simulation results and in Section 5 is the conclusion and future perspectives.
The embedding problem and topology learning neural networks
A brief introduction to linear embedding
Given a matrix of pairwise distances
For a new data item, its dimensionality reduction can be calculated by two steps. First, calculate [29]
Then the embedding is accomplished as
Assume a data set {X} with data points X (1) , X (2) , . . ., , the learning task of GNG [7] and single layered SOINN [26] is that after a single pass scan of the dataset to represent the data by neurons i with weights . The topology structure of the dataset is preserved in the graph comprised of edges connecting the neurons.
The single layered SOINN [26] learning algorithm is comprised of three steps. First, when a new data item is “fired” into the network, a decision step is run to calculate the necessity to insert a new neuron into the set
In step 1, each neuron controls a spherical area in the Euclidean space. The center of the sphere is the neuron itself, and the diameter is the vector to its farthest topological neighbor by metric of Euclidean distance or the smallest distance to its neighbors if it has no topological neighbor. When a new sample V arrives and it is not controlled by its nearest neighbor W
s
nor second nearest neighbor W
t
in
In step 2, first an edge is added to link W
s
and W
t
, if there’s not already one there. Then if in step 1 the input vector V is not controlled by the winner neurons, update W
s
and its topological neighbors towards the new sample as
Step 3 is run with the probability of 1/λ. It generates and refines the clustering result by assuming the convexity of data. Since the objective of the algorithm framework is not for clustering under convexity assumption, the final step in the original ESOINN algorithm is not adapted in our framework.
We present the semi-supervised intrusion detection framework as the following four steps. The framework is also illustrated in Fig. 1. Train the MSOINN neural network on the mixture of labeled and unlabeled data to obtain the weights of neurons and the edges representing topology structures. Run flow algorithms on the Euclidean graph from the result of neural network training to gain similarity measures between the neurons. Embed the neurons with the similarity measures calculated in the previous step. Construct the kernel function for calculation of similarity matrix of labeled samples.
The key idea of the proposed framework is to transform the learned SOINN results into a kernel function. Through this process, the ESOINN results can be reused to enhance the generalization abilities of kernel machines.
Semi-supervised extension to the single layered SOINN
In this section, we propose inverse competitive learning that takes into account labeled information for more generalized data representation learning than ESOINN. Comparing to the existing implementation [1], it can learn stable SSL presentations even given ill labeled information by incorporating self-organized learning into conflict processing.
The intuition behind competitive and inverse competitive learning is that when the label of W s (i.e. the nearest neighbor to the input sample V) is not the same as V’s, the winner weight vector and its topological neighbors are moved in an inverse direction from the current sample. The implementation details are as follows.
The proposed MSOINN is comprised of three steps as the ESOINN algorithm. Step 1 is the same as the unsupervised ESOINN while the most of the improvements are in step 2. In step 2, there are W
s
is labeled and labels of W
s
and V do not conflict. W
s
is labeled and labels of W
s
and V is not the same. W
s
is not labeled.
For situation 1, W s and its neighbors are towards V with the magnitude of . The updating of age and c is the same as ESOINN in this situation.
For situation 2, presence of V is designed to affect W
s
and W
t
. To ensure the effect of samples fall into the ‘gap’ area is not over weighted than the other samples, two weight values are defined as follows.
If V=W s =W t , then w1 = 0, w2 = 0. ζ ≥ 1 is the SSL parameter. After weights w1 and w2 are calculated, Then W s and W t are both updated with the magnitude of , respectively. The reason for this weighting strategy is that information from sample V with smaller distance to winner node is considered more credential. After that c and age will be updated according to the weights as c s = c s + w1, c t = c t + w2, ages,k = ages,k + w1 (k ∈ η s ), aget,k = aget,k + w2 (k ∈ η t ).
For situation 3, label the winner neuron with the new sample’s label and check for label conflicts between winner neuron and its neighbors and remove the edges causing conflicts. The rest of situation 3 is the same to situation 1.
In this section we elaborate our method of embedding neurons with graph similarity. Embedding data items with graph similarity is not new, for example in Isomap [28]. The drawback of graph similarity usage in Isomap is that the graph is constructed with all the data items in input space, and graph algorithms are often slow. In our case we only embed the neurons with graph similarity.
In order to calculate the graph similarity through flow network algorithms, weights of edges should first be assigned. We use the following choice of weight for an edge from the topology learning results
This is a logistic function with squared distance as the input. Other kernel functions such as the popular RBF function can be a choice as well, depending on the specific application settings. After that, the cut value between points can be calculated using algorithms such as Edmonds-Karp [4]. Store the results as a matrix
The embedding of neurons can be carried out by procedures introduced in section 3.1. That is, first calculate the inner product matrix
According to [13, 29], given a distance vector P to the neurons, the embedding can be calculated as the following two steps. The inner product vector H is calculated using equation (3). Then the embedding is obtained as
In order to embed an item from the input data space with the above procedure, similarities between an input point and the neurons must be calculated first. In order to stress the locality properties of Voronoi regions [7, 20] we choose the similarity measure specified by the following equation
Set P
i
= 1 and ∀j ≠ i, P
j
= 0 if S
i
is equal to 0. After that the inner product vector H can be calculated from P and the embedding of the input pattern can be carried out with the linear embedding procedures described in equation (3). However, if a kernel function is constructed to define the similarity between data items, the embedding results would not be necessary. The kernel function is constructed as
This kernel function can be used to train kernel machines, and in our work we chose the famous SVM because of its ability to find the optimal separation hyperplane.
It is possible for the kernel function in equation (16) to generate a indefinite matrix. Thought SVM has shown good performance in Minkowski space, to ensure stability of training kernel machines a semi-definite kernel matrix is often required. The “reflection” technique in [13] is employed to solve this problem. First, during training, a reflected inner product matrix is produces as which is semi-definite because all the signatures (i.e. diagonals along
As a result, the kernel function is constructed as
The complete algorithm for metric learning and embedding is given in Algorithm 1
1: Initialize set N with the first 2 samples from {S}
2:
3: Draw one input vector as V from {S}.
4: Find winner W s and second winner W t from neuron set N.
5: Calculate whether to insert new neuron by Delaunay triangulation principle.
6:
7: Add a neuron k with weight V to N, and set winning times c k = 0
8: .
9:
10: Perform negative competitive learning.
11: Increase the winning times as c s (t) = c s (t - 1) + ω.
12:
13: Perform positive competitive learning.
14: Increase the winning times as c s (t) = c s (t - 1) +1.
15:
16: Update age of edges as
17:
18: Remove the edge connecting m and n.
19:
20:
21:
22: Remove neurons with less than 1 neighbors.
23:
24:
25: Calculate the embeddings of neurons with equation (1) with parameter l.
26: Output the kernel function defined by equation (17)
The experiments are carried out on the NSL-KDD [27] dataset. The dataset is comprised of three parts, namely the training dataset, the test dataset and a ‘test21’ dataset which contains a small number of samples that are hard to classify. The proposed method is implemented with numpy [3]. Other methods for comparison include two supervised learning methods, a one class learning methods and two SSL methods. The SSL method TSVM is implemented by Svmlight [16], and it is trained with an RBF kernel. Label propagation, and Support Vector Machine are implemented with Sklearn [25]. The supervised methods including and Adaboost are carried out by Weka [15]. The reason for choosing Sequential Minimal Optimization (SMO) and Adaboost it because of their high efficiency and they are already used in intrusion detection researches. The supervised methods provide baselines for the false alarm rate. Label propagation and TSVM are the mainstream SSL methods. Besides, OCSVM is another baseline method which is often used in outlier detection applications including intrusion detection. The parameters are selected on the training dataset with cross validation and grid search. The experiments are carried out on an computer with 8G RAM and a 64bit OS. The comparison results are listed in Table 1 and 2.
On the test dataset (in Table 1) where the unlabeled dataset is abundant, the proposed method shows significant advantage over other methods (17.9% better than TSVM in detection rate), and is proven to have better generalization abilities on large amount of unlabeled datasets. Comparing to the other baseline method OCSVM, the proposed method shows better stability when the labeled information is not sufficient. Especially when the labeled dataset size is less than 251, the OCSVM method is unacceptable because of high false alarm rate. This is because when the labeled dataset is not sufficient, the assumption of OCSVM that normal samples reside in a spherical area dose not hold true, while the proposed method dose not take any assumption of the dataset. The label propagation failed the experiment when the dataset is large due to its high space complexity, and the proposed method is constant in space complexity. The average training time of the proposed method is below 15 seconds and TSVM usually requires several hours to finish training. In summery, the experiments on the NSL-KDD test dataset prove the better generalization ability of the proposed method, and the improved space complexity. On the test21 dataset, the overall trend is similar but is more chaotic. This is because that the distribution of data is skewed. One evidence is that the OSCSVM can generate a false alarm rate as low as 14.9% with a labeled dataset with only 2 samples. This means that the normal dataset is highly concentrated in a small area. Despite this distribution skewing, the proposed method is able to outperform the existing supervised and SSL methods. In summery, the experiments on the NSL-KDD dataset can prove the ability to learn accuracy classifies under concept drift with the proposed method.
To demonstrate the necessity of incorporating dimensionality reduction techniques as described in the proposed framework. The proposed framework combining the semi-supervised MSOINN and linear embedding shows an improvement over MSOINN in most cases. This is because of the effect of dimensionality reduction in the kernel construction process. In summery, the proposed method is a space effective alternative to TSVM. In the area of intrusion detection, the proposed method provides low space complexity, and requires only a small number of labeled samples.
Conclusion
In this paper, a semi-supervised learning method based on incremental kernel construction is proposed. First, an SSL extension to ESOINN is proposed to process the mixture of labeled and unlabeled data. Second, a knowledge reuse framework is proposed to construct a kernel function utilizing the knowledge stored in the trained neural network. Finally, we use the kernel function to train SVMs and apply it to intrusion detection. The proposed method shows significant advantage over supervised learning methods and TSVM, and it is more stable than OCSVM when the labeled dataset is small. Though not tested by experiments, the proposed framework can be used in multi class SSL problems. Besides, the proposed framework should work with other topology algorithms and applications. For example, it can be used in combine of GNG providing that semi-supervised extensions are made. These algorithmic and application extensions are to be explored in future works.
Footnotes
Acknowledgment
This work was partly supported by National Natural Science Foundations of China (No. 61301148 and No. 61272061), the fundamental research funds for the central universities of China (No. 531107040263, 531107040276), Hunan Natural Science Foundation of China.
