Abstract
This paper presents a cascade of ensemble-based artificial neural network for multi-class intrusion detection (CANID) in computer network traffic. The proposed system learns a number of neural-networks connected as a cascade with each network trained using a small sample of training examples. The proposed cascade structure uses the trained neural network as a filter to partition the training data and hence a relatively small sample of training examples are used along with a boosting-based learning algorithm to learn an optimal set of neural network parameters for each successive partition. The performance of the proposed approach is evaluated and compared on the standard KDD CUP 1999 dataset as well as a very recent dataset, UNSW-NB15, composed of contemporary synthesized attack activities. Experimental results show that our proposed approach can efficiently detect various types of cyber attacks in computer networks.
Introduction
With the exponential growth in the number of online services, the number of computing devices and users connected to computer networks and the Internet has significantly increased. In parallel, there has been a myriad of associated security vulnerabilities and intrusive behaviors causing disclosure of sensitive information, disruption and unauthorized access of online services and systems, and unauthorized access to restricted resources. Some common forms of attacks include the scanning of port, probing for information,injecting viruses and worms, and DoS [1].
The ever-increasing types of attacks demands for a highly effective intrusion detection system in place that can not only detect well documented forms of intrusions but can also learn to detect new attack forms. Classical intrusion detection systems are commonly divided into signature-based systems, anomaly-based systems or a hybrid of the two types. Signature based systems detect intrusion by monitoring system usage patterns and comparing these patterns against well known signatures of misuses. Such intrusion detection systems require a frequent update of intrusion signature database to be effective. Anomaly based systems uses the notion of normal behavior verses intrusion and a classification algorithm is used to mark any activity as normal or intrusion. In contrast, hybrid approaches combine both signature-based and behavior-based systems. Recently, Such systems have received much attention due to their generalization characteristics [3]. Examples of these techniques include, amongst others, artificial neural networks (ANNs), principal component analysis (PCA), support vector machines (SVMs), and nearest neighbor (k-NN) classifiers [4–8].
Although supervised learning-based techniques such as SVMs, ANNs, decision trees, and ensemble learning have been used in the past to create IDS, they fail in detecting the infrequently occurring attacks or have a very high false positive rate. Moreover, due to the availability of large training data, the learning time and memory requirements of these algorithms is very large. To overcome these difficulties a novel filtering mechanism is introduced that when used along with boosting-based neural network learning results into effective intrusion detection system without requiring the processing of very large sparse matrices.
Our main observation regarding the performance of multi-class learning algorithms is that for skewed datasets such algorithms totally ignore the the sparsely represented classes and learn a solution that favors only the most dominant classes at the expensive of totally ignoring other classes. Such algorithms still attains a high overall accuracy due to the dominance of few classes in the overall distribution. The well cited network intrusion dataset from the KDD-CUP 99 [9] is an example of such learning problems which contains three classes constituting about 98% of training data whereas the remaining classes representing the less common intrusions constitute only about 2% of the total data.
The remainder of this paper is organized as follows. Section 2 describes the boosting-based neural network learning algorithm, the novel example filtering mechanism and the cascade structure for detecting intrusion. A detailed description of the adopted datasets, experimental settings to evaluate the proposed approach, and the corresponding results is provided in Section 3. Finally, Section 4 concludes the paper with discussions and highlights for some future directions.
Proposed method
This section begins with a brief review of boosting-based ANN learning [10, 11] that can be used to learn weights of a given neural network using AdaBoost. This review is followed by the presentation of a cascade structure and an associated example filtering mechanism used to learn an effective multi-class classifier by combining several binary classifiers connected as a decision tree or cascade. The cascade structure is a generalization of one-vs-remaining encoding strategy of building a multi-class classifier by combining several binary classifiers in the form of a tree structure. The cascade is designed greedily by selecting the most discriminative partition of the set of classes and a classifier is then trained using a binary encoding of the partition. Once the classifier has been trained, the training data is divided into two parts by using the labels assigned to the examples by the trained classifier. Each partition, hopefully, contains examples belonging to a subset of the total classes and hence the learning problem is divided into two smaller multi-class learning problems. This procedure is repeated for the two sub-problems and hence a multi-class decision cascade structure is learned.
AdaBoost based neural network learning
AdaBoost [12] is one of the most successful boosting algorithms that constructs a highly accurate classifier ensemble of moderately accurate classifier instances called base learners. It takes n labeled training examples, , as input and iteratively selects T classifiers, h
t
, by maintaining an adaptive weight distribution over the training examples. AdaBoost constructs the final ensemble by taking a linear combination of the selected classifiers using:
Single-node decision trees, commonly referred to as stumps, have been frequently used as base classifiers in AdaBoost [13, 14]. Baig et al. [10] introduced a new representation of a decision stump using homogeneous coordinate to learn weights of a single layer perceptron. Their method, called Boostron, represents a decision stump as a product of a weight vector and an extended feature vector given as . This new representation of stumps when used with the confidence rated version of AdaBoost [13] learns weights of a linear classifier equivalent to a Perceptron.
The Boostron algorithm has been extended further to learn parameters of an ANN with a single hidden layer and a single output neuron [11] such as the one shown in Fig. 1. The extension of Boostron uses a transformed set of examples and a layer-wise iterative traversal of neurons in the network to fine tune its parameters. It uses two simple problem reductions: Learning an output neuron is reduced to that of Perceptron learning, and Learning a hidden neuron is reduced to that of Perceptron learning,
to learn weights of the hidden neurons and the output neuron by the use of Boostron algorithm during an iterative traversal of the neurons in a given feed-forward ANN. The main algorithm by Baig et al. [11] that uses these reductions along with the Boostron algorithm is shown as Algorithm 1.
is a training instance and y i ∈ {-1, + 1} the corresponding class label
P is the number of iterations over ANN layers
Randomly initialize all weights in the range (0 1)
Randomly assign features to each hidden neuron.
Compute transformed training examples where
Use the Boostron algorithm and the transformed training examples, , to learn the weights of H j where k = 0, 1, …, m0
Obtain weights of of hidden neuron H j
Compute transformed training examples
where
Use the Boostron algorithm and the training examples , to learn the weights of the output neuron O1
Output the learned ANN weights
The boosting-based ANN learning method can learn effective network parameters for a given problem but, like many other supervised learning algorithms, it also suffers from the drawback of overfitting, especially in case of very skewed distribution of training examples typically found in an IDS having a large number of classes. To overcome such limitations, a novel cascade structure and an example filtering method is introduced, which can be used to handle learning in case of a large number of classes and handle skewed distribution of trainingexamples.
The construction of the proposed cascade is based on the observation that it might be possible to partition the classes so that the training examples belonging to classes in one part of the partite can be accurately discriminated from the classes in other parts. Once we have found such a partition the procedure can be recursively used to divide the resulting problems into even smaller subproblems containing less classes and finally combine these solutions to form a classifier. The the 23-class KDD-cup intrusion detection dataset, used in our experiments, an accurate classifier can be obtained to separate the instance of a single dominant class remain in a partition, or the number of examples reaching a partition are less than a predefined threshold.
is a training instance and y i ∈ 1, 2, …, K are labels, and l is the number of partitions to use
1:
2: Label the leaf node with the dominating class in the training data
3: return
4:
5:
6: Create a binary classification problem by relabeling y i ∈ P j as +1 or -1.
7: Learn a binary classifier B using Boosting based ANN learning algorithm.
8: Choose this partition if it results in the most accurate classifier B amongst all such classifiers
9:
10: Partition the training data D into two parts D1 and D2 using the predictions of the best classifier B
11: Recursively repeat the above steps for each partition
Algorithm 2 takes as input a K-class learning problem and uses a partitioning mechanism to construct a binary classification learning problem to be used for partitioning the training data. For a problem involving K classes there are 2 K possible partitions to choose from and hence finding an optimal partition of K-classes into two sets becomes intractable even for a moderate number of classes. Therefore, in our experiments we only considered K different partitions of the K-classes each obtained by dividing total classes into two sets such that one set contains only one of the classes while the second set contains all remaining classes. Such partitioning is exactly equivalent to encoding of classes using one-vs-remaining strategy along with an example filtering step. The class best discriminated from the remaining classes is used to divide the learning problem into smaller sub-problems in each cascade stage. Figure 2a shows a the structure of such a cascade.
A traversal of the cascade starting at the root and then traversing the subtree corresponding to the predicted class of an instance can be used to assign a label to . A recursive traversing process used to assigning a label to an instance is shown as Algorithm 3.
Cascaded classifier C
1:
2: set class label of the node as the predicted label of .
3: return
4:
5: Use the classifier at the root of C to compute the label y of
6: Recursively compute label of by using subtree corresponding to the computed label.
Experimental settings and results
A detailed description of the adopted datasets, the experimental settings and the obtained results are presented in this section. Three experiments have been performed to evaluate the proposed cascade structure for two intrusion detection datasets: KDD Cup 99 and UNSW-NB15. A comparison of the proposed cascade with the standard feed-forward ANN trained with sigmoid activation function in the hidden layer is also provided for the recent UNSW-NB15 dataset.
Datasets description
A subset of KDD Cup 99 (KDD’99) dataset [9] has been used to empirically evaluate the proposed method. Since its first use in the International Knowledge Discovery and Data Mining Tools Competition in 1999, it has been a gold standard intrusion detection dataset used by a large number of researchers and their experimental work [15–19]. A detailed description of the dataset can be found in [] and a summary of the example distribution is given in Table 1. The dataset contains very few dominant classes and hence it is an interesting optimization problem because a large class of algorithms converge to suboptimal solution and ignore the sparse classes still attaining high overall accuracy.
A recent network intrusion detection dataset, UNSW-NB15 [20], comprising contemporary attacks, has also been used to evaluate the proposed network intrusion detection system. This dataset contains a hybrid of modern normal and attack behaviors represented using 49 features and containing nine attack categories. A partition of the overall dataset into training/testing datasets is also provided. The training partition consists of 175,341 instances whereas the testing dataset contains 82,332 instances. Table 2 lists the overall class distribution in the test and training data.
Experimental settings
In our first set of experiments with the KDD-cup dataset, five iterations of 2-fold cross-validation have been used to evaluate the learned classifier for the KDD-cup intrusion detection dataset. In each iteration, the dataset has been randomly split into two non-overlapping partitions. A small sample of training examples from one of the partitions have been used for training while the examples in the other partition have been used for testing. Average values of various performance measures are reported in the following section on results. In the second set of experiments with the UNSW-NB15 dataset, a small fraction (about 2%) of the randomly selected dataset has been used for training and the whole testing dataset is used for evaluating the performance of proposed method.
While building the classifiers, classes have always been partitioned into two sets one containing a single class and the second containing all the remaining classes. For example, a classifier discriminating
A boosting-based ANN with twenty hidden neurons and one output neuron has been used as a classifier at each stage of the proposed cascade structure. The resulting cascade structure similar to the one shown in Fig. 2b with a ANN used as classifier in each internal node of the cascade has been used. Example filtering process that uses an ANN classifier corresponding to a node eliminated one of the classes at each stage of the cascade and hence the corresponding examples are also eliminated at successive stages.
Results
The proposed system has been evaluated using the accuracy, precision, recall and F1-score measures over the KDD’99 and UNSW-NB15 intrusion detection datasets. Since a major objective of any network intrusion detection system is to discriminate normal network traffic from intrusion therefore the first set of results presents the performance of the proposed system for discriminating the normal traffic from that representing some form of intrusion.
For the two datasets, confusion matrices along with the four performance measures for detecting intrusion without marking the actual intrusion type is given in Table 3. For the KDD’99 dataset these performance measures have been computed using results of a single fold whereas the test results of a complete run are reported for the UNSW-NB15 dataset. For the KDD’99 dataset the trained cascade has a very low false positive rate (i.e. normal traffic marked as intrusion) of 3.77% and a very low false negative rate (i.e. intrusion detected as normal traffic) of 1.26%. The values of accuracy, precision, recall, and F1-score for this single experiment are also very reasonable. For the UNSW-NB15 dataset the values of false positive and false negative rates are relatively poor than the corresponding values for the KDD’99 dataset.
The next set of results presents the overall performance of the system using five runs of two-fold cross validation scheme for the KDD’99 dataset as described above. Table 4 reports the fold-wise and average test performance of the trained system for the entire testing dataset. From the reported results it is obvious the the proposed learning strategy has resulted into an intrusion detection system with fairly high values for overall accuracy of 99.36% with both precision and recall having value above 0.97 and F1-score greater than 0.96. These high values have been obtained for a larger testing dataset consisting of 50% of the overall data whereas a very small fraction of the training data (about 1% only) has been used for training the classifier.
Table 5 presents a further insight into the results by providing a class-wise average values of the four performance measures for eight dominant classes. These results have been obtained by computing the corresponding values for each of the five two-fold runs and the average values of the obtained results are reported. The classifier trained for intrusion detection has high accuracy for fifteen classes but very low values for the remaining measures. As these classes have a sparse representation in the overall training and testing datasets, therefore the system has been able to achieve high overall values of performance measures even without having high values for these classes.
A similar set of results for the UNSW-NB15 dataset is summarized in Table 6. From these results, it is revealed that the proposed system can detect intrusion successfully but determining the type of intrusion is poorly marked for a number of cases. The average values of accuracy, precision, recall are 86.40, 53.19 and 60.71 respectively. It is also important to note that unlike a typical intrusion detection system, the proposed scheme marks the less frequently occurring classes as intrusion because of the cascade structure however the actual label of such instances might be incorrect.
The last set of results compares the proposed cascade-based approach with a two-layer neural network having twenty hidden neurons with sigmoid activation function. In the previous experiments, only a small fraction (about 5%) training data has been used for learning a classifier whereas in this experiment a larger subset (about 30%) of randomly chosen training examples have been used for comparing the two algorithms. Each experiment has been performed several times and the average performance values for detecting intrusion are reported in Table 7. The proposed approach obviously outperforms the standard feed-forward neural network for detecting intrusion. Because of the cascade structure and the filtering mechanism used, each filtered example contributes to the error accumulation only once. The overall change, i.e. Proposed –ANN, in the four performance measures are reported in Table 8 and it is obvious that the overall improvement obtained by using the proposed approach is significant. By comparing results presented in Tables 3 and 7, we can also make an interesting observation that the results obtained with a smaller fraction (5%) of training dataset are better than those obtained when a larger fraction (30%) of training data is used to build the classifier.
Conclusion and discussion
An effective method of learning a multi-class classifier along with the results obtained for two intrusion detection datasets have been presented. The proposed method uses a cascade of boosting-based ANN to create an effective multi-class classifier and is similar, in principle, to the one-vs-remaining strategy with an additional filtering of examples step. The intrusion detection system trained using the proposed method has very high overall accuracy, precision, recall, and F1-score for the KDD’99 dataset while these measures are relatively lower for the UNSW-NB15 intrusion detection dataset. The reported results also reveled that the trained classifier had high performance for most of the well-represented classes. Although the intrusion detection rate of the classifier trained using the proposed structure has been very high but for extremely sparse classes the proposed intrusion detection system has been unable to discriminate between various types of intrusions.
Two orthogonal research directions can be taken from this point onwards which include i) the proposed structure for learning a classifier can be tested for more classification tasks involving a large number of classes ii) the proposed intrusion detection system can be further refined using a filtering and example weighting strategy that favors the spare classes a little more than the remaining classes. Theoretical analysis of the proposed cascade for learning an effective classifier handling a large number of classes can also be done as a future direction.
Footnotes
Acknowledgments
The authors acknowledge the support of Lahore University of Management Sciences(LUMS) and the Higher Education Commission of Pakistan (H.E.C) for conducting this research work. The third author would like to acknowledge the support provided by King Abdulaziz City for Science and Technology (KACST) through the Science and Technology Unit at King Fahd University of Petroleum and Minerals (KFUPM) during this work through project No. 11-INF1658-04 as part of the National Science, Technology and Innovation Plan.
